Bug 1282713

Summary:	guest occurred an error when ping-pong live migration during vm installation with ENOSPAC
Product:	Red Hat Enterprise Linux 6	Reporter:	Qianqian Zhu <qizhu>
Component:	qemu-kvm	Assignee:	Juan Quintela <quintela>
Status:	CLOSED WONTFIX	QA Contact:	Virtualization Bugs <virt-bugs>
Severity:	high	Docs Contact:
Priority:	high
Version:	6.8	CC:	amit.shah, chayang, mkenneth, quintela, rbalakri, virt-maint
Target Milestone:	rc
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2015-11-18 13:12:05 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Qianqian Zhu 2015-11-17 08:32:13 UTC

Description of problem:
guest occurred an error when ping-pong live migration during vm installation with ENOSPAC, and qemu-kvm monitor doesn't report no space error, so can't continue the installation after enlarge space.

Version-Release number of selected component (if applicable):
kernel-2.6.32-584.el6.x86_64
qemu-kvm-0.12.1.2-2.481.el6.x86_64

How reproducible:
100%

Steps to Reproduce:
1.make a 2G lv and create a 20G qcow2 image on it
2.start VM in src host
# /usr/libexec/qemu-kvm -name linux -cpu Westmere,check -m 2048 -realtime mlock=off -smp 2,sockets=2,cores=1,threads=1 -uuid 7bef3814-631a-48bb-bae8-2b1de75f7a13 -nodefaults -monitor stdio -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=discard -global PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -boot order=d,menu=on -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x6 -drive file=/dev/mapper/my--volume--group-2G_LV,if=none,id=scsi0,format=qcow2  -device virtio-scsi-pci,id=scsi0 -device scsi-disk,drive=scsi0,scsi-id=0,lun=0 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x8 -msg timestamp=on -spice port=5930,disable-ticketing -vga qxl -global qxl-vga.vram_size=33554432 -device qxl,id=video1,vram_size=67108864,bus=pci.0,addr=0x4 -netdev tap,id=hostnet0,vhost=on -device virtio-net-pci,netdev=hostnet0,id=net0,mac=3C:D9:2B:09:AB:44,bus=pci.0,addr=0x7 -drive file=/mnt/ISO/RHEL-Server-6.7/64/RHEL-Server-6.7-x86_64-latest.iso,if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw -device ide-drive,drive=drive-ide0-1-0,id=ide0-1-0

3.during guest installation ,do ping-pong migration until guest stop

Actual results:
guest stop with an error:
 # dmesg 
EXT4-fs error (device dm-0): ext4_mb_generate_buddy: EXT4-fs: group 3: 32768 blocks in bitmap, 31743 in gd
JBD: Spotted dirty metadata buffer (dev = dm-0, blocknr = 0). There's a risk of filesystem corruption in case of system crash.
JBD: Spotted dirty metadata buffer (dev = dm-0, blocknr = 0). There's a risk of filesystem corruption in case of system crash.
EXT4-fs error (device sda1): ext4_mb_generate_buddy: EXT4-fs: group 32: 8192 blocks in bitmap, 4096 in gd
EXT4-fs error (device dm-0): ext4_add_entry: bad entry in directory #782932: rec_len is smaller than minimal - block=3154260offset=0(0), inode=0, rec_len=0, name_len=0

Expected results:
guest stop due to ENOSPAC, but without any error; and qemu-kvm monitor report i/o error:no space; be able to continue the installation after enlarge the lv.

Additional info:

Comment 2 Juan Quintela 2015-11-18 09:59:10 UTC

This problem is fixed in RHEL7.2 and upstream.  Several problems:

- you are not using libvirt
- you are migrating with -S on destination (related to previous error)
- you migrate and while the migration is happening, you get an -ENOSPACE error
  (on source qemu)
- but you don't check there, migration finishes and it runs on destination
- and on destination, qemu hasn't seen the error, and continues, and then it sees the real error.

Posible solutions:
- backport all (or at least required part) of migration events series
- just banish running qemu without libvirt
- "hack" qcow2 to return -ENOSPACE on restart
  (I am not sure that is is easier/more difficult that backport fix upstream).

What do you think?

I am assuming that you can't reproduce it without migration and that if you look for the error on the source qemu you will find it.

Comment 3 Juan Quintela 2015-11-18 13:12:05 UTC

Won't fix.
It is fixed in 7.2 and not seen if you use libvirt.

Comment 4 Qianqian Zhu 2015-11-27 04:44:46 UTC

Hi Quintela,

Just to make sure that there is no misunderstanding.
Actually there is no ENOSPACE error report from BOTH source and destination qemu, so do you think it is the same issue under this situation?

(In reply to Juan Quintela from comment #2)
> This problem is fixed in RHEL7.2 and upstream.  Several problems:
> 
> - you are not using libvirt
> - you are migrating with -S on destination (related to previous error)
> - you migrate and while the migration is happening, you get an -ENOSPACE
> error
>   (on source qemu)
> - but you don't check there, migration finishes and it runs on destination
> - and on destination, qemu hasn't seen the error, and continues, and then it
> sees the real error.
> 
> Posible solutions:
> - backport all (or at least required part) of migration events series
> - just banish running qemu without libvirt
> - "hack" qcow2 to return -ENOSPACE on restart
>   (I am not sure that is is easier/more difficult that backport fix
> upstream).
> 
> What do you think?
> 
> I am assuming that you can't reproduce it without migration and that if you
> look for the error on the source qemu you will find it.

Comment 5 Juan Quintela 2015-12-16 10:55:19 UTC

Hi

Yeap, I think it is the same issue.

You run on machine A.
migrate to Machine B.
During migration, you get a -ENOSPACE on machine A
migration finish correctly to machine B
you make run on machine B
and migrate back to machine A
running on machine B causes the real error
guest see the error
migration to machine A finish

you have a guest that have seen the error
and machine A and machine B are not showing -ENOSPACE.

Previous to RHEL7.2 (and upstream 2.4 if I remember correctly), if you are using -S, you need to check that source is on state pause (without errors) before doing the cont on destination.  That way you detect the error when it first happens.  From RHEL7.2, we have added code that just migrates the error state, and then you current test should work/fail as you expect.

Is this clearer?

Later, Juan.