Red Hat Bugzilla – Bug 578869
VM is corrupted after snapshot when using virtio driver for virtual disk (cow sparse on iscsi)
Last modified: 2013-01-09 17:24:04 EST
Description of problem:
VDS- rhel 5.5 (kernel 194), guest -rhel 5.5
when creating a snapshot using rhevm of the guest vm it becomes corrupted (kernel panik, segmantation faults). same proccess done with ide driver works fine.
host: intel xeon core i7 12GB
Version-Release number of selected component (if applicable):
kernel: 2.6.18 -194
kvm: 83 -164
always, on that system
Steps to Reproduce:
1.create a vm from blank with rhel 5.5
2.create template/snapshot from it
3.the vm (from template or after snapshot) becomes corrupted
qemu-img check went fine on problematic images.
Can you please post the panic message?
Also, what exactly do you mean by "segmentation faults"? qemu dies or random processes in the VM die? If the former, a backtrace would be helpful.
All problems occurs in the VM itself and not on the host, attached is screenshot of the kernel panic and random failures- most of the problems are seen are related to disk/fs.
Created attachment 404639 [details]
kernel panic screenshot
The subject line says that this is on iscsi (I missed this at first because it's not in the bug description). Is iscsi needed, or do you see the same with the image in a local file or LV?
(In reply to comment #5)
> The subject line says that this is on iscsi (I missed this at first because
> it's not in the bug description). Is iscsi needed, or do you see the same with
> the image in a local file or LV?
I have tried it with local file and all worked well.
What about LVs? To qemu they should look the same as iscsi, I think - just a block device.
Moran, can you retest with kvm-83-179.el5? This is possibly a duplicate of bug 542954 which is fixed in this version.
(In reply to comment #8)
> Moran, can you retest with kvm-83-179.el5? This is possibly a duplicate of bug
> 542954 which is fixed in this version.
This will require a whole new kernel and stuff, which we don't have at the moment.
Lawrence, please have someone from your team take over this, if possible. Have you reproduced this?
The fix is in the userspace part, so you could just install that part and keep the old kernel. Even just extracting the binary from the new RPM should be enough.
Re-test this issue on kernel: 2.6.18 -194, kvm: 83 -164, can not reproduce.
1. Install a guest on iscsi.
/usr/libexec/qemu-kvm -no-hpet -usbdevice tablet -rtc-td-hack -no-kvm-pit-reinjection -startdate now -drive file=RHEL5.5-Server-20100322.0-x86_64-DVD.iso,media=cdrom -drive file=/dev/vgtest/lv-base,media=disk,format=qcow2,if=virtio,boot=on -net nic,vlan=0,macaddr=10:1a:4a:10:20:40,model=virtio -net tap,vlan=0,script=/etc/qemu-ifup -cpu qemu64,+sse2 -balloon none -vnc :10 -uuid `uuidgen` -monitor stdio -m 2G -smp 2 -boot dc
2. After installation, create the template.
#lvcreate -n lv-template -L 20G vgtest
#qemu-img create -f qcow2 /dev/vgtest/lv-template 20G
#qemu-img convert -f qcow2 /dev/vgtest/lv-base -O qcow2 /dev/vgtest/lv-template
3. Create snapshot from the template.
LV VG Attr LSize Origin Snap% Move Log Copy% Convert
LogVol00 VolGroup00 -wi-ao 292.28G
LogVol01 VolGroup00 -wi-ao 5.69G
lv-base vgtest -wi-a- 20.00G
lv-template vgtest -wi-a- 20.00G
#lvcreate -n lv-sn1 -L 20G vgtest
#qemu-img create -f qcow2 -F qcow2 -b /dev/vgtest/lv-template /dev/vgtest/lv-sn1
4.Boot snapshot 1: lv-sn1 with the above command line.
Result: can boot up successfully.
PS: I test it using the virtio block all the time and have not changed the interface.
qzhang -> mgoldboi:
Have you changed the guest interface? because there is a bug :
Bug 561221 - Snapshot of guest suffers kernel panic when installed with virtio block and boot with ide block
mgoldboi does not provide input, it does work for qzhang, closing.
repo steps and system details were provided to kwolf
adding the details:
Template location: /rhev/data-center/e80168ab-a912-4855-97ff-f778d5746432/8900978c-e842-4037-8f04-c9a740793a13/images/12cb47b1-3fcc-40f1-a17a-b5ccb0a17dd9
Instance location: /rhev/data-center/e80168ab-a912-4855-97ff-f778d5746432/8900978c-e842-4037-8f04-c9a740793a13/images/d0996fd9-4f06-4583-8bb8-0339084e1e83/2b4ce82a-e3d4-4086-95c8-2512fd4bed9d
Running command: /usr/libexec/qemu-kvm -name fst -smp 1,cores=1 -k en-us -m 1024 -boot cn -net nic,vlan=1,macaddr=00:1a:4a:16:89:0c,model=e1000 -net tap,vlan=1,ifname=e1000_13_1,script=no -drive file=/rhev/data-center/e80168ab-a912-4855-97ff-f778d5746432/8900978c-e842-4037-8f04-c9a740793a13/images/d0996fd9-4f06-4583-8bb8-0339084e1e83/2b4ce82a-e3d4-4086-95c8-2512fd4bed9d,media=disk,if=ide,cache=writeback,serial=83-8bb8-0339084e1e83,boot=on,format=qcow2,werror=stop -vnc 0:13,moran -cpu qemu64,+sse2
If I run it with if=ide it works fine, but if I change it to virtio we get the bug…
Are you sure this is the right one?
It does fail indeed, but never in the way as in the screenshot you attached. Instead it fails mounting its root device - for which the very simple cause seems to be that there is no virtio-blk driver (even a copy of the base image fails this way, with no snapshots involved). At least I can't see any occurrence of "virt" in the kernel log.
So Moran provided me with a different image that actually does show the corruption issue. Thanks!
To test this, I created a new snapshot (in a file) and then just tried to boot the guest up:
# qemu-img create -f qcow2 -F qcow2 -b /rhev/data-center/e80168ab-a912-4855-97ff-f778d5746432/8900978c-e842-4037-8f04-c9a740793a13/images/7c140b58-0dc5-48af-b43f-6ac17fc3257e/../7c140b58-0dc5-48af-b43f-6ac17fc3257e/af8425d0-d63e-4d68-a1ec-2e0ca678caa1 overlay.qcow2
# /usr/libexec/qemu-kvm -no-hpet -usbdevice tablet -rtc-td-hack -startdate 2010-06-14T11:42:22 -name xxxft -smp 1,cores=1 -k en-us -m 512 -boot c -drive file=overlay.qcow2,media=disk,if=virtio,cache=writeback,serial=af-b43f-6ac17fc3257e,boot=on,format=qcow2,werror=stop -vnc 0:15 -cpu qemu64,+sse2 -M rhel5.5.0 -notify all -balloon none -k de -serial file:/tmp/serial.out
With the qemu-kvm binary of the package installed on this machine, I could reproduce the bug every time in three attempt. Tried the same three times with a binary compiled from the current rhel5/master branch and succeeded. As a final test, I also created a fresh snapshot on the block device that Moran had used and ran it with the new binary and it succeeded as well.
I consider this fixed therefore, and I have strong suspicion that it's the fix of bug 542954 which fixes this as well. Marking as a duplicate of that bug.
*** This bug has been marked as a duplicate of bug 542954 ***
Comment from Kevin, QE please take note and make sure the suggestions made by Kevin are well covered.
Anything that uses lots of synchronous reads/writes (i.e. metadata
operations). Long snapshots chains where a lot of COW happens seems to
be a good candidate.
It's probably enough to test intensively with one backing file format,
preferably qcow2 which may issue synchronous metadata I/O again and
therefore makes the scenario more complex.
For verification of the fix, you need to use virtio-blk (multiple
requests running at once are required to even trigger this bug). On the
other hand, only IDE can directly call synchronous bdrv_read/write which
is touched by this patch, so in order to avoid regressions some tests on
IDE should be run, too.
We can NOT reproduce this bug.
RHEV-H: 5.5-2.2 (4.1)
host1: intel xeon core i7
host2: intel xeon 45nm Core2
host3: AMD Opteron G2
guest OS: RHEL 5.5 32bit/64bit, RHEL 5.4 64bit.
1. Access RHEV-M with vdcadmin user.
2. Create a VM guest on iscsi storage with virtio disk and rhevm network(cow sparse on iscsi)
3. After installation, create a snapshot1 for this VM.
4. Boot the VM, the VM started successfully.
5. Stop the VM, preview and commit the snatpshot1
5. Boot the snapshot1, the VM started successfully.
1. Commands line in RHEV-H:
vdsm 13034 13025 2 09:56 ? 00:00:32 /usr/libexec/qemu-kvm -no-hpet -no-kvm-pit-reinjection -usbdevice tablet -rtc-td-hack -startdate 2010-06-17T02:56:08 -name rhel55-64 -smp 1,cores=1 -k en-us -m 1024 -boot cd -net nic,vlan=1,macaddr=00:1a:4a:42:41:0b,model=e1000 -net tap,vlan=1,ifname=e1000_10_1,script=no -drive file=/rhev/data-center/2e85b7a4-e36c-4a15-b3e0-e41f91fb965c/95a01a9f-4341-44db-b725-34f4d08eff11/images/8aeaca2f-04e5-4389-9956-96109dbfcbd7/c2ce9ef5-9d0e-4a53-a69f-4623a1eceab4,media=disk,if=virtio,cache=off,serial=89-9956-96109dbfcbd7,boot=on,format=qcow2,werror=stop -pidfile /var/vdsm/4f074e4f-7925-480f-97bd-e851e3adbd78.pid -vnc 0:10,password -cpu qemu64,+sse2,+cx16,+ssse3,+sse4.1,+sse4.2,+popcnt -M rhel5.5.0 -notify all -balloon none -smbios type=1,manufacturer=Red Hat,product=RHEV Hypervisor,version=5.5-2.2-4.1,serial=44454C4C-4800-1032-8033-C7C04F4D3258_00:21:9b:ff:b9:fe,uuid=4f074e4f-7925-480f-97bd-e851e3adbd78 -vmchannel di:0200,unix:/var/vdsm/4f074e4f-7925-480f-97bd-e851e3adbd78.guest.socket,server -monitor unix:/var/vdsm/4f074e4f-7925-480f-97bd-e851e3adbd78.monitor.socket,server
2. We also test this bug on rhevm-backup.qa.lab.tlv.redhat.com which is ykaul provided. But we also can NOT reproduce this bug with the same steps.
3. We need to continue to test other scenario for qcow2 virtual disk with iscsi storage
We always can reproduce the bug 578869 with the following env.:
Host: RHEL 5.5 Server
iscsi on Solaris
Verified this bug today:
Host: RHEL 5.5 Server
iscsi on Solaris
Note: We could not reproduce the bug when we used iscsi on NetBSD v1.62 before, Now this bug can be reproduced always when we use iscsi on Solaris.
Created attachment 426181 [details]
QE reproduce this bug screenshot
Created attachment 426182 [details]
QE reproduce this bug screenshot2