Bug 578869
Summary: | VM is corrupted after snapshot when using virtio driver for virtual disk (cow sparse on iscsi) | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Moran Goldboim <mgoldboi> | ||||||||
Component: | kvm | Assignee: | Kevin Wolf <kwolf> | ||||||||
Status: | CLOSED DUPLICATE | QA Contact: | Moran Goldboim <mgoldboi> | ||||||||
Severity: | urgent | Docs Contact: | |||||||||
Priority: | high | ||||||||||
Version: | 5.5 | CC: | cpelland, llim, moli, mshao, qzhang, tburke, virt-maint, ycui, ykaul | ||||||||
Target Milestone: | rc | Keywords: | Reopened | ||||||||
Target Release: | --- | ||||||||||
Hardware: | All | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2010-06-15 12:40:56 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 556823, 580948 | ||||||||||
Attachments: |
|
Description
Moran Goldboim
2010-04-01 15:53:50 UTC
qemu-img check went fine on problematic images. Can you please post the panic message? Also, what exactly do you mean by "segmentation faults"? qemu dies or random processes in the VM die? If the former, a backtrace would be helpful. All problems occurs in the VM itself and not on the host, attached is screenshot of the kernel panic and random failures- most of the problems are seen are related to disk/fs. Created attachment 404639 [details]
kernel panic screenshot
The subject line says that this is on iscsi (I missed this at first because it's not in the bug description). Is iscsi needed, or do you see the same with the image in a local file or LV? (In reply to comment #5) > The subject line says that this is on iscsi (I missed this at first because > it's not in the bug description). Is iscsi needed, or do you see the same with > the image in a local file or LV? I have tried it with local file and all worked well. What about LVs? To qemu they should look the same as iscsi, I think - just a block device. Moran, can you retest with kvm-83-179.el5? This is possibly a duplicate of bug 542954 which is fixed in this version. (In reply to comment #8) > Moran, can you retest with kvm-83-179.el5? This is possibly a duplicate of bug > 542954 which is fixed in this version. This will require a whole new kernel and stuff, which we don't have at the moment. Lawrence, please have someone from your team take over this, if possible. Have you reproduced this? The fix is in the userspace part, so you could just install that part and keep the old kernel. Even just extracting the binary from the new RPM should be enough. Re-test this issue on kernel: 2.6.18 -194, kvm: 83 -164, can not reproduce. Steps: 1. Install a guest on iscsi. /usr/libexec/qemu-kvm -no-hpet -usbdevice tablet -rtc-td-hack -no-kvm-pit-reinjection -startdate now -drive file=RHEL5.5-Server-20100322.0-x86_64-DVD.iso,media=cdrom -drive file=/dev/vgtest/lv-base,media=disk,format=qcow2,if=virtio,boot=on -net nic,vlan=0,macaddr=10:1a:4a:10:20:40,model=virtio -net tap,vlan=0,script=/etc/qemu-ifup -cpu qemu64,+sse2 -balloon none -vnc :10 -uuid `uuidgen` -monitor stdio -m 2G -smp 2 -boot dc 2. After installation, create the template. #lvcreate -n lv-template -L 20G vgtest #qemu-img create -f qcow2 /dev/vgtest/lv-template 20G #qemu-img convert -f qcow2 /dev/vgtest/lv-base -O qcow2 /dev/vgtest/lv-template 3. Create snapshot from the template. # lvs LV VG Attr LSize Origin Snap% Move Log Copy% Convert LogVol00 VolGroup00 -wi-ao 292.28G LogVol01 VolGroup00 -wi-ao 5.69G lv-base vgtest -wi-a- 20.00G lv-template vgtest -wi-a- 20.00G #lvcreate -n lv-sn1 -L 20G vgtest #qemu-img create -f qcow2 -F qcow2 -b /dev/vgtest/lv-template /dev/vgtest/lv-sn1 4.Boot snapshot 1: lv-sn1 with the above command line. Result: can boot up successfully. PS: I test it using the virtio block all the time and have not changed the interface. qzhang -> mgoldboi: Have you changed the guest interface? because there is a bug : Bug 561221 - Snapshot of guest suffers kernel panic when installed with virtio block and boot with ide block mgoldboi does not provide input, it does work for qzhang, closing. repo steps and system details were provided to kwolf adding the details: Server- silver-vdsd.qa.lab.tlv.redhat.com Template location: /rhev/data-center/e80168ab-a912-4855-97ff-f778d5746432/8900978c-e842-4037-8f04-c9a740793a13/images/12cb47b1-3fcc-40f1-a17a-b5ccb0a17dd9 Instance location: /rhev/data-center/e80168ab-a912-4855-97ff-f778d5746432/8900978c-e842-4037-8f04-c9a740793a13/images/d0996fd9-4f06-4583-8bb8-0339084e1e83/2b4ce82a-e3d4-4086-95c8-2512fd4bed9d Running command: /usr/libexec/qemu-kvm -name fst -smp 1,cores=1 -k en-us -m 1024 -boot cn -net nic,vlan=1,macaddr=00:1a:4a:16:89:0c,model=e1000 -net tap,vlan=1,ifname=e1000_13_1,script=no -drive file=/rhev/data-center/e80168ab-a912-4855-97ff-f778d5746432/8900978c-e842-4037-8f04-c9a740793a13/images/d0996fd9-4f06-4583-8bb8-0339084e1e83/2b4ce82a-e3d4-4086-95c8-2512fd4bed9d,media=disk,if=ide,cache=writeback,serial=83-8bb8-0339084e1e83,boot=on,format=qcow2,werror=stop -vnc 0:13,moran -cpu qemu64,+sse2 If I run it with if=ide it works fine, but if I change it to virtio we get the bug… Are you sure this is the right one? It does fail indeed, but never in the way as in the screenshot you attached. Instead it fails mounting its root device - for which the very simple cause seems to be that there is no virtio-blk driver (even a copy of the base image fails this way, with no snapshots involved). At least I can't see any occurrence of "virt" in the kernel log. So Moran provided me with a different image that actually does show the corruption issue. Thanks! To test this, I created a new snapshot (in a file) and then just tried to boot the guest up: # qemu-img create -f qcow2 -F qcow2 -b /rhev/data-center/e80168ab-a912-4855-97ff-f778d5746432/8900978c-e842-4037-8f04-c9a740793a13/images/7c140b58-0dc5-48af-b43f-6ac17fc3257e/../7c140b58-0dc5-48af-b43f-6ac17fc3257e/af8425d0-d63e-4d68-a1ec-2e0ca678caa1 overlay.qcow2 # /usr/libexec/qemu-kvm -no-hpet -usbdevice tablet -rtc-td-hack -startdate 2010-06-14T11:42:22 -name xxxft -smp 1,cores=1 -k en-us -m 512 -boot c -drive file=overlay.qcow2,media=disk,if=virtio,cache=writeback,serial=af-b43f-6ac17fc3257e,boot=on,format=qcow2,werror=stop -vnc 0:15 -cpu qemu64,+sse2 -M rhel5.5.0 -notify all -balloon none -k de -serial file:/tmp/serial.out With the qemu-kvm binary of the package installed on this machine, I could reproduce the bug every time in three attempt. Tried the same three times with a binary compiled from the current rhel5/master branch and succeeded. As a final test, I also created a fresh snapshot on the block device that Moran had used and ran it with the new binary and it succeeded as well. I consider this fixed therefore, and I have strong suspicion that it's the fix of bug 542954 which fixes this as well. Marking as a duplicate of that bug. *** This bug has been marked as a duplicate of bug 542954 *** Comment from Kevin, QE please take note and make sure the suggestions made by Kevin are well covered. ======================= Anything that uses lots of synchronous reads/writes (i.e. metadata operations). Long snapshots chains where a lot of COW happens seems to be a good candidate. It's probably enough to test intensively with one backing file format, preferably qcow2 which may issue synchronous metadata I/O again and therefore makes the scenario more complex. For verification of the fix, you need to use virtio-blk (multiple requests running at once are required to even trigger this bug). On the other hand, only IDE can directly call synchronous bdrv_read/write which is touched by this patch, so in order to avoid regressions some tests on IDE should be run, too. Kevin Hi all, We can NOT reproduce this bug. kernel: 2.6.18-194.3.1.el5 kvm: 83-164 RHEV-H: 5.5-2.2 (4.1) host1: intel xeon core i7 host2: intel xeon 45nm Core2 host3: AMD Opteron G2 guest OS: RHEL 5.5 32bit/64bit, RHEL 5.4 64bit. Test steps: 1. Access RHEV-M with vdcadmin user. 2. Create a VM guest on iscsi storage with virtio disk and rhevm network(cow sparse on iscsi) 3. After installation, create a snapshot1 for this VM. 4. Boot the VM, the VM started successfully. 5. Stop the VM, preview and commit the snatpshot1 5. Boot the snapshot1, the VM started successfully. Additional info: 1. Commands line in RHEV-H: vdsm 13034 13025 2 09:56 ? 00:00:32 /usr/libexec/qemu-kvm -no-hpet -no-kvm-pit-reinjection -usbdevice tablet -rtc-td-hack -startdate 2010-06-17T02:56:08 -name rhel55-64 -smp 1,cores=1 -k en-us -m 1024 -boot cd -net nic,vlan=1,macaddr=00:1a:4a:42:41:0b,model=e1000 -net tap,vlan=1,ifname=e1000_10_1,script=no -drive file=/rhev/data-center/2e85b7a4-e36c-4a15-b3e0-e41f91fb965c/95a01a9f-4341-44db-b725-34f4d08eff11/images/8aeaca2f-04e5-4389-9956-96109dbfcbd7/c2ce9ef5-9d0e-4a53-a69f-4623a1eceab4,media=disk,if=virtio,cache=off,serial=89-9956-96109dbfcbd7,boot=on,format=qcow2,werror=stop -pidfile /var/vdsm/4f074e4f-7925-480f-97bd-e851e3adbd78.pid -vnc 0:10,password -cpu qemu64,+sse2,+cx16,+ssse3,+sse4.1,+sse4.2,+popcnt -M rhel5.5.0 -notify all -balloon none -smbios type=1,manufacturer=Red Hat,product=RHEV Hypervisor,version=5.5-2.2-4.1,serial=44454C4C-4800-1032-8033-C7C04F4D3258_00:21:9b:ff:b9:fe,uuid=4f074e4f-7925-480f-97bd-e851e3adbd78 -vmchannel di:0200,unix:/var/vdsm/4f074e4f-7925-480f-97bd-e851e3adbd78.guest.socket,server -monitor unix:/var/vdsm/4f074e4f-7925-480f-97bd-e851e3adbd78.monitor.socket,server 2. We also test this bug on rhevm-backup.qa.lab.tlv.redhat.com which is ykaul provided. But we also can NOT reproduce this bug with the same steps. 3. We need to continue to test other scenario for qcow2 virtual disk with iscsi storage We always can reproduce the bug 578869 with the following env.: Host: RHEL 5.5 Server Kernel:2.6.18-194.el5 KVM Version:83-164.el5_5.6 iscsi on Solaris Verified this bug today: Host: RHEL 5.5 Server Kernel:2.6.18-194.3.1.el5 KVM Version:83-164.el5_5.12 iscsi on Solaris Note: We could not reproduce the bug when we used iscsi on NetBSD v1.62 before, Now this bug can be reproduced always when we use iscsi on Solaris. Created attachment 426181 [details]
QE reproduce this bug screenshot
Created attachment 426182 [details]
QE reproduce this bug screenshot2
|