Bug 578869

Summary: VM is corrupted after snapshot when using virtio driver for virtual disk (cow sparse on iscsi)
Product: Red Hat Enterprise Linux 5 Reporter: Moran Goldboim <mgoldboi>
Component: kvmAssignee: Kevin Wolf <kwolf>
Status: CLOSED DUPLICATE QA Contact: Moran Goldboim <mgoldboi>
Severity: urgent Docs Contact:
Priority: high    
Version: 5.5CC: cpelland, llim, moli, mshao, qzhang, tburke, virt-maint, ycui, ykaul
Target Milestone: rcKeywords: Reopened
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-06-15 12:40:56 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 556823, 580948    
Attachments:
Description Flags
kernel panic screenshot
none
QE reproduce this bug screenshot
none
QE reproduce this bug screenshot2 none

Description Moran Goldboim 2010-04-01 15:53:50 UTC
Description of problem:
VDS- rhel 5.5 (kernel 194), guest -rhel 5.5
when creating a snapshot using rhevm of the guest vm it becomes corrupted (kernel panik, segmantation faults). same proccess done with ide driver works fine.
host: intel xeon core i7 12GB

Version-Release number of selected component (if applicable):
kernel: 2.6.18 -194
kvm: 83 -164

How reproducible:
always, on that system 

Steps to Reproduce:
1.create a vm from blank with rhel 5.5 
2.create template/snapshot from it
3.the vm (from template or after snapshot) becomes corrupted

Comment 1 Moran Goldboim 2010-04-01 15:55:07 UTC
qemu-img check went fine on problematic images.

Comment 2 Kevin Wolf 2010-04-01 17:08:57 UTC
Can you please post the panic message?

Also, what exactly do you mean by "segmentation faults"? qemu dies or random processes in the VM die? If the former, a backtrace would be helpful.

Comment 3 Moran Goldboim 2010-04-06 07:47:25 UTC
All problems occurs in the VM itself and not on the host, attached is screenshot of the kernel panic and random failures- most of the problems are seen are related to disk/fs.

Comment 4 Moran Goldboim 2010-04-06 07:48:44 UTC
Created attachment 404639 [details]
kernel panic screenshot

Comment 5 Kevin Wolf 2010-04-08 12:28:28 UTC
The subject line says that this is on iscsi (I missed this at first because it's not in the bug description). Is iscsi needed, or do you see the same with the image in a local file or LV?

Comment 6 Moran Goldboim 2010-04-12 15:35:08 UTC
(In reply to comment #5)
> The subject line says that this is on iscsi (I missed this at first because
> it's not in the bug description). Is iscsi needed, or do you see the same with
> the image in a local file or LV?    

I have tried it with local file and all worked well.

Comment 7 Kevin Wolf 2010-04-13 08:18:03 UTC
What about LVs? To qemu they should look the same as iscsi, I think - just a block device.

Comment 8 Kevin Wolf 2010-05-11 07:42:52 UTC
Moran, can you retest with kvm-83-179.el5? This is possibly a duplicate of bug 542954 which is fixed in this version.

Comment 9 Yaniv Kaul 2010-05-11 07:47:54 UTC
(In reply to comment #8)
> Moran, can you retest with kvm-83-179.el5? This is possibly a duplicate of bug
> 542954 which is fixed in this version.    

This will require a whole new kernel and stuff, which we don't have at the moment.
Lawrence, please have someone from your team take over this, if possible. Have you reproduced this?

Comment 10 Kevin Wolf 2010-05-11 08:46:07 UTC
The fix is in the userspace part, so you could just install that part and keep the old kernel. Even just extracting the binary from the new RPM should be enough.

Comment 11 Qunfang Zhang 2010-05-17 09:12:11 UTC
Re-test this issue on kernel: 2.6.18 -194, kvm: 83 -164, can not reproduce.

Steps:
1. Install a guest on iscsi.
/usr/libexec/qemu-kvm -no-hpet -usbdevice tablet -rtc-td-hack -no-kvm-pit-reinjection -startdate now -drive file=RHEL5.5-Server-20100322.0-x86_64-DVD.iso,media=cdrom -drive file=/dev/vgtest/lv-base,media=disk,format=qcow2,if=virtio,boot=on -net nic,vlan=0,macaddr=10:1a:4a:10:20:40,model=virtio -net tap,vlan=0,script=/etc/qemu-ifup -cpu qemu64,+sse2 -balloon none -vnc :10 -uuid `uuidgen` -monitor stdio -m 2G -smp 2 -boot dc

2. After installation, create the template.
#lvcreate -n lv-template -L 20G vgtest
#qemu-img create -f qcow2 /dev/vgtest/lv-template 20G
#qemu-img convert -f qcow2 /dev/vgtest/lv-base -O qcow2 /dev/vgtest/lv-template 

3. Create snapshot from the template.
# lvs
  LV          VG         Attr   LSize   Origin Snap%  Move Log Copy%  Convert
  LogVol00    VolGroup00 -wi-ao 292.28G                                      
  LogVol01    VolGroup00 -wi-ao   5.69G                                      
  lv-base     vgtest     -wi-a-  20.00G                                      
  lv-template vgtest     -wi-a-  20.00G   
#lvcreate -n lv-sn1 -L 20G vgtest
#qemu-img create -f qcow2 -F qcow2 -b /dev/vgtest/lv-template  /dev/vgtest/lv-sn1 

4.Boot snapshot 1: lv-sn1 with the above command line.

Result: can boot up successfully.


PS: I test it using the virtio block all the time and have not changed the interface.

qzhang -> mgoldboi:

Have you changed the guest interface?  because there is a bug :

Bug 561221 - Snapshot of guest suffers kernel panic when installed with virtio block and boot with ide block

Comment 12 Dor Laor 2010-05-26 09:54:28 UTC
mgoldboi does not provide input, it does work for qzhang, closing.

Comment 13 Moran Goldboim 2010-06-10 10:25:12 UTC
repo steps and system details were provided to kwolf

Comment 14 Moran Goldboim 2010-06-10 10:44:08 UTC
adding the details:
Server- silver-vdsd.qa.lab.tlv.redhat.com
 
Template location: /rhev/data-center/e80168ab-a912-4855-97ff-f778d5746432/8900978c-e842-4037-8f04-c9a740793a13/images/12cb47b1-3fcc-40f1-a17a-b5ccb0a17dd9
 
Instance location: /rhev/data-center/e80168ab-a912-4855-97ff-f778d5746432/8900978c-e842-4037-8f04-c9a740793a13/images/d0996fd9-4f06-4583-8bb8-0339084e1e83/2b4ce82a-e3d4-4086-95c8-2512fd4bed9d
 
Running command: /usr/libexec/qemu-kvm -name fst -smp 1,cores=1 -k en-us -m 1024 -boot cn -net nic,vlan=1,macaddr=00:1a:4a:16:89:0c,model=e1000 -net tap,vlan=1,ifname=e1000_13_1,script=no -drive file=/rhev/data-center/e80168ab-a912-4855-97ff-f778d5746432/8900978c-e842-4037-8f04-c9a740793a13/images/d0996fd9-4f06-4583-8bb8-0339084e1e83/2b4ce82a-e3d4-4086-95c8-2512fd4bed9d,media=disk,if=ide,cache=writeback,serial=83-8bb8-0339084e1e83,boot=on,format=qcow2,werror=stop -vnc 0:13,moran -cpu qemu64,+sse2
 
If I run it with  if=ide it works fine, but if I change it to virtio we get the bug…

Comment 15 Kevin Wolf 2010-06-10 13:17:52 UTC
Are you sure this is the right one?

It does fail indeed, but never in the way as in the screenshot you attached. Instead it fails mounting its root device - for which the very simple cause seems to be that there is no virtio-blk driver (even a copy of the base image fails this way, with no snapshots involved). At least I can't see any occurrence of "virt" in the kernel log.

Comment 16 Kevin Wolf 2010-06-15 12:40:56 UTC
So Moran provided me with a different image that actually does show the corruption issue. Thanks!

To test this, I created a new snapshot (in a file) and then just tried to boot the guest up:

# qemu-img create -f qcow2 -F qcow2 -b /rhev/data-center/e80168ab-a912-4855-97ff-f778d5746432/8900978c-e842-4037-8f04-c9a740793a13/images/7c140b58-0dc5-48af-b43f-6ac17fc3257e/../7c140b58-0dc5-48af-b43f-6ac17fc3257e/af8425d0-d63e-4d68-a1ec-2e0ca678caa1 overlay.qcow2

# /usr/libexec/qemu-kvm -no-hpet -usbdevice tablet -rtc-td-hack -startdate 2010-06-14T11:42:22 -name xxxft -smp 1,cores=1 -k en-us -m 512 -boot c -drive file=overlay.qcow2,media=disk,if=virtio,cache=writeback,serial=af-b43f-6ac17fc3257e,boot=on,format=qcow2,werror=stop -vnc 0:15 -cpu qemu64,+sse2 -M rhel5.5.0 -notify all -balloon none -k de -serial file:/tmp/serial.out 

With the qemu-kvm binary of the package installed on this machine, I could reproduce the bug every time in three attempt. Tried the same three times with a binary compiled from the current rhel5/master branch and succeeded. As a final test, I also created a fresh snapshot on the block device that Moran had used and ran it with the new binary and it succeeded as well.

I consider this fixed therefore, and I have strong suspicion that it's the fix of bug 542954 which fixes this as well. Marking as a duplicate of that bug.

*** This bug has been marked as a duplicate of bug 542954 ***

Comment 17 Lawrence Lim 2010-06-17 02:19:59 UTC
Comment from Kevin, QE please take note and make sure the suggestions made by Kevin are well covered.

=======================

Anything that uses lots of synchronous reads/writes (i.e. metadata
operations). Long snapshots chains where a lot of COW happens seems to
be a good candidate.

It's probably enough to test intensively with one backing file format,
preferably qcow2 which may issue synchronous metadata I/O again and
therefore makes the scenario more complex.

For verification of the fix, you need to use virtio-blk (multiple
requests running at once are required to even trigger this bug). On the
other hand, only IDE can directly call synchronous bdrv_read/write which
is touched by this patch, so in order to avoid regressions some tests on
IDE should be run, too.

Kevin

Comment 18 Ying Cui 2010-06-17 10:28:13 UTC
Hi all,
   We can NOT reproduce this bug.

   kernel: 2.6.18-194.3.1.el5
   kvm: 83-164
   RHEV-H: 5.5-2.2 (4.1)
   host1: intel xeon core i7
   host2: intel xeon 45nm Core2
   host3: AMD Opteron G2
   guest OS: RHEL 5.5 32bit/64bit, RHEL 5.4 64bit.


Test steps:
  1. Access RHEV-M with vdcadmin user.
  2. Create a VM guest on iscsi storage with virtio disk and rhevm network(cow sparse on iscsi)
  3. After installation, create a snapshot1 for this VM.
  4. Boot the VM, the VM started successfully.
  5. Stop the VM, preview and commit the snatpshot1
  5. Boot the snapshot1, the VM started successfully.

Additional info:
 1. Commands line in RHEV-H:
vdsm     13034 13025  2 09:56 ?        00:00:32 /usr/libexec/qemu-kvm -no-hpet -no-kvm-pit-reinjection -usbdevice tablet -rtc-td-hack -startdate 2010-06-17T02:56:08 -name rhel55-64 -smp 1,cores=1 -k en-us -m 1024 -boot cd -net nic,vlan=1,macaddr=00:1a:4a:42:41:0b,model=e1000 -net tap,vlan=1,ifname=e1000_10_1,script=no -drive file=/rhev/data-center/2e85b7a4-e36c-4a15-b3e0-e41f91fb965c/95a01a9f-4341-44db-b725-34f4d08eff11/images/8aeaca2f-04e5-4389-9956-96109dbfcbd7/c2ce9ef5-9d0e-4a53-a69f-4623a1eceab4,media=disk,if=virtio,cache=off,serial=89-9956-96109dbfcbd7,boot=on,format=qcow2,werror=stop -pidfile /var/vdsm/4f074e4f-7925-480f-97bd-e851e3adbd78.pid -vnc 0:10,password -cpu qemu64,+sse2,+cx16,+ssse3,+sse4.1,+sse4.2,+popcnt -M rhel5.5.0 -notify all -balloon none -smbios type=1,manufacturer=Red Hat,product=RHEV Hypervisor,version=5.5-2.2-4.1,serial=44454C4C-4800-1032-8033-C7C04F4D3258_00:21:9b:ff:b9:fe,uuid=4f074e4f-7925-480f-97bd-e851e3adbd78 -vmchannel di:0200,unix:/var/vdsm/4f074e4f-7925-480f-97bd-e851e3adbd78.guest.socket,server -monitor unix:/var/vdsm/4f074e4f-7925-480f-97bd-e851e3adbd78.monitor.socket,server

  2. We also test this bug on rhevm-backup.qa.lab.tlv.redhat.com which is ykaul provided. But we also can NOT reproduce this bug with the same steps.

  3. We need to continue to test other scenario for qcow2 virtual disk with iscsi storage

Comment 19 Ying Cui 2010-06-23 07:03:37 UTC
We always can reproduce the bug 578869 with the following env.:
Host: RHEL 5.5 Server
Kernel:2.6.18-194.el5
KVM Version:83-164.el5_5.6
iscsi on Solaris

Verified this bug today:
Host: RHEL 5.5 Server
Kernel:2.6.18-194.3.1.el5
KVM Version:83-164.el5_5.12
iscsi on Solaris

Note: We could not reproduce the bug when we used iscsi on NetBSD v1.62 before, Now this bug can be reproduced always when we use iscsi on Solaris.

Comment 20 Ying Cui 2010-06-23 07:24:51 UTC
Created attachment 426181 [details]
QE reproduce this bug screenshot

Comment 21 Ying Cui 2010-06-23 07:25:57 UTC
Created attachment 426182 [details]
QE reproduce this bug screenshot2