Red Hat Bugzilla – Bug 574495
QEMU: VM crash with qcow2 over block device error (and virtio-blk driver)
Last modified: 2013-07-03 21:48:29 EDT
Description of problem:
Ran an installation of 2008R2 with virtio-blk, over block device (FCP). during the file extraction, after lvextend by VDSM (ran via RHEVM 2.2), it crashed.
#0 0x0000003926e30265 in raise () from /lib64/libc.so.6
#1 0x0000003926e31d10 in abort () from /lib64/libc.so.6
#2 0x00000000004981ab in free_clusters (bs=<value optimized out>, offset=<value optimized out>, size=<value optimized out>) at block-qcow2.c:2673
#3 0x000000000049a0f4 in alloc_cluster_link_l2 (bs=0x1da96640, m=<value optimized out>) at block-qcow2.c:1105
#4 0x000000000049a7ae in qcow_aio_write_cb (opaque=0x1db41d90, ret=0) at block-qcow2.c:1547
#5 0x0000000000419707 in posix_aio_read (opaque=<value optimized out>) at block-raw-posix.c:512
#6 0x00000000004094b2 in main_loop_wait (timeout=<value optimized out>) at /usr/src/debug/kvm-83-maint-snapshot-20090205/qemu/vl.c:3983
#7 0x00000000004ff1ea in kvm_main_loop () at /usr/src/debug/kvm-83-maint-snapshot-20090205/qemu/qemu-kvm.c:596
#8 0x000000000040e425 in main_loop (argc=44, argv=0x7fff67a5ffc8, envp=<value optimized out>) at /usr/src/debug/kvm-83-maint-snapshot-20090205/qemu/vl.c:4040
#9 main (argc=44, argv=0x7fff67a5ffc8, envp=<value optimized out>) at /usr/src/debug/kvm-83-maint-snapshot-20090205/qemu/vl.c:6476
last stdout message: "qcow2: free_clusters failed: Invalid argument"
Version-Release number of selected component (if applicable):
Steps to Reproduce:
Complete command line:
/usr/libexec/qemu-kvm -no-hpet -usb -rtc-td-hack -startdate 2010-3-17T9:11:24 -name W2K8_R2PV_COW_V -smp 1,cores=1 -k en-us -m 512 -boot cdn -net nic,vlan=1,macaddr=00:1a:4a:16:97:1d,model=virtio -net tap,vlan=1,ifname=virtio_10_1,script=no -drive file=/rhev/data-center/6bc3de76-9469-4852-91d0-b2bf668f941c/beddbe7d-e512-4d0c-b11f-a2ae5ae98e9e/images/725c6944-2fb2-49f2-b84d-edf3b7787270/fb5a14c2-224a-40fc-af5c-364efbb80861,media=disk,if=virtio,cache=off,serial=f2-b84d-edf3b7787270,boot=on,format=qcow2,werror=stop -drive file=/rhev/data-center/6bc3de76-9469-4852-91d0-b2bf668f941c/dfe7d294-0cc3-484c-a2d4-eb6908c49960/images/11111111-1111-1111-1111-111111111111/en_windows_server_2008_r2_standard_enterprise_datacenter_and_web_x64_dvd_x15-59754.iso,media=cdrom,index=2,if=ide -fda /rhev/data-center/6bc3de76-9469-4852-91d0-b2bf668f941c/dfe7d294-0cc3-484c-a2d4-eb6908c49960/images/11111111-1111-1111-1111-111111111111/virtio-drivers-1.0.0-8.vfd -pidfile /var/vdsm/557eb7ff-7e25-4dc5-a9f4-c60cf36afadf.pid -vnc 0:10,password -cpu qemu64,+sse2,+cx16,+ssse3,+sse4.1,+sse4.2,+popcnt -M rhel5.5.0 -notify all -balloon none -smbios type=1,manufacturer="Red Hat",product="RHEL",version=5Server-126.96.36.199,serial="9978F05A-B189-11DE-9BD8-00215EC7F8AC_00:21:5e:c7:f8:ac",uuid="557eb7ff-7e25-4dc5-a9f4-c60cf36afadf" -vmchannel di:0200,unix:/var/vdsm/557eb7ff-7e25-4dc5-a9f4-c60cf36afadf.guest.socket,server -monitor unix:/var/vdsm/557eb7ff-7e25-4dc5-a9f4-c60cf36afadf.monitor.socket,server 1>/var/vdsm/557eb7ff-7e25-4dc5-a9f4-c60cf36afadf.stdio.dump 2>&1; /usr/bin/sudo /usr/bin/tunctl -d virtio_10_1
Created attachment 400826 [details]
qemu-img check results
Created attachment 400828 [details]
Meh, I seem to have missed that my comment wasn't added yesterday because Yaniv added something else and now it's gone. Let me see if I can remember what it said...
One thing I noticed was that the failure happened at the very last clusters in the first refcount block, with the second refcount block not yet allocated. So the problem could either be that the refcount of these last clusters was already corrupted and would have become negative; or that the update crossed the boundary and tried to decrease the refcount there, which led to the allocation a new refcount block which might have gone wrong. That alloc_refcount_block can turn any write errors into EINVAL is another observation I made.
Yaniv said that shortly before the crash a high watermark was reached and this might actually have been an ENOSPC.
Kevin, does the new ref count fix might fix it too?
Good question. Without that fix I/O errors during refcount block allocation could lead to almost any kind of corruption. It might have fixed the problem completely (if a previous refcount block allocation error has caused the situation), it might have fixed part of it (the abort() would still happen, but with no additional image corruption) or it might be completely unrelated - we can't know.
Yaniv, something like this hasn't happened again since you reported the bug?
(In reply to comment #6)
> Good question. Without that fix I/O errors during refcount block allocation
> could lead to almost any kind of corruption. It might have fixed the problem
> completely (if a previous refcount block allocation error has caused the
> situation), it might have fixed part of it (the abort() would still happen, but
> with no additional image corruption) or it might be completely unrelated - we
> can't know.
> Yaniv, something like this hasn't happened again since you reported the bug?
As long as nobody can reproduce, we can't do anything about it anyway. I'm closing this as a duplicate of the bug Dor mentioned. If later it turns out that the patches don't fix this case, please reopen.
*** This bug has been marked as a duplicate of bug 567940 ***
I can reproduce this problem with a Windows 2008 server VM running on RHEL-5.5 with the virtio disk and network drivers.
I happens when I try to attach a disk to the VM using "virsh attach-disk" command to attach a disk image file in qcow2 format. The log I got is also
qcow2: free_clusters failed: Invalid argument