Bug 574495

Summary:

QEMU: VM crash with qcow2 over block device error (and virtio-blk driver)

Product:

Red Hat Enterprise Linux 5

Reporter:

Yaniv Kaul <ykaul>

Component:

kvm

Assignee:

Kevin Wolf <kwolf>

Status:

CLOSED DUPLICATE

QA Contact:

Virtualization Bugs <virt-bugs>

Severity:

urgent

Docs Contact:

Priority:

high

Version:

5.5

CC:

akong, cpelland, jinzishuai, jkt, llim, tburke, virt-maint, ykaul

Target Milestone:

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2010-05-10 11:12:51 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

556823, 580948

Attachments:

Description	Flags
qemu-img check results	none
VDSM log	none

Description Yaniv Kaul 2010-03-17 16:34:41 UTC

Description of problem:
Ran an installation of 2008R2 with virtio-blk, over block device (FCP). during the file extraction, after lvextend by VDSM (ran via RHEVM 2.2), it crashed.
bt:
#0  0x0000003926e30265 in raise () from /lib64/libc.so.6
#1  0x0000003926e31d10 in abort () from /lib64/libc.so.6
#2  0x00000000004981ab in free_clusters (bs=<value optimized out>, offset=<value optimized out>, size=<value optimized out>) at block-qcow2.c:2673
#3  0x000000000049a0f4 in alloc_cluster_link_l2 (bs=0x1da96640, m=<value optimized out>) at block-qcow2.c:1105
#4  0x000000000049a7ae in qcow_aio_write_cb (opaque=0x1db41d90, ret=0) at block-qcow2.c:1547
#5  0x0000000000419707 in posix_aio_read (opaque=<value optimized out>) at block-raw-posix.c:512
#6  0x00000000004094b2 in main_loop_wait (timeout=<value optimized out>) at /usr/src/debug/kvm-83-maint-snapshot-20090205/qemu/vl.c:3983
#7  0x00000000004ff1ea in kvm_main_loop () at /usr/src/debug/kvm-83-maint-snapshot-20090205/qemu/qemu-kvm.c:596
#8  0x000000000040e425 in main_loop (argc=44, argv=0x7fff67a5ffc8, envp=<value optimized out>) at /usr/src/debug/kvm-83-maint-snapshot-20090205/qemu/vl.c:4040
#9  main (argc=44, argv=0x7fff67a5ffc8, envp=<value optimized out>) at /usr/src/debug/kvm-83-maint-snapshot-20090205/qemu/vl.c:6476

last stdout message: "qcow2: free_clusters failed: Invalid argument"

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Yaniv Kaul 2010-03-17 16:41:22 UTC

Complete command line:

/usr/libexec/qemu-kvm -no-hpet -usb  -rtc-td-hack  -startdate 2010-3-17T9:11:24  -name W2K8_R2PV_COW_V  -smp 1,cores=1  -k en-us  -m 512  -boot cdn  -net nic,vlan=1,macaddr=00:1a:4a:16:97:1d,model=virtio -net tap,vlan=1,ifname=virtio_10_1,script=no  -drive file=/rhev/data-center/6bc3de76-9469-4852-91d0-b2bf668f941c/beddbe7d-e512-4d0c-b11f-a2ae5ae98e9e/images/725c6944-2fb2-49f2-b84d-edf3b7787270/fb5a14c2-224a-40fc-af5c-364efbb80861,media=disk,if=virtio,cache=off,serial=f2-b84d-edf3b7787270,boot=on,format=qcow2,werror=stop -drive file=/rhev/data-center/6bc3de76-9469-4852-91d0-b2bf668f941c/dfe7d294-0cc3-484c-a2d4-eb6908c49960/images/11111111-1111-1111-1111-111111111111/en_windows_server_2008_r2_standard_enterprise_datacenter_and_web_x64_dvd_x15-59754.iso,media=cdrom,index=2,if=ide  -fda /rhev/data-center/6bc3de76-9469-4852-91d0-b2bf668f941c/dfe7d294-0cc3-484c-a2d4-eb6908c49960/images/11111111-1111-1111-1111-111111111111/virtio-drivers-1.0.0-8.vfd -pidfile /var/vdsm/557eb7ff-7e25-4dc5-a9f4-c60cf36afadf.pid -vnc 0:10,password  -cpu qemu64,+sse2,+cx16,+ssse3,+sse4.1,+sse4.2,+popcnt  -M rhel5.5.0  -notify all  -balloon none  -smbios type=1,manufacturer="Red Hat",product="RHEL",version=5Server-5.5.0.2,serial="9978F05A-B189-11DE-9BD8-00215EC7F8AC_00:21:5e:c7:f8:ac",uuid="557eb7ff-7e25-4dc5-a9f4-c60cf36afadf"  -vmchannel di:0200,unix:/var/vdsm/557eb7ff-7e25-4dc5-a9f4-c60cf36afadf.guest.socket,server -monitor unix:/var/vdsm/557eb7ff-7e25-4dc5-a9f4-c60cf36afadf.monitor.socket,server 1>/var/vdsm/557eb7ff-7e25-4dc5-a9f4-c60cf36afadf.stdio.dump 2>&1; /usr/bin/sudo /usr/bin/tunctl -d virtio_10_1

Comment 2 Yaniv Kaul 2010-03-17 16:49:12 UTC

Created attachment 400826 [details]
qemu-img check results

Comment 3 Yaniv Kaul 2010-03-17 16:51:03 UTC

Created attachment 400828 [details]
VDSM log

Comment 4 Kevin Wolf 2010-03-18 09:19:30 UTC

Meh, I seem to have missed that my comment wasn't added yesterday because Yaniv added something else and now it's gone. Let me see if I can remember what it said...

One thing I noticed was that the failure happened at the very last clusters in the first refcount block, with the second refcount block not yet allocated. So the problem could either be that the refcount of these last clusters was already corrupted and would have become negative; or that the update crossed the boundary and tried to decrease the refcount there, which led to the allocation a new refcount block which might have gone wrong. That alloc_refcount_block can turn any write errors into EINVAL is another observation I made.

Yaniv said that shortly before the crash a high watermark was reached and this might actually have been an ENOSPC.

Comment 5 Dor Laor 2010-04-11 13:51:47 UTC

Kevin, does the new ref count fix might fix it too?

Comment 6 Kevin Wolf 2010-04-12 07:53:29 UTC

Good question. Without that fix I/O errors during refcount block allocation could lead to almost any kind of corruption. It might have fixed the problem completely (if a previous refcount block allocation error has caused the situation), it might have fixed part of it (the abort() would still happen, but with no additional image corruption) or it might be completely unrelated - we can't know.

Yaniv, something like this hasn't happened again since you reported the bug?

Comment 7 Yaniv Kaul 2010-05-10 10:53:58 UTC

(In reply to comment #6)
> Good question. Without that fix I/O errors during refcount block allocation
> could lead to almost any kind of corruption. It might have fixed the problem
> completely (if a previous refcount block allocation error has caused the
> situation), it might have fixed part of it (the abort() would still happen, but
> with no additional image corruption) or it might be completely unrelated - we
> can't know.
> 
> Yaniv, something like this hasn't happened again since you reported the bug?    

Nope.

Comment 8 Kevin Wolf 2010-05-10 11:12:51 UTC

As long as nobody can reproduce, we can't do anything about it anyway. I'm closing this as a duplicate of the bug Dor mentioned. If later it turns out that the patches don't fix this case, please reopen.

*** This bug has been marked as a duplicate of bug 567940 ***

Comment 9 Shi jin 2010-12-28 21:06:03 UTC

I can reproduce this problem with a Windows 2008 server VM running on RHEL-5.5 with the virtio disk and network drivers.

I happens when I try to attach a disk to the VM using "virsh attach-disk" command to attach a disk image file in qcow2 format. The log I got is also

qcow2: free_clusters failed: Invalid argument