Bug 682174

Summary:	qcow2 code falls into an endless loop occurred in alloc_clusters_noref
Product:	Red Hat Enterprise Linux 5	Reporter:	Mark Wu <dwu>
Component:	kvm	Assignee:	Kevin Wolf <kwolf>
Status:	CLOSED WONTFIX	QA Contact:	Virtualization Bugs <virt-bugs>
Severity:	high	Docs Contact:
Priority:	high
Version:	5.6	CC:	areis, juzhang, jwest, mkenneth, pgervase, pmdyermms, rhod, shuang, tburke, virt-maint
Target Milestone:	rc
Target Release:	5.7
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2012-04-10 08:40:05 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	580949, 807971

Description Mark Wu 2011-03-04 11:07:18 UTC

Description of problem:
After running the command "dd if=/dev/zero of=test.img bs=4k count=10000" on guest OS which use qcow2 image, the guest OS become non-responsive. Strace shows that qemu thread keep reading the guest image forever . (fd 11 points to the guest image)

pread(11, "\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1"..., 4096, 5897314304) = 4096
pread(11, "\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1"..., 4096, 5905584128) = 4096
...


Version-Release number of selected component (if applicable):
Guest OS: rhel4.8
Host OS: rhel5.6
kvm: kvm-83-224.el5


How reproducible:
almost always on the problematic guest image

Steps to Reproduce:
It's only observed on one guest image from customer. So I don't know the exact steps to reproduce it. 

Actual results:


Expected results:


Additional info:
# qemu-img info /data/fullvm.img 
image: /data/fullvm.img
file format: qcow2
virtual size: 70G (75161927680 bytes)
disk size: 9.1G
cluster_size: 4096

Qemu command line: 
root     10042  2.7  9.0 4474416 742796 ?      Dl   16:07   4:49 /usr/libexec/qemu-kvm -S -M rhel5.4.0 -m 4096 -smp 4,sockets=4,cores=1,threads=1 -name rh4-malongo -uuid 66bdac84-9faa-4f11-9b75-07a036f6f86e -monitor unix:/var/lib/libvirt/qemu/rh4-malongo.monitor,server,nowait -localtime -no-kvm-pit-reinjection -boot c -drive if=ide,media=cdrom,bus=1,unit=0,readonly=on,format=raw -drive file=/data/fullvm.img,if=virtio,boot=on,format=qcow2 -net nic,macaddr=54:52:00:1d:c0:f5,vlan=0,model=virtio -net tap,fd=16,vlan=0 -serial pty -parallel none -usb -vnc 127.0.0.1:4 -k en-us -vga cirrus -balloon virtio

Comment 4 Kevin Wolf 2011-03-07 09:18:19 UTC

qemu-img check prints a lot of errors, I suppose?

Comment 7 pmdyermms 2011-03-08 03:57:38 UTC

31 errors on the copy of the image I have locally.

Paul


qemu-img check malongo.img 
Leaked cluster 475439 refcount=1 reference=0
Leaked cluster 475461 refcount=1 reference=0
Leaked cluster 475462 refcount=1 reference=0
Leaked cluster 493070 refcount=1 reference=0
Leaked cluster 505351 refcount=1 reference=0
Leaked cluster 507836 refcount=1 reference=0
Leaked cluster 2156950 refcount=1 reference=0
Leaked cluster 2156951 refcount=1 reference=0
Leaked cluster 2246627 refcount=1 reference=0
Leaked cluster 2246628 refcount=1 reference=0
Leaked cluster 2246629 refcount=1 reference=0
Leaked cluster 2246630 refcount=1 reference=0
Leaked cluster 2277456 refcount=1 reference=0
Leaked cluster 2277457 refcount=1 reference=0
Leaked cluster 2277458 refcount=1 reference=0
Leaked cluster 2277459 refcount=1 reference=0
Leaked cluster 2277460 refcount=1 reference=0
Leaked cluster 2277461 refcount=1 reference=0
Leaked cluster 2277462 refcount=1 reference=0
Leaked cluster 2277463 refcount=1 reference=0
Leaked cluster 2277464 refcount=1 reference=0
Leaked cluster 2277465 refcount=1 reference=0
Leaked cluster 2277466 refcount=1 reference=0
Leaked cluster 2277467 refcount=1 reference=0
Leaked cluster 2277468 refcount=1 reference=0
Leaked cluster 2277469 refcount=1 reference=0
Leaked cluster 2277470 refcount=1 reference=0
Leaked cluster 2277471 refcount=1 reference=0
Leaked cluster 2277472 refcount=1 reference=0
Leaked cluster 2277473 refcount=1 reference=0
Leaked cluster 2277474 refcount=1 reference=0

31 leaked clusters were found on the image.
This means waste of disk space, but no harm to data.

Comment 8 Kevin Wolf 2011-03-08 08:52:50 UTC

Hm, okay, but that's only cluster leaks. So the metadata seems to be consistent.

The hexdump that Mark posted looks actually not too bad, it's just a fully used refcount block. Are you sure that the endless loop is in alloc_clusters_noref and not in a caller? You can use the 'finish' gdb command to avoid singlestepping all the way through the refcounts.

Comment 9 pmdyermms 2011-03-09 18:06:51 UTC

Hi,

I was able to run qemu-img check against the 2 production images that are the origin of this ticket.   One image had 31 leaked clusters.   The other had 64 leaked clusters.   The information I sent earlier, from the USB copy of the production image, is consistent with what we are seeing in production.

Paul

Comment 10 Mark Wu 2011-03-14 11:32:53 UTC

Kevin,
I think here the problem is that there's no zeroed entry available in ref-count table. It seems that reading from ref-count table always returns the following data
(gdb) x/2048h 0x17245630 
0x17245630:     0x0100  0x0100  0x0100  0x0100  0x0100  0x0100  0x0100  0x0100
0x17245640:     0x0100  0x0100  0x0100  0x0100  0x0100  0x0100  0x0100  0x0100
0x17245650:     0x0100  0x0100  0x0100  0x0100  0x0100  0x0100  0x0100  0x0100
0x17245660:     0x0100  0x0100  0x0100  0x0100  0x0100  0x0100  0x0100  0x0100
....

So it keeps retrying in alloc_clusters_noref:

retry:
    for(i = 0; i < nb_clusters; i++) {
        int64_t next_cluster_index = s->free_cluster_index++;
        refcount = get_refcount(bs, next_cluster_index);

        if (refcount < 0) {
            return refcount;
        } else if (refcount != 0) {
            goto retry;
        }
    }

But I have no idea about how all cluster was referenced.

Comment 11 Kevin Wolf 2011-03-14 11:56:08 UTC

In every loop iteration s->free_cluster_index is increased by one, so we can't get an endless loop here. If no refcount block contains a free cluster, after scanning all clusters in vain a new refcount block would be appended at the end and initialized with zeros.

If you say that you do see an endless loop, what value does s->free_cluster_index have and how does it change while looping through this code?

Comment 12 Mark Wu 2011-03-17 11:12:14 UTC

Kevin, 
Yes, I am wrong. It's not an endless loop in alloc_clusters_noref. I found that in some condition, function update_refcount will move free_cluster_index(2474903) back to the freed cluster(1058272). Then the next  alloc_clusters_noref has to traverse from the refcount table from 1058272 to 2474903. 
Could you shed some light on it for me?  Many thanks! I will continue to look into it.


Breakpoint 31, update_refcount (bs=0xd898010, offset=4334685696, length=<value optimized out>, addend=-1) at block-qcow2.c:3132
3132	            s->free_cluster_index = cluster_index;
(gdb) p ((BDRVQcowState*)0xd895dd0)->free_cluster_index
$128 = 2474903
(gdb) n
54	    return bswap_16(x);
(gdb) n
3134	        s->refcount_block_cache[block_index] = cpu_to_be16(refcount);
(gdb) p ((BDRVQcowState*)0xd895dd0)->free_cluster_index
$129 = 1058272
(gdb) bt
#0  update_refcount (bs=0xd898010, offset=4334685696, length=<value optimized out>, addend=-1) at block-qcow2.c:3134
#1  0x000000000049982e in free_clusters (bs=0xd8af630, offset=3008, size=227106256) at block-qcow2.c:2740
#2  0x000000000049b2e0 in alloc_cluster_link_l2 (bs=0xd898010, m=<value optimized out>) at block-qcow2.c:1127
#3  0x000000000049b37e in qcow_aio_write_cb (opaque=0xee1bc10, ret=0) at block-qcow2.c:1592
#4  0x0000000000419c57 in posix_aio_read (opaque=<value optimized out>) at block-raw-posix.c:483
#5  0x0000000000409672 in main_loop_wait (timeout=<value optimized out>) at /usr/src/debug/kvm-83-maint-snapshot-20090205/qemu/vl.c:4048
#6  0x00000000005005fa in kvm_main_loop () at /usr/src/debug/kvm-83-maint-snapshot-20090205/qemu/qemu-kvm.c:596
#7  0x000000000040e795 in main_loop (argc=39, argv=0x7fff08c583b8, envp=<value optimized out>) at /usr/src/debug/kvm-83-maint-snapshot-20090205/qemu/vl.c:4105
#8  main (argc=39, argv=0x7fff08c583b8, envp=<value optimized out>) at /usr/src/debug/kvm-83-maint-snapshot-20090205/qemu/vl.c:6545
(gdb) cont
Continuing.
[Thread 0x66200940 (LWP 14430) exited]
[New Thread 0x66200940 (LWP 14439)]

Breakpoint 31, update_refcount (bs=0xd898010, offset=4334722048, length=<value optimized out>, addend=-1) at block-qcow2.c:3132
3132	            s->free_cluster_index = cluster_index;
(gdb) p ((BDRVQcowState*)0xd895dd0)->free_cluster_index
$130 = 2475063
(gdb) n
54	    return bswap_16(x);
(gdb) n
3134	        s->refcount_block_cache[block_index] = cpu_to_be16(refcount);
(gdb) p ((BDRVQcowState*)0xd895dd0)->free_cluster_index
$131 = 1058281
(gdb) bt
#0  update_refcount (bs=0xd898010, offset=4334722048, length=<value optimized out>, addend=-1) at block-qcow2.c:3134
#1  0x000000000049982e in free_clusters (bs=0xd8af630, offset=3026, size=227106256) at block-qcow2.c:2740
#2  0x000000000049b2e0 in alloc_cluster_link_l2 (bs=0xd898010, m=<value optimized out>) at block-qcow2.c:1127
#3  0x000000000049b37e in qcow_aio_write_cb (opaque=0xee1bc10, ret=0) at block-qcow2.c:1592
#4  0x0000000000419c57 in posix_aio_read (opaque=<value optimized out>) at block-raw-posix.c:483
#5  0x0000000000409672 in main_loop_wait (timeout=<value optimized out>) at /usr/src/debug/kvm-83-maint-snapshot-20090205/qemu/vl.c:4048
#6  0x00000000005005fa in kvm_main_loop () at /usr/src/debug/kvm-83-maint-snapshot-20090205/qemu/qemu-kvm.c:596
#7  0x000000000040e795 in main_loop (argc=39, argv=0x7fff08c583b8, envp=<value optimized out>) at /usr/src/debug/kvm-83-maint-snapshot-20090205/qemu/vl.c:4105
#8  main (argc=39, argv=0x7fff08c583b8, envp=<value optimized out>) at /usr/src/debug/kvm-83-maint-snapshot-20090205/qemu/vl.c:6545
(gdb) cont
Continuing.
[Thread 0x66200940 (LWP 14439) exited]
[New Thread 0x66200940 (LWP 14447)]
[New Thread 0x69405940 (LWP 14448)]

Comment 13 Mark Wu 2011-05-27 10:22:56 UTC

Kevin,
I think the problem here is the big jump backwards of free_cluster_index. After the jump, alloc_clusters_noref probably has to check all reference count tables from new index to the original index one, like what mentioned in comment 12: from 1058272 to 2474903 and from 1058281 and 2475063. That's why I see qemu spent a lot of time in pread and the data "\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1" was repeated read out. 

From the backtrace in comment12, the jump backwards should happen in free clusters in alloc_cluster_link_l2.  If there're two concurrent writes to a non-allocated cluster, they will use separate cluster. When the second write finishes, it will copy unmodified data from the cluster used by first write, and free the old cluster. So that means there're concurrent writing to the same location in guest OS. However, it was simply reproduced by command dd on guest and there's no explicitly concurrent writing . So I added some debugging log in alloc_cluster_link_l2 to verify it and  see which location was written to. But unfortunately, I can't reproduce it any more with the image.

Any idea?

Comment 14 pmdyermms 2011-06-02 14:56:32 UTC

Mark,

I have updated case 390745, twice, since last I heard from you.   Could you send a response?

Thanks

Comment 17 Peter Gervase 2011-06-02 16:32:01 UTC

Paul, I have copied the postings from the 390745 ticket to here. We're going over the output and will update you with our findings.

Pete

Comment 18 pmdyermms 2011-06-17 00:12:53 UTC

Hi,

I ran qemu-img convert and moved the KVM guests to a raw image. That increased the image size by about 5GB, but has corrected the problem with the file. The uninterruptible sleep state is no longer occurring on disk writes.

I have the original problem image on www.ctghome.com. You can download it, if more work needs to be done.

I am moving on with this. Thanks for the help.

Paul

Comment 19 Suqin Huang 2011-10-10 02:28:20 UTC

Hi pmdyermms,
did you remove the image, I can not find it.

Thanks
suqin

Comment 20 Suqin Huang 2011-10-10 07:56:22 UTC

reproduce it with the special image

kvm-83-241.el5
2.6.18-286.el5

cmd:

/usr/libexec/qemu-kvm -monitor stdio -drive file=/root/bug682174,index=0,if=virtio,media=disk,cache=none,boot=on,format=qcow2 -net nic,vlan=0,model=virtio,macaddr='9a:d0:37:77:25:21' -net tap,vlan=0,ifname='t0-043914-vo3C',script=/home/Auto/autotest/client/tests/kvm/scripts/qemu-ifup-switch,downscript='no' -m 2048 -smp 4,cores=2,threads=1,sockets=2 -cpu qemu64,+sse2 -vnc :1 -rtc-td-hack  -boot c  -usbdevice tablet -no-kvm-pit-reinjection

Comment 22 RHEL Program Management 2012-04-02 10:47:01 UTC

This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux release.  Product Management has
requested further review of this request by Red Hat Engineering, for
potential inclusion in a Red Hat Enterprise Linux release for currently
deployed products.  This request is not yet committed for inclusion in
a release.

Comment 23 Ademar Reis 2012-04-09 22:40:03 UTC

Bug is 13 months old, GSS moved on after using a workaround and we need a special image to reproduce it. I recommend closing it as WONTFIX on RHEL5.