Bug 1524770

Summary: qemu-img convert hangs on converting qcow2 to raw
Product: Red Hat Enterprise Linux 7 Reporter: KOSAL RAJ I <kiyyappa>
Component: qemu-kvm-rhevAssignee: Kevin Wolf <kwolf>
Status: CLOSED DUPLICATE QA Contact: Ping Li <pingl>
Severity: high Docs Contact:
Priority: unspecified    
Version: 7.4CC: berrange, brian.fife, coli, dasmith, dhill, eglynn, kchamart, kiyyappa, knoel, mbooth, michen, ngu, pingl, rbryant, sbauza, sferdjao, sgordon, shivapriya.o.hiremath, srevivo, virt-maint, vromanso
Target Milestone: pre-dev-freeze   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-12-21 18:28:07 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description KOSAL RAJ I 2017-12-12 02:00:26 UTC
Description of problem:
On some of the hypervisors, converting a qcow2 image to raw as part of instance creation is hanging, with the instance remaining in the 'BUILD' state.  attempting to delete the instance ends up with instance/stack stuck in the 'deleting' state. the only workaround has been to restart openstack-nova-compute on the afflicted hypervisor

Version-Release number of selected component (if applicable):
RHOSP 10

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 8 Gu Nini 2017-12-19 10:17:29 UTC
Ping,

Could you help to have a try with the bug on latest rhel7.4z versions?

Comment 11 Brian Fife 2017-12-21 02:24:21 UTC
curl -O  http://download.cirros-cloud.net/0.4.0/cirros-0.4.0-x86_64-disk.img
i=1; while true; do echo $i; qemu-img convert -O raw cirros-0.4.0-x86_64-disk.img cirros-0.4.0-x86_64-disk.raw -f qcow2; i=$[$i+1]; done

It occurred on iteration 2604

ps -ef | grep convert
root       24322    8602  0 21:13 pts/10   00:00:00 qemu-img convert -O raw cirros-0.4.0-x86_64-disk.img cirros-0.4.0-x86_64-disk.raw -f qcow2
[root@nspcloud-compute-43 ~]# pstack 24322
#0  0x00007ff16a943aff in ppoll () from /lib64/libc.so.6
#1  0x000056211520458b in qemu_poll_ns ()
#2  0x0000562115205378 in main_loop_wait ()
#3  0x000056211514efa3 in img_convert ()
#4  0x00005621151483a9 in main ()

Comment 12 David Hill 2017-12-21 16:26:48 UTC
Hi guys,

    We're hitting this exact same problem with :

qemu-img convert -f qcow2 -O qcow2 /var/lib/nova/instances/99bea639-a7b4-43b9-a83a-37fdf5388eda/disk /var/lib/nova/instances/46ff755f1780407bbce1939b6971730c.test 

if we attach gdb to the process we see the following:

(gdb) bt
#0  0x00007fe62138aaff in ppoll () from /lib64/libc.so.6
#1  0x0000558b9315b58b in qemu_poll_ns ()
#2  0x0000558b9315c378 in main_loop_wait ()
#3  0x0000558b930a5fa3 in img_convert ()
#4  0x0000558b9309f3a9 in main ()
(gdb) 

If we run it with " strace -fffff qemu-img convert -f qcow2 -O qcow2 /var/lib/nova/instances/99bea639-a7b4-43b9-a83a-37fdf5388eda/disk /var/lib/nova/instances/46ff755f1780407bbce1939b6971730c.test > /root/strace.out 2>&1 & " it completes successfully.   

Dave

Comment 13 Kevin Wolf 2017-12-21 18:28:07 UTC
After I had a chance to look at a core dump from David's customer, this seems to be a problem that we already have a fix for in qemu-kvm-rhev-2.9.0-16.el7_4.12.

What led me to this conclusion is that we have a single active coroutine in convert_do_copy(), and the only request in it is stuck while we have a ThreadPoolElement that already is in the THREAD_DONE state, but still in the list of thread pool requests. This means that the worker function has completed, but the callback never arrived. This is the same pattern as seen in bug 1513362.


(gdb) p *s
$1 = {src = 0x561db9196050, src_sectors = 0x561db9196060, src_num = 1, total_sectors = 429916160, allocated_sectors = 47303384, allocated_done = 30076272, sector_num = 153985792, 
  wr_offs = 153984768, status = BLK_DATA, sector_next_status = 153985792, target = 0x561db91ea3c0, has_zero_init = true, compressed = false, target_has_backing = false, 
  wr_in_order = true, min_sparse = 8, cluster_sectors = 128, buf_sectors = 4096, num_coroutines = 8, running_coroutines = 8, co = {0x561db996ab40, 0x561db996ac80, 0x561db996adc0, 
    0x561db996af00, 0x561db996b040, 0x561db996b180, 0x561db996b2c0, 0x561db996b400, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, wait_sector_num = {153985152, 153985664, 153985408, 153984896, 
    153985024, 153985536, 153985280, -1, 0, 0, 0, 0, 0, 0, 0, 0}, lock = {locked = 0, ctx = 0x0, from_push = {slh_first = 0x0}, to_pop = {slh_first = 0x0}, handoff = 0, sequence = 0, 
    holder = 0x0}, ret = -115}


(gdb) p *s.target.root.bs.aio_context.thread_pool.head.lh_first
$13 = {common = {aiocb_info = 0x561db7bd9750 <thread_pool_aiocb_info>, bs = 0x0, cb = 0x561db7950ad0 <thread_pool_co_cb>, opaque = 0x7fadd4697910, refcnt = 1}, pool = 0x561db9228000, 
  func = 0x561db78df850 <aio_worker>, arg = 0x561db9172f00, state = THREAD_DONE, ret = 0, reqs = {tqe_next = 0x0, tqe_prev = 0x0}, all = {le_next = 0x0, le_prev = 0x561db9228098}}

*** This bug has been marked as a duplicate of bug 1513362 ***

Comment 14 shivapriya.o.hiremath 2018-02-02 21:57:55 UTC
We are facing the same issue in OSP 10 deployment where the spawning of a huge VM gets stuck. We would want to know how to get the custom build with the patch mentioned in this bugzilla. 

We have downloaded a source RPM (.src.rpm) from http://ftp.redhat.com/pub/redhat/linux/enterprise/7Server/en/RHOS/SRPMS/, specifically qemu-kvm-rhev-2.9.0-16.el7_4.13.src.rpm.

Since this is a source RPM, we are yet to build the RPM from this file. We followed through the steps https://wiki.centos.org/HowTos/RebuildSRPM on how to build source RPMs, including installing dependencies, such as gcc and kernel-headers, but there are a ton of dependencies. 

We have used the 'yum-builddep <src rpm>' command to install some of the dependencies, but there are yet other packages that aren't available. These are the ff.:
•	bluez-libs-devel
•	brlapi-devel
•	gperftools-devel
•	libfdt-devel >= 1.4.3
•	lbiscsi-devel
•	libseccomp-devel >= 2.3.0
•	libssh2-devel
•	lzo-devel
•	pciutils-devel
•	snapp-devel

Can you guide us on how to add these dependencies on RHEL OSP and let us know if we are missing any repositories?