Bug 1508886

Summary:	QEMU's AIO subsystem gets stuck inhibiting all I/O operations on virtio-blk-pci devices
Product:	Red Hat Enterprise Linux 7	Reporter:	Sergio Lopez <slopezpa>
Component:	qemu-kvm-rhev	Assignee:	Stefan Hajnoczi <stefanha>
Status:	CLOSED ERRATA	QA Contact:	aihua liang <aliang>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	7.4	CC:	aliang, chayang, coli, dzheng, junli, juzhang, knoel, ljozsa, lolyu, michen, mrezanin, mschuppe, mtessun, ngu, pbonzini, pingl, qzhang, rhodain, stefanha, virt-maint, xuwei, yhong
Target Milestone:	rc	Keywords:	ZStream
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	qemu-kvm-rhev-2.10.0-7.el7	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1513362 (view as bug list)		Environment:
Last Closed:	2018-04-11 00:46:47 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1513362, 1525497, 1525500, 1525502, 1525504, 1525505, 1525507, 1525510

Description Sergio Lopez 2017-11-02 12:54:30 UTC

From Guest's perspective, when this bug appears, all IO operations get indefinitely stuck waiting for completion in the corresponding request queue.

From QEMU's perspective, we can make the following observations:

 - We have 7 threads, 4 of them backing vCPUs, 1 main_loop, 1 RCU worker and 1 VNC worker. Even though we're running with 'aio=threads' and there are pending I/O operations, we have no worker_threads:

Thread 7 (Thread 0x7f3b61237d00 (LWP 255248)):
#0  0x00007f3b487c2aff in __GI_ppoll (fds=0x55eb83bf5300, nfds=8, timeout=<optimized out>, 
    timeout@entry=0x7ffd7f4bae50, sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:56
#1  0x000055eb81f003d9 in ppoll (__ss=0x0, __timeout=0x7ffd7f4bae50, __nfds=<optimized out>, 
    __fds=<optimized out>) at /usr/include/bits/poll2.h:77
#2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=timeout@entry=999000000)
    at util/qemu-timer.c:334
#3  0x000055eb81f011e8 in os_host_main_loop_wait (timeout=999000000) at util/main-loop.c:255
#4  main_loop_wait (nonblocking=nonblocking@entry=0) at util/main-loop.c:517
#5  0x000055eb81bf2c2c in main_loop () at vl.c:1909
#6  main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at vl.c:4733

Thread 6 (Thread 0x7f393bbff700 (LWP 255281)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x000055eb81f040f9 in qemu_cond_wait (cond=cond@entry=0x55eb83bf4c00, 
    mutex=mutex@entry=0x55eb83bf4c30) at util/qemu-thread-posix.c:133
#2  0x000055eb81e423fb in vnc_worker_thread_loop (queue=queue@entry=0x55eb83bf4c00) at ui/vnc-jobs.c:205
#3  0x000055eb81e42938 in vnc_worker_thread (arg=0x55eb83bf4c00) at ui/vnc-jobs.c:312
#4  0x00007f3b48a9fe25 in start_thread (arg=0x7f393bbff700) at pthread_create.c:308
#5  0x00007f3b487cd34d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Thread 5 (Thread 0x7f3b3de6c700 (LWP 255279)):
#0  0x00007f3b487c4107 in ioctl () at ../sysdeps/unix/syscall-template.S:81
#1  0x000055eb81c42fa5 in kvm_vcpu_ioctl (cpu=cpu@entry=0x55eb84116000, type=type@entry=44672)
    at /usr/src/debug/qemu-2.9.0/kvm-all.c:2178
#2  0x000055eb81c4306b in kvm_cpu_exec (cpu=cpu@entry=0x55eb84116000)
    at /usr/src/debug/qemu-2.9.0/kvm-all.c:2016
#3  0x000055eb81c30432 in qemu_kvm_cpu_thread_fn (arg=0x55eb84116000)
    at /usr/src/debug/qemu-2.9.0/cpus.c:1118
#4  0x00007f3b48a9fe25 in start_thread (arg=0x7f3b3de6c700) at pthread_create.c:308
#5  0x00007f3b487cd34d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Thread 4 (Thread 0x7f3b3e66d700 (LWP 255278)):
#0  0x00007f3b487c4107 in ioctl () at ../sysdeps/unix/syscall-template.S:81
#1  0x000055eb81c42fa5 in kvm_vcpu_ioctl (cpu=cpu@entry=0x55eb840f6000, type=type@entry=44672)
    at /usr/src/debug/qemu-2.9.0/kvm-all.c:2178
#2  0x000055eb81c4306b in kvm_cpu_exec (cpu=cpu@entry=0x55eb840f6000)
    at /usr/src/debug/qemu-2.9.0/kvm-all.c:2016
#3  0x000055eb81c30432 in qemu_kvm_cpu_thread_fn (arg=0x55eb840f6000)
    at /usr/src/debug/qemu-2.9.0/cpus.c:1118
#4  0x00007f3b48a9fe25 in start_thread (arg=0x7f3b3e66d700) at pthread_create.c:308
#5  0x00007f3b487cd34d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Thread 3 (Thread 0x7f3b3ee6e700 (LWP 255277)):
#0  0x00007f3b487c4107 in ioctl () at ../sysdeps/unix/syscall-template.S:81
#1  0x000055eb81c42fa5 in kvm_vcpu_ioctl (cpu=cpu@entry=0x55eb840d6000, type=type@entry=44672)
    at /usr/src/debug/qemu-2.9.0/kvm-all.c:2178
#2  0x000055eb81c4306b in kvm_cpu_exec (cpu=cpu@entry=0x55eb840d6000)
    at /usr/src/debug/qemu-2.9.0/kvm-all.c:2016
#3  0x000055eb81c30432 in qemu_kvm_cpu_thread_fn (arg=0x55eb840d6000)
    at /usr/src/debug/qemu-2.9.0/cpus.c:1118
#4  0x00007f3b48a9fe25 in start_thread (arg=0x7f3b3ee6e700) at pthread_create.c:308
#5  0x00007f3b487cd34d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Thread 2 (Thread 0x7f3b3f66f700 (LWP 255276)):
#0  0x00007f3b487c4107 in ioctl () at ../sysdeps/unix/syscall-template.S:81
#1  0x000055eb81c42fa5 in kvm_vcpu_ioctl (cpu=cpu@entry=0x55eb8405c000, type=type@entry=44672)
    at /usr/src/debug/qemu-2.9.0/kvm-all.c:2178
#2  0x000055eb81c4306b in kvm_cpu_exec (cpu=cpu@entry=0x55eb8405c000)
    at /usr/src/debug/qemu-2.9.0/kvm-all.c:2016
#3  0x000055eb81c30432 in qemu_kvm_cpu_thread_fn (arg=0x55eb8405c000)
    at /usr/src/debug/qemu-2.9.0/cpus.c:1118
#4  0x00007f3b48a9fe25 in start_thread (arg=0x7f3b3f66f700) at pthread_create.c:308
#5  0x00007f3b487cd34d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Thread 1 (Thread 0x7f3b40e8a700 (LWP 255272)):
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x000055eb81f04406 in qemu_futex_wait (val=<optimized out>, f=<optimized out>)
    at /usr/src/debug/qemu-2.9.0/include/qemu/futex.h:26
#2  qemu_event_wait (ev=ev@entry=0x55eb82917784 <rcu_call_ready_event>) at util/qemu-thread-posix.c:399
#3  0x000055eb81f13a2e in call_rcu_thread (opaque=<optimized out>) at util/rcu.c:249
#4  0x00007f3b48a9fe25 in start_thread (arg=0x7f3b40e8a700) at pthread_create.c:308
#5  0x00007f3b487cd34d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113


 - In the device's VQ, we have 128 slots, and all of them are in use:

(gdb) p ((VirtIODevice *)0x55eb85adc510)->vq[0]
$126 = {
  vring = {
    num = 128,                    <-- 128 slots in total
    num_default = 128, 
    align = 4096, 
    desc = 907452416, 
    avail = 907454464, 
    used = 907454784, 
    caches = 0x55eb86ad8d80
  }, 
  last_avail_idx = 43081, 
  shadow_avail_idx = 43081, 
  used_idx = 42953, 
  signalled_used = 42953, 
  signalled_used_valid = true, 
  notification = true, 
  queue_index = 0, 
  inuse = 128,                     <-- 128 slots in use
  vector = 1, 
  handle_output = 0x55eb81c5e320 <virtio_blk_handle_output>, 
  handle_aio_output = 0x55eb81c5e440 <virtio_blk_data_plane_handle_output>, 
  vdev = 0x55eb85adc510, 
  guest_notifier = {
    rfd = 30, 
    wfd = 30
  }, 
  host_notifier = {
    rfd = 31, 
    wfd = 31
  }, 
  node = {
    le_next = 0x0, 
    le_prev = 0x55eb85245e48
  }
}


 - We have 381 "in flight" requests in the QCOW2 layer, and 1 in the file layer:

(gdb) p ((VirtIOBlock *)0x557ea7ae8510)->blk->root->bs->in_flight
$162 = 383
(gdb) p ((VirtIOBlock *)0x557ea7ae8510)->blk->root->bs->file->bs->in_flight
$163 = 1


 - The coroutine servicing the first request from the QCOW2 layer is waiting on a lock held by coroutine 0x557ea751ad00.


(gdb) bt
#0  qemu_coroutine_switch (from_=<optimized out>, to_=0x7fc2d134dbd8,
    action=action@entry=COROUTINE_YIELD) at util/coroutine-ucontext.c:176
#1  0x0000557ea4493e6c in qemu_coroutine_yield () at util/qemu-coroutine.c:167
#2  0x0000557ea449422a in qemu_co_mutex_lock_slowpath (mutex=0x557ea4e97784 <rcu_call_ready_event>,
    ctx=0x0) at util/qemu-coroutine-lock.c:246
#3  qemu_co_mutex_lock (mutex=0x557ea4e97784 <rcu_call_ready_event>, mutex@entry=0x557ea5c06250)
    at util/qemu-coroutine-lock.c:284
#4  0x0000557ea43e7765 in qcow2_co_pwritev (bs=0x557ea5ccc000, offset=0, bytes=12288,
    qiov=0x557ea5c12c90, flags=<optimized out>) at block/qcow2.c:1597
#5  0x0000557ea4410811 in bdrv_driver_pwritev (bs=bs@entry=0x557ea5ccc000,
    offset=offset@entry=31570137088, bytes=bytes@entry=12288, qiov=qiov@entry=0x557ea5c12c90,
    flags=flags@entry=0) at block/io.c:888
#6  0x0000557ea4411bc6 in bdrv_aligned_pwritev (req=req@entry=0x7fc0a59c7ec0,
    offset=offset@entry=31570137088, bytes=bytes@entry=12288, align=align@entry=1, qiov=0x557ea5c12c90,
    flags=flags@entry=0, child=0x557ea5bcf9f0) at block/io.c:1396
#7  0x0000557ea4412702 in bdrv_co_pwritev (child=0x557ea5bcf9f0, offset=offset@entry=31570137088,
    bytes=bytes@entry=12288, qiov=qiov@entry=0x557ea5c12c90, flags=0) at block/io.c:1647
#8  0x0000557ea440404b in blk_co_pwritev (blk=0x557ea5c16000, offset=31570137088, bytes=12288,
    qiov=0x557ea5c12c90, flags=<optimized out>) at block/block-backend.c:995
#9  0x0000557ea44040da in blk_aio_write_entry (opaque=0x557ea7ce23f0) at block/block-backend.c:1186
#10 0x0000557ea4494c1a in coroutine_trampoline (i0=<optimized out>, i1=<optimized out>)
    at util/coroutine-ucontext.c:79
#11 0x00007fc2b8831d40 in ?? () from /lib64/libc.so.6
#12 0x00007fff432fb9c0 in ?? ()
#13 0x0000000000000000 in ?? ()

(gdb) p *(CoMutex *)0x557ea5c06250
$174 = {locked = 80, ctx = 0x557ea5c05700, from_push = {slh_first = 0x7fc09d3aaee0}, to_pop = {
    slh_first = 0x7fc0a4ac6b50}, handoff = 0, sequence = 0, holder = 0x557ea751ad00}


 - Coroutine 0x557ea751ad00, which is holding the lock mentioned above, is servicing the only request in the file layer, as it's waiting for completion:

(gdb) bt
#0  qemu_coroutine_switch (from_=<optimized out>, to_=0x7fc2d134dbd8,
    action=action@entry=COROUTINE_YIELD) at util/coroutine-ucontext.c:176
#1  0x0000557ea4493e6c in qemu_coroutine_yield () at util/qemu-coroutine.c:167
#2  0x0000557ea447fd78 in thread_pool_submit_co (pool=0x557ea4e97784 <rcu_call_ready_event>, func=0x0,
    arg=0x0) at util/thread-pool.c:290
#3  0x0000557ea440ee90 in bdrv_driver_preadv (bs=bs@entry=0x557ea5ccf400, offset=offset@entry=25886720,
    bytes=bytes@entry=65536, qiov=0x7fc0a5ccaa70, qiov@entry=0x7fc2b0fa0700, flags=0) at block/io.c:847
#4  0x0000557ea441214d in bdrv_aligned_preadv (child=child@entry=0x557ea5bcfae0,
    req=req@entry=0x7fc0a5cca950, offset=offset@entry=25886720, bytes=bytes@entry=65536,
    align=140474169559488, align@entry=512, qiov=0x7fc2b0fa0700, qiov@entry=0x7fc0a5ccaa70,
    flags=flags@entry=0) at block/io.c:1096
#5  0x0000557ea4412557 in bdrv_co_preadv (child=<optimized out>, offset=25886720, bytes=65536,
    qiov=0x7fc0a5ccaa70, flags=0) at block/io.c:1190
#6  0x0000557ea4412e25 in bdrv_rw_co_entry (opaque=0x7fc0a5ccaa10) at block/io.c:604
#7  0x0000557ea4412f58 in bdrv_prwv_co (child=child@entry=0x557ea5bcfae0, offset=offset@entry=25886720,
    qiov=qiov@entry=0x7fc0a5ccaa70, is_write=is_write@entry=false, flags=flags@entry=0) at block/io.c:633
#8  0x0000557ea4413256 in bdrv_preadv (qiov=0x7fc0a5ccaa70, offset=25886720, child=0x557ea5bcfae0)
    at block/io.c:747
#9  bdrv_pread (child=0x557ea5bcfae0, offset=offset@entry=25886720, buf=<optimized out>,
    bytes=<optimized out>) at block/io.c:768
#10 0x0000557ea43f5d2f in qcow2_cache_do_get (bs=bs@entry=0x557ea5ccc000, c=0x557ea5c79b60,
    offset=offset@entry=25886720, table=table@entry=0x7fc0a5ccab48,
    read_from_disk=read_from_disk@entry=true) at block/qcow2-cache.c:360
#11 0x0000557ea43f5eab in qcow2_cache_get (bs=bs@entry=0x557ea5ccc000, c=<optimized out>,
    offset=offset@entry=25886720, table=table@entry=0x7fc0a5ccab48) at block/qcow2-cache.c:384
#12 0x0000557ea43f269a in l2_load (l2_table=0x7fc0a5ccab48, l2_offset=25886720, bs=0x557ea5ccc000)
    at block/qcow2-cluster.c:161
#13 get_cluster_table (bs=bs@entry=0x557ea5ccc000, offset=offset@entry=37855051776,
    new_l2_table=new_l2_table@entry=0x7fc0a5ccac10, new_l2_index=new_l2_index@entry=0x7fc0a5ccac0c)
    at block/qcow2-cluster.c:656
#14 0x0000557ea43f343f in handle_copied (m=0x7fc0a5ccacc0, bytes=<synthetic pointer>,
    host_offset=<synthetic pointer>, guest_offset=37855051776, bs=0x557ea5ccc000)
    at block/qcow2-cluster.c:999
#15 qcow2_alloc_cluster_offset (bs=bs@entry=0x557ea5ccc000, offset=offset@entry=37855051776,
    bytes=bytes@entry=0x7fc0a5ccacb4, host_offset=host_offset@entry=0x7fc0a5ccacb8,
    m=m@entry=0x7fc0a5ccacc0) at block/qcow2-cluster.c:1348
#16 0x0000557ea43e7808 in qcow2_co_pwritev (bs=0x557ea5ccc000, offset=37855051776, bytes=16384,
    qiov=0x557ea5c14b80, flags=<optimized out>) at block/qcow2.c:1612
#17 0x0000557ea4410811 in bdrv_driver_pwritev (bs=bs@entry=0x557ea5ccc000,
    offset=offset@entry=37855051776, bytes=bytes@entry=16384, qiov=qiov@entry=0x557ea5c14b80,
    flags=flags@entry=0) at block/io.c:888
#18 0x0000557ea4411bc6 in bdrv_aligned_pwritev (req=req@entry=0x7fc0a5ccaec0,
    offset=offset@entry=37855051776, bytes=bytes@entry=16384, align=align@entry=1, qiov=0x557ea5c14b80,
    flags=flags@entry=0, child=0x557ea5bcf9f0) at block/io.c:1396
#19 0x0000557ea4412702 in bdrv_co_pwritev (child=0x557ea5bcf9f0, offset=offset@entry=37855051776,
    bytes=bytes@entry=16384, qiov=qiov@entry=0x557ea5c14b80, flags=0) at block/io.c:1647
#20 0x0000557ea440404b in blk_co_pwritev (blk=0x557ea5c16000, offset=37855051776, bytes=16384,
    qiov=0x557ea5c14b80, flags=<optimized out>) at block/block-backend.c:995
#21 0x0000557ea44040da in blk_aio_write_entry (opaque=0x557ea7ce3de0) at block/block-backend.c:1186
#22 0x0000557ea4494c1a in coroutine_trampoline (i0=<optimized out>, i1=<optimized out>)
    at util/coroutine-ucontext.c:79
#23 0x00007fc2b8831d40 in ?? () from /lib64/libc.so.6
#24 0x00007fff432fb9c0 in ?? ()
#25 0x0000000000000000 in ?? ()


 - In the ThreadPool from the AIOContext corresponding to the file layer, we have one element in head, and request_list is empty:

(gdb) p *((struct ThreadPoolReal *)0x557ea5c844e0)
$128 = {ctx = 0x557ea5c05700, completion_bh = 0x557ea5c79b00, lock = {lock = {__data = {__lock = 0, 
        __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {
          __prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}}, 
  worker_stopped = {cond = {__data = {{__wseq = 0, __wseq32 = {__low = 0, __high = 0}}, {__g1_start = 0, 
          __g1_start32 = {__low = 0, __high = 0}}, __g_refs = {0, 0}, __g_size = {0, 0}, 
        __g1_orig_size = 0, __wrefs = 0, __g_signals = {0, 0}}, __size = '\000' <repeats 47 times>, 
      __align = 0}}, sem = {sem = {__size = '\000' <repeats 31 times>, __align = 0}}, max_threads = 64, 
  new_thread_bh = 0x557ea5c79b30, head = {lh_first = 0x557ea5bdf570}, request_list = {tqh_first = 0x0, 
    tqh_last = 0x557ea5c84580}, cur_threads = 0, idle_threads = 0, new_threads = 0, pending_threads = 0, 
  stopping = false}


 - The only ThreadPoolElement in head is in state THREAD_DONE, with ret == 0:

(gdb) p *((struct ThreadPoolReal *)0x557ea5c844e0)->head->lh_first
$160 = {common = {aiocb_info = 0x557ea49e7570 <thread_pool_aiocb_info>, bs = 0x0, 
    cb = 0x557ea447f940 <thread_pool_co_cb>, opaque = 0x7fc0a5cca7b0, refcnt = 1}, 
  pool = 0x557ea5c844e0, func = 0x557ea4409cd0 <aio_worker>, arg = 0x557ea853aac0, state = THREAD_DONE, 
  ret = 0, reqs = {tqe_next = 0x0, tqe_prev = 0x0}, all = {le_next = 0x0, le_prev = 0x557ea5c84578}}


 - The ThreadPoolCo corresponding to the above ThreadPoolElement is still -EINPROGRESS (-115):

(gdb) p *(ThreadPoolCo *) 0x7fc0a5cca7b0
$178 = {co = 0x557ea751ad00, ret = -115}


 - ThreadPool's completion BH is not scheduled:

(gdb) p *((struct ThreadPoolReal *)0x557ea5c844e0)->completion_bh
$156 = {ctx = 0x557ea5c05700, cb = 0x557ea447fb50 <thread_pool_completion_bh>, opaque = 0x557ea5c844e0, 
  next = 0x557ea5c79620, scheduled = false, idle = false, deleted = false}


########################################################
########################################################


The fact that ThreadPoolCo.ret is still -EINPROGRESS indicates that its callback (thread_pool_co_cb) has not yet been called. This element is still in the head of the ThreadPool, but the completion BH is not scheduled, and all worker_threads seem to have expired, so the system can't make any progress.

I suspect there may be a concurrency issue between worker_thread and thread_pool_completion_bh:

<snip from util/thread-pool.c>
 77 static void *worker_thread(void *opaque)
 78 {
 79     ThreadPool *pool = opaque;
 80 
 81     qemu_mutex_lock(&pool->lock);
 82     pool->pending_threads--;
 83     do_spawn_thread(pool);
 84 
 85     while (!pool->stopping) {
 86         ThreadPoolElement *req;
 87         int ret;
 88 
 89         do {
 90             pool->idle_threads++;
 91             qemu_mutex_unlock(&pool->lock);
 92             ret = qemu_sem_timedwait(&pool->sem, 10000);
 93             qemu_mutex_lock(&pool->lock);
 94             pool->idle_threads--;
 95         } while (ret == -1 && !QTAILQ_EMPTY(&pool->request_list));
 96         if (ret == -1 || pool->stopping) {
 97             break;
 98         }
 99 
100         req = QTAILQ_FIRST(&pool->request_list);
101         QTAILQ_REMOVE(&pool->request_list, req, reqs);
102         req->state = THREAD_ACTIVE;
103         qemu_mutex_unlock(&pool->lock);
104 
105         ret = req->func(req->arg);
106 
107         req->ret = ret;
108         /* Write ret before state.  */
109         smp_wmb();
110         req->state = THREAD_DONE;
111 
112         qemu_mutex_lock(&pool->lock);
113 
114         qemu_bh_schedule(pool->completion_bh);
115     }
116 
117     pool->cur_threads--;
118     qemu_cond_signal(&pool->worker_stopped);
119     qemu_mutex_unlock(&pool->lock);
120     return NULL;
121 }
</snip>

<snip from util/thread-pool.c>
163 static void thread_pool_completion_bh(void *opaque)
164 {
165     ThreadPool *pool = opaque;
166     ThreadPoolElement *elem, *next;
167 
168     aio_context_acquire(pool->ctx);
169 restart:
170     QLIST_FOREACH_SAFE(elem, &pool->head, all, next) {
171         if (elem->state != THREAD_DONE) {
172             continue;
173         }
174 
175         trace_thread_pool_complete(pool, elem, elem->common.opaque,
176                                    elem->ret);
177         QLIST_REMOVE(elem, all);
178 
179         if (elem->common.cb) {
180             /* Read state before ret.  */
181             smp_rmb();
182 
183             /* Schedule ourselves in case elem->common.cb() calls aio_poll() to
184              * wait for another request that completed at the same time.
185              */
186             qemu_bh_schedule(pool->completion_bh);
187 
188             aio_context_release(pool->ctx);
189             elem->common.cb(elem->common.opaque, elem->ret);
190             aio_context_acquire(pool->ctx);
191 
192             /* We can safely cancel the completion_bh here regardless of someone
193              * else having scheduled it meanwhile because we reenter the
194              * completion function anyway (goto restart).
195              */
196             qemu_bh_cancel(pool->completion_bh);
197 
198             qemu_aio_unref(elem);
199             goto restart;
200         } else {
201             qemu_aio_unref(elem);
202         }
203     }
204     aio_context_release(pool->ctx);
205 }
</snip>

The BH pool->completion_bh is scheduled after executing the request's function in worker_thread:114. In thread_pool_completion_bh:196 this BH is cancelled, which is supposedly safe as it's going back at the start of the loop.

But I wonder if it's possible that, with large CPU caches, the elem->state comparison at thread_pool_completion_bh:171 may be still seeing THREAD_QUEUED, even after it has been set to THREAD_DONE at worker_thread:110.

If this is indeed possible, and there are no more requests in the queue, then thread_pool_completion_bh would cancel the execution of the completion_bh, leaving the last request waiting forever, with its coroutine perpetually yield.

Comment 5 Sergio Lopez 2017-11-03 07:50:36 UTC

To test the worker_thread vs. thread_pool_completion_bh concurrency issue hypothesis, I wrote this small patch:

From 057f9ed07937886c4ac3f8b8a408e7597f3da93d Mon Sep 17 00:00:00 2001
From: Sergio Lopez <slp>
Date: Fri, 3 Nov 2017 07:57:44 +0100
Subject: [PATCH] util/thread-pool: use atomic operations for TPE->state

---
 util/thread-pool.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/util/thread-pool.c b/util/thread-pool.c
index 610646d..a264c12 100644
--- a/util/thread-pool.c
+++ b/util/thread-pool.c
@@ -107,7 +107,7 @@ static void *worker_thread(void *opaque)
         req->ret = ret;
         /* Write ret before state.  */
         smp_wmb();
-        req->state = THREAD_DONE;
+        atomic_set(&req->state, THREAD_DONE);
 
         qemu_mutex_lock(&pool->lock);
 
@@ -168,7 +168,7 @@ static void thread_pool_completion_bh(void *opaque)
     aio_context_acquire(pool->ctx);
 restart:
     QLIST_FOREACH_SAFE(elem, &pool->head, all, next) {
-        if (elem->state != THREAD_DONE) {
+        if (atomic_read(&elem->state) != THREAD_DONE) {
             continue;
         }
 
-- 
2.13.5

Comment 9 Sergio Lopez 2017-11-06 10:55:45 UTC

Apparently, the patch in comment #5 didn't help. We have a new coredump from QEMU, with the same symptoms we've seen before.

Looking at each coroutine backtrace, we can see we have a total of 130 coroutines, with 51 waiting for another operation in the same cluster (qemu_co_queue_wait), and 78 waiting for a mutex (qemu_co_mutex_lock_slow) held by this one:

#0  qemu_coroutine_switch (from_=<optimized out>, to_=0x7f1f311dfbd8, action=action@entry=COROUTINE_YIELD) at util/coroutine-ucontext.c:176
#1  0x0000560f395e333c in qemu_coroutine_yield () at util/qemu-coroutine.c:167
#2  0x0000560f395ced78 in thread_pool_submit_co (pool=0x560f39fe6784 <rcu_call_ready_event>, func=0x0, arg=0x0) at util/thread-pool.c:290
#3  0x0000560f3955de90 in bdrv_driver_preadv (bs=bs@entry=0x560f3c441400, offset=offset@entry=4128768, bytes=bytes@entry=65536, qiov=0x7f1cf1045940, qiov@entry=0x7f1f10e31700, flags=0) at block/io.c:847
#4  0x0000560f3956114d in bdrv_aligned_preadv (child=child@entry=0x560f3c341ae0, req=req@entry=0x7f1cf1045820, offset=offset@entry=4128768, bytes=bytes@entry=65536, align=139771404032448, align@entry=512, qiov=0x7f1f10e31700, qiov@entry=0x7f1cf1045940, flags=flags@entry=0) at block/io.c:1096
#5  0x0000560f39561557 in bdrv_co_preadv (child=<optimized out>, offset=4128768, bytes=65536, qiov=0x7f1cf1045940, flags=0) at block/io.c:1190
#6  0x0000560f39561e25 in bdrv_rw_co_entry (opaque=0x7f1cf10458e0) at block/io.c:604
#7  0x0000560f39561f58 in bdrv_prwv_co (child=child@entry=0x560f3c341ae0, offset=offset@entry=4128768, qiov=qiov@entry=0x7f1cf1045940, is_write=is_write@entry=false, flags=flags@entry=0) at block/io.c:633
#8  0x0000560f39562256 in bdrv_preadv (qiov=0x7f1cf1045940, offset=4128768, child=0x560f3c341ae0) at block/io.c:747
#9  bdrv_pread (child=0x560f3c341ae0, offset=offset@entry=4128768, buf=<optimized out>, bytes=<optimized out>) at block/io.c:768
#10 0x0000560f39544d2f in qcow2_cache_do_get (bs=bs@entry=0x560f3c43e000, c=0x560f3c3ebb60, offset=offset@entry=4128768, table=table@entry=0x7f1cf1045a10, read_from_disk=read_from_disk@entry=true) at block/qcow2-cache.c:360
#11 0x0000560f39544eab in qcow2_cache_get (bs=bs@entry=0x560f3c43e000, c=<optimized out>, offset=offset@entry=4128768, table=table@entry=0x7f1cf1045a10) at block/qcow2-cache.c:384
#12 0x0000560f39541c26 in l2_load (l2_table=0x7f1cf1045a10, l2_offset=4128768, bs=0x560f3c43e000) at block/qcow2-cluster.c:161
#13 qcow2_get_cluster_offset (bs=bs@entry=0x560f3c43e000, offset=offset@entry=37883887616, bytes=bytes@entry=0x7f1cf1045ab4, cluster_offset=cluster_offset@entry=0x7f1cf1045ab8) at block/qcow2-cluster.c:531
#14 0x0000560f395378a5 in qcow2_co_preadv (bs=0x560f3c43e000, offset=37883887616, bytes=49152, qiov=0x7f1cf1045b90, flags=<optimized out>) at block/qcow2.c:1444
#15 0x0000560f39541944 in do_perform_cow (bytes=49152, offset_in_cluster=16384, cluster_offset=<optimized out>, src_cluster_offset=<optimized out>, bs=0x560f3c43e000) at block/qcow2-cluster.c:424
#16 perform_cow (bs=bs@entry=0x560f3c43e000, r=r@entry=0x560f3e80e498, m=0x560f3e80e460, m=0x560f3e80e460) at block/qcow2-cluster.c:754
#17 0x0000560f39542067 in qcow2_alloc_cluster_link_l2 (bs=bs@entry=0x560f3c43e000, m=0x560f3e80e460) at block/qcow2-cluster.c:793
#18 0x0000560f395369e0 in qcow2_co_pwritev (bs=0x560f3c43e000, offset=37883871232, bytes=16384, qiov=0x560f3dcde040, flags=<optimized out>) at block/qcow2.c:1674
#19 0x0000560f3955f811 in bdrv_driver_pwritev (bs=bs@entry=0x560f3c43e000, offset=offset@entry=37883871232, bytes=bytes@entry=16384, qiov=qiov@entry=0x560f3dcde040, flags=flags@entry=0) at block/io.c:888
#20 0x0000560f39560bc6 in bdrv_aligned_pwritev (req=req@entry=0x7f1cf1045ec0, offset=offset@entry=37883871232, bytes=bytes@entry=16384, align=align@entry=1, qiov=0x560f3dcde040, flags=flags@entry=0, child=0x560f3c3419f0) at block/io.c:1396
#21 0x0000560f39561702 in bdrv_co_pwritev (child=0x560f3c3419f0, offset=offset@entry=37883871232, bytes=bytes@entry=16384, qiov=qiov@entry=0x560f3dcde040, flags=0) at block/io.c:1647
#22 0x0000560f3955304b in blk_co_pwritev (blk=0x560f3c388000, offset=37883871232, bytes=16384, qiov=0x560f3dcde040, flags=<optimized out>) at block/block-backend.c:995
#23 0x0000560f395530da in blk_aio_write_entry (opaque=0x560f3eee3220) at block/block-backend.c:1186
#24 0x0000560f395e40ea in coroutine_trampoline (i0=<optimized out>, i1=<optimized out>) at util/coroutine-ucontext.c:79
#25 0x00007f1f186c2d40 in ?? () from /lib64/libc.so.6
#26 0x00007ffe590d9be0 in ?? ()
#27 0x0000000000000000 in ?? ()

And, once again, the corresponding Thread Element is marked as THREAD_DONE, and the corresponding completion_bh is not pending of execution (scheduled = false):

(gdb) p *((struct ThreadPoolReal *)0x560f3c3f64e0)->head->lh_first
$14 = {common = {aiocb_info = 0x560f39b36570 <thread_pool_aiocb_info>, bs = 0x0, 
    cb = 0x560f395ce940 <thread_pool_co_cb>, opaque = 0x7f1cf1045680, refcnt = 0x1}, 
  pool = 0x560f3c3f64e0, func = 0x560f39558cd0 <aio_worker>, arg = 0x560f3de22bc0, state = THREAD_DONE, 
  ret = 0x0, reqs = {tqe_next = 0x0, tqe_prev = 0x0}, all = {le_next = 0x0, le_prev = 0x560f3c3f6578}}

(gdb) p *((struct ThreadPoolReal *)0x560f3c3f64e0)->completion_bh
$17 = {ctx = 0x560f3c377700, cb = 0x560f395ceb50 <thread_pool_completion_bh>, opaque = 0x560f3c3f64e0, 
  next = 0x560f3c3eb620, scheduled = false, idle = false, deleted = false}

So, basically, everything appears to be waiting for a thread_pool_completion_bh call that never happens.

Looking at the history of util/thread-pool.c, I found this interesting and relatively recent (in downstream terms) commit (b7a745dc33a18377bb4a8dfe54d1df01ea60bf66):

<snip>
 thread-pool: add missing qemu_bh_cancel in completion function

commit 3c80ca1 fixed a deadlock scenarion with nested aio_poll invocations.

However, the rescheduling of the completion BH introcuded unnecessary spinning
in the main-loop. On very fast file backends this can even lead to the
"WARNING: I/O thread spun for 1000 iterations" message popping up.

Callgrind reports about 3-4% less instructions with this patch running
qemu-img bench on a ramdisk based VMDK file.

Fixes: 3c80ca1
Cc: qemu-stable
Signed-off-by: Peter Lieven <pl>
Signed-off-by: Kevin Wolf <kwolf>
</snip>

While I wasn't able to come up with a concrete scenario in which we can get the completion_bh cancelled without the element being immediately executed in the thread_pool_completion_bh loop, I think it's worth a try to revert this commit and see if the issue is still reproducible.

Comment 13 Paolo Bonzini 2017-11-06 15:50:50 UTC

Yes, I believe the following scenario is possible:

    I/O thread                             worker thread
 --------------------------------------------------------------------------------
                                           speculatively read req->state
    req->state = THREAD_DONE;
    qemu_bh_schedule(pool->completion_bh)
       bh->scheduled = 1;
                                           qemu_bh_cancel(pool->completion_bh)
                                             bh->scheduled = 0;
                                           if (req->state == THREAD_DONE)
                                             // sees THREAD_QUEUED

Apart from reverting the fix, we could also:

1) add an smp_mb after qemu_bh_cancel

2) change qemu_bh_cancel to use atomic_mb_set(&bh->scheduled, 0)

both of which forbid the reordering.

Comment 14 Paolo Bonzini 2017-11-06 15:51:31 UTC

Reverting the patch, not the fix, of course.

Comment 17 Sergio Lopez 2017-11-07 07:04:51 UTC

Hi,

(In reply to Paolo Bonzini from comment #13)
> Yes, I believe the following scenario is possible:
> 
>     I/O thread                             worker thread
>  ----------------------------------------------------------------------------
> ----
>                                            speculatively read req->state
>     req->state = THREAD_DONE;
>     qemu_bh_schedule(pool->completion_bh)
>        bh->scheduled = 1;
>                                           
> qemu_bh_cancel(pool->completion_bh)
>                                              bh->scheduled = 0;
>                                            if (req->state == THREAD_DONE)
>                                              // sees THREAD_QUEUED
> 
> Apart from reverting the fix, we could also:
> 
> 1) add an smp_mb after qemu_bh_cancel
> 
> 2) change qemu_bh_cancel to use atomic_mb_set(&bh->scheduled, 0)
> 
> both of which forbid the reordering.

After reading this, I was wondering why the patch at comment #5 didn't work. Now I see that I wrongly assumed atomic_[read|set] would emit HW memory barriers (I should've checked the generated code).

So, IIUC (please correct if I'm wrong), we have an implicit MB at worker_thread due to qemu_bh_schedule->atomic_xchg->(Intel ASM)xchg, and your proposal is to add another one on thread_pool_completion_bh at or after qemu_bh_cancel.

I think we should prepare another test build, any preference between options (1) and (2)? Personally, I'm more inclined towards (2), as in Intel and s390x it'd be optimized to just emit a single xchg instruction:

(gdb) disassemble/m qemu_bh_cancel 
Dump of assembler code for function qemu_bh_cancel:
176	{

177	    atomic_mb_set(&bh->scheduled, 0);
   0x0000000000586650 <+0>:	xor    %eax,%eax
   0x0000000000586652 <+2>:	xchg   %al,0x20(%rdi)

178	}
   0x0000000000586655 <+5>:	retq   
   0x0000000000586656:	nopw   %cs:0x0(%rax,%rax,1)

On the other hand, upstream is also affected, so probably the patch should be sent there too.

Comment 18 Paolo Bonzini 2017-11-07 11:32:47 UTC

Yes, I am not sure which of the two would be harder to follow and/or more likely to introduce bugs in the future.  I now think (2) is not just more efficient but also clearer; as Stefan said, it matches the way aio_bh_poll resets bh->scheduled.

Comment 19 Paolo Bonzini 2017-11-07 12:45:33 UTC

> So, IIUC (please correct if I'm wrong), we have an implicit MB at worker_thread 
> due to qemu_bh_schedule->atomic_xchg->(Intel ASM)xchg, and your proposal is to 
> add another one on thread_pool_completion_bh at or after qemu_bh_cancel.

Correct.

Another way to see it is that qemu_bh_schedule is a "release" operation (they happen after everything before them), and aio_bh_poll instead is an "acquire" operation (it happens before everything after it).  The only way to order stores before everything that follows (that is, to order store-load and store-store pairs) is a full memory barrier, and that's what aio_bh_poll uses.

Commit b7a745dc33a18377bb4a8dfe54d1df01ea60bf66 used qemu_bh_cancel in such a way that it must be an "acquire" operation too, and needs a full memory barrier too.

Comment 20 Sergio Lopez 2017-11-07 14:43:07 UTC

Thanks Paolo and Stefan,

I'm going to prepare a test build with this oneliner:

<snip>
commit 8dff3eeaf741722759afb5013b4722fbc45868ae (HEAD -> master)
Author: Sergio Lopez <slp>
Date:   Tue Nov 7 14:30:16 2017 +0100

    util/async: use atomic_mb_set on qemu_bh_cancel
    
    Commit b7a745d added a qemu_bh_cancel call to the completion function,
    as an optimization to prevent it from unnecessarily rescheduling itself.
    
    This completion function is scheduled from worker_thread, after setting
    the state of a ThreadPoolElement to THREAD_DONE.
    
    This was considered to be safe, as the completion function restarts the
    loop just after the call to qemu_bh_cancel. But, under certain access
    patterns and scheduling conditions, the loop may wrongly use a
    pre-fetched elem->state value, reading it as THREAD_QUEUED, and ending
    the completion function without having processed a pending TPE linked at
    pool->head.
    
    In some situations, if there are no other independent requests in the
    same aio context that could eventually trigger the scheduling of the
    completion function, the omitted TPE and all operations pending on it
    will get stuck forever.

diff --git a/util/async.c b/util/async.c
index 355af73..0e1bd87 100644
--- a/util/async.c
+++ b/util/async.c
@@ -174,7 +174,7 @@ void qemu_bh_schedule(QEMUBH *bh)
  */
 void qemu_bh_cancel(QEMUBH *bh)
 {
-    bh->scheduled = 0;
+    atomic_mb_set(&bh->scheduled, 0);
 }
 
 /* This func is async.The bottom half will do the delete action at the finial
</snip>

Comment 22 Stefan Hajnoczi 2017-11-07 15:52:39 UTC

Thanks, I have reviewed it upstream.  Please let me know when the customer has tested it.  It will probably be merged into QEMU 2.11-rc1.

Comment 27 Stefan Hajnoczi 2017-11-13 16:30:13 UTC

Note to QE:

This race condition is hard to reproduce.  Please verify this BZ by code inspection if there is no reproducer available.  The util/async.c:qemu_bh_cancel() function must look like this:

  void qemu_bh_cancel(QEMUBH *bh)
  {
      atomic_mb_set(&bh->scheduled, 0);
  }

Comment 34 Miroslav Rezanina 2017-11-22 15:21:12 UTC

Fix included in qemu-kvm-rhev-2.10.0-7.el7

Comment 36 Kevin Wolf 2017-11-23 11:53:12 UTC

*** Bug 1510728 has been marked as a duplicate of this bug. ***

Comment 37 aihua liang 2017-11-30 10:30:34 UTC

Code check on qemu-kvm-rhev-2.10.0-7.el7: PASS
Sanity Test on qemu-kvm-rhev-2.10.0-7.el7 + 3.10.0-796.el7.x86_64: PASS

So, set bug's status to "Verified".

Comment 42 errata-xmlrpc 2018-04-11 00:46:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:1104