Bug 1508886
Summary: | QEMU's AIO subsystem gets stuck inhibiting all I/O operations on virtio-blk-pci devices | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Sergio Lopez <slopezpa> | |
Component: | qemu-kvm-rhev | Assignee: | Stefan Hajnoczi <stefanha> | |
Status: | CLOSED ERRATA | QA Contact: | aihua liang <aliang> | |
Severity: | urgent | Docs Contact: | ||
Priority: | urgent | |||
Version: | 7.4 | CC: | aliang, chayang, coli, dzheng, junli, juzhang, knoel, ljozsa, lolyu, michen, mrezanin, mschuppe, mtessun, ngu, pbonzini, pingl, qzhang, rhodain, stefanha, virt-maint, xuwei, yhong | |
Target Milestone: | rc | Keywords: | ZStream | |
Target Release: | --- | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | qemu-kvm-rhev-2.10.0-7.el7 | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1513362 (view as bug list) | Environment: | ||
Last Closed: | 2018-04-11 00:46:47 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1513362, 1525497, 1525500, 1525502, 1525504, 1525505, 1525507, 1525510 |
Description
Sergio Lopez
2017-11-02 12:54:30 UTC
To test the worker_thread vs. thread_pool_completion_bh concurrency issue hypothesis, I wrote this small patch: From 057f9ed07937886c4ac3f8b8a408e7597f3da93d Mon Sep 17 00:00:00 2001 From: Sergio Lopez <slp> Date: Fri, 3 Nov 2017 07:57:44 +0100 Subject: [PATCH] util/thread-pool: use atomic operations for TPE->state --- util/thread-pool.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/util/thread-pool.c b/util/thread-pool.c index 610646d..a264c12 100644 --- a/util/thread-pool.c +++ b/util/thread-pool.c @@ -107,7 +107,7 @@ static void *worker_thread(void *opaque) req->ret = ret; /* Write ret before state. */ smp_wmb(); - req->state = THREAD_DONE; + atomic_set(&req->state, THREAD_DONE); qemu_mutex_lock(&pool->lock); @@ -168,7 +168,7 @@ static void thread_pool_completion_bh(void *opaque) aio_context_acquire(pool->ctx); restart: QLIST_FOREACH_SAFE(elem, &pool->head, all, next) { - if (elem->state != THREAD_DONE) { + if (atomic_read(&elem->state) != THREAD_DONE) { continue; } -- 2.13.5 Apparently, the patch in comment #5 didn't help. We have a new coredump from QEMU, with the same symptoms we've seen before. Looking at each coroutine backtrace, we can see we have a total of 130 coroutines, with 51 waiting for another operation in the same cluster (qemu_co_queue_wait), and 78 waiting for a mutex (qemu_co_mutex_lock_slow) held by this one: #0 qemu_coroutine_switch (from_=<optimized out>, to_=0x7f1f311dfbd8, action=action@entry=COROUTINE_YIELD) at util/coroutine-ucontext.c:176 #1 0x0000560f395e333c in qemu_coroutine_yield () at util/qemu-coroutine.c:167 #2 0x0000560f395ced78 in thread_pool_submit_co (pool=0x560f39fe6784 <rcu_call_ready_event>, func=0x0, arg=0x0) at util/thread-pool.c:290 #3 0x0000560f3955de90 in bdrv_driver_preadv (bs=bs@entry=0x560f3c441400, offset=offset@entry=4128768, bytes=bytes@entry=65536, qiov=0x7f1cf1045940, qiov@entry=0x7f1f10e31700, flags=0) at block/io.c:847 #4 0x0000560f3956114d in bdrv_aligned_preadv (child=child@entry=0x560f3c341ae0, req=req@entry=0x7f1cf1045820, offset=offset@entry=4128768, bytes=bytes@entry=65536, align=139771404032448, align@entry=512, qiov=0x7f1f10e31700, qiov@entry=0x7f1cf1045940, flags=flags@entry=0) at block/io.c:1096 #5 0x0000560f39561557 in bdrv_co_preadv (child=<optimized out>, offset=4128768, bytes=65536, qiov=0x7f1cf1045940, flags=0) at block/io.c:1190 #6 0x0000560f39561e25 in bdrv_rw_co_entry (opaque=0x7f1cf10458e0) at block/io.c:604 #7 0x0000560f39561f58 in bdrv_prwv_co (child=child@entry=0x560f3c341ae0, offset=offset@entry=4128768, qiov=qiov@entry=0x7f1cf1045940, is_write=is_write@entry=false, flags=flags@entry=0) at block/io.c:633 #8 0x0000560f39562256 in bdrv_preadv (qiov=0x7f1cf1045940, offset=4128768, child=0x560f3c341ae0) at block/io.c:747 #9 bdrv_pread (child=0x560f3c341ae0, offset=offset@entry=4128768, buf=<optimized out>, bytes=<optimized out>) at block/io.c:768 #10 0x0000560f39544d2f in qcow2_cache_do_get (bs=bs@entry=0x560f3c43e000, c=0x560f3c3ebb60, offset=offset@entry=4128768, table=table@entry=0x7f1cf1045a10, read_from_disk=read_from_disk@entry=true) at block/qcow2-cache.c:360 #11 0x0000560f39544eab in qcow2_cache_get (bs=bs@entry=0x560f3c43e000, c=<optimized out>, offset=offset@entry=4128768, table=table@entry=0x7f1cf1045a10) at block/qcow2-cache.c:384 #12 0x0000560f39541c26 in l2_load (l2_table=0x7f1cf1045a10, l2_offset=4128768, bs=0x560f3c43e000) at block/qcow2-cluster.c:161 #13 qcow2_get_cluster_offset (bs=bs@entry=0x560f3c43e000, offset=offset@entry=37883887616, bytes=bytes@entry=0x7f1cf1045ab4, cluster_offset=cluster_offset@entry=0x7f1cf1045ab8) at block/qcow2-cluster.c:531 #14 0x0000560f395378a5 in qcow2_co_preadv (bs=0x560f3c43e000, offset=37883887616, bytes=49152, qiov=0x7f1cf1045b90, flags=<optimized out>) at block/qcow2.c:1444 #15 0x0000560f39541944 in do_perform_cow (bytes=49152, offset_in_cluster=16384, cluster_offset=<optimized out>, src_cluster_offset=<optimized out>, bs=0x560f3c43e000) at block/qcow2-cluster.c:424 #16 perform_cow (bs=bs@entry=0x560f3c43e000, r=r@entry=0x560f3e80e498, m=0x560f3e80e460, m=0x560f3e80e460) at block/qcow2-cluster.c:754 #17 0x0000560f39542067 in qcow2_alloc_cluster_link_l2 (bs=bs@entry=0x560f3c43e000, m=0x560f3e80e460) at block/qcow2-cluster.c:793 #18 0x0000560f395369e0 in qcow2_co_pwritev (bs=0x560f3c43e000, offset=37883871232, bytes=16384, qiov=0x560f3dcde040, flags=<optimized out>) at block/qcow2.c:1674 #19 0x0000560f3955f811 in bdrv_driver_pwritev (bs=bs@entry=0x560f3c43e000, offset=offset@entry=37883871232, bytes=bytes@entry=16384, qiov=qiov@entry=0x560f3dcde040, flags=flags@entry=0) at block/io.c:888 #20 0x0000560f39560bc6 in bdrv_aligned_pwritev (req=req@entry=0x7f1cf1045ec0, offset=offset@entry=37883871232, bytes=bytes@entry=16384, align=align@entry=1, qiov=0x560f3dcde040, flags=flags@entry=0, child=0x560f3c3419f0) at block/io.c:1396 #21 0x0000560f39561702 in bdrv_co_pwritev (child=0x560f3c3419f0, offset=offset@entry=37883871232, bytes=bytes@entry=16384, qiov=qiov@entry=0x560f3dcde040, flags=0) at block/io.c:1647 #22 0x0000560f3955304b in blk_co_pwritev (blk=0x560f3c388000, offset=37883871232, bytes=16384, qiov=0x560f3dcde040, flags=<optimized out>) at block/block-backend.c:995 #23 0x0000560f395530da in blk_aio_write_entry (opaque=0x560f3eee3220) at block/block-backend.c:1186 #24 0x0000560f395e40ea in coroutine_trampoline (i0=<optimized out>, i1=<optimized out>) at util/coroutine-ucontext.c:79 #25 0x00007f1f186c2d40 in ?? () from /lib64/libc.so.6 #26 0x00007ffe590d9be0 in ?? () #27 0x0000000000000000 in ?? () And, once again, the corresponding Thread Element is marked as THREAD_DONE, and the corresponding completion_bh is not pending of execution (scheduled = false): (gdb) p *((struct ThreadPoolReal *)0x560f3c3f64e0)->head->lh_first $14 = {common = {aiocb_info = 0x560f39b36570 <thread_pool_aiocb_info>, bs = 0x0, cb = 0x560f395ce940 <thread_pool_co_cb>, opaque = 0x7f1cf1045680, refcnt = 0x1}, pool = 0x560f3c3f64e0, func = 0x560f39558cd0 <aio_worker>, arg = 0x560f3de22bc0, state = THREAD_DONE, ret = 0x0, reqs = {tqe_next = 0x0, tqe_prev = 0x0}, all = {le_next = 0x0, le_prev = 0x560f3c3f6578}} (gdb) p *((struct ThreadPoolReal *)0x560f3c3f64e0)->completion_bh $17 = {ctx = 0x560f3c377700, cb = 0x560f395ceb50 <thread_pool_completion_bh>, opaque = 0x560f3c3f64e0, next = 0x560f3c3eb620, scheduled = false, idle = false, deleted = false} So, basically, everything appears to be waiting for a thread_pool_completion_bh call that never happens. Looking at the history of util/thread-pool.c, I found this interesting and relatively recent (in downstream terms) commit (b7a745dc33a18377bb4a8dfe54d1df01ea60bf66): <snip> thread-pool: add missing qemu_bh_cancel in completion function commit 3c80ca1 fixed a deadlock scenarion with nested aio_poll invocations. However, the rescheduling of the completion BH introcuded unnecessary spinning in the main-loop. On very fast file backends this can even lead to the "WARNING: I/O thread spun for 1000 iterations" message popping up. Callgrind reports about 3-4% less instructions with this patch running qemu-img bench on a ramdisk based VMDK file. Fixes: 3c80ca1 Cc: qemu-stable Signed-off-by: Peter Lieven <pl> Signed-off-by: Kevin Wolf <kwolf> </snip> While I wasn't able to come up with a concrete scenario in which we can get the completion_bh cancelled without the element being immediately executed in the thread_pool_completion_bh loop, I think it's worth a try to revert this commit and see if the issue is still reproducible. Yes, I believe the following scenario is possible: I/O thread worker thread -------------------------------------------------------------------------------- speculatively read req->state req->state = THREAD_DONE; qemu_bh_schedule(pool->completion_bh) bh->scheduled = 1; qemu_bh_cancel(pool->completion_bh) bh->scheduled = 0; if (req->state == THREAD_DONE) // sees THREAD_QUEUED Apart from reverting the fix, we could also: 1) add an smp_mb after qemu_bh_cancel 2) change qemu_bh_cancel to use atomic_mb_set(&bh->scheduled, 0) both of which forbid the reordering. Reverting the patch, not the fix, of course. Hi, (In reply to Paolo Bonzini from comment #13) > Yes, I believe the following scenario is possible: > > I/O thread worker thread > ---------------------------------------------------------------------------- > ---- > speculatively read req->state > req->state = THREAD_DONE; > qemu_bh_schedule(pool->completion_bh) > bh->scheduled = 1; > > qemu_bh_cancel(pool->completion_bh) > bh->scheduled = 0; > if (req->state == THREAD_DONE) > // sees THREAD_QUEUED > > Apart from reverting the fix, we could also: > > 1) add an smp_mb after qemu_bh_cancel > > 2) change qemu_bh_cancel to use atomic_mb_set(&bh->scheduled, 0) > > both of which forbid the reordering. After reading this, I was wondering why the patch at comment #5 didn't work. Now I see that I wrongly assumed atomic_[read|set] would emit HW memory barriers (I should've checked the generated code). So, IIUC (please correct if I'm wrong), we have an implicit MB at worker_thread due to qemu_bh_schedule->atomic_xchg->(Intel ASM)xchg, and your proposal is to add another one on thread_pool_completion_bh at or after qemu_bh_cancel. I think we should prepare another test build, any preference between options (1) and (2)? Personally, I'm more inclined towards (2), as in Intel and s390x it'd be optimized to just emit a single xchg instruction: (gdb) disassemble/m qemu_bh_cancel Dump of assembler code for function qemu_bh_cancel: 176 { 177 atomic_mb_set(&bh->scheduled, 0); 0x0000000000586650 <+0>: xor %eax,%eax 0x0000000000586652 <+2>: xchg %al,0x20(%rdi) 178 } 0x0000000000586655 <+5>: retq 0x0000000000586656: nopw %cs:0x0(%rax,%rax,1) On the other hand, upstream is also affected, so probably the patch should be sent there too. Yes, I am not sure which of the two would be harder to follow and/or more likely to introduce bugs in the future. I now think (2) is not just more efficient but also clearer; as Stefan said, it matches the way aio_bh_poll resets bh->scheduled. > So, IIUC (please correct if I'm wrong), we have an implicit MB at worker_thread
> due to qemu_bh_schedule->atomic_xchg->(Intel ASM)xchg, and your proposal is to
> add another one on thread_pool_completion_bh at or after qemu_bh_cancel.
Correct.
Another way to see it is that qemu_bh_schedule is a "release" operation (they happen after everything before them), and aio_bh_poll instead is an "acquire" operation (it happens before everything after it). The only way to order stores before everything that follows (that is, to order store-load and store-store pairs) is a full memory barrier, and that's what aio_bh_poll uses.
Commit b7a745dc33a18377bb4a8dfe54d1df01ea60bf66 used qemu_bh_cancel in such a way that it must be an "acquire" operation too, and needs a full memory barrier too.
Thanks Paolo and Stefan, I'm going to prepare a test build with this oneliner: <snip> commit 8dff3eeaf741722759afb5013b4722fbc45868ae (HEAD -> master) Author: Sergio Lopez <slp> Date: Tue Nov 7 14:30:16 2017 +0100 util/async: use atomic_mb_set on qemu_bh_cancel Commit b7a745d added a qemu_bh_cancel call to the completion function, as an optimization to prevent it from unnecessarily rescheduling itself. This completion function is scheduled from worker_thread, after setting the state of a ThreadPoolElement to THREAD_DONE. This was considered to be safe, as the completion function restarts the loop just after the call to qemu_bh_cancel. But, under certain access patterns and scheduling conditions, the loop may wrongly use a pre-fetched elem->state value, reading it as THREAD_QUEUED, and ending the completion function without having processed a pending TPE linked at pool->head. In some situations, if there are no other independent requests in the same aio context that could eventually trigger the scheduling of the completion function, the omitted TPE and all operations pending on it will get stuck forever. diff --git a/util/async.c b/util/async.c index 355af73..0e1bd87 100644 --- a/util/async.c +++ b/util/async.c @@ -174,7 +174,7 @@ void qemu_bh_schedule(QEMUBH *bh) */ void qemu_bh_cancel(QEMUBH *bh) { - bh->scheduled = 0; + atomic_mb_set(&bh->scheduled, 0); } /* This func is async.The bottom half will do the delete action at the finial </snip> Thanks, I have reviewed it upstream. Please let me know when the customer has tested it. It will probably be merged into QEMU 2.11-rc1. Note to QE: This race condition is hard to reproduce. Please verify this BZ by code inspection if there is no reproducer available. The util/async.c:qemu_bh_cancel() function must look like this: void qemu_bh_cancel(QEMUBH *bh) { atomic_mb_set(&bh->scheduled, 0); } Fix included in qemu-kvm-rhev-2.10.0-7.el7 *** Bug 1510728 has been marked as a duplicate of this bug. *** Code check on qemu-kvm-rhev-2.10.0-7.el7: PASS Sanity Test on qemu-kvm-rhev-2.10.0-7.el7 + 3.10.0-796.el7.x86_64: PASS So, set bug's status to "Verified". Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:1104 |