Bug 2185688
Summary: | [qemu-kvm] no response with QMP command block_resize | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 9 | Reporter: | qing.wang <qinwang> |
Component: | qemu-kvm | Assignee: | Kevin Wolf <kwolf> |
qemu-kvm sub component: | virtio-blk,scsi | QA Contact: | qing.wang <qinwang> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | high | ||
Priority: | high | CC: | aliang, chayang, coli, hreitz, jinzhao, juzhang, kwolf, lijin, meili, mrezanin, qizhu, vgoyal, virt-maint, xuwei, zhenyzha |
Version: | 9.3 | Keywords: | CustomerScenariosInitiative, Regression, Triaged |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | All | ||
Whiteboard: | |||
Fixed In Version: | qemu-kvm-8.0.0-4.el9 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2023-11-07 08:27:12 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
qing.wang
2023-04-11 01:29:34 UTC
I think this is a different bug, here’s the stack trace of what seems to hang: (gdb) bt #0 0x00007fd863a909a0 in () at /usr/lib/libc.so.6 #1 0x00007fd863a96efa in pthread_mutex_lock () at /usr/lib/libc.so.6 #2 0x000055dd05f770c3 in qemu_mutex_lock_impl (mutex=0x55dd09405ac0, file=0x55dd061a60c7 "../util/async.c", line=697) at ../util/qemu-thread-posix.c:94 #3 0x000055dd05e64900 in bdrv_co_drain_bh_cb (opaque=0x7fd85e6cfc90) at ../block/io.c:278 #4 0x000055dd05f88345 in aio_bh_call (bh=0x55dd09708d20) at ../util/async.c:155 #5 aio_bh_poll (ctx=ctx@entry=0x55dd09405a60) at ../util/async.c:184 #6 0x000055dd05f73e6b in aio_poll (ctx=0x55dd09405a60, blocking=blocking@entry=true) at ../util/aio-posix.c:721 #7 0x000055dd05e2ad66 in iothread_run (opaque=opaque@entry=0x55dd090e9090) at ../iothread.c:63 #8 0x000055dd05f76ce8 in qemu_thread_start (args=0x55dd094060a0) at ../util/qemu-thread-posix.c:541 Notably, as written in comment 0, I/O is completely optional. Even when you start an empty VM (no guest) with -S, block_resize still won’t return. (When you do have a guest, as far as I can see, it doesn’t fully hang, but I believe starting from block_resize, all I/O to the resizee’s iothread hangs. Which basically makes the guest hang.) Something seems to keep the AioContext acquired and isn’t releasing it, but I don’t know what yet. Sounds a bit like the secondary bdrv_drain_all_end() thing from bug 2186725. Some debugging later, I’m not entirely sure what the problem is, but here’s my best guess: 1. qmp_block_resize(), near its end, calls bdrv_co_lock(bs), locking the AioContext. 2. We go down this chain: blk_unref(blk) -> blk_delete() -> blk_remove_bs() -> bdrv_root_unref_child() -> ... -> bdrv_graph_wrlock() -> bdrv_drain_all_begin_nopoll() 3. Iterating over all nodes, we lock the AioContext again[1]. 4. bdrv_do_drained_begin() -> bdrv_co_yield_to_drain() 5. We release the AioContext so we can schedule a BH to run in it, and it will actually be run; but it is still locked once by bdrv_co_lock() from qmp_block_resize() 6. The hang occurs with the stack trace shown in comment 4: iothread_run() -> aio_poll() -> aio_bh_call() wants to run the BH, but can’t, because the context is still locked from bdrv_co_lock() [1] I haven’t quite understood at this point whether nested aio_context_acquire() is possible; AFAIR it was allowed in the past, but the implementation is just a mutex, so it doesn’t look like it’s still possible. Anyway, this nested lock is actually not where we hang, so it looks like it is possible. So I think what has caused this bug to appear is the fact that bdrv_graph_wrlock() runs bdrv_drain_all, which seems like it can’t work when any AioContext is acquired; i.e. in the end just like bug 2186725, bdrv_graph_wrlock() mustn’t be called with AioContexts acquired, just that there’s an additional reason why it doesn’t work (not just that read lock owners in any AioContexts will deadlock). hit same issue on Red Hat Enterprise Linux release 9.3 Beta (Plow) 5.14.0-303.el9.x86_64 qemu-kvm-8.0.0-1.el9.x86_64 seabios-bin-1.16.1-1.el9.noarch edk2-ovmf-20230301gitf80f052277c8-2.el9.noarch libvirt-9.0.0-10.el9_2.x86_64 virtio-win-prewhql-0.1-235.iso I actually already sent an upstream patch that fixes this. I just didn't make the connection with this BZ. https://patchew.org/QEMU/20230504115750.54437-1-kwolf@redhat.com/20230504115750.54437-5-kwolf@redhat.com/ QE bot(pre verify): Set 'Verified:Tested,SanityOnly' as gating/tier1 test pass. Pass the test with comment0 steps Red Hat Enterprise Linux release 9.3 Beta (Plow) 5.14.0-316.el9.x86_64 qemu-kvm-8.0.0-4.el9.x86_64 seabios-bin-1.16.1-1.el9.noarch edk2-ovmf-20230301gitf80f052277c8-4.el9.noarch libvirt-9.3.0-2.el9.x86_64 virtio-win-prewhql-0.1-237.iso Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: qemu-kvm security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:6368 This comment was flagged a spam, view the edit history to see the original text if required. |