Bug 2099541
Summary: | qemu coredump with error Assertion `qemu_mutex_iothread_locked()' failed when repeatly hotplug/unplug disks in pause status | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 9 | Reporter: | qing.wang <qinwang> | |
Component: | qemu-kvm | Assignee: | Stefan Hajnoczi <stefanha> | |
qemu-kvm sub component: | virtio-blk,scsi | QA Contact: | qing.wang <qinwang> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | high | |||
Priority: | high | CC: | coli, eesposit, hreitz, jinzhao, juzhang, kkiwi, kwolf, lijin, mrezanin, pbonzini, pvlasin, sgarzare, stefanha, timao, virt-maint, xuwei | |
Version: | 9.1 | Keywords: | Regression, Triaged | |
Target Milestone: | rc | |||
Target Release: | --- | |||
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | qemu-kvm-7.0.0-12.el9 | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 2121435 (view as bug list) | Environment: | ||
Last Closed: | 2022-11-15 09:54:42 UTC | Type: | --- | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 2121435 |
Description
qing.wang
2022-06-21 08:08:13 UTC
BT: #0 0x00007f67b360857c __pthread_kill_implementation (libc.so.6 + 0xa257c) #1 0x00007f67b35bbd56 raise (libc.so.6 + 0x55d56) #2 0x00007f67b358e833 abort (libc.so.6 + 0x28833) #3 0x00007f67b358e75b __assert_fail_base.cold (libc.so.6 + 0x2875b) #4 0x00007f67b35b4cd6 __assert_fail (libc.so.6 + 0x4ecd6) #5 0x000055ca87fd411b memory_region_transaction_commit (qemu-kvm + 0x67511b) #6 0x000055ca87e17811 virtio_pci_ioeventfd_assign (qemu-kvm + 0x4b8811) #7 0x000055ca87e14836 virtio_bus_set_host_notifier (qemu-kvm + 0x4b5836) #8 0x000055ca87f8e14e virtio_scsi_set_host_notifier (qemu-kvm + 0x62f14e) #9 0x000055ca87f8dd62 virtio_scsi_dataplane_start (qemu-kvm + 0x62ed62) #10 0x000055ca87e14610 virtio_bus_start_ioeventfd (qemu-kvm + 0x4b5610) #11 0x000055ca87f8c29a virtio_scsi_handle_ctrl (qemu-kvm + 0x62d29a) #12 0x000055ca87fa5902 virtio_queue_host_notifier_read (qemu-kvm + 0x646902) #13 0x000055ca882c099e aio_dispatch_handler (qemu-kvm + 0x96199e) #14 0x000055ca882c1761 aio_poll (qemu-kvm + 0x962761) #15 0x000055ca880e1052 iothread_run (qemu-kvm + 0x782052) #16 0x000055ca882c562a qemu_thread_start (qemu-kvm + 0x96662a) core dump: http://fileshare.englab.nay.redhat.com/pub/section2/images_backup/qbugs/2099541/core.qemu-kvm.0.394e0dafa04a4f489583d8d1282ef7da.112456.1655792680000000.zst This looks like a dataplane locking problem. Stefan, can you have a look? Re-assigning to Emanuele to keep going for now (Stefan is on PTO until Aug 1st). (In reply to Kevin Wolf from comment #2) > This looks like a dataplane locking problem. Stefan, can you have a look? Should be related to: f34e8d8b8d48d73f36a67b6d5e492ef9784b5012 ("virtio-scsi: prepare virtio_scsi_handle_cmd for dataplane") 2f743ef6366c2df4ef51ef3ae318138cdc0125ab ("virtio-scsi: fix ctrl and event handler functions in dataplane mode") I don't know the code very well, but comparing with virtio-blk where we call blk_set_aio_context() when we start the dataplane, in virtio-scsi we call blk_set_aio_context() in virtio_scsi_hotplug/virtio_scsi_hotunplug, so should we start the dataplane there? Or we should take the global mutex if it is not taken if we want to do it in virtio_scsi_defer_to_dataplane(). Otherwise, I think virtio_scsi_defer_to_dataplane() can be called from the iothread before starting the dataplane, so IIUC virtio_scsi_dataplane_start() is called without QEMU global mutex held in this case. > Otherwise, I think virtio_scsi_defer_to_dataplane() can be called from the iothread before starting
> the dataplane, so IIUC virtio_scsi_dataplane_start() is called without QEMU global mutex held in this case.
I think the problem is that virtio_scsi_defer_to_dataplane() is not checking s->dataplane_starting.
(In reply to Paolo Bonzini from comment #5) > > Otherwise, I think virtio_scsi_defer_to_dataplane() can be called from the iothread before starting > > the dataplane, so IIUC virtio_scsi_dataplane_start() is called without QEMU global mutex held in this case. > > I think the problem is that virtio_scsi_defer_to_dataplane() is not checking > s->dataplane_starting. But virtio_scsi_dataplane_start() checks s->dataplane_starting at the beginning, should this prevent the problem we see? Yeah it could be that these are done in the wrong order: s->dataplane_starting = false; s->dataplane_started = true; and using a single enum for the four states (STOPPED/STARTING/STARTED/STOPPING), plus s->dataplane_fenced separately, could work. (In reply to Paolo Bonzini from comment #8) > Yeah it could be that these are done in the wrong order: > > s->dataplane_starting = false; > s->dataplane_started = true; > What concerns me is the comment above virtio_scsi_dataplane_start: /* Context: QEMU global mutex held */ int virtio_scsi_dataplane_start(VirtIODevice *vdev) That is why I believe no locks had been used to set the state or atomic variables. But from the backtrace we see that now it can be called from an iothread. > and using a single enum for the four states > (STOPPED/STARTING/STARTED/STOPPING), plus s->dataplane_fenced separately, > could work. Or in virtio_scsi_defer_to_dataplane() we could return immediately if we are in an iothread. Patch posted upstream: https://patchwork.kernel.org/project/qemu-devel/patch/20220803162824.948023-1-stefanha@redhat.com/ (In reply to Stefan Hajnoczi from comment #16) > Patch posted upstream: > https://patchwork.kernel.org/project/qemu-devel/patch/20220803162824.948023- > 1-stefanha/ No issue found on Red Hat Enterprise Linux release 9.1 Beta (Plow) 5.14.0-136.el9.x86_64 qemu-kvm-7.0.0-6.el9.stefanha202208031157.x86_64 seabios-bin-1.16.0-3.el9.noarch edk2-ovmf-20220526git16779ede2d36-2.el9.noarch BTW, the comment 12 already provide a patch based on 7.0.0-9.el9. I found this patch in comment 15 it based on 7.0.0-6.el9, It looks likes a backport. Could you please help to confirm the final release will based on 7.0.0-6.el9 ? (In reply to qing.wang from comment #17) > (In reply to Stefan Hajnoczi from comment #16) > > Patch posted upstream: > > https://patchwork.kernel.org/project/qemu-devel/patch/20220803162824.948023- > > 1-stefanha/ > > No issue found on > > Red Hat Enterprise Linux release 9.1 Beta (Plow) > 5.14.0-136.el9.x86_64 > qemu-kvm-7.0.0-6.el9.stefanha202208031157.x86_64 > seabios-bin-1.16.0-3.el9.noarch > edk2-ovmf-20220526git16779ede2d36-2.el9.noarch Thank you! > BTW, the comment 12 already provide a patch based on 7.0.0-9.el9. > I found this patch in comment 15 it based on 7.0.0-6.el9, > It looks likes a backport. Could you please help to confirm the final > release will based on 7.0.0-6.el9 ? The final fix will be backported from qemu.git and released in a future 7.0.0-N.el9 rpm. I built a 7.0.0-6.el9-based rpm because that's what the original bug report was against but the final fix will be delivered in the latest rpm build as usual. This actually doesn't seem to have been posted downstream yet, fixing the status. Stefan, what are the upstream plans for this? It looks like this just missed upstream 7.1.0-rc3 because despite having been reviewed, there wasn't a pull request for it. If it can't be merged upstream soon, we may have to consider requesting exception? for downstream. QE bot(pre verify): Set 'Verified:Tested,SanityOnly' as gating/tier1 test pass. Passed on Red Hat Enterprise Linux release 9.1 Beta (Plow) 5.14.0-152.el9.x86_64 qemu-kvm-7.0.0-12.el9.x86_64 seabios-bin-1.16.0-4.el9.noarch edk2-ovmf-20220526git16779ede2d36-3.el9.noarch virtio-win-prewhql-0.1-225.iso python ConfigTest.py --testcase=block_hotplug_in_pause.with_plug.multi_pci --iothread_scheme=roundrobin --nr_iothreads=2 --platform=x86_64 --guestname=RHEL.9.1.0 --driveformat=virtio_scsi --nicmodel=virtio_net --imageformat=qcow2 --machines=q35 --customsparams="vm_mem_limit = 12G\nimage_aio=native" --firmware=default_bios --clone=no --nrepeat=50 --netdst=virbr0 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: qemu-kvm security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7967 |