RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 2099541 - qemu coredump with error Assertion `qemu_mutex_iothread_locked()' failed when repeatly hotplug/unplug disks in pause status
Summary: qemu coredump with error Assertion `qemu_mutex_iothread_locked()' failed when...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 9
Classification: Red Hat
Component: qemu-kvm
Version: 9.1
Hardware: x86_64
OS: Linux
high
high
Target Milestone: rc
: ---
Assignee: Stefan Hajnoczi
QA Contact: qing.wang
URL:
Whiteboard:
Depends On:
Blocks: 2121435
TreeView+ depends on / blocked
 
Reported: 2022-06-21 08:08 UTC by qing.wang
Modified: 2023-01-11 20:51 UTC (History)
16 users (show)

Fixed In Version: qemu-kvm-7.0.0-12.el9
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2121435 (view as bug list)
Environment:
Last Closed: 2022-11-15 09:54:42 UTC
Type: ---
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHELPLAN-125811 0 None None None 2022-06-21 08:14:39 UTC
Red Hat Product Errata RHSA-2022:7967 0 None None None 2022-11-15 09:55:16 UTC

Description qing.wang 2022-06-21 08:08:13 UTC
Description of problem:
Repeatly hotplug/unplug disks when guest in pause status. 
There is a small probability that the qemu will crash with error message:
qemu-kvm: ../softmmu/memory.c:1093: void memory_region_transaction_commit(void): Assertion `qemu_mutex_iothread_locked()' failed.


Version-Release number of selected component (if applicable):
Red Hat Enterprise Linux release 9.1 Beta (Plow)
5.14.0-109.el9.x86_64
qemu-kvm-7.0.0-6.el9.x86_64
seabios-bin-1.16.0-3.el9.noarch
edk2-ovmf-20220221gitb24306f15d-2.el9.noarch
virtio-win-prewhql-0.1-221.iso


How reproducible:
< 10%

Steps to Reproduce:
1.create image files
/usr/bin/qemu-img create -f qcow2 /home/kvm_autotest_root/images/storage0.qcow2 1G
/usr/bin/qemu-img create -f qcow2 /home/kvm_autotest_root/images/storage1.qcow2 1G

2.boot vm 
/usr/libexec/qemu-kvm 
-name 'avocado-vt-vm1'  
-sandbox on  
-machine q35,memory-backend=mem-machine_mem 
-device pcie-root-port,id=pcie-root-port-0,multifunction=on,bus=pcie.0,addr=0x1,chassis=1 
-device pcie-pci-bridge,id=pcie-pci-bridge-0,addr=0x0,bus=pcie-root-port-0  
-nodefaults 
-device VGA,bus=pcie.0,addr=0x2 
-m 12288 
-object memory-backend-ram,size=12288M,id=mem-machine_mem  
-smp 10,maxcpus=10,cores=5,threads=1,dies=1,sockets=2  
-cpu 'Cascadelake-Server-noTSX',+kvm_pv_unhalt 
-chardev socket,id=qmp_id_qmpmonitor1,wait=off,path=/var/tmp/avocado_l9_j5gs4/monitor-qmpmonitor1-20220621-022111-umUqWCgG,server=on  
-mon chardev=qmp_id_qmpmonitor1,mode=control 
-chardev socket,id=qmp_id_catch_monitor,wait=off,path=/var/tmp/avocado_l9_j5gs4/monitor-catch_monitor-20220621-022111-umUqWCgG,server=on  
-mon chardev=qmp_id_catch_monitor,mode=control 
-device pvpanic,ioport=0x505,id=idRHfelS 
-chardev socket,id=chardev_serial0,wait=off,path=/var/tmp/avocado_l9_j5gs4/serial-serial0-20220621-022111-umUqWCgG,server=on 
-device isa-serial,id=serial0,chardev=chardev_serial0  
-chardev socket,id=seabioslog_id_20220621-022111-umUqWCgG,path=/var/tmp/avocado_l9_j5gs4/seabios-20220621-022111-umUqWCgG,server=on,wait=off 
-device isa-debugcon,chardev=seabioslog_id_20220621-022111-umUqWCgG,iobase=0x402 
-device pcie-root-port,id=pcie-root-port-1,port=0x1,addr=0x1.0x1,bus=pcie.0,chassis=2 
-device qemu-xhci,id=usb1,bus=pcie-root-port-1,addr=0x0 
-device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 
-object iothread,id=iothread0 
-object iothread,id=iothread1 
-device pcie-root-port,id=pcie-root-port-2,port=0x2,addr=0x1.0x2,bus=pcie.0,chassis=3 
-device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pcie-root-port-2,addr=0x0,iothread=iothread0 
-blockdev node-name=file_image1,driver=file,auto-read-only=on,discard=unmap,aio=native,filename=/home/kvm_autotest_root/images/rhel900-64-virtio-scsi.qcow2,cache.direct=on,cache.no-flush=off 
-blockdev node-name=drive_image1,driver=qcow2,read-only=off,cache.direct=on,cache.no-flush=off,file=file_image1 
-device scsi-hd,id=image1,drive=drive_image1,bootindex=0,write-cache=on 
-device pcie-root-port,id=pcie-root-port-3,port=0x3,addr=0x1.0x3,bus=pcie.0,chassis=4 
-device virtio-net-pci,mac=9a:f0:a3:b2:43:86,id=idq2iKun,netdev=idRRmPQi,bus=pcie-root-port-3,addr=0x0  
-netdev tap,id=idRRmPQi,vhost=on 
-vnc :0  
-rtc base=utc,clock=host,driftfix=slew  
-boot menu=off,order=cdn,once=c,strict=off 
-enable-kvm 
-device pcie-root-port,id=pcie_extra_root_port_0,multifunction=on,bus=pcie.0,addr=0x3,chassis=5 
-device pcie-root-port,id=pcie_extra_root_port_1,addr=0x3.0x1,bus=pcie.0,chassis=6

3.pause guest
{"execute": "stop", "id": "Yrumnv4q"}
{"execute": "query-status", "id": "yekoV93y"}
{"return": {"status": "paused", "singlestep": false, "running": false}, "id": "yekoV93y"}

4.hotplug disks
{"execute": "blockdev-add", "arguments": {"node-name": "file_stg0", "driver": "file", "auto-read-only": true, "discard": "unmap", "aio": "native", "filename": "/home/kvm_autotest_root/images/storage0.qcow2", "cache": {"direct": true, "no-flush": false}}, "id": "hIS6jm2f"}
{"execute": "blockdev-add", "arguments": {"node-name": "drive_stg0", "driver": "qcow2", "read-only": false, "cache": {"direct": true, "no-flush": false}, "file": "file_stg0"}, "id": "uuzktqFi"}
{"execute": "device_add", "arguments": {"driver": "scsi-hd", "id": "stg0", "drive": "drive_stg0", "write-cache": "on", "bus": "virtio_scsi_pci0.0"}, "id": "7tSDhfP1"}

{"execute": "blockdev-add", "arguments": {"node-name": "file_stg1", "driver": "file", "auto-read-only": true, "discard": "unmap", "aio": "native", "filename": "/home/kvm_autotest_root/images/storage1.qcow2", "cache": {"direct": true, "no-flush": false}}, "id": "90EkiVyp"}
{"execute": "blockdev-add", "arguments": {"node-name": "drive_stg1", "driver": "qcow2", "read-only": false, "cache": {"direct": true, "no-flush": false}, "file": "file_stg1"}, "id": "YoHqHH7d"}
{"execute": "device_add", "arguments": {"driver": "scsi-hd", "id": "stg1", "drive": "drive_stg1", "write-cache": "on", "bus": "virtio_scsi_pci0.0"}, "id": "7nFTRjHf"}

5.resume guest
{"execute": "cont", "id": "z0sPJBdv"}
{"execute": "query-status", "id": "6YgUSJTC"}
{"return": {"status": "running", "singlestep": false, "running": true}, "id": "6YgUSJTC"}

6.check disks in guest and do some io on hotpluged disk
dd if=/dev/sdb of=/dev/null bs=1k count=1000 iflag=direct && dd if=/dev/zero of=/dev/sdb bs=1k count=1000 oflag=direct


7.pause guest and unplug disks
{"execute": "stop", "id": "PxDgtBnR"}
{"execute": "query-status", "id": "hDZxLh72"}
{"return": {"status": "paused", "singlestep": false, "running": false}, "id": "hDZxLh72"}
{"execute": "device_del", "arguments": {"id": "stg1"}, "id": "cmyw8Blw"}
{"execute": "device_del", "arguments": {"id": "stg0"}, "id": "YgGXM3Bf"}

8.resume guest and check disks are removed in guset

{"execute": "cont", "id": "LFLWlRPN"}

{"execute": "blockdev-del", "arguments": {"node-name": "drive_stg0"}, "id": "CBsfwMrz"}
{"execute": "blockdev-del", "arguments": {"node-name": "file_stg0"}, "id": "3atJcPmb"}
{"execute": "blockdev-del", "arguments": {"node-name": "drive_stg1"}, "id": "tnB0IHme"}
{"execute": "blockdev-del", "arguments": {"node-name": "file_stg1"}, "id": "RapUpnW1"}
lsblk

9 repeat steps 3-8 100 times
 

Actual results:
Sometimes qemu crash with error 

Expected results:
All operations succeed
Additional info:
automation script:
python ConfigTest.py --testcase=block_hotplug_in_pause.with_plug.multi_pci --iothread_scheme=roundrobin --nr_iothreads=2 --platform=x86_64 --guestname=RHEL.9.0.0 --driveformat=virtio_scsi --nicmodel=virtio_net --imageformat=qcow2 --machines=q35  --customsparams="vm_mem_limit = 12G\nimage_aio=native" --firmware=default_bios --clone=no  --nrepeat=20 --netdst=virbr0
log:http://fileshare.englab.nay.redhat.com/pub/section2/images_backup/qbugs/2099541/2022-06-21/

Same error also find on qemu-kvm-7.0.0-6.el9.x86_64
It does not hotplug succeed after resume guest on qemu-kvm-7.0.0-3.el9.x86_64 or lower version

It can not reproduce this issue on  qemu-kvm-6.2.0-11.el9_0.3.x86_64

Comment 1 qing.wang 2022-06-21 08:58:09 UTC
BT:

                #0  0x00007f67b360857c __pthread_kill_implementation (libc.so.6 + 0xa257c)
                #1  0x00007f67b35bbd56 raise (libc.so.6 + 0x55d56)
                #2  0x00007f67b358e833 abort (libc.so.6 + 0x28833)
                #3  0x00007f67b358e75b __assert_fail_base.cold (libc.so.6 + 0x2875b)
                #4  0x00007f67b35b4cd6 __assert_fail (libc.so.6 + 0x4ecd6)
                #5  0x000055ca87fd411b memory_region_transaction_commit (qemu-kvm + 0x67511b)
                #6  0x000055ca87e17811 virtio_pci_ioeventfd_assign (qemu-kvm + 0x4b8811)
                #7  0x000055ca87e14836 virtio_bus_set_host_notifier (qemu-kvm + 0x4b5836)
                #8  0x000055ca87f8e14e virtio_scsi_set_host_notifier (qemu-kvm + 0x62f14e)
                #9  0x000055ca87f8dd62 virtio_scsi_dataplane_start (qemu-kvm + 0x62ed62)
                #10 0x000055ca87e14610 virtio_bus_start_ioeventfd (qemu-kvm + 0x4b5610)
                #11 0x000055ca87f8c29a virtio_scsi_handle_ctrl (qemu-kvm + 0x62d29a)
                #12 0x000055ca87fa5902 virtio_queue_host_notifier_read (qemu-kvm + 0x646902)
                #13 0x000055ca882c099e aio_dispatch_handler (qemu-kvm + 0x96199e)
                #14 0x000055ca882c1761 aio_poll (qemu-kvm + 0x962761)
                #15 0x000055ca880e1052 iothread_run (qemu-kvm + 0x782052)
                #16 0x000055ca882c562a qemu_thread_start (qemu-kvm + 0x96662a)

core dump:
http://fileshare.englab.nay.redhat.com/pub/section2/images_backup/qbugs/2099541/core.qemu-kvm.0.394e0dafa04a4f489583d8d1282ef7da.112456.1655792680000000.zst

Comment 2 Kevin Wolf 2022-06-21 12:21:29 UTC
This looks like a dataplane locking problem. Stefan, can you have a look?

Comment 3 Klaus Heinrich Kiwi 2022-07-20 12:21:48 UTC
Re-assigning to Emanuele to keep going for now (Stefan is on PTO until Aug 1st).

Comment 4 Stefano Garzarella 2022-07-21 08:10:18 UTC
(In reply to Kevin Wolf from comment #2)
> This looks like a dataplane locking problem. Stefan, can you have a look?

Should be related to:
f34e8d8b8d48d73f36a67b6d5e492ef9784b5012 ("virtio-scsi: prepare virtio_scsi_handle_cmd for dataplane")
2f743ef6366c2df4ef51ef3ae318138cdc0125ab ("virtio-scsi: fix ctrl and event handler functions in dataplane mode")
​
I don't know the code very well, but comparing with virtio-blk where we call blk_set_aio_context() when we start the dataplane, in virtio-scsi we call blk_set_aio_context() in virtio_scsi_hotplug/virtio_scsi_hotunplug, so should we start the dataplane there?

Or we should take the global mutex if it is not taken if we want to do it in virtio_scsi_defer_to_dataplane(). Otherwise, I think virtio_scsi_defer_to_dataplane() can be called from the iothread before starting the dataplane, so IIUC virtio_scsi_dataplane_start() is called without QEMU global mutex held in this case.

Comment 5 Paolo Bonzini 2022-07-26 13:11:11 UTC
> Otherwise, I think virtio_scsi_defer_to_dataplane() can be called from the iothread before starting
> the dataplane, so IIUC virtio_scsi_dataplane_start() is called without QEMU global mutex held in this case.

I think the problem is that virtio_scsi_defer_to_dataplane() is not checking s->dataplane_starting.

Comment 6 Stefano Garzarella 2022-07-26 14:04:42 UTC
(In reply to Paolo Bonzini from comment #5)
> > Otherwise, I think virtio_scsi_defer_to_dataplane() can be called from the iothread before starting
> > the dataplane, so IIUC virtio_scsi_dataplane_start() is called without QEMU global mutex held in this case.
> 
> I think the problem is that virtio_scsi_defer_to_dataplane() is not checking
> s->dataplane_starting.

But virtio_scsi_dataplane_start() checks s->dataplane_starting at the beginning, should this prevent the problem we see?

Comment 8 Paolo Bonzini 2022-07-27 06:44:37 UTC
Yeah it could be that these are done in the wrong order:

    s->dataplane_starting = false;
    s->dataplane_started = true;

and using a single enum for the four states (STOPPED/STARTING/STARTED/STOPPING), plus s->dataplane_fenced separately, could work.

Comment 10 Stefano Garzarella 2022-07-27 07:55:12 UTC
(In reply to Paolo Bonzini from comment #8)
> Yeah it could be that these are done in the wrong order:
> 
>     s->dataplane_starting = false;
>     s->dataplane_started = true;
> 

What concerns me is the comment above virtio_scsi_dataplane_start:

  /* Context: QEMU global mutex held */
  int virtio_scsi_dataplane_start(VirtIODevice *vdev)

That is why I believe no locks had been used to set the state or atomic variables.
But from the backtrace we see that now it can be called from an iothread.

> and using a single enum for the four states
> (STOPPED/STARTING/STARTED/STOPPING), plus s->dataplane_fenced separately,
> could work.

Or in virtio_scsi_defer_to_dataplane() we could return immediately if we are in an iothread.

Comment 17 qing.wang 2022-08-04 06:57:00 UTC
(In reply to Stefan Hajnoczi from comment #16)
> Patch posted upstream:
> https://patchwork.kernel.org/project/qemu-devel/patch/20220803162824.948023-
> 1-stefanha/

No issue found on

Red Hat Enterprise Linux release 9.1 Beta (Plow)
5.14.0-136.el9.x86_64
qemu-kvm-7.0.0-6.el9.stefanha202208031157.x86_64
seabios-bin-1.16.0-3.el9.noarch
edk2-ovmf-20220526git16779ede2d36-2.el9.noarch


BTW, the comment 12 already provide a patch based on 7.0.0-9.el9.
I found this patch in comment 15 it based on 7.0.0-6.el9,
It looks likes a backport. Could you please help to confirm the final release will based on   7.0.0-6.el9 ?

Comment 18 Stefan Hajnoczi 2022-08-04 18:53:46 UTC
(In reply to qing.wang from comment #17)
> (In reply to Stefan Hajnoczi from comment #16)
> > Patch posted upstream:
> > https://patchwork.kernel.org/project/qemu-devel/patch/20220803162824.948023-
> > 1-stefanha/
> 
> No issue found on
> 
> Red Hat Enterprise Linux release 9.1 Beta (Plow)
> 5.14.0-136.el9.x86_64
> qemu-kvm-7.0.0-6.el9.stefanha202208031157.x86_64
> seabios-bin-1.16.0-3.el9.noarch
> edk2-ovmf-20220526git16779ede2d36-2.el9.noarch

Thank you!

> BTW, the comment 12 already provide a patch based on 7.0.0-9.el9.
> I found this patch in comment 15 it based on 7.0.0-6.el9,
> It looks likes a backport. Could you please help to confirm the final
> release will based on   7.0.0-6.el9 ?

The final fix will be backported from qemu.git and released in a future 7.0.0-N.el9 rpm.

I built a 7.0.0-6.el9-based rpm because that's what the original bug report was against but the final fix will be delivered in the latest rpm build as usual.

Comment 21 Kevin Wolf 2022-08-17 13:33:14 UTC
This actually doesn't seem to have been posted downstream yet, fixing the status.

Stefan, what are the upstream plans for this? It looks like this just missed upstream 7.1.0-rc3 because despite having been reviewed, there wasn't a pull request for it.

If it can't be merged upstream soon, we may have to consider requesting exception? for downstream.

Comment 27 Yanan Fu 2022-08-29 07:04:39 UTC
QE bot(pre verify): Set 'Verified:Tested,SanityOnly' as gating/tier1 test pass.

Comment 30 qing.wang 2022-08-29 08:53:07 UTC
Passed on 
Red Hat Enterprise Linux release 9.1 Beta (Plow)
5.14.0-152.el9.x86_64
qemu-kvm-7.0.0-12.el9.x86_64
seabios-bin-1.16.0-4.el9.noarch
edk2-ovmf-20220526git16779ede2d36-3.el9.noarch
virtio-win-prewhql-0.1-225.iso


python ConfigTest.py --testcase=block_hotplug_in_pause.with_plug.multi_pci --iothread_scheme=roundrobin --nr_iothreads=2 --platform=x86_64 --guestname=RHEL.9.1.0 --driveformat=virtio_scsi --nicmodel=virtio_net --imageformat=qcow2 --machines=q35  --customsparams="vm_mem_limit = 12G\nimage_aio=native" --firmware=default_bios --clone=no  --nrepeat=50 --netdst=virbr0

Comment 33 errata-xmlrpc 2022-11-15 09:54:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: qemu-kvm security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7967


Note You need to log in before you can comment on or make changes to this bug.