Note: This bug is displayed in read-only format because
the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
(In reply to yilzhang from comment #4)
> Will test it on x86, please stay tuned.
If it's confirmed to be a power-only BZ, then we'll reassign it.
See also Bug 1503437, which might be related.
Jeff, some ideas to debug this:
1. blockjob.c:block_job_create() calls blk_add_aio_context_notifier() so that the job can respond to AioContext changes. This test case is exactly the type of scenario that this code was written for since device_del will destroy the device that is using an IOThread and move the BlockDriverState back into the global AioContext.
2. bdrv_drained_begin() -> bdrv_co_yield_to_drain() -> qemu_coroutine_yield() should be called with job->busy = true (I haven't checked if this is the case). The assertion failure is complaining about a spurious qemu_coroutine_enter(), so the first step is tracking down who is entering this coroutine.
This is fixed in upstream commit:
commit 481cad48e5e655746893e001af31c161f4587a02
Author: Manos Pitsidianakis <el13635.gr>
Date: Sat Sep 23 14:14:09 2017 +0300
block: add bdrv_co_drain_end callback
BlockDriverState has a bdrv_co_drain() callback but no equivalent for
the end of the drain. The throttle driver (block/throttle.c) needs a way
to mark the end of the drain in order to toggle io_limits_disabled
correctly, thus bdrv_co_drain_end is needed.
We should probably backport the whole series 78b62d3..b867eaa
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHSA-2018:1104
Description of problem: Hot-plug a virtio-scsi device with data-plane while the drive-mirror job is running, after the hot-plug is done, hot-unplug this block device, qemu-kvm will abort with core dumped. Version-Release number of selected component (if applicable): Host kernel: 3.10.0-747.el7.ppc64le qemu-kvm-rhev: qemu-kvm-rhev-2.10.0-3.el7 SLOF: SLOF-20170724-2.git89f519f.el7.noarch Guest kernel: 3.10.0-747.el7.ppc64le How reproducible: 100% Steps to Reproduce: 1. Start QEMU with an image drive, but without device /usr/libexec/qemu-kvm \ -smp 8,sockets=2,cores=4,threads=1 -m 8192 \ -serial unix:/tmp/2dp-serial.log,server,nowait \ -nodefaults \ -rtc base=localtime,clock=host \ -boot menu=on \ -monitor stdio \ -qmp tcp:0:888,server,nowait \ \ -device pci-bridge,id=bridge1,chassis_nr=1,bus=pci.0 \ -netdev tap,id=net0,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown,vhost=on \ -device virtio-net-pci,netdev=net0,id=nic0,mac=52:54:00:c3:e7:8b,bus=bridge1,addr=0x1e \ \ -object iothread,id=iothread0 \ -device virtio-scsi-pci,bus=bridge1,addr=0x1f,id=scsi0,iothread=iothread0 \ -drive file=second_guest_image.qcow2,media=disk,if=none,cache=none,id=drive_sysdisk,aio=native,format=qcow2,werror=stop,rerror=stop \ -device scsi-hd,drive=drive_sysdisk,bus=scsi0.0,id=sysdisk,bootindex=0 \ \ -drive file=/home/yilzhang/dataplane/DISK-image-2.qcow2,if=none,cache=none,id=drive_ddisk_1,aio=native,format=qcow2,werror=stop,rerror=stop 2. Now start mirroring this block device's writes QMP: { "execute": "drive-mirror", "arguments": { "device": "drive_ddisk_1", "target": "destination-1.image", "sync": "full"} } 3. Hot-plug device with data-plane QMP: {"execute":"device_add","arguments":{"driver":"scsi-hd","drive":"drive_ddisk_1", "bus":"scsi0.0","id":"ddisk_1"}} 4. After a while, hot-unplug this device QMP: { "execute": "device_del", "arguments": { "id": "ddisk_1" }} Actual results: Qemu-kvm aborted abnormally with core dumped. [root@ibm-p8-09 dataplane]# sh fail-9330.sh QEMU 2.10.0 monitor - type 'help' for more information (qemu) VNC server running on ::1:5901 (qemu) Formatting 'destination-1.image', fmt=qcow2 size=4294967296 cluster_size=65536 lazy_refcounts=off refcount_bits=16 qemu-kvm: block/io.c:250: bdrv_co_yield_to_drain: Assertion `data.done' failed. fail-9330.sh: line 21: 146467 Aborted (core dumped) /usr/libexec/qemu-kvm -smp 8,sockets=2,cores=4,threads=1 -m 8192 -serial unix:/tmp/2dp-serial.log,server,nowait -nodefaults -rtc base=localtime,clock=host -boot menu=on -monitor stdio -qmp tcp:0:888,server,nowait -device pci-bridge,id=bridge1,chassis_nr=1,bus=pci.0 -netdev tap,id=net0,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown,vhost=on -device virtio-net-pci,netdev=net0,id=nic0,mac=52:54:00:c3:e7:8b,bus=bridge1,addr=0x1e -object iothread,id=iothread0 -device virtio-scsi-pci,bus=bridge1,addr=0x1f,id=scsi0,iothread=iothread0 -drive file=second_guest_image.qcow2,media=disk,if=none,cache=none,id=drive_sysdisk,aio=native,format=qcow2,werror=stop,rerror=stop -device scsi-hd,drive=drive_sysdisk,bus=scsi0.0,id=sysdisk,bootindex=0 -drive file=/home/yilzhang/dataplane/DISK-image-2.qcow2,if=none,cache=none,id=drive_ddisk_1,aio=native,format=qcow2,werror=stop,rerror=stop Expected results: Qemu-kvm should not abort with core dumped. Additional info: [root@ibm-p8-09 dataplane]# gdb /usr/libexec/qemu-kvm core.147883 GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-103.el7 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "ppc64le-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /usr/libexec/qemu-kvm...Reading symbols from /usr/lib/debug/usr/libexec/qemu-kvm.debug...done. done. [New LWP 147885] [New LWP 147909] [New LWP 147884] [New LWP 147911] [New LWP 147905] [New LWP 147907] [New LWP 147915] [New LWP 147906] [New LWP 147883] [New LWP 147918] [New LWP 148228] [New LWP 147914] [New LWP 147910] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `/usr/libexec/qemu-kvm -smp 8,sockets=2,cores=4,threads=1 -m 8192 -serial unix:/'. Program terminated with signal 6, Aborted. #0 0x00003fff861cf5f0 in raise () from /lib64/libc.so.6 Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.ppc64le cyrus-sasl-lib-2.1.26-22.el7.ppc64le cyrus-sasl-md5-2.1.26-22.el7.ppc64le elfutils-libelf-0.170-1.el7.ppc64le elfutils-libs-0.170-1.el7.ppc64le glib2-2.50.3-3.el7.ppc64le glibc-2.17-207.el7.ppc64le gmp-6.0.0-15.el7.ppc64le gnutls-3.3.26-9.el7.ppc64le gperftools-libs-2.6.1-1.el7.ppc64le keyutils-libs-1.5.8-3.el7.ppc64le krb5-libs-1.15.1-15.el7.ppc64le libaio-0.3.109-13.el7.ppc64le libattr-2.4.46-13.el7.ppc64le libcap-2.22-9.el7.ppc64le libcom_err-1.42.9-10.el7.ppc64le libcurl-7.29.0-44.el7.ppc64le libdb-5.3.21-20.el7.ppc64le libfdt-1.4.3-1.el7.ppc64le libffi-3.0.13-18.el7.ppc64le libgcc-4.8.5-20.el7.ppc64le libgcrypt-1.5.3-14.el7.ppc64le libgpg-error-1.12-3.el7.ppc64le libibverbs-15-1.el7.ppc64le libidn-1.28-4.el7.ppc64le libiscsi-1.9.0-7.el7.ppc64le libnl3-3.2.28-4.el7.ppc64le libpng-1.5.13-7.el7_2.ppc64le librdmacm-15-1.el7.ppc64le libseccomp-2.3.1-3.el7.ppc64le libselinux-2.5-11.el7.ppc64le libssh2-1.4.3-10.el7_2.1.ppc64le libstdc++-4.8.5-20.el7.ppc64le libtasn1-4.10-1.el7.ppc64le libusbx-1.0.21-1.el7.ppc64le lzo-2.06-8.el7.ppc64le nettle-2.7.1-8.el7.ppc64le nspr-4.17.0-1.el7.ppc64le nss-3.28.4-11.el7.ppc64le nss-softokn-freebl-3.28.3-8.el7_4.ppc64le nss-util-3.33.0-1.el7.ppc64le numactl-libs-2.0.9-7.el7.ppc64le openldap-2.4.44-5.el7.ppc64le openssl-libs-1.0.2k-8.el7.ppc64le p11-kit-0.23.5-3.el7.ppc64le pcre-8.32-17.el7.ppc64le pixman-0.34.0-1.el7.ppc64le snappy-1.1.0-3.el7.ppc64le systemd-libs-219-45.el7.ppc64le xz-libs-5.2.2-1.el7.ppc64le zlib-1.2.7-17.el7.ppc64le (gdb) bt #0 0x00003fff861cf5f0 in raise () from /lib64/libc.so.6 #1 0x00003fff861d196c in abort () from /lib64/libc.so.6 #2 0x00003fff861c5244 in __assert_fail_base () from /lib64/libc.so.6 #3 0x00003fff861c5334 in __assert_fail () from /lib64/libc.so.6 #4 0x00000000410b2768 in bdrv_co_yield_to_drain (bs=0x100392d6800) at block/io.c:250 #5 bdrv_drained_begin (bs=0x100392d6800) at block/io.c:256 #6 0x00000000410af47c in mirror_run (opaque=0x1003a0201c0) at block/mirror.c:874 #7 0x000000004105aad8 in block_job_co_entry (opaque=0x1003a0201c0) at blockjob.c:287 #8 0x0000000041176698 in coroutine_trampoline (i0=<optimized out>, i1=<optimized out>) at util/coroutine-ucontext.c:79 #9 0x00003fff861e319c in makecontext () from /lib64/libc.so.6 #10 0x0000000000000000 in ?? () (gdb) bt full #0 0x00003fff861cf5f0 in raise () from /lib64/libc.so.6 No symbol table info available. #1 0x00003fff861d196c in abort () from /lib64/libc.so.6 No symbol table info available. #2 0x00003fff861c5244 in __assert_fail_base () from /lib64/libc.so.6 No symbol table info available. #3 0x00003fff861c5334 in __assert_fail () from /lib64/libc.so.6 No symbol table info available. #4 0x00000000410b2768 in bdrv_co_yield_to_drain (bs=0x100392d6800) at block/io.c:250 data = {co = 0x1003a685700, bs = 0x100392d6800, done = false} #5 bdrv_drained_begin (bs=0x100392d6800) at block/io.c:256 No locals. #6 0x00000000410af47c in mirror_run (opaque=0x1003a0201c0) at block/mirror.c:874 delay_ns = 0 cnt = <optimized out> delta = <optimized out> should_complete = true data = <optimized out> bs = 0x100392d6800 target_bs = <optimized out> need_drain = true length = <optimized out> bdi = {cluster_size = 65536, vm_state_offset = 4294967296, is_dirty = false, unallocated_blocks_are_zero = true, can_write_zeroes_with_unmap = true, needs_compressed_writes = false} backing_filename = "\000" ret = <optimized out> __PRETTY_FUNCTION__ = "mirror_run" #7 0x000000004105aad8 in block_job_co_entry (opaque=0x1003a0201c0) at blockjob.c:287 job = 0x1003a0201c0 #8 0x0000000041176698 in coroutine_trampoline (i0=<optimized out>, i1=<optimized out>) at util/coroutine-ucontext.c:79 arg = {p = 0x1003a685700, i = {979916544, 256}} self = 0x1003a685700 co = 0x1003a685700 #9 0x00003fff861e319c in makecontext () from /lib64/libc.so.6 No symbol table info available. #10 0x0000000000000000 in ?? () No symbol table info available.