Bug 1508708

Summary: [data plane] Qemu-kvm core dumped when doing block-stream and block-job-cancel to a data disk with data-plane enabled
Product: Red Hat Enterprise Linux 7 Reporter: yilzhang
Component: qemu-kvm-rhevAssignee: Kevin Wolf <kwolf>
Status: CLOSED ERRATA QA Contact: Gu Nini <ngu>
Severity: high Docs Contact:
Priority: high    
Version: 7.5CC: chayang, coli, gianluca.cecchi, juzhang, knoel, lmiksik, lolyu, ngu, qizhu, qzhang, virt-maint
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-08-22 09:18:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1506531    
Attachments:
Description Flags
Reproducer test script none

Description yilzhang 2017-11-02 01:50:18 UTC
Description of problem:
Qemu-kvm core dumped when doing block-stream and block-job-cancel to a data disk with data-plane enabled


Version-Release number of selected component (if applicable):
Host kernel:  3.10.0-760.el7.ppc64le
Guest kernel: 3.10.0-760.el7.ppc64le
qemu-kvm-rhev: qemu-kvm-rhev-2.10.0-3.el7

How reproducible: 10%


Steps to Reproduce:
1. Boot up guest with one data disk and data plane enabled
2. Do blockdev-snapshot-sync to this data disk
{ "execute": "blockdev-snapshot-sync", "arguments": { "device": "drive_ddisk_1","snapshot-file": "/root/sn1", "format": "qcow2", "mode": "absolute-paths" } }

3. Login guest, and write some data to the data disk until write finishes
[Guest ~]#  dd if=/dev/urandom  of=/dev/sdb  bs=1M count=99   oflag=sync

4. Do block-stream, and then do block-job-cancel right after the block-stream job to cancel it
{ "execute": "block-stream", "arguments": { "device": "drive_ddisk_1", "speed":10000000, "on-error": "report" } }  
{ "execute": "block-job-cancel", "arguments": { "device": "drive_ddisk_1" } }


Actual results:
Qemu-kvm core dumped sometimes
Note: If qemu-kvm works well after the above four steps, you can repeat step4 for more times(usually less than 20 times, sometimes 100 times)

Expected results:
"block-job-cancel" should succeed with QMP prompts:
{"timestamp": {"seconds": 1509585004, "microseconds": 491783}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": "drive_ddisk_1", "len": 2576980480, "offset": 0, "speed": 10000000, "type": "stream"}}



Additional info:
# gdb /usr/libexec/qemu-kvm  core.50994
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/libexec/qemu-kvm -smp 8,sockets=2,cores=4,threads=1 -m 32768 -serial unix:'.
Program terminated with signal 11, Segmentation fault.
#0  bdrv_get_flags (bs=0x0) at block.c:3753
3753	}
Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.ppc64le cyrus-sasl-lib-2.1.26-21.el7.ppc64le elfutils-libelf-0.168-8.el7.ppc64le elfutils-libs-0.168-8.el7.ppc64le glib2-2.50.3-3.el7.ppc64le glibc-2.17-196.el7.ppc64le gmp-6.0.0-15.el7.ppc64le gnutls-3.3.26-9.el7.ppc64le gperftools-libs-2.4-8.el7.ppc64le keyutils-libs-1.5.8-3.el7.ppc64le krb5-libs-1.15.1-8.el7.ppc64le libaio-0.3.109-13.el7.ppc64le libattr-2.4.46-12.el7.ppc64le libcap-2.22-9.el7.ppc64le libcom_err-1.42.9-10.el7.ppc64le libcurl-7.29.0-42.el7.ppc64le libdb-5.3.21-20.el7.ppc64le libfdt-1.4.3-1.el7.ppc64le libffi-3.0.13-18.el7.ppc64le libgcc-4.8.5-16.el7.ppc64le libgcrypt-1.5.3-14.el7.ppc64le libgpg-error-1.12-3.el7.ppc64le libibverbs-13-7.el7.ppc64le libidn-1.28-4.el7.ppc64le libiscsi-1.9.0-7.el7.ppc64le libnl3-3.2.28-4.el7.ppc64le libpng-1.5.13-7.el7_2.ppc64le librdmacm-13-7.el7.ppc64le libseccomp-2.3.1-3.el7.ppc64le libselinux-2.5-11.el7.ppc64le libssh2-1.4.3-10.el7_2.1.ppc64le libstdc++-4.8.5-16.el7.ppc64le libtasn1-4.10-1.el7.ppc64le libusbx-1.0.20-1.el7.ppc64le lzo-2.06-8.el7.ppc64le nettle-2.7.1-8.el7.ppc64le nspr-4.13.1-1.0.el7_3.ppc64le nss-3.28.4-8.el7.ppc64le nss-softokn-freebl-3.28.3-6.el7.ppc64le nss-util-3.28.4-3.el7.ppc64le numactl-libs-2.0.9-6.el7_2.ppc64le openldap-2.4.44-5.el7.ppc64le openssl-libs-1.0.2k-8.el7.ppc64le p11-kit-0.23.5-3.el7.ppc64le pcre-8.32-17.el7.ppc64le pixman-0.34.0-1.el7.ppc64le snappy-1.1.0-3.el7.ppc64le systemd-libs-219-42.el7.ppc64le xz-libs-5.2.2-1.el7.ppc64le zlib-1.2.7-17.el7.ppc64le
(gdb) bt
#0  bdrv_get_flags (bs=0x0) at block.c:3753
#1  0x000000008534541c in stream_complete (job=0x10036300870, opaque=0x1003609fc20) at block/stream.c:92
#2  0x0000000085439890 in block_job_defer_to_main_loop_bh (opaque=0x1003724ab40) at blockjob.c:909
#3  0x0000000085536258 in aio_bh_call (bh=0x100360773e0) at util/async.c:90
#4  aio_bh_poll (ctx=0x10036171900) at util/async.c:118
#5  0x000000008553ac88 in aio_dispatch (ctx=0x10036171900) at util/aio-posix.c:429
#6  0x0000000085536040 in aio_ctx_dispatch (source=<optimized out>, callback=<optimized out>, user_data=<optimized out>) at util/async.c:261
#7  0x00003fffa6633ab0 in g_main_context_dispatch () from /lib64/libglib-2.0.so.0
#8  0x0000000085539784 in glib_pollfds_poll () at util/main-loop.c:213
#9  os_host_main_loop_wait (timeout=<optimized out>) at util/main-loop.c:261
#10 main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:515
#11 0x0000000085118038 in main_loop () at vl.c:1917
#12 main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at vl.c:4804
(gdb) bt full
#0  bdrv_get_flags (bs=0x0) at block.c:3753
No locals.
#1  0x000000008534541c in stream_complete (job=0x10036300870, opaque=0x1003609fc20) at block/stream.c:92
        s = 0x10036300870
        data = 0x1003609fc20
        bs = 0x0
        base = <optimized out>
        local_err = 0x0
#2  0x0000000085439890 in block_job_defer_to_main_loop_bh (opaque=0x1003724ab40) at blockjob.c:909
        data = 0x1003724ab40
        aio_context = 0x10036171900
#3  0x0000000085536258 in aio_bh_call (bh=0x100360773e0) at util/async.c:90
No locals.
#4  aio_bh_poll (ctx=0x10036171900) at util/async.c:118
        bh = <optimized out>
        bhp = <optimized out>
        next = 0x10036076570
        ret = <optimized out>
        deleted = <optimized out>
#5  0x000000008553ac88 in aio_dispatch (ctx=0x10036171900) at util/aio-posix.c:429
No locals.
#6  0x0000000085536040 in aio_ctx_dispatch (source=<optimized out>, callback=<optimized out>, user_data=<optimized out>) at util/async.c:261
        ctx = <optimized out>
#7  0x00003fffa6633ab0 in g_main_context_dispatch () from /lib64/libglib-2.0.so.0
No symbol table info available.
#8  0x0000000085539784 in glib_pollfds_poll () at util/main-loop.c:213
        context = 0x100362c0790
        pfds = <optimized out>
#9  os_host_main_loop_wait (timeout=<optimized out>) at util/main-loop.c:261
        context = 0x100362c0790
        ret = <optimized out>
        spin_counter = 1
#10 main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:515
        ret = 1
        timeout = 4294967295
        timeout_ns = 1780373897
#11 0x0000000085118038 in main_loop () at vl.c:1917
No locals.
#12 main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at vl.c:4804
        i = <optimized out>
        snapshot = <optimized out>
        linux_boot = 0
        initrd_filename = <optimized out>
        kernel_filename = 0x0
        kernel_cmdline = <optimized out>
        boot_order = <optimized out>
        boot_once = <optimized out>
        cyls = 0
        heads = 0
        secs = 0
        translation = <optimized out>
---Type <return> to continue, or q <return> to quit---
        opts = <optimized out>
        machine_opts = <optimized out>
        hda_opts = <optimized out>
        icount_opts = <optimized out>
        accel_opts = <optimized out>
        olist = <optimized out>
        optind = 40
        optarg = 0x3fffc24bf3fc "scsi-hd,drive=drive_ddisk_1,bus=scsi0.0,id=ddisk_1"
        loadvm = <optimized out>
        machine_class = 0x0
        cpu_model = <optimized out>
        vga_model = 0x0
        qtest_chrdev = <optimized out>
        qtest_log = <optimized out>
        pid_file = <optimized out>
        incoming = 0x0
        defconfig = <optimized out>
        userconfig = <optimized out>
        nographic = <optimized out>
        display_type = <optimized out>
        display_remote = <optimized out>
        log_mask = <optimized out>
        log_file = <optimized out>
        trace_file = <optimized out>
        maxram_size = <optimized out>
        ram_slots = <optimized out>
        vmstate_dump_file = 0x0
        main_loop_err = 0x0
        err = 0x0
        list_data_dirs = <optimized out>
        bdo_queue = {sqh_first = 0x0, sqh_last = 0x3fffc24b4db8}
        __func__ = "main"
        __FUNCTION__ = "main"

Comment 2 yilzhang 2017-11-02 01:54:17 UTC
1.  x86 also has this bug
2.  This bug cannot be reproduced if not using data-plane

Comment 3 yilzhang 2017-11-02 10:04:30 UTC
Add one qemu command line used when reproducing this bug:

/usr/libexec/qemu-kvm \
 -smp 8,sockets=2,cores=4,threads=1 -m 32768 \
 -serial unix:/tmp/3dp-serial.log,server,nowait \
 -nodefaults \
 -rtc base=localtime,clock=host \
 -boot menu=on \
 -monitor stdio \
 -monitor unix:/tmp/monitor1,server,nowait \
 -qmp tcp:0:777,server,nowait \
 -vnc :1 \
 -device virtio-vga \
\
 -device pci-bridge,id=bridge1,chassis_nr=1,bus=pci.0 \
 -netdev tap,id=net0,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown,vhost=on \
 -device virtio-net-pci,netdev=net0,id=nic0,mac=52:54:00:c3:e7:8a,bus=bridge1,addr=0x1e \
\
 -object iothread,id=iothread0 \
 -device virtio-scsi-pci,bus=bridge1,addr=0x1f,id=scsi0,iothread=iothread0 \
-drive file=/home/yilzhang/rhel7.5__3.10.0-760.el7.ppc64le.qcow2,media=disk,if=none,cache=none,id=drive_sysdisk,aio=native,format=qcow2,werror=stop,rerror=stop \
-device scsi-hd,drive=drive_sysdisk,bus=scsi0.0,id=sysdisk,bootindex=0 \
\
-drive file=/home/images/disk-image_1.qcow2,if=none,cache=none,id=drive_ddisk_1,format=qcow2,werror=stop,rerror=stop \
-device scsi-hd,drive=drive_ddisk_1,bus=scsi0.0,id=ddisk_1

Comment 4 Jeff Cody 2017-11-14 16:21:16 UTC
I'm able to reproduce this fairly easily, thanks to the information provided in the Description and Comment #3.  I haven't yet tracked it to a definitive root cause, but what appears to be happening is the blk and/or blk->bs is freed before the stream_complete executes in the main loop BH.

Comment 5 Jeff Cody 2017-11-14 18:12:51 UTC
Created attachment 1352092 [details]
Reproducer test script

This is a self-contained script to reproduce this bug (I suggest running it in a test directory) - it will create all images needed, run QEMU, and send the appropriate QMP commands, and drop into GDB on segfault.

On my test system, I can reproduce it 100% via this script against both upstream QEMU and qemu-kvm-rhev-2.10.0-6.el7.

Comment 6 Jeff Cody 2017-11-17 07:56:13 UTC
In commit 2f47da5f7f, the qemu coroutine sleep callback was changed to use aio_co_wake(), so that the coroutine can be properly scheduled in the AioContext BH, if the current AioContext does not match the current block_job/coroutine context.

There is a race condition when entering the block job to cancel the coroutine.  Depending on the AioContext, the coroutine will either be entered directly, or scheduled for later.  We end up with a coroutine that is scheduled, and that same coroutine is deferring its cleanup to the main loop.  The cleanup happens, freeing the block job, but scheduled block job runs sometime after (or perhaps during) this cleanup.  The BlockJob structure, and the coroutine itself, is freed.  This causes us to access invalid memory via corrupted pointers (in this case, it is happening when the late and defunct coroutine tries to run stream_complete a second time).

Comment 7 Jeff Cody 2017-11-18 07:41:56 UTC
I have a patch for fix, that works with my test script:

https://github.com/codyprime/qemu-kvm-jtc/commit/9d3cecfcdfb62434452cd2ef456a3ea80ac98e9d

Will be submitted to upstream list, once I get an iotest case worked out for it.

Comment 8 Jeff Cody 2017-11-20 03:47:08 UTC
Patches submitted to qemu-devel: http://lists.nongnu.org/archive/html/qemu-devel/2017-11/msg03485.html

Comment 9 Jeff Cody 2017-12-01 03:24:38 UTC
Patches posted as part of BZ 1506531.

Comment 10 Ademar Reis 2017-12-01 23:35:15 UTC
May be related to bug 1519721

Comment 11 Jeff Cody 2017-12-15 21:41:10 UTC
From the series for BZ 1506531:

Fix included in qemu-kvm-rhev-2.10.0-12.el7

Comment 12 Qianqian Zhu 2017-12-20 05:53:41 UTC
Test with qemu-kvm-rhev-2.10.0-12.el7, still hit core dump when quit qemu:

Command line:
usr/libexec/qemu-kvm -M q35,accel=kvm,kernel-irqchip=split -m 4G -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pcie.0,iothread=iothread1 -drive id=drive_image1,if=none,snapshot=off,aio=threads,cache=writethrough,format=qcow2,file=gluster://10.66.8.119/gv0/win2016-64-virtio-scsi.qcow2 -device scsi-hd,id=image1,drive=drive_image1 -device virtio-net-pci,mac=9a:89:8a:8b:8c:8d,id=idTqeiKU,netdev=idU17IIx,bus=pcie.0 -object iothread,id=iothread1  -netdev tap,id=idU17IIx  -cpu 'SandyBridge',+kvm_pv_unhalt  -vnc :0   -enable-kvm  -qmp tcp::5555,server,nowait -monitor stdio -object iothread,id=iothread2 -object iothread,id=iothread3 -device virtio-scsi-pci,id=virtio_scsi_pci1,bus=pcie.0,iothread=iothread2 -drive id=drive_image2,if=none,snapshot=off,aio=threads,cache=writethrough,format=qcow2,file=gluster://10.66.8.119/gv0/data1.qcow2 -device scsi-hd,id=image2,drive=drive_image2 


Steps:
{ "execute": "blockdev-snapshot-sync", "arguments": { "device": "drive_image2","snapshot-file": "sn1-data", "format": "qcow2", "mode": "absolute-paths" } }
{"return": {}}
{ "execute": "block-stream", "arguments": { "device": "drive_image2"}}
{"return": {}}
{ "execute": "block-job-cancel", "arguments": { "device": "drive_image2"}}
{"return": {}}

Result:
(qemu) quit
qemu-kvm: /builddir/build/BUILD/qemu-2.10.0/hw/scsi/virtio-scsi.c:246: virtio_scsi_ctx_check: Assertion `blk_get_aio_context(d->conf.blk) == s->ctx' failed.
Aborted (core dumped)

Comment 13 Qianqian Zhu 2017-12-21 09:29:10 UTC
(gdb) bt full
#0  0x00007fb7b36911a7 in raise () at /lib64/libc.so.6
#1  0x00007fb7b3692898 in abort () at /lib64/libc.so.6
#2  0x00007fb7b3689fc8 in __assert_fail_base () at /lib64/libc.so.6
#3  0x00007fb7b368a074 in  () at /lib64/libc.so.6
#4  0x000056216110a4c7 in virtio_scsi_ctx_check (s=<optimized out>, s=<optimized out>, d=0x5621658ad680) at /usr/src/debug/qemu-2.10.0/hw/scsi/virtio-scsi.c:246
#5  0x000056216118fb16 in virtio_scsi_handle_cmd_vq (s=<optimized out>, s=<optimized out>, d=0x5621658ad680) at /usr/src/debug/qemu-2.10.0/hw/scsi/virtio-scsi.c:246
        vs = 0x562165a68170
        rc = <optimized out>
        req = 0x5621654f6c00
        next = <optimized out>
        ret = <optimized out>
        progress = true
        reqs = {tqh_first = 0x0, tqh_last = 0x7fb7abdf67b0}
#6  0x000056216118fb16 in virtio_scsi_handle_cmd_vq (req=0x5621654f6c00, s=0x562165a68170) at /usr/src/debug/qemu-2.10.0/hw/scsi/virtio-scsi.c:559
        vs = 0x562165a68170
        rc = <optimized out>
        req = 0x5621654f6c00
        next = <optimized out>
        ret = <optimized out>
        progress = true
        reqs = {tqh_first = 0x0, tqh_last = 0x7fb7abdf67b0}
#7  0x000056216118fb16 in virtio_scsi_handle_cmd_vq (s=s@entry=0x562165a68170, vq=vq@entry=0x562165a70100) at /usr/src/debug/qemu-2.10.0/hw/scsi/virtio-scsi.c:599
        req = 0x5621654f6c00
        next = <optimized out>
        ret = <optimized out>
        progress = true
        reqs = {tqh_first = 0x0, tqh_last = 0x7fb7abdf67b0}
#8  0x00005621611906fa in virtio_scsi_data_plane_handle_cmd (vdev=<optimized out>, vq=0x562165a70100) at /usr/src/debug/qemu-2.10.0/hw/scsi/virtio-scsi-dataplane.c:60
        progress = <optimized out>
        s = 0x562165a68170
#9  0x000056216119ced6 in virtio_queue_host_notifier_aio_poll (vq=0x562165a70100) at /usr/src/debug/qemu-2.10.0/hw/virtio/virtio.c:1506
        n = 0x562165a70168
        vq = 0x562165a70100
        progress = <optimized out>
#10 0x000056216119ced6 in virtio_queue_host_notifier_aio_poll (opaque=0x562165a70168) at /usr/src/debug/qemu-2.10.0/hw/virtio/virtio.c:2420
        n = 0x562165a70168
        vq = 0x562165a70100
        progress = <optimized out>
#11 0x000056216142f55e in run_poll_handlers_once (ctx=ctx@entry=0x562163195cc0) at util/aio-posix.c:490
        progress = false
        node = 0x5621653c9740
#12 0x000056216142ff85 in aio_poll (blocking=true, ctx=0x562163195cc0) at util/aio-posix.c:566
        node = <optimized out>
        i = <optimized out>
        ret = 0
        progress = <optimized out>
        timeout = <optimized out>
        start = 183182899378971
        __PRETTY_FUNCTION__ = "aio_poll"
---Type <return> to continue, or q <return> to quit---
#13 0x000056216142ff85 in aio_poll (ctx=0x562163195cc0, blocking=blocking@entry=true) at util/aio-posix.c:595
        node = <optimized out>
        i = <optimized out>
        ret = 0
        progress = <optimized out>
        timeout = <optimized out>
        start = 183182899378971
        __PRETTY_FUNCTION__ = "aio_poll"
#14 0x000056216122415e in iothread_run (opaque=0x5621630e3500) at iothread.c:59
        iothread = 0x5621630e3500
#15 0x00007fb7b3a2fdd5 in start_thread () at /lib64/libpthread.so.0
#16 0x00007fb7b375994d in clone () at /lib64/libc.so.6

Comment 15 Qianqian Zhu 2018-01-03 06:20:54 UTC
With latest qemu-kvm-rhev-2.10.0-14.el7.x86_64, I still hit once core dump when perform block stream, not even to cancel it. It is not 100% reproducible, only 1/10.

Steps:
1. Create a disk:
qemu-img create -f qcow2 gluster://10.66.8.119/gv0/data1.qcow2 2G

2. Launch guest with the disk, with iothread:
/usr/libexec/qemu-kvm \
-M q35,accel=kvm,kernel-irqchip=split \
-m 4G \
-device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pcie.0,iothread=iothread1 \
-drive id=drive_image1,if=none,snapshot=off,aio=threads,cache=writethrough,format=qcow2,file=/home/kvm_autotest_root/images/rhel75-64-virtio-scsi.qcow2 \
-device scsi-hd,id=image1,drive=drive_image1 -device virtio-net-pci,mac=9a:89:8a:8b:8c:8d,id=idTqeiKU,netdev=idU17IIx,bus=pcie.0 \
-object iothread,id=iothread1  \
-netdev tap,id=idU17IIx  \
-cpu 'SandyBridge',+kvm_pv_unhalt  \
-vnc :0   \
-enable-kvm  \
-qmp tcp::5555,server,nowait \
-monitor stdio \
-object iothread,id=iothread2 \
-object iothread,id=iothread3 \
-device virtio-scsi-pci,id=virtio_scsi_pci1,bus=pcie.0,iothread=iothread2 \
-drive id=drive_image2,if=none,snapshot=off,aio=threads,cache=writethrough,format=qcow2,file=gluster://10.66.8.119/gv0/data1.qcow2 \
-device scsi-hd,id=image2,drive=drive_image2

3. Live snapshot and block-stream:
{ "execute": "blockdev-snapshot-sync", "arguments": { "device": "drive_image2","snapshot-file": "sn1-data", "format": "qcow2", "mode": "absolute-paths" } }
{"return": {}}
{ "execute": "block-stream", "arguments": { "device": "drive_image2"}}
{"return": {}}
{"timestamp": {"seconds": 1514958794, "microseconds": 42977}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "drive_image2", "len": 2147483648, "offset": 2147483648, "speed": 0, "type": "stream"}}

Result:
(qemu) Formatting 'sn1-data', fmt=qcow2 size=2147483648 backing_file=gluster://10.66.8.119/gv0/data1.qcow2 backing_fmt=qcow2 cluster_size=65536 lazy_refcounts=off refcount_bits=16

(qemu) quit
qemu-kvm: /builddir/build/BUILD/qemu-2.10.0/hw/scsi/virtio-scsi.c:246: virtio_scsi_ctx_check: Assertion `blk_get_aio_context(d->conf.blk) == s->ctx' failed.
Aborted (core dumped)

(gdb) bt full
#0  0x00007fb4856651a7 in raise () at /lib64/libc.so.6
#1  0x00007fb485666898 in abort () at /lib64/libc.so.6
#2  0x00007fb48565dfc8 in __assert_fail_base () at /lib64/libc.so.6
#3  0x00007fb48565e074 in  () at /lib64/libc.so.6
#4  0x000055be1e2ce877 in virtio_scsi_ctx_check (s=<optimized out>, s=<optimized out>, d=0x55be223dd680) at /usr/src/debug/qemu-2.10.0/hw/scsi/virtio-scsi.c:246
#5  0x000055be1e353ec6 in virtio_scsi_handle_cmd_vq (s=<optimized out>, s=<optimized out>, d=0x55be223dd680) at /usr/src/debug/qemu-2.10.0/hw/scsi/virtio-scsi.c:246
        vs = 0x55be22598170
        rc = <optimized out>
        req = 0x55be21718500
        next = <optimized out>
        ret = <optimized out>
        progress = true
        reqs = {tqh_first = 0x0, tqh_last = 0x7fb47ddca7b0}
#6  0x000055be1e353ec6 in virtio_scsi_handle_cmd_vq (req=0x55be21718500, s=0x55be22598170) at /usr/src/debug/qemu-2.10.0/hw/scsi/virtio-scsi.c:559
        vs = 0x55be22598170
        rc = <optimized out>
        req = 0x55be21718500
        next = <optimized out>
        ret = <optimized out>
        progress = true
        reqs = {tqh_first = 0x0, tqh_last = 0x7fb47ddca7b0}
#7  0x000055be1e353ec6 in virtio_scsi_handle_cmd_vq (s=s@entry=0x55be22598170, vq=vq@entry=0x55be225a2100) at /usr/src/debug/qemu-2.10.0/hw/scsi/virtio-scsi.c:599
        req = 0x55be21718500
        next = <optimized out>
        ret = <optimized out>
        progress = true
        reqs = {tqh_first = 0x0, tqh_last = 0x7fb47ddca7b0}
#8  0x000055be1e354aaa in virtio_scsi_data_plane_handle_cmd (vdev=<optimized out>, vq=0x55be225a2100) at /usr/src/debug/qemu-2.10.0/hw/scsi/virtio-scsi-dataplane.c:60
        progress = <optimized out>
        s = 0x55be22598170
#9  0x000055be1e3612d6 in virtio_queue_host_notifier_aio_poll (vq=0x55be225a2100) at /usr/src/debug/qemu-2.10.0/hw/virtio/virtio.c:1506
        n = 0x55be225a2168
        vq = 0x55be225a2100
        progress = <optimized out>
#10 0x000055be1e3612d6 in virtio_queue_host_notifier_aio_poll (opaque=0x55be225a2168) at /usr/src/debug/qemu-2.10.0/hw/virtio/virtio.c:2420
        n = 0x55be225a2168
        vq = 0x55be225a2100
        progress = <optimized out>
#11 0x000055be1e5f483e in run_poll_handlers_once (ctx=ctx@entry=0x55be1fcc5cc0) at util/aio-posix.c:497
        progress = false
        node = 0x55be20add9e0
#12 0x000055be1e5f5285 in aio_poll (blocking=true, ctx=0x55be1fcc5cc0) at util/aio-posix.c:573
        node = <optimized out>
        i = <optimized out>
        ret = 0
        progress = <optimized out>
        timeout = <optimized out>
        start = 84268908368309
        __PRETTY_FUNCTION__ = "aio_poll"
---Type <return> to continue, or q <return> to quit---
#13 0x000055be1e5f5285 in aio_poll (ctx=0x55be1fcc5cc0, blocking=blocking@entry=true) at util/aio-posix.c:602
        node = <optimized out>
        i = <optimized out>
        ret = 0
        progress = <optimized out>
        timeout = <optimized out>
        start = 84268908368309
        __PRETTY_FUNCTION__ = "aio_poll"
#14 0x000055be1e3e8876 in iothread_run (opaque=0x55be1fccf260) at iothread.c:59
        iothread = 0x55be1fccf260
#15 0x00007fb485a03dd5 in start_thread () at /lib64/libpthread.so.0
#16 0x00007fb48572d94d in clone () at /lib64/libc.so.6

Comment 16 Jeff Cody 2018-01-10 18:33:09 UTC
I'm not able to reproduce your failings on either qemu-kvm-rhev-2.10.0-14.el7.x86_64 or qemu-kvm-rhev-2.10.0-12.el7.x86_64 (or qemu-kvm-rhev-2.10.0-16.el7.x86_64).

Would it be possible for me to get access to your test machine?

Comment 18 Qianqian Zhu 2018-01-11 02:24:30 UTC
Hit 3 times same core dump during automation test on qemu-kvm-rhev-2.10.0-16.el7.x86_64.

Comment 24 Qianqian Zhu 2018-03-20 02:32:30 UTC
Hit one time similar core dump during block-commit on qemu-kvm-rhev-2.10.0-21.el7_5.1.x86_64

(gdb) bt
#0  0x00007f53acc13207 in raise () at /lib64/libc.so.6
#1  0x00007f53acc148f8 in abort () at /lib64/libc.so.6
#2  0x00007f53acc0c026 in __assert_fail_base () at /lib64/libc.so.6
#3  0x00007f53acc0c0d2 in  () at /lib64/libc.so.6
#4  0x00005642aadea517 in virtio_scsi_ctx_check (s=<optimized out>, s=<optimized out>, d=0x5642ae7d8500) at /usr/src/debug/qemu-2.10.0/hw/scsi/virtio-scsi.c:246
#5  0x00005642aae6f936 in virtio_scsi_handle_cmd_vq (s=<optimized out>, s=<optimized out>, d=0x5642ae7d8500) at /usr/src/debug/qemu-2.10.0/hw/scsi/virtio-scsi.c:246
#6  0x00005642aae6f936 in virtio_scsi_handle_cmd_vq (req=0x5642aecd8d80, s=0x5642af61a170) at /usr/src/debug/qemu-2.10.0/hw/scsi/virtio-scsi.c:559
#7  0x00005642aae6f936 in virtio_scsi_handle_cmd_vq (s=s@entry=0x5642af61a170, vq=vq@entry=0x5642af624100) at /usr/src/debug/qemu-2.10.0/hw/scsi/virtio-scsi.c:599
#8  0x00005642aae7051a in virtio_scsi_data_plane_handle_cmd (vdev=<optimized out>, vq=0x5642af624100) at /usr/src/debug/qemu-2.10.0/hw/scsi/virtio-scsi-dataplane.c:60
#9  0x00005642aae7cd56 in virtio_queue_host_notifier_aio_poll (vq=0x5642af624100) at /usr/src/debug/qemu-2.10.0/hw/virtio/virtio.c:1506
#10 0x00005642aae7cd56 in virtio_queue_host_notifier_aio_poll (opaque=0x5642af624168) at /usr/src/debug/qemu-2.10.0/hw/virtio/virtio.c:2420
#11 0x00005642ab11106e in run_poll_handlers_once (ctx=ctx@entry=0x5642ad99bb80) at util/aio-posix.c:497
#12 0x00005642ab111ab5 in aio_poll (blocking=true, ctx=0x5642ad99bb80) at util/aio-posix.c:573
#13 0x00005642ab111ab5 in aio_poll (ctx=0x5642ad99bb80, blocking=blocking@entry=true) at util/aio-posix.c:602
#14 0x00005642aaf046a6 in iothread_run (opaque=0x5642ad9a5260) at iothread.c:59
#15 0x00007f53acfb1dd5 in start_thread () at /lib64/libpthread.so.0
#16 0x00007f53accdbb3d in clone () at /lib64/libc.so.6

Comment 27 Qianqian Zhu 2018-05-31 03:05:53 UTC
Hi Jeff, sorry that I need to release ibm-x3250m6-10.lab.eng.pek2.redhat.com for other usage, is that okay with you? We can set up another reproducible environment when you need.

Comment 34 Longxiang Lyu 2018-07-18 05:48:35 UTC
could reproduce this issue as ngu mentioned with the following components:
qemu-kvm-rhev-2.12.0-7.el7.x86_64
kernel-3.10.0-918.el7.x86_64

1. boot up guest
/usr/libexec/qemu-kvm \
-name guest=test-virt0 \
-machine pc,accel=kvm,usb=off,vmport=off,dump-guest-core=off \
-cpu SandyBridge \
-m 4G \
-smp 4,sockets=4,cores=1,threads=1 \
-boot strict=on \
-object iothread,id=iothread0 \
-device virtio-scsi-pci,bus=pci.0,addr=0x5,iothread=iothread0,id=scsi0 \
-drive file=/home/kvm_autotest_root/images/rhel76-64-virtio-scsi.qcow2,format=qcow2,snapshot=off,cache=none,if=none,aio=native,id=img0 \
-device scsi-hd,bus=scsi0.0,drive=img0,scsi-id=0,lun=0,id=scsi-disk0,bootindex=0 \
-netdev tap,id=hostnet0,vhost=on \
-device virtio-net-pci,netdev=hostnet0,id=net0,mac=51:54:12:b3:20:61,bus=pci.0,addr=0x3 \
-device qxl-vga \
-vnc :1 \
-device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x8 \
-monitor stdio \
-qmp tcp:0:4444,server,nowait \
-usbdevice tablet \

2. create snapshot then block stream
{ "execute": "qmp_capabilities" }
{ "execute": "blockdev-snapshot-sync", "arguments": { "device": "img0","snapshot-file": "sn1.qcow2", "format": "qcow2" } }
{ "execute": "block-stream", "arguments": { "device": "img0"}}

Result:
guest hangs.

bt info:
# gdb -batch -ex bt -p 23305
[New LWP 23341]
[New LWP 23338]
[New LWP 23324]
[New LWP 23323]
[New LWP 23322]
[New LWP 23321]
[New LWP 23319]
[New LWP 23307]
[New LWP 23306]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007f22cef032cf in __GI_ppoll (fds=0x55fd8e9383c0, nfds=2, timeout=<optimized out>, timeout@entry=0x0, sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:56
56	      result = INLINE_SYSCALL (ppoll, 5, fds, nfds, timeout, sigmask,
#0  0x00007f22cef032cf in __GI_ppoll (fds=0x55fd8e9383c0, nfds=2, timeout=<optimized out>, timeout@entry=0x0, sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:56
#1  0x000055fd8cc1a7cb in qemu_poll_ns (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/bits/poll2.h:77
#2  0x000055fd8cc1a7cb in qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=timeout@entry=-1) at util/qemu-timer.c:322
#3  0x000055fd8cc1c515 in aio_poll (ctx=0x55fd8e8677c0, blocking=blocking@entry=true) at util/aio-posix.c:629
#4  0x000055fd8cb97caa in bdrv_flush (bs=bs@entry=0x55fd8e93c000) at block/io.c:2531
#5  0x000055fd8cb489ab in bdrv_unref (bs=0x55fd8e93c000) at block.c:3326
#6  0x000055fd8cb489ab in bdrv_unref (bs=0x55fd8e93c000) at block.c:3514
#7  0x000055fd8cb489ab in bdrv_unref (bs=0x55fd8e93c000) at block.c:4614
#8  0x000055fd8cb4bf44 in block_job_remove_all_bdrv (job=job@entry=0x55fd8e848c40) at blockjob.c:177
#9  0x000055fd8cb4bf93 in block_job_free (job=0x55fd8e848c40) at blockjob.c:94
#10 0x000055fd8cb4d3dd in job_unref (job=0x55fd8e848c40) at job.c:367
#11 0x000055fd8cb4d5e8 in job_finalize_single (job=0x55fd8e848c40) at job.c:654
#12 0x000055fd8cb4d5e8 in job_finalize_single (job=0x55fd8e848c40) at job.c:722
#13 0x000055fd8cb4cc50 in job_txn_apply (fn=0x55fd8cb4d4e0 <job_finalize_single>, lock=true, txn=<optimized out>) at job.c:150
#14 0x000055fd8ca1aa67 in stream_complete (job=0x55fd8e848c40, opaque=0x55fd8ecb26a8) at block/stream.c:96
#15 0x000055fd8cb4caf2 in job_defer_to_main_loop_bh (opaque=0x55fd9082efc0) at job.c:968
#16 0x000055fd8cc19331 in aio_bh_poll (bh=0x55fd8e920030) at util/async.c:90
#17 0x000055fd8cc19331 in aio_bh_poll (ctx=ctx@entry=0x55fd8e8677c0) at util/async.c:118
#18 0x000055fd8cc1c3d0 in aio_dispatch (ctx=0x55fd8e8677c0) at util/aio-posix.c:436
#19 0x000055fd8cc1920e in aio_ctx_dispatch (source=<optimized out>, callback=<optimized out>, user_data=<optimized out>) at util/async.c:261
#20 0x00007f22e75a2049 in g_main_context_dispatch (context=0x55fd8e8baa50) at gmain.c:3175
#21 0x00007f22e75a2049 in g_main_context_dispatch (context=context@entry=0x55fd8e8baa50) at gmain.c:3828
#22 0x000055fd8cc1b6d7 in main_loop_wait () at util/main-loop.c:215
#23 0x000055fd8cc1b6d7 in main_loop_wait (timeout=<optimized out>) at util/main-loop.c:238
#24 0x000055fd8cc1b6d7 in main_loop_wait (nonblocking=nonblocking@entry=0) at util/main-loop.c:497
#25 0x000055fd8c8c2f27 in main () at vl.c:1963
#26 0x000055fd8c8c2f27 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at vl.c:4768

Comment 35 Qianqian Zhu 2018-07-18 07:13:09 UTC
Hi Jeff, Nini,
Looks like the host ibm-x3250m6-10.lab.eng.pek2.redhat.com can be released, I am going to clean the environment, is that okay with you?

Comment 36 Gu Nini 2018-07-18 07:16:22 UTC
(In reply to Qianqian Zhu from comment #35)
> Hi Jeff, Nini,
> Looks like the host ibm-x3250m6-10.lab.eng.pek2.redhat.com can be released,
> I am going to clean the environment, is that okay with you?

Qianqian,

It's ok for me. I can reproduce current issue with any host.

Comment 37 Jeff Cody 2018-08-01 14:07:46 UTC
(In reply to Qianqian Zhu from comment #35)
> Hi Jeff, Nini,
> Looks like the host ibm-x3250m6-10.lab.eng.pek2.redhat.com can be released,
> I am going to clean the environment, is that okay with you?

Yes, go ahead and release, thanks!

Comment 44 errata-xmlrpc 2019-08-22 09:18:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2019:2553