Bug 1316368 - qmp no response while start/cancel block job repeatly
Summary: qmp no response while start/cancel block job repeatly
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: qemu-kvm-rhev
Version: 7.3
Hardware: All
OS: Linux
medium
medium
Target Milestone: rc
: 7.3
Assignee: Fam Zheng
QA Contact: Qianqian Zhu
URL:
Whiteboard:
Depends On:
Blocks: RHV4.1PPC
TreeView+ depends on / blocked
 
Reported: 2016-03-10 05:42 UTC by mazhang
Modified: 2016-11-07 21:00 UTC (History)
12 users (show)

Fixed In Version: QEMU 2.6
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-11-07 21:00:01 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2016:2673 normal SHIPPED_LIVE qemu-kvm-rhev bug fix and enhancement update 2016-11-08 01:06:13 UTC

Description mazhang 2016-03-10 05:42:31 UTC
Description of problem:
qmp no response while start/cancel block job repeatly

Version-Release number of selected component (if applicable):

Host:
3.10.0-357.el7.ppc64le
qemu-kvm-rhev-2.5.0-2.el7.ppc64le

Guest:
2.6.32-573.el6.ppc64

How reproducible:
5/5

Steps to Reproduce:
1.Boot guest with:
/usr/libexec/qemu-kvm \
    -name 'virt-tests-vm1'  \
    -sandbox off  \
    -machine pseries  \
    -nodefaults  \
    -vga std  \
    -qmp tcp:0:6666,server,nowait \
    -chardev socket,id=qmp_id_qmp1,path=/tmp/monitor-qmp1-20160309-024047-fOYMc277,server,nowait \
    -mon chardev=qmp_id_qmp1,mode=control  \
    -chardev socket,id=qmp_id_catch_monitor,path=/tmp/monitor-catch_monitor-20160309-024047-fOYMc277,server,nowait \
    -mon chardev=qmp_id_catch_monitor,mode=control  \
    -chardev socket,id=serial_id_serial0,path=/tmp/serial-serial0-20160309-024047-fOYMc277,server,nowait \
    -device spapr-vty,reg=0x30000000,chardev=serial_id_serial0 \
    -device pci-ohci,id=usb1,bus=pci.0,addr=03 \
    -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pci.0,addr=04,disable-legacy=off,disable-modern=on \
    -drive id=drive_image1,if=none,cache=none,snapshot=off,aio=native,format=qcow2,file=/home/test/staf-kvm/autotest/client/tests/virt/shared/data/images/rhel67-ppc64-virtio-scsi.qcow2 \
    -device scsi-hd,id=image1,drive=drive_image1 \
    -device virtio-net-pci,mac=9a:f6:f7:f8:f9:fa,id=idvCYunh,vectors=4,netdev=idLzhJx8,bus=pci.0,addr=05,disable-legacy=off,disable-modern=on  \
    -netdev tap,id=idLzhJx8 \
    -m 8192  \
    -smp 4,maxcpus=4,cores=2,threads=1,sockets=2 \
    -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1  \
    -vnc :11  \
    -rtc base=utc,clock=host  \
    -boot order=cdn,once=c,menu=off,strict=off \
    -enable-kvm \
    -monitor stdio \
    -S \

2. Start/cancel block job with following script:

#!/bin/bash

let i=0
exec 3<>/dev/tcp/localhost/6666
echo -e "{ 'execute': 'qmp_capabilities' }" >&3
read response <&3
echo $response
while [ $i -lt 500 ]
do
    echo -e "{'execute': 'query-cpus', 'id': 'PkjCdBBG'}" >&3
    read response <&3;  echo "$i: $response"
    echo -e "{'execute': 'query-block', 'id': 'YO1lkgJR'}" >&3
    read response <&3;  echo "$i: $response"
    echo -e "{'execute': 'query-block', 'id': 'cHd67fdS'}" >&3
    read response <&3;  echo "$i: $response"
    echo -e "{'execute': 'drive-mirror', 'arguments': {'device': 'drive_image1', 'mode': 'absolute-paths', 'format': 'qcow2', 'target': '/home/test/staf-kvm/autotest/client/tests/virt/shared/data/images/target1.qcow2', 'sync': 'full'}, 'id': 'TS1SJ9WQ'}" >&3
    read response <&3;  echo "$i: $response"
    echo -e "{'execute': 'query-block-jobs', 'id': 'GA59wJVC'}" >&3
    read response <&3;  echo "$i: $response"
    echo -e "{'execute': 'block-job-set-speed', 'arguments': {'device': 'drive_image1', 'speed': 10485760}, 'id': '2f76YsoS'}" >&3
    read response <&3;  echo "$i: $response"
    echo -e "{'execute': 'query-block-jobs', 'id': '1SACh0ms'}" >&3
    read response <&3;  echo "$i: $response"
    echo -e "{'execute': 'query-block-jobs', 'id': '4vKqzkG1'}" >&3
    read response <&3;  echo "$i: $response"
    echo -e "{'execute': 'block-job-cancel', 'arguments': {'device': 'drive_image1'}, 'id': 'tzq3eBk3'}" >&3
    read response <&3;  echo "$i: $response"
    echo -e "{'execute': 'query-block-jobs', 'id': 'g43PqGfX'}" >&3
    read response <&3;  echo "$i: $response"
    echo -e "{'execute': 'query-block-jobs', 'id': 'GDKiSWWW'}" >&3
    read response <&3;  echo "$i: $response"
    let i=$i+1
    sleep 1
done


3.

Actual results:
qmp no response, qemu-kvm hung, ping guest no response.

5: {"return": [{"io-status": "ok", "device": "drive_image1", "locked": false, "removable": false, "inserted": {"iops_rd": 0, "detect_zeroes": "off", "image": {"virtual-size": 21474836480, "filename": "/home/test/staf-kvm/autotest/client/tests/virt/shared/data/images/rhel67-ppc64-virtio-scsi.qcow2", "cluster-size": 65536, "format": "qcow2", "actual-size": 12473073664, "format-specific": {"type": "qcow2", "data": {"compat": "1.1", "lazy-refcounts": false, "refcount-bits": 16, "corrupt": false}}, "dirty-flag": false}, "iops_wr": 0, "ro": false, "node-name": "#block126", "backing_file_depth": 0, "drv": "qcow2", "iops": 0, "bps_wr": 0, "write_threshold": 0, "encrypted": false, "bps": 0, "bps_rd": 0, "cache": {"no-flush": false, "direct": true, "writeback": true}, "file": "/home/test/staf-kvm/autotest/client/tests/virt/shared/data/images/rhel67-ppc64-virtio-scsi.qcow2", "encryption_key_missing": false}, "type": "unknown"}], "id": "cHd67fdS"}
5: {"return": {}, "id": "TS1SJ9WQ"}
5: {"return": [{"io-status": "ok", "device": "drive_image1", "busy": true, "len": 0, "offset": 0, "paused": false, "speed": 0, "ready": false, "type": "mirror"}], "id": "GA59wJVC"}
6: {"return": {}, "id": "2f76YsoS"}
6: {"return": [{"io-status": "ok", "device": "drive_image1", "busy": true, "len": 0, "offset": 0, "paused": false, "speed": 10485760, "ready": false, "type": "mirror"}], "id": "1SACh0ms"}
6: {"return": [{"io-status": "ok", "device": "drive_image1", "busy": true, "len": 0, "offset": 0, "paused": false, "speed": 10485760, "ready": false, "type": "mirror"}], "id": "4vKqzkG1"}
6: {"return": {}, "id": "tzq3eBk3"}
6: {"return": [{"io-status": "ok", "device": "drive_image1", "busy": true, "len": 0, "offset": 0, "paused": false, "speed": 10485760, "ready": false, "type": "mirror"}], "id": "g43PqGfX"}
6: {"return": [{"io-status": "ok", "device": "drive_image1", "busy": true, "len": 0, "offset": 0, "paused": false, "speed": 10485760, "ready": false, "type": "mirror"}], "id": "GDKiSWWW"}
     <--- problem always happened after cancel block job then send "query-block-jobs" the second times.


Expected results:
qemu-kvm works well.

Additional info:

Comment 2 mazhang 2016-03-10 10:18:14 UTC
Also test this on qemu-kvm-rhev-2.3.0-31.el7.ppc64le, after 1000 times repeat not hit the problem.

Comment 4 mazhang 2016-03-11 05:28:44 UTC
This bug can't reproduce on x86 platform, after repeat 1000 times test, qemu-kvm works well.

Host:
qemu-kvm-rhev-2.5.0-2.el7.x86_64
3.10.0-357.el7.x86_64

Guest:
2.6.32-573.18.1.el6.x86_64

Comment 6 Laurent Vivier 2016-03-29 12:58:53 UTC
KVM thread seems hanging on iothread lock:

#0  0x00003fff81e21ac8 in __lll_lock_wait () from /lib64/power8/libpthread.so.0
#1  0x00003fff81e1b004 in pthread_mutex_lock ()
   from /lib64/power8/libpthread.so.0
#2  0x0000000036f6c3e8 in qemu_mutex_lock ()
#3  0x0000000036bdb1fc in qemu_mutex_lock_iothread ()
#4  0x0000000036d63394 in kvm_arch_handle_exit ()
#5  0x0000000036bf42a8 in kvm_cpu_exec ()
#6  0x0000000036bdb400 in qemu_kvm_cpu_thread_fn ()
#7  0x00003fff81e18728 in start_thread () from /lib64/power8/libpthread.so.0
#8  0x00003fff81799240 in clone () from /lib64/power8/libc.so.6

lock is owned by the thread executing block job:

#0  0x00003fff81789914 in ppoll () from /lib64/power8/libc.so.6
#1  0x0000000036eca86c in qemu_poll_ns ()
#2  0x0000000036ecc58c in aio_poll ()
#3  0x0000000036f18e78 in bdrv_drain ()
#4  0x0000000036f11e90 in mirror_run ()
#5  0x0000000036f82608 in coroutine_trampoline ()
#6  0x00003fff816d1c9c in makecontext () from /lib64/power8/libc.so.6
#7  0x0000000000000000 in ?? ()

Comment 7 Laurent Vivier 2016-03-29 15:33:49 UTC
bisected to:

commit 176c36997fd4a94a7b919468d8967e0ad81fdf9c
Author: Fam Zheng <famz@redhat.com>
Date:   Mon Nov 23 10:28:04 2015 +0800

    mirror: Quiesce source during "mirror_exit"
    
    With dataplane, the ioeventfd events could be dispatched after
    mirror_run releases the dirty bitmap, but before mirror_exit actually
    does the device switch, because the iothread will still be running, and
    it will cause silent data loss.
    
    Fix this by adding a bdrv_drained_begin/end pair around the window, so
    that no new external request will be handled.


Fam, do you have any idea of what happens?

Comment 8 Laurent Vivier 2016-03-29 16:21:38 UTC
I've been able to reproduce this on x86_64 with qemu upstream (b68a801).

Lock between:

#0  0x00007f7c5140789d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f7c514019cd in pthread_mutex_lock () from /lib64/libpthread.so.0
#2  0x000055c6808fd4d9 in qemu_mutex_lock (
    mutex=mutex@entry=0x55c680eb8320 <qemu_global_mutex>)
    at /home/lvivier/Projects/qemu-upstream/util/qemu-thread-posix.c:64
#3  0x000055c68060c081 in qemu_mutex_lock_iothread ()
    at /home/lvivier/Projects/qemu-upstream/cpus.c:1232
#4  0x000055c6805decbd in prepare_mmio_access (mr=0x55c68341c0a0, 
    mr=0x55c68341c0a0) at /home/lvivier/Projects/qemu-upstream/exec.c:2556
#5  0x000055c6805e2e5f in address_space_read_continue (
...

And

#0  0x00007f7c4f1cd0a1 in ppoll () from /lib64/libc.so.6
#1  0x000055c68086cddb in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/bits/poll2.h:77
#2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=timeout@entry=-1) at /home/lvivier/Projects/qemu-upstream/qemu-timer.c:313
#3  0x000055c68086e024 in aio_poll (ctx=0x55c681a5ae00, blocking=blocking@entry=true) at /home/lvivier/Projects/qemu-upstream/aio-posix.c:453
#4  0x000055c6808ad016 in bdrv_drain (bs=0x55c681a745c0) at /home/lvivier/Projects/qemu-upstream/block/io.c:274
#5  0x000055c6808ad1ed in bdrv_drained_begin (bs=<optimized out>) at /home/lvivier/Projects/qemu-upstream/block/io.c:2737
#6  0x000055c6808a7139 in mirror_run (opaque=0x55c683245c40) at /home/lvivier/Projects/qemu-upstream/block/mirror.c:700
#7  0x000055c68090cc7a in coroutine_trampoline (i0=<optimized out>, i1=<optimized out>) at /home/lvivier/Projects/qemu-upstream/util/coroutine-ucontext.c:78
...

With the same qmp script and the following qemu command:

QEMU=/home/lvivier/Projects/qemu-upstream/build/pc-origin/x86_64-softmmu/qemu-system-x86_64
IMAGE=/var/lib/libvirt/images/rhel7.2.qcow2

$QEMU \
    -name 'virt-tests-vm1'  \
    -sandbox off  \
    -machine pc-i440fx-2.3,accel=kvm,usb=off -cpu SandyBridge -m 2048 -realtime mlock=off \
    -nodefaults  \
    -vga std  \
    -qmp tcp:0:6666,server,nowait \
    -chardev socket,id=qmp_id_qmp1,path=/tmp/monitor-qmp1-20160309-024047-fOYMc277,server,nowait \
    -mon chardev=qmp_id_qmp1,mode=control  \
    -chardev socket,id=qmp_id_catch_monitor,path=/tmp/monitor-catch_monitor-20160309-024047-fOYMc277,server,nowait \
    -mon chardev=qmp_id_catch_monitor,mode=control  \
    -chardev socket,id=serial_id_serial0,path=/tmp/serial-serial0-20160309-024047-fOYMc277,server,nowait \
    -device isa-serial,id=serial0,chardev=serial_id_serial0 \
    -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pci.0,addr=04,disable-legacy=off,disable-modern=on \
    -drive id=drive_image1,if=none,cache=none,snapshot=off,aio=native,format=qcow2,file=$IMAGE \
    -device scsi-hd,id=image1,drive=drive_image1 \
    -m 2048  \
    -smp 4,maxcpus=4,cores=2,threads=1,sockets=2 \
    -usb \
    -device usb-tablet,id=usb-tablet1 \
    -vnc :11  \
    -rtc base=utc,clock=host  \
    -boot order=cdn,once=c,menu=off,strict=off \
    -enable-kvm \
    -monitor stdio \
    -S

How to reproduce:

1- start qemu
2- start qmp script
3- attach "nc" to console (to see if the VM boots):
   nc -U /tmp/serial-serial0-20160309-024047-fOYMc277
4- type "c" in qemu monitor to start the VM. If the VM doesn't run, the lock 
   cannot be not taken.
5- wait, during the boot VM hangs.

Reverting 176c36997fd4a94a7b919468d8967e0ad81fdf9c fixes the problem.

Comment 9 Laurent Vivier 2016-03-29 16:28:11 UTC
As not a ppc64le specific bug, assign back to default assignee.

Comment 10 Fam Zheng 2016-03-30 02:51:43 UTC
I cannot reproduce on upstream. The backtrace is expected, except the blocking ppoll should return when the I/O request it is waiting for completes. In your case it seems the ppoll didn't return.

How reproducible is it?

Comment 11 Laurent Vivier 2016-03-30 07:39:57 UTC
(In reply to Fam Zheng from comment #10)
> How reproducible is it?

Doing what it is described in comment 8, it is 100% reproducible (I just did it again).

You must boot the system while the QMP tests are running, it hangs just after GRUB.

The last logs from QMP script are:

14: {"return": {}, "id": "tzq3eBk3"}
14: {"return": [{"io-status": "ok", "device": "drive_image1", "busy": true, "len": 0, "offset": 0, "paused": false, "speed": 10485760, "ready": false, "type": "mirror"}], "id": "g43PqGfX"}
14: {"return": [{"io-status": "ok", "device": "drive_image1", "busy": true, "len": 0, "offset": 0, "paused": false, "speed": 10485760, "ready": false, "type": "mirror"}], "id": "GDKiSWWW"}
14: {"timestamp": {"seconds": 1459323013, "microseconds": 355102}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": "drive_image1", "len": 0, "offset": 0, "speed": 10485760, "type": "mirror"}}
14: {"return": [{"arch": "x86", "current": true, "CPU": 0, "qom_path": "/machine/unattached/device[0]", "pc": 2146040702, "halted": false, "thread_id": 22405}, {"arch": "x86", "current": false, "CPU": 1, "qom_path": "/machine/unattached/device[3]", "pc": 1037083, "halted": true, "thread_id": 22406}, {"arch": "x86", "current": false, "CPU": 2, "qom_path": "/machine/unattached/device[5]", "pc": 1037083, "halted": true, "thread_id": 22407}, {"arch": "x86", "current": false, "CPU": 3, "qom_path": "/machine/unattached/device[7]", "pc": 1037083, "halted": true, "thread_id": 22408}], "id": "PkjCdBBG"}
14: {"return": [{"io-status": "ok", "device": "drive_image1", "locked": false, "removable": false, "inserted": {"iops_rd": 0, "detect_zeroes": "off", "image": {"virtual-size": 21474836480, "filename": "/var/lib/libvirt/images/rhel7.2.qcow2", "cluster-size": 65536, "format": "qcow2", "actual-size": 5331587072, "format-specific": {"type": "qcow2", "data": {"compat": "1.1", "lazy-refcounts": true, "refcount-bits": 16, "corrupt": false}}, "dirty-flag": false}, "iops_wr": 0, "ro": false, "node-name": "#block110", "backing_file_depth": 0, "drv": "qcow2", "iops": 0, "bps_wr": 0, "write_threshold": 0, "encrypted": false, "bps": 0, "bps_rd": 0, "cache": {"no-flush": false, "direct": true, "writeback": true}, "file": "/var/lib/libvirt/images/rhel7.2.qcow2", "encryption_key_missing": false}, "type": "unknown"}], "id": "YO1lkgJR"}
14: {"return": [{"io-status": "ok", "device": "drive_image1", "locked": false, "removable": false, "inserted": {"iops_rd": 0, "detect_zeroes": "off", "image": {"virtual-size": 21474836480, "filename": "/var/lib/libvirt/images/rhel7.2.qcow2", "cluster-size": 65536, "format": "qcow2", "actual-size": 5331587072, "format-specific": {"type": "qcow2", "data": {"compat": "1.1", "lazy-refcounts": true, "refcount-bits": 16, "corrupt": false}}, "dirty-flag": false}, "iops_wr": 0, "ro": false, "node-name": "#block110", "backing_file_depth": 0, "drv": "qcow2", "iops": 0, "bps_wr": 0, "write_threshold": 0, "encrypted": false, "bps": 0, "bps_rd": 0, "cache": {"no-flush": false, "direct": true, "writeback": true}, "file": "/var/lib/libvirt/images/rhel7.2.qcow2", "encryption_key_missing": false}, "type": "unknown"}], "id": "cHd67fdS"}
14: {"return": {}, "id": "TS1SJ9WQ"}
14: {"return": [{"io-status": "ok", "device": "drive_image1", "busy": true, "len": 0, "offset": 0, "paused": false, "speed": 0, "ready": false, "type": "mirror"}], "id": "GA59wJVC"}
14: {"return": {}, "id": "2f76YsoS"}
14: {"return": [{"io-status": "ok", "device": "drive_image1", "busy": true, "len": 0, "offset": 0, "paused": false, "speed": 10485760, "ready": false, "type": "mirror"}], "id": "1SACh0ms"}
15: {"return": [{"io-status": "ok", "device": "drive_image1", "busy": true, "len": 0, "offset": 0, "paused": false, "speed": 10485760, "ready": false, "type": "mirror"}], "id": "4vKqzkG1"}
15: {"return": {}, "id": "tzq3eBk3"}
15: {"return": [{"io-status": "ok", "device": "drive_image1", "busy": true, "len": 0, "offset": 0, "paused": false, "speed": 10485760, "ready": false, "type": "mirror"}], "id": "g43PqGfX"}
15: {"return": [{"io-status": "ok", "device": "drive_image1", "busy": true, "len": 0, "offset": 0, "paused": false, "speed": 10485760, "ready": false, "type": "mirror"}], "id": "GDKiSWWW"}
15: {"timestamp": {"seconds": 1459323014, "microseconds": 343062}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": "drive_image1", "len": 0, "offset": 0, "speed": 10485760, "type": "mirror"}}
15: {"return": [{"arch": "x86", "current": true, "CPU": 0, "qom_path": "/machine/unattached/device[0]", "pc": 998095, "halted": false, "thread_id": 22405}, {"arch": "x86", "current": false, "CPU": 1, "qom_path": "/machine/unattached/device[3]", "pc": 1037083, "halted": true, "thread_id": 22406}, {"arch": "x86", "current": false, "CPU": 2, "qom_path": "/machine/unattached/device[5]", "pc": 1037083, "halted": true, "thread_id": 22407}, {"arch": "x86", "current": false, "CPU": 3, "qom_path": "/machine/unattached/device[7]", "pc": 1037083, "halted": true, "thread_id": 22408}], "id": "PkjCdBBG"}
15: {"return": [{"io-status": "ok", "device": "drive_image1", "locked": false, "removable": false, "inserted": {"iops_rd": 0, "detect_zeroes": "off", "image": {"virtual-size": 21474836480, "filename": "/var/lib/libvirt/images/rhel7.2.qcow2", "cluster-size": 65536, "format": "qcow2", "actual-size": 5331587072, "format-specific": {"type": "qcow2", "data": {"compat": "1.1", "lazy-refcounts": true, "refcount-bits": 16, "corrupt": false}}, "dirty-flag": false}, "iops_wr": 0, "ro": false, "node-name": "#block110", "backing_file_depth": 0, "drv": "qcow2", "iops": 0, "bps_wr": 0, "write_threshold": 0, "encrypted": false, "bps": 0, "bps_rd": 0, "cache": {"no-flush": false, "direct": true, "writeback": true}, "file": "/var/lib/libvirt/images/rhel7.2.qcow2", "encryption_key_missing": false}, "type": "unknown"}], "id": "YO1lkgJR"}
15: {"return": [{"io-status": "ok", "device": "drive_image1", "locked": false, "removable": false, "inserted": {"iops_rd": 0, "detect_zeroes": "off", "image": {"virtual-size": 21474836480, "filename": "/var/lib/libvirt/images/rhel7.2.qcow2", "cluster-size": 65536, "format": "qcow2", "actual-size": 5331587072, "format-specific": {"type": "qcow2", "data": {"compat": "1.1", "lazy-refcounts": true, "refcount-bits": 16, "corrupt": false}}, "dirty-flag": false}, "iops_wr": 0, "ro": false, "node-name": "#block110", "backing_file_depth": 0, "drv": "qcow2", "iops": 0, "bps_wr": 0, "write_threshold": 0, "encrypted": false, "bps": 0, "bps_rd": 0, "cache": {"no-flush": false, "direct": true, "writeback": true}, "file": "/var/lib/libvirt/images/rhel7.2.qcow2", "encryption_key_missing": false}, "type": "unknown"}], "id": "cHd67fdS"}
15: {"return": {}, "id": "TS1SJ9WQ"}
15: {"return": [{"io-status": "ok", "device": "drive_image1", "busy": true, "len": 0, "offset": 0, "paused": false, "speed": 0, "ready": false, "type": "mirror"}], "id": "GA59wJVC"}
15: {"return": {}, "id": "2f76YsoS"}
16: {"return": [{"io-status": "ok", "device": "drive_image1", "busy": true, "len": 0, "offset": 0, "paused": false, "speed": 10485760, "ready": false, "type": "mirror"}], "id": "1SACh0ms"}
16: {"return": [{"io-status": "ok", "device": "drive_image1", "busy": true, "len": 0, "offset": 0, "paused": false, "speed": 10485760, "ready": false, "type": "mirror"}], "id": "4vKqzkG1"}
16: {"return": {}, "id": "tzq3eBk3"}
16: {"return": [{"io-status": "ok", "device": "drive_image1", "busy": true, "len": 0, "offset": 0, "paused": false, "speed": 10485760, "ready": false, "type": "mirror"}], "id": "g43PqGfX"}
16: {"return": [{"io-status": "ok", "device": "drive_image1", "busy": true, "len": 0, "offset": 0, "paused": false, "speed": 10485760, "ready": false, "type": "mirror"}], "id": "GDKiSWWW"}

My test script is (the path of the mirror device is different from comment the mazhang one, check you own is accessible):
 #!/bin/bash

let i=0
exec 3<>/dev/tcp/localhost/6666
echo -e "{ 'execute': 'qmp_capabilities' }" >&3
read response <&3
echo $response
while [ $i -lt 500 ]
do
    echo -e "{'execute': 'query-cpus', 'id': 'PkjCdBBG'}" >&3
    read response <&3;  echo "$i: $response"
    echo -e "{'execute': 'query-block', 'id': 'YO1lkgJR'}" >&3
    read response <&3;  echo "$i: $response"
    echo -e "{'execute': 'query-block', 'id': 'cHd67fdS'}" >&3
    read response <&3;  echo "$i: $response"
    echo -e "{'execute': 'drive-mirror', 'arguments': {'device': 'drive_image1', 'mode': 'absolute-paths', 'format': 'qcow2', 'target': '/home/lvivier/target1.qcow2', 'sync': 'full'}, 'id': 'TS1SJ9WQ'}" >&3
    read response <&3;  echo "$i: $response"
    echo -e "{'execute': 'query-block-jobs', 'id': 'GA59wJVC'}" >&3
    read response <&3;  echo "$i: $response"
    echo -e "{'execute': 'block-job-set-speed', 'arguments': {'device': 'drive_image1', 'speed': 10485760}, 'id': '2f76YsoS'}" >&3
    read response <&3;  echo "$i: $response"
    echo -e "{'execute': 'query-block-jobs', 'id': '1SACh0ms'}" >&3
    read response <&3;  echo "$i: $response"
    echo -e "{'execute': 'query-block-jobs', 'id': '4vKqzkG1'}" >&3
    read response <&3;  echo "$i: $response"
    echo -e "{'execute': 'block-job-cancel', 'arguments': {'device': 'drive_image1'}, 'id': 'tzq3eBk3'}" >&3
    read response <&3;  echo "$i: $response"
    echo -e "{'execute': 'query-block-jobs', 'id': 'g43PqGfX'}" >&3
    read response <&3;  echo "$i: $response"
    echo -e "{'execute': 'query-block-jobs', 'id': 'GDKiSWWW'}" >&3
    read response <&3;  echo "$i: $response"
    let i=$i+1
    sleep 1
done

Comment 12 Fam Zheng 2016-03-30 10:15:30 UTC
It's not clear how this could hang and I double checked that I cannot reproduce. Can you try this patch to rule out the event loop?

---

diff --git a/block/io.c b/block/io.c
index 41d954ca..6e484b4 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2732,7 +2732,7 @@ void bdrv_flush_io_queue(BlockDriverState *bs)
 void bdrv_drained_begin(BlockDriverState *bs)
 {
     if (!bs->quiesce_counter++) {
-        aio_disable_external(bdrv_get_aio_context(bs));
+        /*aio_disable_external(bdrv_get_aio_context(bs));*/
     }
     bdrv_drain(bs);
 }
@@ -2743,5 +2743,5 @@ void bdrv_drained_end(BlockDriverState *bs)
     if (--bs->quiesce_counter > 0) {
         return;
     }
-    aio_enable_external(bdrv_get_aio_context(bs));
+    /*aio_enable_external(bdrv_get_aio_context(bs));*/
 }

Comment 13 Laurent Vivier 2016-03-30 12:18:17 UTC
(In reply to Fam Zheng from comment #12)
> It's not clear how this could hang and I double checked that I cannot
> reproduce. Can you try this patch to rule out the event loop?


The result is the same with this patch: it hangs on ppc64le and x86_64. On x86_64 it seems to happen a little bit later with this patch.

Comment 15 Fam Zheng 2016-04-01 08:38:27 UTC
The hang is an ABBA deadlock between two coroutines. One coroutine is mirror, which has just finished a request (qcow2 read) - the block job is about to complete. It calls bdrv_drain() which waits for the other coroutine to complete. The other coroutine is a scsi-disk read request. The deadlock happens when the second coroutine is pending on the first to yield/terminate, in order to acquire the qcow2 lock in qemu_co_mutex_lock(). The state is like:


mirror                         scsi-disk
-------------------------------------------------------------
do last write
  qcow2:qemu_co_mutex_lock()
  ...
                               scsi disk read
                                 tracked request begin
                                 qcow2:qemu_co_mutex_lock.enter
  qcow2:qemu_co_mutex_unlock()
bdrv_drain
  while (req pending)
    aio_poll

In the scsi-disk coroutine, the qemu_co_mutex_lock() will never return because the mirror coroutine is blocked in the aio_poll(blocking=true).

Upstream has this too.

Comment 20 Qianqian Zhu 2016-09-09 06:17:25 UTC
Moving to Verified per comment 17

Comment 23 errata-xmlrpc 2016-11-07 21:00:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-2673.html


Note You need to log in before you can comment on or make changes to this bug.