RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1316368 - qmp no response while start/cancel block job repeatly
Summary: qmp no response while start/cancel block job repeatly
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: qemu-kvm-rhev
Version: 7.3
Hardware: All
OS: Linux
medium
medium
Target Milestone: rc
: 7.3
Assignee: Fam Zheng
QA Contact: Qianqian Zhu
URL:
Whiteboard:
Depends On:
Blocks: RHV4.1PPC
TreeView+ depends on / blocked
 
Reported: 2016-03-10 05:42 UTC by mazhang
Modified: 2016-11-07 21:00 UTC (History)
12 users (show)

Fixed In Version: QEMU 2.6
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-11-07 21:00:01 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2016:2673 0 normal SHIPPED_LIVE qemu-kvm-rhev bug fix and enhancement update 2016-11-08 01:06:13 UTC

Description mazhang 2016-03-10 05:42:31 UTC
Description of problem:
qmp no response while start/cancel block job repeatly

Version-Release number of selected component (if applicable):

Host:
3.10.0-357.el7.ppc64le
qemu-kvm-rhev-2.5.0-2.el7.ppc64le

Guest:
2.6.32-573.el6.ppc64

How reproducible:
5/5

Steps to Reproduce:
1.Boot guest with:
/usr/libexec/qemu-kvm \
    -name 'virt-tests-vm1'  \
    -sandbox off  \
    -machine pseries  \
    -nodefaults  \
    -vga std  \
    -qmp tcp:0:6666,server,nowait \
    -chardev socket,id=qmp_id_qmp1,path=/tmp/monitor-qmp1-20160309-024047-fOYMc277,server,nowait \
    -mon chardev=qmp_id_qmp1,mode=control  \
    -chardev socket,id=qmp_id_catch_monitor,path=/tmp/monitor-catch_monitor-20160309-024047-fOYMc277,server,nowait \
    -mon chardev=qmp_id_catch_monitor,mode=control  \
    -chardev socket,id=serial_id_serial0,path=/tmp/serial-serial0-20160309-024047-fOYMc277,server,nowait \
    -device spapr-vty,reg=0x30000000,chardev=serial_id_serial0 \
    -device pci-ohci,id=usb1,bus=pci.0,addr=03 \
    -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pci.0,addr=04,disable-legacy=off,disable-modern=on \
    -drive id=drive_image1,if=none,cache=none,snapshot=off,aio=native,format=qcow2,file=/home/test/staf-kvm/autotest/client/tests/virt/shared/data/images/rhel67-ppc64-virtio-scsi.qcow2 \
    -device scsi-hd,id=image1,drive=drive_image1 \
    -device virtio-net-pci,mac=9a:f6:f7:f8:f9:fa,id=idvCYunh,vectors=4,netdev=idLzhJx8,bus=pci.0,addr=05,disable-legacy=off,disable-modern=on  \
    -netdev tap,id=idLzhJx8 \
    -m 8192  \
    -smp 4,maxcpus=4,cores=2,threads=1,sockets=2 \
    -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1  \
    -vnc :11  \
    -rtc base=utc,clock=host  \
    -boot order=cdn,once=c,menu=off,strict=off \
    -enable-kvm \
    -monitor stdio \
    -S \

2. Start/cancel block job with following script:

#!/bin/bash

let i=0
exec 3<>/dev/tcp/localhost/6666
echo -e "{ 'execute': 'qmp_capabilities' }" >&3
read response <&3
echo $response
while [ $i -lt 500 ]
do
    echo -e "{'execute': 'query-cpus', 'id': 'PkjCdBBG'}" >&3
    read response <&3;  echo "$i: $response"
    echo -e "{'execute': 'query-block', 'id': 'YO1lkgJR'}" >&3
    read response <&3;  echo "$i: $response"
    echo -e "{'execute': 'query-block', 'id': 'cHd67fdS'}" >&3
    read response <&3;  echo "$i: $response"
    echo -e "{'execute': 'drive-mirror', 'arguments': {'device': 'drive_image1', 'mode': 'absolute-paths', 'format': 'qcow2', 'target': '/home/test/staf-kvm/autotest/client/tests/virt/shared/data/images/target1.qcow2', 'sync': 'full'}, 'id': 'TS1SJ9WQ'}" >&3
    read response <&3;  echo "$i: $response"
    echo -e "{'execute': 'query-block-jobs', 'id': 'GA59wJVC'}" >&3
    read response <&3;  echo "$i: $response"
    echo -e "{'execute': 'block-job-set-speed', 'arguments': {'device': 'drive_image1', 'speed': 10485760}, 'id': '2f76YsoS'}" >&3
    read response <&3;  echo "$i: $response"
    echo -e "{'execute': 'query-block-jobs', 'id': '1SACh0ms'}" >&3
    read response <&3;  echo "$i: $response"
    echo -e "{'execute': 'query-block-jobs', 'id': '4vKqzkG1'}" >&3
    read response <&3;  echo "$i: $response"
    echo -e "{'execute': 'block-job-cancel', 'arguments': {'device': 'drive_image1'}, 'id': 'tzq3eBk3'}" >&3
    read response <&3;  echo "$i: $response"
    echo -e "{'execute': 'query-block-jobs', 'id': 'g43PqGfX'}" >&3
    read response <&3;  echo "$i: $response"
    echo -e "{'execute': 'query-block-jobs', 'id': 'GDKiSWWW'}" >&3
    read response <&3;  echo "$i: $response"
    let i=$i+1
    sleep 1
done


3.

Actual results:
qmp no response, qemu-kvm hung, ping guest no response.

5: {"return": [{"io-status": "ok", "device": "drive_image1", "locked": false, "removable": false, "inserted": {"iops_rd": 0, "detect_zeroes": "off", "image": {"virtual-size": 21474836480, "filename": "/home/test/staf-kvm/autotest/client/tests/virt/shared/data/images/rhel67-ppc64-virtio-scsi.qcow2", "cluster-size": 65536, "format": "qcow2", "actual-size": 12473073664, "format-specific": {"type": "qcow2", "data": {"compat": "1.1", "lazy-refcounts": false, "refcount-bits": 16, "corrupt": false}}, "dirty-flag": false}, "iops_wr": 0, "ro": false, "node-name": "#block126", "backing_file_depth": 0, "drv": "qcow2", "iops": 0, "bps_wr": 0, "write_threshold": 0, "encrypted": false, "bps": 0, "bps_rd": 0, "cache": {"no-flush": false, "direct": true, "writeback": true}, "file": "/home/test/staf-kvm/autotest/client/tests/virt/shared/data/images/rhel67-ppc64-virtio-scsi.qcow2", "encryption_key_missing": false}, "type": "unknown"}], "id": "cHd67fdS"}
5: {"return": {}, "id": "TS1SJ9WQ"}
5: {"return": [{"io-status": "ok", "device": "drive_image1", "busy": true, "len": 0, "offset": 0, "paused": false, "speed": 0, "ready": false, "type": "mirror"}], "id": "GA59wJVC"}
6: {"return": {}, "id": "2f76YsoS"}
6: {"return": [{"io-status": "ok", "device": "drive_image1", "busy": true, "len": 0, "offset": 0, "paused": false, "speed": 10485760, "ready": false, "type": "mirror"}], "id": "1SACh0ms"}
6: {"return": [{"io-status": "ok", "device": "drive_image1", "busy": true, "len": 0, "offset": 0, "paused": false, "speed": 10485760, "ready": false, "type": "mirror"}], "id": "4vKqzkG1"}
6: {"return": {}, "id": "tzq3eBk3"}
6: {"return": [{"io-status": "ok", "device": "drive_image1", "busy": true, "len": 0, "offset": 0, "paused": false, "speed": 10485760, "ready": false, "type": "mirror"}], "id": "g43PqGfX"}
6: {"return": [{"io-status": "ok", "device": "drive_image1", "busy": true, "len": 0, "offset": 0, "paused": false, "speed": 10485760, "ready": false, "type": "mirror"}], "id": "GDKiSWWW"}
     <--- problem always happened after cancel block job then send "query-block-jobs" the second times.


Expected results:
qemu-kvm works well.

Additional info:

Comment 2 mazhang 2016-03-10 10:18:14 UTC
Also test this on qemu-kvm-rhev-2.3.0-31.el7.ppc64le, after 1000 times repeat not hit the problem.

Comment 4 mazhang 2016-03-11 05:28:44 UTC
This bug can't reproduce on x86 platform, after repeat 1000 times test, qemu-kvm works well.

Host:
qemu-kvm-rhev-2.5.0-2.el7.x86_64
3.10.0-357.el7.x86_64

Guest:
2.6.32-573.18.1.el6.x86_64

Comment 6 Laurent Vivier 2016-03-29 12:58:53 UTC
KVM thread seems hanging on iothread lock:

#0  0x00003fff81e21ac8 in __lll_lock_wait () from /lib64/power8/libpthread.so.0
#1  0x00003fff81e1b004 in pthread_mutex_lock ()
   from /lib64/power8/libpthread.so.0
#2  0x0000000036f6c3e8 in qemu_mutex_lock ()
#3  0x0000000036bdb1fc in qemu_mutex_lock_iothread ()
#4  0x0000000036d63394 in kvm_arch_handle_exit ()
#5  0x0000000036bf42a8 in kvm_cpu_exec ()
#6  0x0000000036bdb400 in qemu_kvm_cpu_thread_fn ()
#7  0x00003fff81e18728 in start_thread () from /lib64/power8/libpthread.so.0
#8  0x00003fff81799240 in clone () from /lib64/power8/libc.so.6

lock is owned by the thread executing block job:

#0  0x00003fff81789914 in ppoll () from /lib64/power8/libc.so.6
#1  0x0000000036eca86c in qemu_poll_ns ()
#2  0x0000000036ecc58c in aio_poll ()
#3  0x0000000036f18e78 in bdrv_drain ()
#4  0x0000000036f11e90 in mirror_run ()
#5  0x0000000036f82608 in coroutine_trampoline ()
#6  0x00003fff816d1c9c in makecontext () from /lib64/power8/libc.so.6
#7  0x0000000000000000 in ?? ()

Comment 7 Laurent Vivier 2016-03-29 15:33:49 UTC
bisected to:

commit 176c36997fd4a94a7b919468d8967e0ad81fdf9c
Author: Fam Zheng <famz>
Date:   Mon Nov 23 10:28:04 2015 +0800

    mirror: Quiesce source during "mirror_exit"
    
    With dataplane, the ioeventfd events could be dispatched after
    mirror_run releases the dirty bitmap, but before mirror_exit actually
    does the device switch, because the iothread will still be running, and
    it will cause silent data loss.
    
    Fix this by adding a bdrv_drained_begin/end pair around the window, so
    that no new external request will be handled.


Fam, do you have any idea of what happens?

Comment 8 Laurent Vivier 2016-03-29 16:21:38 UTC
I've been able to reproduce this on x86_64 with qemu upstream (b68a801).

Lock between:

#0  0x00007f7c5140789d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f7c514019cd in pthread_mutex_lock () from /lib64/libpthread.so.0
#2  0x000055c6808fd4d9 in qemu_mutex_lock (
    mutex=mutex@entry=0x55c680eb8320 <qemu_global_mutex>)
    at /home/lvivier/Projects/qemu-upstream/util/qemu-thread-posix.c:64
#3  0x000055c68060c081 in qemu_mutex_lock_iothread ()
    at /home/lvivier/Projects/qemu-upstream/cpus.c:1232
#4  0x000055c6805decbd in prepare_mmio_access (mr=0x55c68341c0a0, 
    mr=0x55c68341c0a0) at /home/lvivier/Projects/qemu-upstream/exec.c:2556
#5  0x000055c6805e2e5f in address_space_read_continue (
...

And

#0  0x00007f7c4f1cd0a1 in ppoll () from /lib64/libc.so.6
#1  0x000055c68086cddb in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/bits/poll2.h:77
#2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=timeout@entry=-1) at /home/lvivier/Projects/qemu-upstream/qemu-timer.c:313
#3  0x000055c68086e024 in aio_poll (ctx=0x55c681a5ae00, blocking=blocking@entry=true) at /home/lvivier/Projects/qemu-upstream/aio-posix.c:453
#4  0x000055c6808ad016 in bdrv_drain (bs=0x55c681a745c0) at /home/lvivier/Projects/qemu-upstream/block/io.c:274
#5  0x000055c6808ad1ed in bdrv_drained_begin (bs=<optimized out>) at /home/lvivier/Projects/qemu-upstream/block/io.c:2737
#6  0x000055c6808a7139 in mirror_run (opaque=0x55c683245c40) at /home/lvivier/Projects/qemu-upstream/block/mirror.c:700
#7  0x000055c68090cc7a in coroutine_trampoline (i0=<optimized out>, i1=<optimized out>) at /home/lvivier/Projects/qemu-upstream/util/coroutine-ucontext.c:78
...

With the same qmp script and the following qemu command:

QEMU=/home/lvivier/Projects/qemu-upstream/build/pc-origin/x86_64-softmmu/qemu-system-x86_64
IMAGE=/var/lib/libvirt/images/rhel7.2.qcow2

$QEMU \
    -name 'virt-tests-vm1'  \
    -sandbox off  \
    -machine pc-i440fx-2.3,accel=kvm,usb=off -cpu SandyBridge -m 2048 -realtime mlock=off \
    -nodefaults  \
    -vga std  \
    -qmp tcp:0:6666,server,nowait \
    -chardev socket,id=qmp_id_qmp1,path=/tmp/monitor-qmp1-20160309-024047-fOYMc277,server,nowait \
    -mon chardev=qmp_id_qmp1,mode=control  \
    -chardev socket,id=qmp_id_catch_monitor,path=/tmp/monitor-catch_monitor-20160309-024047-fOYMc277,server,nowait \
    -mon chardev=qmp_id_catch_monitor,mode=control  \
    -chardev socket,id=serial_id_serial0,path=/tmp/serial-serial0-20160309-024047-fOYMc277,server,nowait \
    -device isa-serial,id=serial0,chardev=serial_id_serial0 \
    -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pci.0,addr=04,disable-legacy=off,disable-modern=on \
    -drive id=drive_image1,if=none,cache=none,snapshot=off,aio=native,format=qcow2,file=$IMAGE \
    -device scsi-hd,id=image1,drive=drive_image1 \
    -m 2048  \
    -smp 4,maxcpus=4,cores=2,threads=1,sockets=2 \
    -usb \
    -device usb-tablet,id=usb-tablet1 \
    -vnc :11  \
    -rtc base=utc,clock=host  \
    -boot order=cdn,once=c,menu=off,strict=off \
    -enable-kvm \
    -monitor stdio \
    -S

How to reproduce:

1- start qemu
2- start qmp script
3- attach "nc" to console (to see if the VM boots):
   nc -U /tmp/serial-serial0-20160309-024047-fOYMc277
4- type "c" in qemu monitor to start the VM. If the VM doesn't run, the lock 
   cannot be not taken.
5- wait, during the boot VM hangs.

Reverting 176c36997fd4a94a7b919468d8967e0ad81fdf9c fixes the problem.

Comment 9 Laurent Vivier 2016-03-29 16:28:11 UTC
As not a ppc64le specific bug, assign back to default assignee.

Comment 10 Fam Zheng 2016-03-30 02:51:43 UTC
I cannot reproduce on upstream. The backtrace is expected, except the blocking ppoll should return when the I/O request it is waiting for completes. In your case it seems the ppoll didn't return.

How reproducible is it?

Comment 11 Laurent Vivier 2016-03-30 07:39:57 UTC
(In reply to Fam Zheng from comment #10)
> How reproducible is it?

Doing what it is described in comment 8, it is 100% reproducible (I just did it again).

You must boot the system while the QMP tests are running, it hangs just after GRUB.

The last logs from QMP script are:

14: {"return": {}, "id": "tzq3eBk3"}
14: {"return": [{"io-status": "ok", "device": "drive_image1", "busy": true, "len": 0, "offset": 0, "paused": false, "speed": 10485760, "ready": false, "type": "mirror"}], "id": "g43PqGfX"}
14: {"return": [{"io-status": "ok", "device": "drive_image1", "busy": true, "len": 0, "offset": 0, "paused": false, "speed": 10485760, "ready": false, "type": "mirror"}], "id": "GDKiSWWW"}
14: {"timestamp": {"seconds": 1459323013, "microseconds": 355102}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": "drive_image1", "len": 0, "offset": 0, "speed": 10485760, "type": "mirror"}}
14: {"return": [{"arch": "x86", "current": true, "CPU": 0, "qom_path": "/machine/unattached/device[0]", "pc": 2146040702, "halted": false, "thread_id": 22405}, {"arch": "x86", "current": false, "CPU": 1, "qom_path": "/machine/unattached/device[3]", "pc": 1037083, "halted": true, "thread_id": 22406}, {"arch": "x86", "current": false, "CPU": 2, "qom_path": "/machine/unattached/device[5]", "pc": 1037083, "halted": true, "thread_id": 22407}, {"arch": "x86", "current": false, "CPU": 3, "qom_path": "/machine/unattached/device[7]", "pc": 1037083, "halted": true, "thread_id": 22408}], "id": "PkjCdBBG"}
14: {"return": [{"io-status": "ok", "device": "drive_image1", "locked": false, "removable": false, "inserted": {"iops_rd": 0, "detect_zeroes": "off", "image": {"virtual-size": 21474836480, "filename": "/var/lib/libvirt/images/rhel7.2.qcow2", "cluster-size": 65536, "format": "qcow2", "actual-size": 5331587072, "format-specific": {"type": "qcow2", "data": {"compat": "1.1", "lazy-refcounts": true, "refcount-bits": 16, "corrupt": false}}, "dirty-flag": false}, "iops_wr": 0, "ro": false, "node-name": "#block110", "backing_file_depth": 0, "drv": "qcow2", "iops": 0, "bps_wr": 0, "write_threshold": 0, "encrypted": false, "bps": 0, "bps_rd": 0, "cache": {"no-flush": false, "direct": true, "writeback": true}, "file": "/var/lib/libvirt/images/rhel7.2.qcow2", "encryption_key_missing": false}, "type": "unknown"}], "id": "YO1lkgJR"}
14: {"return": [{"io-status": "ok", "device": "drive_image1", "locked": false, "removable": false, "inserted": {"iops_rd": 0, "detect_zeroes": "off", "image": {"virtual-size": 21474836480, "filename": "/var/lib/libvirt/images/rhel7.2.qcow2", "cluster-size": 65536, "format": "qcow2", "actual-size": 5331587072, "format-specific": {"type": "qcow2", "data": {"compat": "1.1", "lazy-refcounts": true, "refcount-bits": 16, "corrupt": false}}, "dirty-flag": false}, "iops_wr": 0, "ro": false, "node-name": "#block110", "backing_file_depth": 0, "drv": "qcow2", "iops": 0, "bps_wr": 0, "write_threshold": 0, "encrypted": false, "bps": 0, "bps_rd": 0, "cache": {"no-flush": false, "direct": true, "writeback": true}, "file": "/var/lib/libvirt/images/rhel7.2.qcow2", "encryption_key_missing": false}, "type": "unknown"}], "id": "cHd67fdS"}
14: {"return": {}, "id": "TS1SJ9WQ"}
14: {"return": [{"io-status": "ok", "device": "drive_image1", "busy": true, "len": 0, "offset": 0, "paused": false, "speed": 0, "ready": false, "type": "mirror"}], "id": "GA59wJVC"}
14: {"return": {}, "id": "2f76YsoS"}
14: {"return": [{"io-status": "ok", "device": "drive_image1", "busy": true, "len": 0, "offset": 0, "paused": false, "speed": 10485760, "ready": false, "type": "mirror"}], "id": "1SACh0ms"}
15: {"return": [{"io-status": "ok", "device": "drive_image1", "busy": true, "len": 0, "offset": 0, "paused": false, "speed": 10485760, "ready": false, "type": "mirror"}], "id": "4vKqzkG1"}
15: {"return": {}, "id": "tzq3eBk3"}
15: {"return": [{"io-status": "ok", "device": "drive_image1", "busy": true, "len": 0, "offset": 0, "paused": false, "speed": 10485760, "ready": false, "type": "mirror"}], "id": "g43PqGfX"}
15: {"return": [{"io-status": "ok", "device": "drive_image1", "busy": true, "len": 0, "offset": 0, "paused": false, "speed": 10485760, "ready": false, "type": "mirror"}], "id": "GDKiSWWW"}
15: {"timestamp": {"seconds": 1459323014, "microseconds": 343062}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": "drive_image1", "len": 0, "offset": 0, "speed": 10485760, "type": "mirror"}}
15: {"return": [{"arch": "x86", "current": true, "CPU": 0, "qom_path": "/machine/unattached/device[0]", "pc": 998095, "halted": false, "thread_id": 22405}, {"arch": "x86", "current": false, "CPU": 1, "qom_path": "/machine/unattached/device[3]", "pc": 1037083, "halted": true, "thread_id": 22406}, {"arch": "x86", "current": false, "CPU": 2, "qom_path": "/machine/unattached/device[5]", "pc": 1037083, "halted": true, "thread_id": 22407}, {"arch": "x86", "current": false, "CPU": 3, "qom_path": "/machine/unattached/device[7]", "pc": 1037083, "halted": true, "thread_id": 22408}], "id": "PkjCdBBG"}
15: {"return": [{"io-status": "ok", "device": "drive_image1", "locked": false, "removable": false, "inserted": {"iops_rd": 0, "detect_zeroes": "off", "image": {"virtual-size": 21474836480, "filename": "/var/lib/libvirt/images/rhel7.2.qcow2", "cluster-size": 65536, "format": "qcow2", "actual-size": 5331587072, "format-specific": {"type": "qcow2", "data": {"compat": "1.1", "lazy-refcounts": true, "refcount-bits": 16, "corrupt": false}}, "dirty-flag": false}, "iops_wr": 0, "ro": false, "node-name": "#block110", "backing_file_depth": 0, "drv": "qcow2", "iops": 0, "bps_wr": 0, "write_threshold": 0, "encrypted": false, "bps": 0, "bps_rd": 0, "cache": {"no-flush": false, "direct": true, "writeback": true}, "file": "/var/lib/libvirt/images/rhel7.2.qcow2", "encryption_key_missing": false}, "type": "unknown"}], "id": "YO1lkgJR"}
15: {"return": [{"io-status": "ok", "device": "drive_image1", "locked": false, "removable": false, "inserted": {"iops_rd": 0, "detect_zeroes": "off", "image": {"virtual-size": 21474836480, "filename": "/var/lib/libvirt/images/rhel7.2.qcow2", "cluster-size": 65536, "format": "qcow2", "actual-size": 5331587072, "format-specific": {"type": "qcow2", "data": {"compat": "1.1", "lazy-refcounts": true, "refcount-bits": 16, "corrupt": false}}, "dirty-flag": false}, "iops_wr": 0, "ro": false, "node-name": "#block110", "backing_file_depth": 0, "drv": "qcow2", "iops": 0, "bps_wr": 0, "write_threshold": 0, "encrypted": false, "bps": 0, "bps_rd": 0, "cache": {"no-flush": false, "direct": true, "writeback": true}, "file": "/var/lib/libvirt/images/rhel7.2.qcow2", "encryption_key_missing": false}, "type": "unknown"}], "id": "cHd67fdS"}
15: {"return": {}, "id": "TS1SJ9WQ"}
15: {"return": [{"io-status": "ok", "device": "drive_image1", "busy": true, "len": 0, "offset": 0, "paused": false, "speed": 0, "ready": false, "type": "mirror"}], "id": "GA59wJVC"}
15: {"return": {}, "id": "2f76YsoS"}
16: {"return": [{"io-status": "ok", "device": "drive_image1", "busy": true, "len": 0, "offset": 0, "paused": false, "speed": 10485760, "ready": false, "type": "mirror"}], "id": "1SACh0ms"}
16: {"return": [{"io-status": "ok", "device": "drive_image1", "busy": true, "len": 0, "offset": 0, "paused": false, "speed": 10485760, "ready": false, "type": "mirror"}], "id": "4vKqzkG1"}
16: {"return": {}, "id": "tzq3eBk3"}
16: {"return": [{"io-status": "ok", "device": "drive_image1", "busy": true, "len": 0, "offset": 0, "paused": false, "speed": 10485760, "ready": false, "type": "mirror"}], "id": "g43PqGfX"}
16: {"return": [{"io-status": "ok", "device": "drive_image1", "busy": true, "len": 0, "offset": 0, "paused": false, "speed": 10485760, "ready": false, "type": "mirror"}], "id": "GDKiSWWW"}

My test script is (the path of the mirror device is different from comment the mazhang one, check you own is accessible):
 #!/bin/bash

let i=0
exec 3<>/dev/tcp/localhost/6666
echo -e "{ 'execute': 'qmp_capabilities' }" >&3
read response <&3
echo $response
while [ $i -lt 500 ]
do
    echo -e "{'execute': 'query-cpus', 'id': 'PkjCdBBG'}" >&3
    read response <&3;  echo "$i: $response"
    echo -e "{'execute': 'query-block', 'id': 'YO1lkgJR'}" >&3
    read response <&3;  echo "$i: $response"
    echo -e "{'execute': 'query-block', 'id': 'cHd67fdS'}" >&3
    read response <&3;  echo "$i: $response"
    echo -e "{'execute': 'drive-mirror', 'arguments': {'device': 'drive_image1', 'mode': 'absolute-paths', 'format': 'qcow2', 'target': '/home/lvivier/target1.qcow2', 'sync': 'full'}, 'id': 'TS1SJ9WQ'}" >&3
    read response <&3;  echo "$i: $response"
    echo -e "{'execute': 'query-block-jobs', 'id': 'GA59wJVC'}" >&3
    read response <&3;  echo "$i: $response"
    echo -e "{'execute': 'block-job-set-speed', 'arguments': {'device': 'drive_image1', 'speed': 10485760}, 'id': '2f76YsoS'}" >&3
    read response <&3;  echo "$i: $response"
    echo -e "{'execute': 'query-block-jobs', 'id': '1SACh0ms'}" >&3
    read response <&3;  echo "$i: $response"
    echo -e "{'execute': 'query-block-jobs', 'id': '4vKqzkG1'}" >&3
    read response <&3;  echo "$i: $response"
    echo -e "{'execute': 'block-job-cancel', 'arguments': {'device': 'drive_image1'}, 'id': 'tzq3eBk3'}" >&3
    read response <&3;  echo "$i: $response"
    echo -e "{'execute': 'query-block-jobs', 'id': 'g43PqGfX'}" >&3
    read response <&3;  echo "$i: $response"
    echo -e "{'execute': 'query-block-jobs', 'id': 'GDKiSWWW'}" >&3
    read response <&3;  echo "$i: $response"
    let i=$i+1
    sleep 1
done

Comment 12 Fam Zheng 2016-03-30 10:15:30 UTC
It's not clear how this could hang and I double checked that I cannot reproduce. Can you try this patch to rule out the event loop?

---

diff --git a/block/io.c b/block/io.c
index 41d954ca..6e484b4 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2732,7 +2732,7 @@ void bdrv_flush_io_queue(BlockDriverState *bs)
 void bdrv_drained_begin(BlockDriverState *bs)
 {
     if (!bs->quiesce_counter++) {
-        aio_disable_external(bdrv_get_aio_context(bs));
+        /*aio_disable_external(bdrv_get_aio_context(bs));*/
     }
     bdrv_drain(bs);
 }
@@ -2743,5 +2743,5 @@ void bdrv_drained_end(BlockDriverState *bs)
     if (--bs->quiesce_counter > 0) {
         return;
     }
-    aio_enable_external(bdrv_get_aio_context(bs));
+    /*aio_enable_external(bdrv_get_aio_context(bs));*/
 }

Comment 13 Laurent Vivier 2016-03-30 12:18:17 UTC
(In reply to Fam Zheng from comment #12)
> It's not clear how this could hang and I double checked that I cannot
> reproduce. Can you try this patch to rule out the event loop?


The result is the same with this patch: it hangs on ppc64le and x86_64. On x86_64 it seems to happen a little bit later with this patch.

Comment 15 Fam Zheng 2016-04-01 08:38:27 UTC
The hang is an ABBA deadlock between two coroutines. One coroutine is mirror, which has just finished a request (qcow2 read) - the block job is about to complete. It calls bdrv_drain() which waits for the other coroutine to complete. The other coroutine is a scsi-disk read request. The deadlock happens when the second coroutine is pending on the first to yield/terminate, in order to acquire the qcow2 lock in qemu_co_mutex_lock(). The state is like:


mirror                         scsi-disk
-------------------------------------------------------------
do last write
  qcow2:qemu_co_mutex_lock()
  ...
                               scsi disk read
                                 tracked request begin
                                 qcow2:qemu_co_mutex_lock.enter
  qcow2:qemu_co_mutex_unlock()
bdrv_drain
  while (req pending)
    aio_poll

In the scsi-disk coroutine, the qemu_co_mutex_lock() will never return because the mirror coroutine is blocked in the aio_poll(blocking=true).

Upstream has this too.

Comment 20 Qianqian Zhu 2016-09-09 06:17:25 UTC
Moving to Verified per comment 17

Comment 23 errata-xmlrpc 2016-11-07 21:00:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-2673.html


Note You need to log in before you can comment on or make changes to this bug.