Bug 1404137

Summary: 'block-job-cancel' can not cancel a "block-stream" job normally
Product: Red Hat Enterprise Linux 7 Reporter: Yanan Fu <yfu>
Component: qemu-kvm-rhevAssignee: John Snow <jsnow>
Status: CLOSED ERRATA QA Contact: Qianqian Zhu <qizhu>
Severity: medium Docs Contact:
Priority: medium    
Version: 7.3CC: chayang, jsnow, juzhang, knoel, michen, mrezanin, qizhu, virt-bugs, virt-maint, yfu
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: qemu-kvm-rhev-2.9.0-1.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1456088 (view as bug list) Environment:
Last Closed: 2017-08-01 23:39:45 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1456088    
Attachments:
Description Flags
debug log for this bug none

Description Yanan Fu 2016-12-13 07:34:25 UTC
Description of problem:

With qmp, execute "block-stream", then do 'block-job-cancel', then the cancel will not return, and after get "BLOCK_JOB_COMPLETED" event, the "block-job-cancel" will return "No active block job on device 'drive_image1'".

Hit this issue with autotest, not manually. and it is not related with guest, both windows and linux guest can hit this issue.

autotest casename: 

Version-Release number of selected component (if applicable):
qemu : qemu-kvm-rhev-2.6.0-28.el7_3.2.x86_64
kernel: kernel-3.10.0-514.2.2.el7.x86_64

How reproducible:
1/15

Steps to Reproduce:
1. Boot one vm
2. With qmp, execute 'blockdev-snapshot-sync'---> 'block-stream' ---> 'block-job-cancel'
3.Job can not be canceled.

Actual results:
2016-12-12 14:19:57: {"execute": "blockdev-snapshot-sync", "arguments": {"device": "drive_image1", "snapshot-file": "/usr/share/avocado/data/avocado-vt/images/sn1", "mode": "absolute-paths", "format": "qcow2"}, "id": "q2TGym54"}
2016-12-12 14:19:57: {"return": {}, "id": "q2TGym54"}

2016-12-12 14:19:57: {"execute": "block-stream", "arguments": {"device": "drive_image1", "speed": 0}, "id": "5l8CWxlc"}
2016-12-12 14:19:57: {"return": {}, "id": "5l8CWxlc"}

2016-12-12 14:19:57: {"execute": "query-block-jobs", "id": "Q9CESh2g"}
2016-12-12 14:19:57: {"return": [{"io-status": "ok", "device": "drive_image1", "busy": true, "len": 32212254720, "offset": 0, "paused": false, "speed": 0, "ready": false, "type": "stream"}], "id": "Q9CESh2g"}

2016-12-12 14:19:57: {"execute": "block-job-cancel", "arguments": {"device": "drive_image1"}, "id": "v9VZofcJ"}
2016-12-12 14:23:31: {"timestamp": {"seconds": 1481570611, "microseconds": 661760}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "drive_image1", "len": 32212254720, "offset": 32212254720, "speed": 0, "type": "stream"}}
2016-12-12 14:23:31: {"id": "v9VZofcJ", "error": {"class": "DeviceNotActive", "desc": "No active block job on device 'drive_image1'"}}
2016-12-12 14:23:50: {"timestamp": {"seconds": 1481570617, "microseconds": 930977}, "event": "NIC_RX_FILTER_CHANGED", "data": {"name": "idKuzNdu", "path": "/machine/peripheral/idKuzNdu/virtio-backend"}}

Expected results:
block job can be canceled normally.

Additional info:
job log:
http://10.66.4.244/autotest_static_job_log/1632055/test-results/083-smp_8.8192m.repeat1.Host_RHEL.m7.u3.spice.qcow2.virtio_scsi.up.virtio_net.Win10.x86_64.io-github-autotest-qemu.block_stream.simple_test.cancel_sync/

Comment 1 Yanan Fu 2016-12-13 12:00:23 UTC
Created attachment 1231173 [details]
debug log for this bug

Comment 3 Yanan Fu 2017-01-06 02:33:04 UTC
Hit another issue in acceptance test, maybe for same reason.

qemu version: qemu-kvm-rhev-2.6.0-28.el7_3.3.x86_64
kernel version: kernel-3.10.0-514.6.1.el7.x86_64


2017-01-05 06:18:19: {"execute": "query-block", "id": "VE5G6oWn"}
2017-01-05 06:18:19: {"return": [{"io-status": "ok", "device": "drive_image1", "locked": false, "removable": false, "inserted": {"iops_rd": 0, "detect_zeroes": "off", "image": {"virtual-size": 21474836480, "filename": "/usr/share/avocado/data/avocado-vt/images/rhel73-64-virtio-scsi.qcow2", "cluster-size": 65536, "format": "qcow2", "actual-size": 3972464640, "format-specific": {"type": "qcow2", "data": {"compat": "1.1", "lazy-refcounts": false, "refcount-bits": 16, "corrupt": false}}, "dirty-flag": false}, "iops_wr": 0, "ro": false, "node-name": "#block105", "backing_file_depth": 0, "drv": "qcow2", "iops": 0, "bps_wr": 0, "write_threshold": 0, "encrypted": false, "bps": 0, "bps_rd": 0, "cache": {"no-flush": false, "direct": true, "writeback": true}, "file": "/usr/share/avocado/data/avocado-vt/images/rhel73-64-virtio-scsi.qcow2", "encryption_key_missing": false}, "type": "unknown"}], "id": "VE5G6oWn"}

2017-01-05 06:18:19: {"execute": "blockdev-snapshot-sync", "arguments": {"device": "drive_image1", "snapshot-file": "/usr/share/avocado/data/avocado-vt/images/sn1", "mode": "absolute-paths", "format": "qcow2"}, "id": "tQoC4edJ"}
2017-01-05 06:18:19: {"return": {}, "id": "tQoC4edJ"}
2017-01-05 06:18:19: {"execute": "query-block", "id": "vQ1kHOeN"}
2017-01-05 06:18:19: {"return": [{"io-status": "ok", "device": "drive_image1", "locked": false, "removable": false, "inserted": {"iops_rd": 0, "detect_zeroes": "off", "image": {"backing-image": {"virtual-size": 21474836480, "filename": "/usr/share/avocado/data/avocado-vt/images/rhel73-64-virtio-scsi.qcow2", "cluster-size": 65536, "format": "qcow2", "actual-size": 3972464640, "format-specific": {"type": "qcow2", "data": {"compat": "1.1", "lazy-refcounts": false, "refcount-bits": 16, "corrupt": false}}, "dirty-flag": false}, "backing-filename-format": "qcow2", "virtual-size": 21474836480, "filename": "/usr/share/avocado/data/avocado-vt/images/sn1", "cluster-size": 65536, "format": "qcow2", "actual-size": 200704, "format-specific": {"type": "qcow2", "data": {"compat": "1.1", "lazy-refcounts": false, "refcount-bits": 16, "corrupt": false}}, "full-backing-filename": "/usr/share/avocado/data/avocado-vt/images/rhel73-64-virtio-scsi.qcow2", "backing-filename": "/usr/share/avocado/data/avocado-vt/images/rhel73-64-virtio-scsi.qcow2", "dirty-flag": false}, "iops_wr": 0, "ro": false, "node-name": "#block818", "backing_file_depth": 1, "drv": "qcow2", "iops": 0, "bps_wr": 0, "write_threshold": 0, "backing_file": "/usr/share/avocado/data/avocado-vt/images/rhel73-64-virtio-scsi.qcow2", "encrypted": false, "bps": 0, "bps_rd": 0, "cache": {"no-flush": false, "direct": true, "writeback": true}, "file": "/usr/share/avocado/data/avocado-vt/images/sn1", "encryption_key_missing": false}, "type": "unknown"}], "id": "vQ1kHOeN"}

2017-01-05 06:18:19: {"execute": "block-stream", "arguments": {"device": "drive_image1", "speed": 0}, "id": "ZZ0YQrX1"}
2017-01-05 06:18:19: {"return": {}, "id": "ZZ0YQrX1"}
2017-01-05 06:18:19: {"execute": "query-block-jobs", "id": "J5hfm3xT"}
2017-01-05 06:18:19: {"return": [{"io-status": "ok", "device": "drive_image1", "busy": true, "len": 21474836480, "offset": 0, "paused": false, "speed": 0, "ready": false, "type": "stream"}], "id": "J5hfm3xT"}

2017-01-05 06:18:19: {"execute": "block-job-set-speed", "arguments": {"device": "drive_image1", "speed": 10485760}, "id": "LxsGdjL5"}
2017-01-05 06:20:47: {"timestamp": {"seconds": 1483615247, "microseconds": 872427}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "drive_image1", "len": 21474836480, "offset": 21474836480, "speed": 0, "type": "stream"}}
2017-01-05 06:20:47: {"id": "LxsGdjL5", "error": {"class": "DeviceNotActive", "desc": "No active block job on device 'drive_image1'"}}
2017-01-05 06:21:11: {"timestamp": {"seconds": 1483615268, "microseconds": 522056}, "event": "NIC_RX_FILTER_CHANGED", "data": {"name": "id7PgQyi", "path": "/machine/peripheral/id7PgQyi/virtio-backend"}}


Execute "block-stream", then execute "block-job-set-speed" (06:18:19), but no return, and after 2 minutes, get "BLOCK_JOB_COMPLETED" event (06:20:47), then get the result of "block-job-set speed":
06:20:47: {"id": "LxsGdjL5", "error": {"class": "DeviceNotActive", "desc": "No active block job on device 'drive_image1'"}}

Comment 4 Yanan Fu 2017-01-06 02:59:01 UTC
With same qemu and kernel version in comment 3.
Another failed scenario:

2017-01-05 03:16:28: {"execute": "query-block-jobs", "id": "luZRSqrk"}
2017-01-05 03:16:28: {"return": [{"io-status": "ok", "device": "drive_image1", "busy": true, "len": 32212254720, "offset": 562036736, "paused": false, "speed": 0, "ready": false, "type": "stream"}], "id": "luZRSqrk"}
2017-01-05 03:16:29: {"execute": "query-block-jobs", "id": "NbdasH0s"}
2017-01-05 03:32:29: {"execute": "quit", "id": "KxPD7TaU"}
2017-01-05 03:33:08: {"timestamp": {"seconds": 1483605188, "microseconds": 983567}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "drive_image1", "len": 32212254720, "offset": 32212254720, "speed": 0, "type": "stream"}}
2017-01-05 03:33:08: {"return": [], "id": "NbdasH0s"}
2017-01-05 03:33:08: {"return": {}, "id": "KxPD7TaU"}


Execute "query-block-jobs" (03:16:29) has no return, and after several minutes, get "BLOCK_JOB_COMPLETED" event (03:33:08), then get the return for "query-block-jobs".

Hi jsnow,
Could you please have a look at this issue, we have hit this issue for not only one time, thanks!

Comment 5 John Snow 2017-01-12 22:53:43 UTC
OK, will investigate.

Comment 6 John Snow 2017-02-17 01:03:19 UTC
Trying to reproduce now, do you have any tricks for reproducing? Should I let the stream run for quite some time first, or try to cancel it immediately?

What kind of storage media is the VM image on, and what kind of storage media is the snapshot on? (Is the stream operation slow, or fast?)

Comment 7 John Snow 2017-02-17 01:23:44 UTC
It doesn't look like any of the logs I can still get to appear to reproduce a command line for me to try to reproduce. If you still have the CLI, it might be helpful for trying to recreate the conditions necessary for this softlock.

I'll contact Yash Mankad in the meantime for some advice on replicating avocado setups for automating this test myself.

Thanks!
--js

Comment 8 Yanan Fu 2017-02-20 03:29:35 UTC
Hi John,
Can hit this issue from time to time, please refer the CLI, it comes from avocado test, case name is "block_stream.simple_test.start_stream".

CLI: 
/usr/libexec/qemu-kvm \
    -S  \
    -name 'avocado-vt-vm1'  \
    -sandbox off  \
    -machine pc  \
    -nodefaults  \
    -vga qxl \
    -device pci-bridge,id=pci_bridge,bus=pci.0,addr=03,chassis_nr=1 \
    -device intel-hda,bus=pci.0,addr=04 \
    -device hda-duplex  \
    -chardev socket,id=qmp_id_qmp1,path=/var/tmp/avocado_oegWUL/monitor-qmp1-20170216-115906-eZezw2uY,server,nowait \
    -mon chardev=qmp_id_qmp1,mode=control  \
    -chardev socket,id=qmp_id_catch_monitor,path=/var/tmp/avocado_oegWUL/monitor-catch_monitor-20170216-115906-eZezw2uY,server,nowait \
    -mon chardev=qmp_id_catch_monitor,mode=control \
    -device pvpanic,ioport=0x505,id=idMyOH7w  \
    -chardev socket,id=serial_id_serial0,path=/var/tmp/avocado_oegWUL/serial-serial0-20170216-115906-eZezw2uY,server,nowait \
    -device isa-serial,chardev=serial_id_serial0 \
    -device virtio-serial-pci,id=virtio_serial_pci0,bus=pci.0,addr=05 \
    -chardev socket,path=/var/tmp/avocado_oegWUL/virtio_port-vs-20170216-115906-eZezw2uY,nowait,id=idTf6FMg,server \
    -device virtserialport,id=idlnaE32,name=vs,bus=virtio_serial_pci0.0,chardev=idTf6FMg \
    -object rng-random,filename=/dev/random,id=passthrough-EzXGP1It \
    -device virtio-rng-pci,id=virtio-rng-pci-H621ETef,rng=passthrough-EzXGP1It,bus=pci.0,addr=06  \
    -chardev socket,id=seabioslog_id_20170216-115906-eZezw2uY,path=/var/tmp/avocado_oegWUL/seabios-20170216-115906-eZezw2uY,server,nowait \
    -device isa-debugcon,chardev=seabioslog_id_20170216-115906-eZezw2uY,iobase=0x402 \
    -device ich9-usb-ehci1,id=usb1,addr=1d.7,multifunction=on,bus=pci.0 \
    -device ich9-usb-uhci1,id=usb1.0,multifunction=on,masterbus=usb1.0,addr=1d.0,firstport=0,bus=pci.0 \
    -device ich9-usb-uhci2,id=usb1.1,multifunction=on,masterbus=usb1.0,addr=1d.2,firstport=2,bus=pci.0 \
    -device ich9-usb-uhci3,id=usb1.2,multifunction=on,masterbus=usb1.0,addr=1d.4,firstport=4,bus=pci.0 \
    -device nec-usb-xhci,id=usb2,bus=pci.0,addr=07 \
    -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pci.0,addr=08 \
    -drive id=drive_image1,if=none,snapshot=off,aio=native,cache=none,format=qcow2,file=/home/kvm_autotest_root/images/win10-64-virtio-scsi.qcow2 \
    -device scsi-hd,id=image1,drive=drive_image1 \
    -device virtio-net-pci,mac=9a:20:21:22:23:24,id=idyQ6zAf,vectors=4,netdev=idX3NVY3,bus=pci.0,addr=09  \
    -netdev tap,id=idX3NVY3,vhost=on,vhostfd=21,fd=20 \
    -m 8192  \
    -smp 8,maxcpus=8,cores=4,threads=1,sockets=2  \
    -cpu 'Opteron_G4',+kvm_pv_unhalt,hv_spinlocks=0x1fff,hv_vapic,hv_time \
    -drive id=drive_cd1,if=none,snapshot=off,aio=native,cache=none,media=cdrom,file=/home/kvm_autotest_root/iso/windows/winutils.iso \
    -device scsi-cd,id=cd1,drive=drive_cd1 \
    -device usb-tablet,id=usb-tablet1,bus=usb2.0,port=1  \
    -vnc :0 \
    -rtc base=localtime,clock=host,driftfix=slew  \
    -boot order=cdn,once=c,menu=off,strict=off  \
    -no-hpet \
    -enable-kvm  \
    -watchdog i6300esb \
    -watchdog-action reset \
    -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0a \
    -monitor stdio \


Boot VM and do block_stream:
11:59:12 DEBUG| (monitor qmp1) Sending command 'blockdev-snapshot-sync'
11:59:12 DEBUG| Send command: {'execute': 'blockdev-snapshot-sync', 'arguments': {'device': u'drive_image1', 'snapshot-file': '/home/kvm_autotest_root/images/sn1', 'mode': 'absolute-paths', 'format': 'qcow2'}, 'id': 'ua7vPoev'}
11:59:12 INFO | [qemu output] Formatting '/home/kvm_autotest_root/images/sn1', fmt=qcow2 size=32212254720 backing_file=/home/kvm_autotest_root/images/win10-64-virtio-scsi.qcow2 backing_fmt=qcow2 encryption=off cluster_size=65536 lazy_refcounts=off refcount_bits=16

11:59:12 DEBUG| (monitor qmp1) Sending command 'block-stream'
11:59:12 DEBUG| Send command: {'execute': 'block-stream', 'arguments': {'device': u'drive_image1', 'speed': 0}, 'id': 'rSubpE9K'}
11:59:12 DEBUG| (monitor qmp1) Sending command 'query-block-jobs'
11:59:12 DEBUG| Send command: {'execute': 'query-block-jobs', 'id': 'AhH7RKBc'}


keep execute "query-block-jobs", can hit this issue, qmp has no return, and get "BLOCK_JOB_COMPLETED" event after 15 mins. then get null return for "query-block-jobs".

2017-02-16 12:00:02: {"execute": "query-block-jobs", "id": "iQMNcLKk"}
2017-02-16 12:15:28: {"timestamp": {"seconds": 1487265310, "microseconds": 506756}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "drive_image1", "len": 32212254720, "offset": 32212254720, "speed": 0, "type": "stream"}}
2017-02-16 12:15:28: {"return": [], "id": "iQMNcLKk"}
2017-02-16 12:15:28: {"timestamp": {"seconds": 1487265320, "microseconds": 707396}, "event": "NIC_RX_FILTER_CHANGED", "data": {"name": "idyQ6zAf", "path": "/machine/peripheral/idyQ6zAf/virtio-backend"}}

Comment 9 Yanan Fu 2017-02-20 10:14:15 UTC
(In reply to John Snow from comment #6)
> Trying to reproduce now, do you have any tricks for reproducing? Should I
> let the stream run for quite some time first, or try to cancel it
> immediately?
> 
> What kind of storage media is the VM image on, and what kind of storage
> media is the snapshot on? (Is the stream operation slow, or fast?)

Hi John,
This issue is not only related with block job cancel.
According to the description and comment 3, comment 4, include three scenario:

1. "block-job-cancel" has no return, and get "BLOCK_JOB_COMPLETED" in the end, then get error for "block-job-cancel". Both "cancel the job immediately" or "let it run for quite some time first" are ok.

2. "block-job-set-speed" has no return, and get "BLOCK_JOB_COMPLETED" in the end,
then get error for "block-job-set-speed".

3. "query-block-jobs" has no return, and get "BLOCK_JOB_COMPLETED" in the end,
then get error for "query-block-jobs"


These issue was reproduced with in avocado test (case name: block_stream.simple_test.start_stream can reproduce it), the storage media is a local qcow2 image,  and the snapshot is a local qcow2 image too.

In comment 8, qemu command line is provided, hope it useful, thanks !

In

Comment 10 John Snow 2017-03-30 00:08:03 UTC
This may be related to https://bugzilla.redhat.com/show_bug.cgi?id=1367369 and I may ask you to re-test once the fix for that issue goes into the tree.

Comment 11 John Snow 2017-04-26 22:58:10 UTC
Can we re-test under rhv 2.9.0 branch, please? I'd like to follow-up on comment #10.

Comment 12 Yanan Fu 2017-05-02 02:58:10 UTC
(In reply to John Snow from comment #11)
> Can we re-test under rhv 2.9.0 branch, please? I'd like to follow-up on
> comment #10.

Hi John,
Test version: qemu-kvm-rhev-2.9.0-1.el7, run test with avocado, case name "drive_mirror.simple_test..on_localfs.2qcow2", and repeat 10 times, total 60 cases, all pass, job link: 
http://10.66.4.244/kvm_autotest_job_log/?jobid=1835106

So, i think , it is ok with rhv 2.9 branch !

Comment 13 John Snow 2017-05-02 16:43:23 UTC
(In reply to Yanan Fu from comment #12)
> (In reply to John Snow from comment #11)
> > Can we re-test under rhv 2.9.0 branch, please? I'd like to follow-up on
> > comment #10.
> 
> Hi John,
> Test version: qemu-kvm-rhev-2.9.0-1.el7, run test with avocado, case name
> "drive_mirror.simple_test..on_localfs.2qcow2", and repeat 10 times, total 60
> cases, all pass, job link: 
> http://10.66.4.244/kvm_autotest_job_log/?jobid=1835106
> 
> So, i think , it is ok with rhv 2.9 branch !

AWESOME! Thanks, Yanan! :)

Mirek, I'm moving this to POST to indicate that this was fixed by the 2.9.0 rebase.

Comment 15 Qianqian Zhu 2017-05-17 01:53:02 UTC
Per comment 12, moving to VERIFIED.

Comment 16 Yanan Fu 2017-05-25 08:16:23 UTC
Hi John,

I meet this issue in acceptance test for the latest rhev7.3.z build.
version: qemu-kvm-rhev-2.6.0-28.el7_3.10.x86_64


This issue was verified after rebase to rhv 2.9 branch and maybe related to bz 1367369 per comment 10, Do we need to clone and fix it on 7.3.z branch ?

Best wishes
Yanan Fu

Comment 17 John Snow 2017-05-26 18:51:30 UTC
Yes, if the issue is being observed in 7.3.z still, please fork this BZ, assign me directly, and I will prepare a backport for you. I'll get started on that now.

Comment 19 errata-xmlrpc 2017-08-01 23:39:45 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:2392

Comment 20 errata-xmlrpc 2017-08-02 01:17:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:2392

Comment 21 errata-xmlrpc 2017-08-02 02:09:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:2392

Comment 22 errata-xmlrpc 2017-08-02 02:50:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:2392

Comment 23 errata-xmlrpc 2017-08-02 03:14:52 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:2392

Comment 24 errata-xmlrpc 2017-08-02 03:35:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:2392