Bug 811683 - deal with change from RHEL 6.2 sync block_job_cancel to RHEL 6.3 async block-job-cancel
deal with change from RHEL 6.2 sync block_job_cancel to RHEL 6.3 async block-...
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: libvirt (Show other bugs)
6.3
All Linux
medium Severity medium
: rc
: 6.2
Assigned To: Eric Blake
Virtualization Bugs
:
Depends On: 582475 812085 813953 814080
Blocks: 525307 580954 638506 638508 638509 748534 756082 769496 786141 799055 802284 806280 806432 815791 830861 831532 835344 835345 835722 865384
  Show dependency treegraph
 
Reported: 2012-04-11 13:33 EDT by Eric Blake
Modified: 2013-01-09 19:51 EST (History)
32 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 582475
: 815791 (view as bug list)
Environment:
Last Closed: 2012-06-20 02:54:20 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Comment 1 Eric Blake 2012-04-11 13:42:01 EDT
The initial patches were done under the auspices of bug 638506, but getting async block_job_cancel to work correctly with libvirt is important whether or not we also get live block migration working.
Comment 8 Eric Blake 2012-04-18 13:05:08 EDT
Upstream raised another issue where a semantic difference would be desirable:

https://lists.gnu.org/archive/html/qemu-devel/2012-04/msg02273.html

If upstream indeed goes with block-job-set-speed being callable at any time, and not just when a block job is active, then this semantic change from block_job_set_speed would be another thing that libvirt would like to differentiate on based on the spelling of the monitor command.  I'm not sure whether to clone this into another libvirt BZ, but depending on whether qemu 1.1 gets the semantics fixed in time, this is something that libvirt should be aware of.  (I suppose that libvirt could blindly try to set speed in advance, and fall back to setting it after the job, as a mitigation if we cannot rely on the spelling of the command to tell the difference).
Comment 9 Wayne Sun 2012-04-19 03:21:22 EDT
pkgs:
libvirt-0.9.10-13.el6.x86_64
qemu-kvm-0.12.1.2-2.275.el6.x86_64
kernel-2.6.32-251.el6.x86_64

prepare a running domain with qed img
# qemu-img info /var/lib/libvirt/images/libvirt_test_api 
image: /var/lib/libvirt/images/libvirt_test_api
file format: qed
virtual size: 10G (10737418240 bytes)
disk size: 1.2G
cluster_size: 65536
You have new mail in /var/spool/mail/root

1. run blockpull
# virsh blockpull libvirt_test_api vda 1
Block Pull started

2. check blockjob info
# virsh blockjob libvirt_test_api vda --info
Block Pull: [ 14 %]    Bandwidth limit: 1 MB/s

3. abort block job with --async
# virsh blockjob libvirt_test_api vda --abort --async
immediately returned

check in libvirtd.log:
2012-04-19 06:58:53.632+0000: 10291: debug : virJSONValueToString:1102 : result={"execute":"block_job_cancel","arguments":{"device":"drive-virtio-disk0"},"id":"libvirt-8"}
2012-04-19 06:58:53.632+0000: 10291: debug : virEventPollUpdateHandle:151 : EVENT_POLL_UPDATE_HANDLE: watch=8 events=15
2012-04-19 06:58:53.632+0000: 10291: debug : virEventPollInterruptLocked:706 : Interrupting
2012-04-19 06:58:53.632+0000: 10291: debug : qemuMonitorSend:823 : QEMU_MONITOR_SEND_MSG: mon=0x7f59b0000cf0 msg={"execute":"block_job_cancel","arguments":{"device":"drive-virtio-disk0"},"id":"libvirt-8"}^M
 fd=-1

with --async libvirt still pass block_job_cancel command to qemu, this is not desired, should send block-job-cancel, right? (p.s. the test shows no difference with only use --abort)
Comment 10 Eric Blake 2012-04-19 10:33:43 EDT
(In reply to comment #9)
> pkgs:
> libvirt-0.9.10-13.el6.x86_64
> qemu-kvm-0.12.1.2-2.275.el6.x86_64
> kernel-2.6.32-251.el6.x86_64

There's your problem. According to bug 812085, RHEV didn't supply the name block-job-cancel until qemu-kvm-rhev-0.12.1.2-2.278.el6

> 
> with --async libvirt still pass block_job_cancel command to qemu, this is not
> desired, should send block-job-cancel, right? (p.s. the test shows no
> difference with only use --abort)

You are seeing that libvirt _correctly_ detected the spelling provided by the build; however, as qemu-kvm*.275 has an asynchronous cancel but only the older synchronous name, you would also notice that libvirt ends up emitting double events from qemu (one synthesized by libvirt, since libvirt thinks qemu-kvm won't emit the event due to the wrong spelling, and one directly from the qemu event).
Comment 11 Eric Blake 2012-04-19 17:20:47 EDT
Also, Alex Jia found a bug detected by valgrind in 'virsh blockpull --wait ...', so I'm moving this back to ASSIGNED.
Comment 12 Wayne Sun 2012-04-20 02:32:06 EDT
(In reply to comment #10)

> 
> There's your problem. According to bug 812085, RHEV didn't supply the name
> block-job-cancel until qemu-kvm-rhev-0.12.1.2-2.278.el6
> 
My fault, after updated to qemu-kvm-rhev-0.12.1.2-2.282.el6.x86_64, retest and check the log:

2012-04-20 03:03:39.448+0000: 2157: debug : virJSONValueToString:1102 : result={"execute":"block-job-cancel","arguments":{"device":"drive-virtio-disk0"},"id":"libvirt-10"}
2012-04-20 03:03:39.448+0000: 2157: debug : virEventPollUpdateHandle:151 : EVENT_POLL_UPDATE_HANDLE: watch=16 events=15
2012-04-20 03:03:39.448+0000: 2157: debug : virEventPollInterruptLocked:706 : Interrupting
2012-04-20 03:03:39.448+0000: 2157: debug : qemuMonitorSend:823 : QEMU_MONITOR_SEND_MSG: mon=0x7ff57c007aa0 msg={"execute":"block-job-cancel","arguments":{"device":"drive-virtio-disk0"},"id":"libvirt-10"}^M
 fd=-1 

It working as expected.
> > 
> > with --async libvirt still pass block_job_cancel command to qemu, this is not
> > desired, should send block-job-cancel, right? (p.s. the test shows no
> > difference with only use --abort)
> 
> You are seeing that libvirt _correctly_ detected the spelling provided by the
> build; however, as qemu-kvm*.275 has an asynchronous cancel but only the older
> synchronous name, you would also notice that libvirt ends up emitting double
> events from qemu (one synthesized by libvirt, since libvirt thinks qemu-kvm
> won't emit the event due to the wrong spelling, and one directly from the qemu
> event).
Thanks for explain this.

-----
Other steps:
1. test with --wait with blockpull
# valgrind -v virsh blockpull libvirt_test_api vda --wait

2. partial blockpull 
# virsh blockpull libvirt_test_api vda --base /var/lib/libvirt/images/qed1.img 
Block Pull started
Comment 14 Eric Blake 2012-04-23 23:38:42 EDT
bug 813593 may require one further patch for this BZ
Comment 15 Eric Blake 2012-04-24 00:20:20 EDT
*** Bug 814080 has been marked as a duplicate of this bug. ***
Comment 16 Eric Blake 2012-04-24 00:22:16 EDT
back to ASSIGNED while we wait on 813593; the memory leak has been split off to bug 814080
Comment 17 EricLee 2012-04-24 04:39:24 EDT
Does the new version only support "block-job-cancel", but not support "block_job_cancel"? 
Because I have tested without '--async' option, and got the same result as with '--async'.

The versions I used:
# rpm -qa libvirt qemu-kvm-rhev kernel
kernel-2.6.32-262.el6.x86_64
qemu-kvm-rhev-0.12.1.2-2.282.el6.x86_64
libvirt-0.9.10-13.el6.x86_64

# virsh blockpull qed /var/lib/libvirt/images/qed_backing.img 1
Block Pull started
# virsh blockjob qed /var/lib/libvirt/images/qed_backing.img --abort
returned immediately

And the log in libvirtd.log:
2012-04-24 08:16:00.380+0000: 1393: debug : qemuDomainObjBeginJobInternal:753 : Starting job: modify (async=none)
2012-04-24 08:16:00.416+0000: 1393: debug : qemuMonitorRef:201 : QEMU_MONITOR_REF: mon=0x7f6f70002960 refs=3
2012-04-24 08:16:00.416+0000: 1393: debug : qemuMonitorBlockJob:2745 : mon=0x7f6f70002960, device=drive-virtio-disk0, base=(null), bandwidth=0, info=(nil), mode=0, async=1
2012-04-24 08:16:00.416+0000: 1393: debug : qemuMonitorSend:823 : QEMU_MONITOR_SEND_MSG: mon=0x7f6f70002960 msg={"execute":"block-job-cancel","arguments":{"device":"drive-virtio-disk0"},"id":"libvirt-28"}^M
 fd=-1
2012-04-24 08:16:00.416+0000: 1392: debug : qemuMonitorRef:201 : QEMU_MONITOR_REF: mon=0x7f6f70002960 refs=4
2012-04-24 08:16:00.416+0000: 1392: debug : qemuMonitorIOWrite:432 : QEMU_MONITOR_IO_WRITE: mon=0x7f6f70002960 buf={"execute":"block-job-cancel","arguments":{"device":"drive-virtio-disk0"},"id":"libvirt-28"}^M
 len=94 ret=94 errno=11
2012-04-24 08:16:00.416+0000: 1392: debug : qemuMonitorUnref:210 : QEMU_MONITOR_UNREF: mon=0x7f6f70002960 refs=3
Comment 18 Paolo Bonzini 2012-04-24 04:43:34 EDT
> Does the new version only support "block-job-cancel", but not support
> "block_job_cancel"? 

Correct.
Comment 19 Eric Blake 2012-04-24 08:10:10 EDT
(In reply to comment #17)
> Does the new version only support "block-job-cancel", but not support
> "block_job_cancel"? 
> Because I have tested without '--async' option, and got the same result as with
> '--async'.

In practice, the window where async matters is very small.  But the general idea is that with RHEL 6.2, libvirt will issue 'block_job_cancel' in isolation, regardless of the --async flag; in RHEL 6.3, libvirt will issue 'block-job-cancel' in isolation with the --async flag, but without the --async flag libvirt will issue 'block-job-cancel' followed by one or more 'query-block-job' in succession (the first query-block-job will be as soon as possible, any additional calls will be in 500ms intervals).

Another thing to test is how many block job events are issued.  With libvirt from RHEL 6.2, you would not get an event on a block job abort from either 6.2 or 6.3 qemu.  With the new libvirt semantics, you should now get exactly one event on block job abort; and that event will either come from qemu (if you are using RHEV 6.3 qemu with block-job-cancel) or be synthesized by libvirt (if you are using RHEL 6.2 qemu with block_job_cancel).  If you ever get double events from libvirt, that's a sign of an impedence mismatch between libvirt and qemu.  Furthermore, if you are testing the RHEL 6.2 interface, remember that you have to test with QED images, as RHEL 6.2 didn't support block pull on qcow2.
Comment 20 Eric Blake 2012-04-24 10:10:15 EDT
Moving this back to ON_QA; any remaining changes that depend on the outcome bug 813953 will be split into a new patch, and we can already test that libvirt targets the names 'block-stream', 'block-job-set-speed', and 'block-job-cancel' when testing against qemu-kvm-rhev-0.12.1.2-2.282.el6.x86_64 or newer.
Comment 21 EricLee 2012-04-25 01:42:01 EDT
(In reply to comment #19)
> (In reply to comment #17)
> > Does the new version only support "block-job-cancel", but not support
> > "block_job_cancel"? 
> > Because I have tested without '--async' option, and got the same result as with
> > '--async'.
> 
> In practice, the window where async matters is very small.  But the general
> idea is that with RHEL 6.2, libvirt will issue 'block_job_cancel' in isolation,
> regardless of the --async flag; in RHEL 6.3, libvirt will issue
> 'block-job-cancel' in isolation with the --async flag, but without the --async
> flag libvirt will issue 'block-job-cancel' followed by one or more
> 'query-block-job' in succession (the first query-block-job will be as soon as
> possible, any additional calls will be in 500ms intervals).
> 

Thanks for explaining.

> Another thing to test is how many block job events are issued.  With libvirt
> from RHEL 6.2, you would not get an event on a block job abort from either 6.2
> or 6.3 qemu.  With the new libvirt semantics, you should now get exactly one
> event on block job abort; and that event will either come from qemu (if you are
> using RHEV 6.3 qemu with block-job-cancel) or be synthesized by libvirt (if you
> are using RHEL 6.2 qemu with block_job_cancel).  If you ever get double events
> from libvirt, that's a sign of an impedence mismatch between libvirt and qemu. 
> Furthermore, if you are testing the RHEL 6.2 interface, remember that you have
> to test with QED images, as RHEL 6.2 didn't support block pull on qcow2.
Comment 22 Wayne Sun 2012-04-25 03:38:55 EDT
pkgs:
libvirt-0.9.10-14.el6.x86_64
qemu-kvm-rhev-0.12.1.2-2.285.el6.x86_64
kernel-2.6.32-262.el6.x86_64

1. prpare a domain with qed img disk
# virsh dumpxml dom
...
    <disk type='file' device='disk'>
      <driver name='qemu' type='qed' cache='none'/>
      <source file='/var/lib/libvirt/images/qed.img'/>
      <target dev='vda' bus='virtio'/>
      <alias name='virtio-disk0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
    </disk>
...

2. create img backing file
# qemu-img create -f qed -b /var/lib/libvirt/images/qed.img /var/lib/libvirt/images/qed1.img
Formatting '/var/lib/libvirt/images/qed1.img', fmt=qed size=8388608000 backing_file='/var/lib/libvirt/images/qed.img' cluster_size=0 table_size=0 

3. edit domain disk as using the backing file
# virsh edit dom
...
    <disk type='file' device='disk'>
      <driver name='qemu' type='qed' cache='none'/>
      <source file='/var/lib/libvirt/images/qed1.img'/>
      <target dev='vda' bus='virtio'/>
      <alias name='virtio-disk0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
    </disk>
...

4. start domain
# virsh start dom

5. check with/without async
5.1 check without --async
# virsh blockpull dom vda 1
Block Pull started
# virsh blockjob dom vda --abort

check log:
2012-04-25 05:14:49.601+0000: 2594: debug : virDomainBlockJobAbort:17883 : dom=0x7fe960054e20, (VM: name=dom, uuid=e0027b60-e4ed-8f4e-ee17-27a3159cd8f3), disk=vda, flags=0
2012-04-25 05:14:49.601+0000: 2594: debug : qemuDomainObjBeginJobInternal:753 : Starting job: modify (async=none)
2012-04-25 05:14:49.681+0000: 2594: debug : qemuMonitorRef:201 : QEMU_MONITOR_REF: mon=0x7fe960002e30 refs=3
2012-04-25 05:14:49.681+0000: 2594: debug : qemuMonitorBlockJob:2782 : mon=0x7fe960002e30, device=drive-virtio-disk0, base=(null), bandwidth=0, info=(nil), mode=0, async=1
...

2012-04-25 05:14:49.690+0000: 2594: debug : qemuMonitorSend:823 : QEMU_MONITOR_SEND_MSG: mon=0x7fe960002e30 msg={"execute":"block-job-cancel","arguments":{"device":"drive-virtio-disk0"},"id":"libvirt-9"}^M
 fd=-1

...
2012-04-25 05:14:49.690+0000: 2592: debug : qemuMonitorRef:201 : QEMU_MONITOR_REF: mon=0x7fe960002e30 refs=4
2012-04-25 05:14:49.691+0000: 2592: debug : qemuMonitorIOWrite:432 : QEMU_MONITOR_IO_WRITE: mon=0x7fe960002e30 buf={"execute":"block-job-cancel","arguments":{"device":"drive-virtio-disk0"},"id":"libvirt-9"}^M
 len=93 ret=93 errno=11
...

2012-04-25 05:14:49.717+0000: 2594: debug : virJSONValueToString:1105 : result={"execute":"query-block-jobs","id":"libvirt-10"}
2012-04-25 05:14:49.717+0000: 2594: debug : virEventPollUpdateHandle:151 : EVENT_POLL_UPDATE_HANDLE: watch=10 events=15
2012-04-25 05:14:49.717+0000: 2594: debug : virEventPollInterruptLocked:706 : Interrupting
2012-04-25 05:14:49.717+0000: 2594: debug : qemuMonitorSend:823 : QEMU_MONITOR_SEND_MSG: mon=0x7fe960002e30 msg={"execute":"query-block-jobs","id":"libvirt-10"}^M
 fd=-1

...

2012-04-25 05:14:49.718+0000: 2592: debug : qemuMonitorRef:201 : QEMU_MONITOR_REF: mon=0x7fe960002e30 refs=4
2012-04-25 05:14:49.718+0000: 2592: debug : qemuMonitorIOWrite:432 : QEMU_MONITOR_IO_WRITE: mon=0x7fe960002e30 buf={"execute":"query-block-jobs","id":"libvirt-10"}^M
 len=50 ret=50 errno=11

One query-block-jobs event followed block-job-cancel, only one event on block job abort.


5.2 check with --async
# virsh blockpull dom vda 1
Block Pull started
# virsh blockjob dom vda --abort --async

check libvirtd.log:

As in comment 12, only one block-job-cancel event found, no query-block-jobs followed.

6. test on 6.2
pkgs:
libvirt-0.9.4-23.el6.x86_64
qemu-kvm-0.12.1.2-2.209.el6.x86_64

# virsh blockpull dom vda 1

# virsh blockjob dom vda --abort

check in libvirtd.log:
14:28:20.856: 4272: debug : virDomainBlockJobAbort:16260 : dom=0x7fdf78009db0, (VM: name=dom, uuid=5b5fee7b-5e4d-ff9c-6a54-df6a51f75572), path=0x7fdf78008d00, flags=0
14:28:20.856: 4272: debug : qemuMonitorBlockJob:2562 : mon=0x7fdf74005de0, device=0x7fdf78001940, bandwidth=0, info=(nil), mode=0
14:28:20.871: 4272: debug : virDomainFree:2153 : dom=0x7fdf78009db0, (VM: name=dom, uuid=5b5fee7b-5e4d-ff9c-6a54-df6a51f75572), 
14:28:20.872: 4267: debug : virConnectClose:1323 : conn=0x7fdf7c0962d0
14:28:20.873: 4267: debug : qemuProcessAutoDestroyRun:3706 : conn=0x7fdf7c0962d0


No event on block job abort found (nor block-job-cancel neither block_job_cancel)

7. test with libvirt on 6.2 but qemu on 6.3
pkgs:
libvirt-0.9.4-23.el6.x86_64
qemu-kvm-0.12.1.2-2.275.el6.x86_64

# virsh blockpull dom vda 1

# virsh blockjob dom vda --abort

check in libvirtd.log:

15:30:26.441: 6367: debug : virDomainBlockJobAbort:16260 : dom=0x7f0d38000b20, (VM: name=dom, uuid=5b5fee7b-5e4d-ff9c-6a54-df6a51f75572), path=0x7f0d38000970, flags=0
15:30:26.441: 6367: debug : qemuMonitorBlockJob:2562 : mon=0x7f0d40000ce0, device=0x7f0d380009b0, bandwidth=0, info=(nil), mode=0
15:30:26.442: 6367: debug : virDomainFree:2153 : dom=0x7f0d38000b20, (VM: name=dom, uuid=5b5fee7b-5e4d-ff9c-6a54-df6a51f75572), 
15:30:26.443: 6363: debug : virConnectClose:1323 : conn=0x7f0d30000a90
15:30:26.444: 6363: debug : qemuProcessAutoDestroyRun:3706 : conn=0x7f0d30000a90

Also no event on block job abort found (nor block-job-cancel neither block_job_cancel)


It works as expected. 
So, are these enough to verify this bug?
Comment 23 Eric Blake 2012-04-25 09:57:31 EDT
Yes, I think you've verified it.
Comment 24 Wayne Sun 2012-04-25 22:10:32 EDT
Thanks, mark verified.
Comment 26 errata-xmlrpc 2012-06-20 02:54:20 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2012-0748.html

Note You need to log in before you can comment on or make changes to this bug.