Bug 1210903 - Blockcopy --wait fails using qemu 2.2.x
Summary: Blockcopy --wait fails using qemu 2.2.x
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Virtualization Tools
Classification: Community
Component: libvirt
Version: unspecified
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
Assignee: Libvirt Maintainers
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-04-10 21:02 UTC by Matthew Schumacher
Modified: 2016-04-10 21:02 UTC (History)
10 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2016-04-10 21:02:59 UTC
Embargoed:


Attachments (Terms of Use)
debug output with timestamps (638.33 KB, text/plain)
2015-04-28 23:57 UTC, Matthew Schumacher
no flags Details

Description Matthew Schumacher 2015-04-10 21:02:49 UTC
Description of problem:

Blockcopy --wait doesn't seem to wait for the mirroring phase when using qemu 2.2.x which causes scripts and other functions such as blockcopy --pivot to fail.

Version-Release number of selected component (if applicable):
qemu-2.2.1
libvirt-1.2.14
python 2.7.5
compiled on slackware 14.1

How reproducible:

Always:

Steps to Reproduce:

1. install qemu-2.2.1, libvirt-1.2.14

root@wasvirt2:~# virsh blockcopy test vda /datastore/test.qcow2 --verbose --wait
Now in mirroring phase

Immediately reports it's in mirroring phase without showing the block copy progress.  You can check the block copy progress like this:

root@wasvirt2:~#  virsh blockjob test vda --info
Block Copy: [ 58 %]

We should never see this given we passed a --wait.

Expected results:

virsh blockcopy test vda /datastore/test.qcow2 --verbose --wait
Block Copy: [100 %]

Then virsh unblocks, and we can do stuff with the mirrored blockstore.

Additional info:

This breaks --pivot, because libvirt doesn't wait for the mirror to complete before trying to pivot:

root@wasvirt2:~# virsh blockcopy test vda /glustervol1/vm/test/test.qcow2 --verbose --wait --pivot
error: failed to pivot job for disk vda
error: internal error: unable to execute QEMU command 'block-job-complete': The active block job for device 'drive-virtio-disk0' cannot be completed

If I revert back to qemu 2.1.x then this works fine.

If I do the block copy without --wait, and then call --pivot it works fine:

root@wasvirt2:~# virsh blockcopy test vda /datastore/test.qcow2 --verbose --wait
Now in mirroring phase

root@wasvirt2:~#  virsh blockjob test vda --info
Block Copy: [ 96 %]

root@wasvirt2:~#  virsh blockjob test vda --info
Block Copy: [100 %]

root@wasvirt2:~#  virsh blockjob test vda --pivot

root@wasvirt2:~#  virsh blockjob test vda --info
No current block job for vda

Someone else referenced it here:  
https://www.redhat.com/archives/libvirt-users/2015-January/msg00010.html

Comment 2 Matthew Schumacher 2015-04-28 23:56:31 UTC
I can confirm that this issue exists while using qemu 2.3.0 as well. 

I did a little more debugging today running these commands:

root@wasvirt2:/# date; virsh blockcopy test vda /datastore/test.qcow2  --wait --verbose
Tue Apr 28 15:46:06 AKDT 2015

Now in mirroring phase
root@wasvirt2:/# date
Tue Apr 28 15:46:09 AKDT 2015
root@wasvirt2:/# virsh blockjob test vda 
Block Copy: [100 %]

root@wasvirt2:/# date; virsh blockjob test vda --pivot
Tue Apr 28 15:46:59 AKDT 2015


You can see that libvirt announces that it's in mirroring phase without waiting.  Attached is the debug with timestamps showing that it didn't actually get to mirrored phase until 2015-04-28 23:46:25.

I can provide backtraces if requested, just need to know where to break.  Hopefully this debug will help.

Comment 3 Matthew Schumacher 2015-04-28 23:57:38 UTC
Created attachment 1019911 [details]
debug output with timestamps

Comment 4 Eric Blake 2015-04-29 02:27:10 UTC
This sounds a lot like bug 1197592; it may be the same data race.  Immediately after creating a job, qemu reports job stats as 0 complete of 0 total; only after a little time has elapsed does it start reporting 0 complete of non-zero total.  However, libvirt is treating complete == total as done, even though that only makes sense for a non-zero size (or for a no-op job that really has nothing to do).  I'm still trying to figure out if the fix should be in qemu, libvirt, or both.

Comment 6 Cole Robinson 2016-04-10 21:02:59 UTC
The RHEL bug Eric linked to mentions a few commits that should fix this, starting with:

commit 86c4df83b913dd73b79caeed2038291374384dc5
Author: Michael Chapman <mike.org>
Date:   Wed Jan 27 13:24:54 2016 +1100

    virsh: improve waiting for block job readiness

So closing as CURRENTRELEASE


Note You need to log in before you can comment on or make changes to this bug.