Bug 1135169

Summary:	blockcopy job was cancel by "CTRL+C" while it show there still be one block job in background
Product:	Red Hat Enterprise Linux 7	Reporter:	Shanzhi Yu <shyu>
Component:	libvirt	Assignee:	Erik Skultety <eskultet>
Status:	CLOSED ERRATA	QA Contact:	Virtualization Bugs <virt-bugs>
Severity:	medium	Docs Contact:
Priority:	high
Version:	7.1	CC:	dyuan, eblake, eskultet, jdenemar, jsc, mzhan, nerijus, rbalakri, xuzhang, yanyang, zpeng
Target Milestone:	rc
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	libvirt-1.2.8-10.el7	Doc Type:	Bug Fix
Doc Text:	Cause: Abort blockcopy/blockcommit job byt either CTRL+C or by abort via virsh cmd Consequence: Blockcopy/Blockcommit job indicates it was aborted successfully, however the cleanup routine is skipped not destroying the reference to the active blockjob, so any further calls to any blockjob returns error stating that the disk is still in an active blockjob Fix: Check for another flag (VIR_DOMAIN_BLOCK_JOB_CANCELED) was added, so the cleanup routine is executed in this case as well Result: All blockjobs can be aborted successfully	Story Points:	---
Clone Of:		Environment:
Last Closed:	2015-03-05 07:43:25 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Shanzhi Yu 2014-08-29 02:27:10 UTC

Description of problem:

blockcopy job was cancel by "CTRL+C" while it show there still be one block job in background

Version-Release number of selected component (if applicable):

libvirt-1.2.7-2.el7.x86_64
qemu-kvm-rhev-2.1.0-2.el7.x86_64


How reproducible:

100%

Steps to Reproduce:

1.Prepare transient guest
# virsh list --transient
 Id    Name                           State
----------------------------------------------------
 4     rhel6                          running


2.Do blockcopy job with --wait and --verbose and cancel the job with "CTRL+C"  or with "timeout" options or use "blockjob --abort" before copy job is finished

# virsh blockcopy rhel6 vda /var/lib/libvirt/images/copy.img --verbose  --wait
Block Copy: [ 32 %]^C
Copy aborted

3.Check block job info

# virsh blockjob rhel6 vda


4.Do block copy again

# virsh blockcopy rhel6 vda /var/lib/libvirt/images/copy.img --verbose  --wait
error: block copy still active: disk 'vda' already in active block job

# virsh dumpxml rhel6 |grep mirror -A 3
      <mirror type='file' file='/var/lib/libvirt/images/copy.img' format='qcow2' job='copy' ready='abort'>
        <format type='qcow2'/>
        <source file='/var/lib/libvirt/images/copy.img'/>
      </mirror>


Actual results:


Expected results:

blockcopy job should can be cancelled in first phase(copy data from source)

Additional info:

Comment 1 Yang Yang 2014-08-29 08:49:43 UTC

The issue is reproduced on virsh command blockcommit.
steps:
1.Do blockcommit and then cancel it by ctrl+c, timeout or abort before the job is completed

# virsh blockcommit test1 hda --top /var/lib/libvirt/images/test1.s4 --shallow --active --wait --verbose --async
Block Commit: [ 63 %]^C      ---- press ctrl+c  
Commit aborted

OR abort the job by using virsh comm blockjob
 # virsh blockjob test1 hda --abort
 At the same time the commit job will return with the following messages
 Block Commit: [100 %]
 Now in synchronized phase
 
OR do blockcommit with timeout

2. Check block job info

# virsh blockjob test1 hda

3. Abort the job again
# virsh blockjob test1 hda --abort
error: Requested operation is not valid: another job on disk 'hda' is still being ended

4. check the xml
# virsh dumpxml test1 | grep disk -a6
<disk type='file' device='disk'>
      <driver name='qemu' type='qcow2' cache='none'/>
      <source file='/var/lib/libvirt/images/test1.s3'/>
      <mirror type='file' job='active-commit' ready='abort'>
        <format type='qcow2'/>
        <source file='/var/lib/libvirt/images/test1.s2'/>
      </mirror>
      <target dev='hda' bus='ide'/>
      <alias name='ide0-0-0'/>
      <address type='drive' controller='0' bus='0' target='0' unit='0'/>
    </disk>

Comment 2 Jan Schumacher 2014-10-30 17:25:13 UTC

This issue is reproduced even if the blockcopy job was not cancelled, but finished successfully. This is a big problem, because in order to get rid of the phantom block job, and to make the domain persistent again, it has to be restarted entirely.


Linux host 3.16-3-amd64 #1 SMP Debian 3.16.5-1 (2014-10-10) x86_64 GNU/Linux

libvirt-bin:
  Installed: 1.2.9-3

qemu-kvm:
  Installed: 2.1+dfsg-5+b1


Steps to reproduce:

1. Making guest transient
# virsh undefine guest


2. Start blockcopy
# virsh blockcopy guest hda /vm/guest-copy.qcow2 --wait --verbose --finish

Output:

Block Copy: [100 %]
Successfully copied


3. Making domain persistent again
# virsh define guest.xml

Output:

error: Failed to define domain from guest.xml
error: block copy still active: domain has active block job


4. Checking active jobs
# virsh blockjob guest hda --info

Output:

No current block job for hda


5. Trying to abort job for guest
# virsh blockjob guest hda --abort

Output:

error: Requested operation is not valid: another job on disk 'hda' is still being ended


6. Check guest XML
virsh dumpxml guest > guest.xml

    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2'/>
      <source file='/vm/guest.qcow2'/>
      <backingStore/>
      <mirror type='file' file='/vm/guest-copy.qcow2' format='qcow2' job='copy' ready='abort'>
        <format type='qcow2'/>
        <source file='/vm/guest-copy.qcow2'/>
      </mirror>
      <target dev='hda' bus='ide'/>
      <alias name='ide0-0-0'/>
      <address type='drive' controller='0' bus='0' target='0' unit='0'/>
    </disk>




Additional info: I have six guests running, pulling blockcopy backups every night. The behaviour described above only affects the same two or three guests, sometimes not even immediately (meaning the 1st or 2nd nightly blockcopy might leave the guest without an active block job, but not the one after). This seems rather erratic.

More info: This behaviour did not occur with libvirt-bin 1.2.4-3 / qemu-kvm 2.0.0+dfsg-6 (before dist-upgrade ..)

Comment 3 Erik Skultety 2014-11-27 12:50:27 UTC

Fixed upstream:

commit 35ce5abcdeef51fdde89983a3f1650ba6904ff34
Author: Erik Skultety <eskultet>
Date:   Thu Nov 27 10:17:44 2014 +0100

    qemu: fix block{commit,copy} abort handling
    
    When a block{commit,copy} job was aborted on a domain, block job handler
    did not process it correctly, leaving a phantom job in the background.
    Any further calls to any blockjob causes "block <jobtype> still active"
    error. This patch fixes the blockjob handler so that it checks not only
    for VIR_DOMAIN_BLOCK_JOB_FAILED status, but VIR_DOMAIN_BLOCK_JOB_CANCELED
    status as well, followed by our existing cleanup routine.

v1.2.10-209-g35ce5ab

Comment 4 Jiri Denemark 2014-12-01 09:13:37 UTC

Comment 3 is incorrect, the patch was not upstream yet... But it is now upstream as v1.2.10-218-g8e23e0e:

commit 8e23e0e977fbcc4a7880e187a63c509d6e6879c6
Author: Erik Skultety <eskultet>
Date:   Thu Nov 27 13:29:42 2014 +0100

    qemu: fix block{commit,copy} abort handling
    
    When a block{commit,copy} job was aborted on a domain, block job handler
    did not process it correctly, leaving a phantom job in the background.
    Any further calls to any blockjob causes "block <jobtype> still active"
    error. This patch fixes the blockjob handler so that it checks not only
    for VIR_DOMAIN_BLOCK_JOB_FAILED status, but VIR_DOMAIN_BLOCK_JOB_CANCELED
    status as well, followed by our existing cleanup routine.
    
    Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1135169
    
    Signed-off-by: Jiri Denemark <jdenemar>

Comment 7 Shanzhi Yu 2014-12-04 09:08:11 UTC

I will verify this bug after test with blockcopy,blockcommit,blockpull cmd.
All there cmds can be cancel correctly.

with libvirt-1.2.8-10.el7.x86_64

Comment 9 errata-xmlrpc 2015-03-05 07:43:25 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-0323.html

Comment 10 Nerijus Baliūnas 2015-10-21 13:48:29 UTC

I have similar problem with RH 7.2 beta:

# virsh snapshot-create-as --domain rasa sn1 --diskspec
 vda,file=/var/lib/libvirt/images/rasa-sn1.qcow2 --disk-only
 --atomic --no-metadata

Now copy original no longer updated /var/lib/libvirt/images/rasa.qcow2 to another place.

# virsh blockcommit rasa vda --active --verbose --pivot
Block commit: [100 %]error: failed to pivot job for disk vda
error: block copy still active: disk 'vda' not ready for pivot yet

# virsh domblklist rasa
Target     Source
------------------------------------------------
vda        /var/lib/libvirt/images/rasa-sn1.qcow2

If blockcommit had succeeded, it would be now:
vda        /var/lib/libvirt/images/rasa.qcow2

Now both files rasa.qcow2 and rasa-sn1.qcow2 are written to, and
# virsh blockjob rasa vda
Active Block Commit: [100 %]

But trying to virsh blockcommit rasa vda --active --verbose --pivot once more:
error: block copy still active: disk 'vda' already in active block job

How do I make rasa.qcow2 the only active vda?
Now both the original rasa.qcow2 and snapshot rasa-sn1.qcow2 are updated.