Bug 2057067

Summary:	`virsh blockjob --abort' logs error when cancelling a copy job started with '--reuse-external --shallow', where the target image has a backing file
Product:	Red Hat Enterprise Linux 9	Reporter:	Kashyap Chamarthy <kchamart>
Component:	libvirt	Assignee:	Peter Krempa <pkrempa>
libvirt sub component:	Storage	QA Contact:	Meina Li <meili>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	low
Priority:	low	CC:	astupnik, chhu, dzheng, jdenemar, lmen, nanli, pkrempa, virt-maint, xuzhang
Version:	9.0	Keywords:	Triaged
Target Milestone:	rc	Flags:	pm-rhel: mirror+
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	libvirt-8.1.0-1.el9	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-11-15 10:03:40 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:	8.1.0
Embargoed:

Description Kashyap Chamarthy 2022-02-22 16:55:05 UTC

Description of problem
----------------------

[Thanks to Peter Krempa for the bug title summary.]

The test here is an OpenStack CI test.  And below is the rough
libvirt/QEMU sequence:

`virsh blockjob --abort' fails when cancelling a copy/mirror job that is
started with '--reuse-external --shallow'.  Where the target image has a
backing image.

And the failure is:

        internal error: unable to execute QEMU command 'blockdev-del':
        Failed to find node with node-name='libvirt-4-storage'

Where:

* "--reuse-external" == reuse an existing external file on the
  destination host for the mirror/copy job

* "--shallow" == the copy shares the backing chain

And the rough underlying QEMU call sequence here is:

    - blockdev-add,
    - blockdev-mirror,
    - block-job-cancel, 
    - job-dismiss, 
    - blockdev-del ... which fails with the above "internal error"


Root cause analysis
-------------------

This is based on an IRC chat with Peter:

   libvirt has a piece of code which ensures that thh backing image of
   the reused destination image is added only when finishing the job.
   On cancellation of the [copy] job, we want to unplug the image, but
   the backing image was not yet plugged in.
   
   However, since the test is doing a `block-job-cancel' here, which
   most likely still expects that the backing image was already plugged
   in.

Version
-------

  - libvirt version is 7.10.0;
  - QEMU is 6.1.0-5

How reproducible: Consistently (in the OpenStack CI)


Steps to Reproduce
------------------

The bug was triggered by OpenStack test code here:
https://bugs.launchpad.net/tripleo/+bug/1959014/ 

The test is roughly booting the server, then snapshot it, and try to
upload the image to Glance (the image template storage service)


Actual results
--------------

Copy job cancellation fails with:

    internal error: unable to execute QEMU command 'blockdev-del':
    Failed to find node with node-name='libvirt-4-storage' 


Expected results
----------------

The call to `blockdev-del` doesn't fail on [copy] job cancel.

Comment 1 Peter Krempa 2022-02-23 12:17:41 UTC

My original assumption was that the aborting of the block job actually propagates the error, but at the point where it happens we no longer propagate it to the caller, so the error is only a log entry.

The cancellation of the block job was actually successful, and the error is spurious because the image was not actually inserted. Thus it can be safely ignored until libvirt is fixed.

The actual problems described in the launchpad issue are actually caused by qemu crashing and have nothing to do with the block job cancellation reporting errors.

Comment 2 Peter Krempa 2022-02-23 12:25:58 UTC

To reproduce the issue the following steps are necessary:

1) create a VM with a disk image which has at least one backing image, or create a snapshot. E.g.:

    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2'/>
      <source file='/tmp/img.qcow2' index='1'/>
      <backingStore type='file' index='5'>
        <format type='qcow2'/>
        <source file='/tmp/copybase.qcow2'/>
        <backingStore/>
      </backingStore>
      <target dev='hdd' bus='ide'/>
      <alias name='ide0-1-1'/>
      <address type='drive' controller='0' bus='1' target='0' unit='1'/>
    </disk>

2) create the destination images:

cp /tmp/copybase.qcow2 /tmp/copycopy.qcow2
qemu-img create -f qcow2 -F qcow2 -b /tmp/copycopy.qcow2 /tmp/copy.qcow2

(no need to actually copy the original image, you can create a dummy one, the data will not be consistent, but we are going to cancel the job anyways)

3) start the copy job
virsh blockcopy $VM --path $DISKTARGET --dest /tmp/copy.qcow2 --reuse-external --shallow --transient-job

4) abor the blockjob
virsh blockjob --abort $VM $DISKTARGET

The log file will have the error mentioned in the description.

Comment 3 Peter Krempa 2022-02-23 12:26:45 UTC

Fixed upstream:

commit 14851cff117a5cb77f0543f0ca5b72d10b83b8e5
Author: Peter Krempa <pkrempa>
Date:   Tue Feb 22 17:34:46 2022 +0100

    qemu: blockjob: Avoid spurious log errors when cancelling a shallow copy with reused images
    
    In case when a user starts a block copy operation with
    VIR_DOMAIN_BLOCK_COPY_SHALLOW and VIR_DOMAIN_BLOCK_COPY_REUSE_EXT and
    both the reused image and the original disk have a backing image libvirt
    specifically does not insert the backing image until after the job is
    asked to be completed via virBlockJobAbort with
    VIR_DOMAIN_BLOCK_JOB_ABORT_PIVOT.
    
    This is so that management applications can copy the backing image on
    the background.
    
    Now when a user aborts the block job instead of cancelling it we'd
    ignore the fact that we didn't insert the backing image yet and the
    cancellation would result into a 'blockdev-del' of a invalid node name
    and thus an 'error' severity entry in the log.
    
    To solve this issue we use the same conditions when the backing image
    addition is avoided to remove the internal state for them prior to the
    call to unplug the mirror destination.
    
    Reported-by: Kashyap Chamarthy <kchamart>
    Signed-off-by: Peter Krempa <pkrempa>
    Reviewed-by: Ján Tomko <jtomko>

v8.0.0-469-g14851cff11

Comment 4 Meina Li 2022-02-25 03:27:55 UTC

Reprocuded version:
libvirt-8.0.0-5.el9.x86_64
qemu-kvm-6.2.0-10.el9.x86_64

Reproduced Steps:
1. Prepare a running guest.
# virsh domstate lmn
running
2. Create snapshot for the guest.
# virsh snapshot-create-as lmn s1 --disk-only
Domain snapshot s1 created
# virsh dumpxml lmn | grep /disk -B10
......
    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2'/>
      <source file='/var/lib/libvirt/images/lmn.s1' index='2'/>
      <backingStore type='file' index='1'>
        <format type='qcow2'/>
        <source file='/var/lib/libvirt/images/lmn.qcow2'/>
        <backingStore/>
      </backingStore>
      <target dev='vda' bus='virtio'/>
      <alias name='virtio-disk0'/>
      <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
    </disk>
3. Create a disk image which has another backing file.
# qemu-img create -f qcow2 /var/lib/libvirt/images/test.img 500M
Formatting '/var/lib/libvirt/images/test.img', fmt=qcow2 cluster_size=65536 extended_l2=off compression_type=zlib size=524288000 lazy_refcounts=off refcount_bits=16
# qemu-img create -f qcow2 -F qcow2 -b /var/lib/libvirt/images/test.img /tmp/copy.qcow2
Formatting '/tmp/copy.qcow2', fmt=qcow2 cluster_size=65536 extended_l2=off compression_type=zlib size=524288000 backing_file=/var/lib/libvirt/images/test.img backing_fmt=qcow2 lazy_refcounts=off refcount_bits=16
4. Do blockcopy to /tmp/copy.qcow2
# virsh blockcopy lmn vda /tmp/copy.qcow2 --reuse-external --shallow --transient-job
Block Copy started
5. Abort the blockjob.
# virsh blockjob lmn vda --abort
error: invalid argument: disk vda does not have an active block job
6. Check the libvirtd.log:
......
2022-02-25 02:45:57.723+0000: 2401: debug : qemuMonitorJSONIOProcessLine:222 : Line [{"id": "libvirt-20", "error": {"class": "GenericError", "desc": "Failed to find node with node-name='libvirt-8-format'"}}]
2022-02-25 02:45:57.723+0000: 2401: info : qemuMonitorJSONIOProcessLine:241 : QEMU_MONITOR_RECV_REPLY: mon=0x7f302c082460 reply={"id": "libvirt-20", "error": {"class": "GenericError", "desc": "Failed to find node with node-name='libvirt-8-format'"}}
2022-02-25 02:45:57.723+0000: 2554: debug : qemuMonitorJSONCheckErrorFull:387 : unable to execute QEMU command {"execute":"blockdev-del","arguments":{"node-name":"libvirt-8-format"},"id":"libvirt-20"}: {"id":"libvirt-20","error":{"class":"GenericError","desc":"Failed to find node with node-name='libvirt-8-format'"}}
2022-02-25 02:45:57.723+0000: 2554: error : qemuMonitorJSONCheckErrorFull:399 : internal error: unable to execute QEMU command 'blockdev-del': Failed to find node with node-name='libvirt-8-format'
......

Comment 5 Meina Li 2022-02-25 06:43:35 UTC

Pre-verified in libvirt-8.1.0-1.fc35.x86_64 and qemu-kvm-6.1.0-14.fc35.x86_64: PASSED

Comment 8 Meina Li 2022-05-10 06:55:09 UTC

Verified Version:
libvirt-8.3.0-1.el9.x86_64
qemu-kvm-7.0.0-2.el9.x86_64

Verified Steps:
S1：Do blockcopy to file disk with backing file
1. Prepare a running guest.
# virsh domstate lmn
running
2. Create snapshot for the guest.
# virsh snapshot-create-as lmn s1 --disk-only
Domain snapshot s1 created
# virsh dumpxml lmn | xmllint --xpath //disk -
    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2'/>
      <source file='/var/lib/libvirt/images/lmn.s1' index='2'/>
      <backingStore type='file' index='1'>
        <format type='qcow2'/>
        <source file='/var/lib/libvirt/images/lmn.qcow2'/>
        <backingStore/>
      </backingStore>
      <target dev='vda' bus='virtio'/>
      <alias name='virtio-disk0'/>
      <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
    </disk>
3. Create a disk image which has another backing file.
# qemu-img create -f qcow2 /var/lib/libvirt/images/test.img 10G
Formatting '/var/lib/libvirt/images/test.img', fmt=qcow2 cluster_size=65536 extended_l2=off compression_type=zlib size=10737418240 lazy_refcounts=off refcount_bits=16
# qemu-img create -f qcow2 -F qcow2 -b /var/lib/libvirt/images/test.img /tmp/copy.qcow2
Formatting '/tmp/copy.qcow2', fmt=qcow2 cluster_size=65536 extended_l2=off compression_type=zlib size=10737418240 backing_file=/var/lib/libvirt/images/test.img backing_fmt=qcow2 lazy_refcounts=off refcount_bits=16
4. Do blockcopy and then abort the blockjob.
# virsh blockcopy lmn vda /tmp/copy.qcow2 --reuse-external --shallow --transient-job
Block Copy started
# virsh blockjob lmn vda --abort
# virsh dumpxml lmn | xmllint --xpath //disk -
    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2'/>
      <source file='/var/lib/libvirt/images/lmn.s1' index='2'/>
      <backingStore type='file' index='1'>
        <format type='qcow2'/>
        <source file='/var/lib/libvirt/images/lmn.qcow2'/>
        <backingStore/>
      </backingStore>
      <target dev='vda' bus='virtio'/>
      <alias name='virtio-disk0'/>
      <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
    </disk>
5. Do blockcopy and then pivot the blockjob.
# virsh blockcopy lmn vda /tmp/copy.qcow2 --reuse-external --shallow --transient-job
Block Copy started
# virsh blockjob lmn vda --pivot
# virsh dumpxml lmn | xmllint --xpath //disk -
<disk type="file" device="disk">
      <driver name="qemu" type="qcow2"/>
      <source file="/tmp/copy.qcow2" index="9"/>
      <backingStore type="file" index="10">
        <format type="qcow2"/>
        <source file="/var/lib/libvirt/images/test.img"/>
        <backingStore/>
      </backingStore>
      <target dev="vda" bus="virtio"/>
      <alias name="virtio-disk0"/>
      <address type="pci" domain="0x0000" bus="0x04" slot="0x00" function="0x0"/>
    </disk>

S2: Do blockcopy to block disk with backing file
1. Prepare a running guest.
# virsh domstate lmn
running
2. Create snapshot for the guest.
# virsh snapshot-create-as lmn --no-metadata --reuse-external --disk-only --diskspec vdb,file=/dev/vg0/lv1,stype=block
Domain snapshot 1652164878 created
# virsh dumpxml lmn | xmllint --xpath //disk -
<disk type="block" device="disk">
      <driver name="qemu" type="qcow2" cache="none"/>
      <source dev="/dev/vg0/lv1" index="2"/>
      <backingStore type="block" index="1">
        <format type="raw"/>
        <source dev="/dev/vg0/lv0"/>
        <backingStore/>
      </backingStore>
      <target dev="vdb" bus="virtio"/>
      <alias name="virtio-disk1"/>
      <address type="pci" domain="0x0000" bus="0x04" slot="0x00" function="0x0"/>
    </disk>
3. Create a block disk which has another backing file.
# qemu-img create -f qcow2 -F qcow2 -b /dev/vg0/lv3  /dev/vg0/lv4
Formatting '/dev/vg0/lv4', fmt=qcow2 cluster_size=65536 extended_l2=off compression_type=zlib size=104857600 backing_file=/dev/vg0/lv3 backing_fmt=qcow2 lazy_refcounts=off refcount_bits=16
4. Do blockcopy and then abort the blockjob.
# virsh blockcopy lmn vdb /dev/vg0/lv4 --reuse-external --shallow --transient-job --blockdev
Block Copy started
# virsh blockjob lmn vdb --abort
# virsh dumpxml lmn | xmllint --xpath //disk -
<disk type="block" device="disk">
      <driver name="qemu" type="qcow2" cache="none"/>
      <source dev="/dev/vg0/lv1" index="2"/>
      <backingStore type="block" index="1">
        <format type="raw"/>
        <source dev="/dev/vg0/lv0"/>
        <backingStore/>
      </backingStore>
      <target dev="vdb" bus="virtio"/>
      <alias name="virtio-disk1"/>
      <address type="pci" domain="0x0000" bus="0x04" slot="0x00" function="0x0"/>
    </disk>
5. Do blockcopy and then pivot the blockjob.
# virsh blockcopy lmn vdb /dev/vg0/lv4 --reuse-external --shallow --transient-job --blockdev
Block Copy started
# virsh blockjob lmn vdb --pivot
# virsh dumpxml lmn | xmllint --xpath //disk -
<disk type="block" device="disk">
      <driver name="qemu" type="qcow2" cache="none"/>
      <source dev="/dev/vg0/lv4" index="5"/>
      <backingStore type="block" index="6">
        <format type="qcow2"/>
        <source dev="/dev/vg0/lv3"/>
        <backingStore/>
      </backingStore>
      <target dev="vdb" bus="virtio"/>
      <alias name="virtio-disk1"/>
      <address type="pci" domain="0x0000" bus="0x04" slot="0x00" function="0x0"/>
    </disk>

Both of them have no error libvirtd log.

Comment 10 errata-xmlrpc 2022-11-15 10:03:40 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Low: libvirt security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:8003