Bug 1951507 - blockCopy succeeds, but target volume "not assigned to domain" after pivot
Summary: blockCopy succeeds, but target volume "not assigned to domain" after pivot
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux Advanced Virtualization
Classification: Red Hat
Component: libvirt
Version: 8.3
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: rc
: 8.4
Assignee: Peter Krempa
QA Contact: Meina Li
URL:
Whiteboard:
: 1978713 1979015 (view as bug list)
Depends On:
Blocks: 1955667
TreeView+ depends on / blocked
 
Reported: 2021-04-20 10:01 UTC by Nir Soffer
Modified: 2024-10-01 17:57 UTC (History)
17 users (show)

Fixed In Version: libvirt-7.3.0-1.el8
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1955667 (view as bug list)
Environment:
Last Closed: 2021-11-16 07:52:40 UTC
Type: Bug
Target Upstream Version: 7.3.0
Embargoed:


Attachments (Terms of Use)
libvirt debug log (332.32 KB, application/x-xz)
2021-04-20 10:01 UTC, Nir Soffer
no flags Details
libvirt/qemu vm log (9.13 KB, text/plain)
2021-04-20 10:02 UTC, Nir Soffer
no flags Details
vdsm log (191.97 KB, application/x-xz)
2021-04-20 10:03 UTC, Nir Soffer
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 6191881 0 None None None 2021-09-02 12:18:47 UTC
Red Hat Product Errata RHBA-2021:4684 0 None None None 2021-11-16 07:53:23 UTC

Description Nir Soffer 2021-04-20 10:01:46 UTC
Created attachment 1773723 [details]
libvirt debug log

Description of problem:

After successful blockCopy, libvirt fail blockInfo with the target drive with:

    invalid argument: invalid path /path... not assigned to domain

In libvirt log we can see that libvirt complete the job succcessfuly, and then
it tries remove the target nodes. Aftter that blockInfo for that drive fail.

This breaks extending volumes in RHV, so the VM will pause the disk becomes
full, and is unlikely to recover, since blockInfo() is called when extending a
volume.

This also breaks live merge for same reason, and I guess any API trying to use
the drive will fail.


Version-Release number of selected component (if applicable):
libvirt-daemon 6.6.0-13.el8
qemu-kvm-5.1.0-20.el8,

How reproducible:
Fails randomly in oVirt system tests.

Steps to Reproduce:
1. Live storage migration in RHV when source disk is on file storage domain 
   and target disk on block storage domain.

Actual results:
The blockCopy job succeeds, but after that the drive path is
"not assigned to domain"

Expected results:
Drive path is recognizable after successful blockCopy, or blockCopy
fails so RHV can continue to use the old drive.

Additional info:
Looks like libivrt is confused when completing the block job; so this may
also be qemu bug changing the behavior in an incompatible way.

Tested with Centos, not with RHEL. This is environment used by oVirt system
tests. This may affect current RHV customers using same libvirt and qemu-kvm
version in RHEL AV 8.3.1.


## vdsm.log

1. Stopping disk replication, pivoting from file volume to block volume

2021-04-18 15:41:17,264+0200 DEBUG (jsonrpc/2) [virt.vm] (vmId='8020459c-ba31-48f5-9ad1-3af792151dbf') Stopping the disk repli cation switching to the destination drive: {'imageID': 'd9fe3172-512d-45d6-b433-a834140fd268', 'poolID': '044320e3-7070-4cee-a 654-8d4b19ce2cfd', 'volumeID': 'dd9b9a5b-65ae-4463-9a62-2f01e73cebf8', 'device': 'disk', 'domainID': '44fd974e-6466-45aa-b932- 67c2732d0ae8'} (vm:4382)

2. Libvirt reports that the block job is ready

2021-04-18 15:41:17,301+0200 INFO  (libvirt/events) [virt.vm] (vmId='8020459c-ba31-48f5-9ad1-3af792151dbf') Block job (untrack ed) type COPY for drive vdb is ready (vm:5675)

3. Libvirt reports the job has completed

2021-04-18 15:41:17,326+0200 INFO  (libvirt/events) [virt.vm] (vmId='8020459c-ba31-48f5-9ad1-3af792151dbf') Block job (untrack ed) type COPY for drive vdb has completed (vm:5666)

4. The libvirt call succeeded, vdsm update the drive type

2021-04-18 15:41:17,351+0200 DEBUG (jsonrpc/2) [virt.vm] (vmId='8020459c-ba31-48f5-9ad1-3af792151dbf') Drive vdb type changed from 'file' to 'block' (storage:528)

2021-04-18 15:41:17,351+0200 DEBUG (jsonrpc/2) [virt.vm] (vmId='8020459c-ba31-48f5-9ad1-3af792151dbf') Drive vdb move from '/rhev/data-center/mnt/192.168.202.2:_exports_nfs_share1/e80b1b07-12c2-4308-a219-d430065dc40e/images/d9fe3172-512d-45d6-b433-a834140fd268/dd9b9a5b-65ae-4463-9a62-2f01e73cebf8' to '/rhev/data-center/mnt/blockSD/44fd974e-6466-45aa-b932-67c2732d0ae8/images/d9fe3172-512d-45d6-b433-a834140fd268/dd9b9a5b-65ae-4463-9a62-2f01e73cebf8', unsetting threshold (storage:440)

5. Vdsm report success

2021-04-18 15:41:17,360+0200 DEBUG (jsonrpc/2) [api] FINISH diskReplicateFinish response={'status': {'code': 0, 'message': 'Done'}} (api:137)

6. When vdsm try to set block threshold, libvirt fail

2021-04-18 15:41:17,999+0200 ERROR (periodic/1) [virt.vm] (vmId='8020459c-ba31-48f5-9ad1-3af792151dbf') Unable to get watermarks for drive vdb: invalid argument: invalid path /rhev/data-center/mnt/blockSD/44fd974e-6466-45aa-b932-67c2732d0ae8/images/d9fe3172-512d-45d6-b433-a834140fd268/dd9b9a5b-65ae-4463-9a62-2f01e73cebf8 not assigned to domain (vm:1318)
...
2021-04-18 15:41:20,003+0200 ERROR (periodic/0) [virt.vm] (vmId='8020459c-ba31-48f5-9ad1-3af792151dbf') Unable to get watermarks for drive vdb: invalid argument: invalid path /rhev/data-center/mnt/blockSD/44fd974e-6466-45aa-b932-67c2732d0ae8/images/d9fe3172-512d-45d6-b433-a834140fd268/dd9b9a5b-65ae-4463-9a62-2f01e73cebf8 not assigned to domain (vm:1318)
...
2021-04-18 15:41:22,000+0200 ERROR (periodic/1) [virt.vm] (vmId='8020459c-ba31-48f5-9ad1-3af792151dbf') Unable to get watermarks for drive vdb: invalid argument: invalid path /rhev/data-center/mnt/blockSD/44fd974e-6466-45aa-b932-67c2732d0ae8/images/d9fe3172-512d-45d6-b433-a834140fd268/dd9b9a5b-65ae-4463-9a62-2f01e73cebf8 not assigned to domain (vm:1318)


## libvirtd.log

1. Creating snapshot on block device

2021-04-18 13:40:34.235+0000: 10892: debug : virDomainSnapshotCreateXML:221 : dom=0x7f01400140b0, (VM: name=vm0, uuid=8020459c -ba31-48f5-9ad1-3af792151dbf), xmlDesc=<?xml version='1.0' encoding='utf-8'?> <domainsnapshot><disks><disk name="vdb" snapshot="external" type="file"><source file="/rhev/data-center/mnt/192.168.202.2:_exports_nfs_share1/e80b1b07-12c2-4308-a219-d430065dc40e/images/d9fe3172-512d-45d6-b433-a834140fd268/dd9b9a5b-65ae-4463-9a62-2f01e73cebf8" type="file"><seclabel model="dac" relabel="no" type="none" /></source></disk></disks></domainsnapshot>, flags=0x34


2. Adding new device as libvirt-5-storage (source)

2021-04-18 13:40:34.263+0000: 10892: info : qemuMonitorSend:946 : QEMU_MONITOR_SEND_MSG: mon=0x7f0120022180 msg={"execute":"blockdev-add","arguments":{"driver":"file","filename":"/rhev/data-center/mnt/192.168.202.2:_exports_nfs_share1/e80b1b07-12c2-4308-a219-d430065dc40e/images/d9fe3172-512d-45d6-b433-a834140fd268/dd9b9a5b-65ae-4463-9a62-2f01e73cebf8","aio":"threads","node-name":"libvirt-5-storage","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"},"id":"libvirt-444"} fd=-1

3. Starting block copy from source to target

2021-04-18 13:41:10.813+0000: 10891: debug : virDomainBlockCopy:10374 : dom=0x7f0130005330, (VM: name=vm0, uuid=8020459c-ba31- 48f5-9ad1-3af792151dbf), disk=vdb, destxml=<?xml version='1.0' encoding='utf-8'?> <disk device="disk" snapshot="no" type="block"><source dev="/rhev/data-center/mnt/blockSD/44fd974e-6466-45aa-b932-67c2732d0ae8/images/d9fe3172-512d-45d6-b433-a834140fd268/dd9b9a5b-65ae-4463-9a62-2f01e73cebf8"><seclabel model="dac" relabel="no" type="none" /></source><driver cache="none" error_policy="stop" io="native" name="qemu" type="qcow2" /></disk>, params=(nil), nparams=0, flags=0x7

4. Adding libvirt-6-storage (target)

2021-04-18 13:41:10.874+0000: 10891: info : qemuMonitorSend:946 : QEMU_MONITOR_SEND_MSG: mon=0x7f0120022180 msg={"execute":"blockdev-add","arguments":{"driver":"host_device","filename":"/rhev/data-center/mnt/blockSD/44fd974e-6466-45aa-b932-67c2732d0ae8/images/d9fe3172-512d-45d6-b433-a834140fd268/dd9b9a5b-65ae-4463-9a62-2f01e73cebf8","aio":"threads","node-name":"libvirt-6-storage","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"},"id":"libvirt-456"} fd=-1

5. Starting block job copy-vdb-libvirt-5-format

2021-04-18 13:41:10.896+0000: 10891: info : qemuMonitorSend:946 : QEMU_MONITOR_SEND_MSG: mon=0x7f0120022180 msg={"execute":"blockdev-mirror","arguments":{"job-id":"copy-vdb-libvirt-5-format","device":"libvirt-5-format","target":"libvirt-6-format","sync":"top","auto-finalize":true,"auto-dismiss":false},"id":"libvirt-458"} fd=-1


6. Trying to abort the block copy block job, switching to target
   flags=VIR_DOMAIN_BLOCK_JOB_ABORT_PIVOT

2021-04-18 13:41:17.265+0000: 10893: debug : virDomainBlockJobAbort:9898 : dom=0x7f013800b500, (VM: name=vm0, uuid=8020459c-ba31-48f5-9ad1-3af792151dbf), disk=vdb, flags=0x2

2021-04-18 13:41:17.265+0000: 10893: info : qemuMonitorSend:946 : QEMU_MONITOR_SEND_MSG: mon=0x7f0120022180 msg={"execute":"blockdev-add","arguments":{"driver":"host_device","filename":"/rhev/data-center/mnt/blockSD/44fd974e-6466-45aa-b932-67c2732d0ae8/images/d9fe3172-512d-45d6-b433-a834140fd268/a8f66889-b221-43b8-a02c-43c983a0451c","aio":"threads","node-name":"libvirt-7-storage","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"},"id":"libvirt-469"}
 fd=-1

2021-04-18 13:41:17.275+0000: 19074: debug : qemuMonitorJSONIOProcessLine:220 : Line [{"return": {}, "id": "libvirt-469"}]

2021-04-18 13:41:17.275+0000: 19074: info : qemuMonitorJSONIOProcessLine:240 : QEMU_MONITOR_RECV_REPLY: mon=0x7f0120022180 reply={"return": {}, "id": "libvirt-469"}

2021-04-18 13:41:17.276+0000: 10893: info : qemuMonitorSend:946 : QEMU_MONITOR_SEND_MSG: mon=0x7f0120022180 msg={"execute":"blockdev-add","arguments":{"node-name":"libvirt-7-format","read-only":true,"cache":{"direct":true,"no-flush":false},"driver":"qcow2","file":"libvirt-7-storage","backing":null},"id":"libvirt-470"} fd=-1

2021-04-18 13:41:17.286+0000: 19074: info : qemuMonitorJSONIOProcessLine:240 : QEMU_MONITOR_RECV_REPLY: mon=0x7f0120022180 reply={"return": {}, "id": "libvirt-470"}

2021-04-18 13:41:17.286+0000: 10893: info : qemuMonitorSend:946 : QEMU_MONITOR_SEND_MSG: mon=0x7f0120022180 msg={"execute":"transaction","arguments":{"actions":[{"type":"blockdev-snapshot","data":{"node":"libvirt-7-format","overlay":"libvirt-6-format"}}]},"id":"libvirt-471"} fd=-1

7. Copy block job move to standby state?

2021-04-18 13:41:17.289+0000: 19074: info : qemuMonitorJSONIOProcessLine:235 : QEMU_MONITOR_RECV_EVENT: mon=0x7f0120022180 event={"timestamp": {"seconds": 1618753277, "microseconds": 288831}, "event": "JOB_STATUS_CHANGE", "data": {"status": "standby", "id": "copy-vdb-libvirt-5-format"}}

8. Copy block jobs move back to ready state

2021-04-18 13:41:17.289+0000: 19074: info : qemuMonitorJSONIOProcessLine:235 : QEMU_MONITOR_RECV_EVENT: mon=0x7f0120022180 event={"timestamp": {"seconds": 1618753277, "microseconds": 289019}, "event": "JOB_STATUS_CHANGE", "data": {"status": "ready", "id": "copy-vdb-libvirt-5-format"}}

2021-04-18 13:41:17.292+0000: 19074: info : qemuMonitorJSONIOProcessLine:240 : QEMU_MONITOR_RECV_REPLY: mon=0x7f0120022180 reply={"return": {}, "id": "libvirt-471"}

2021-04-18 13:41:17.293+0000: 10893: info : qemuMonitorSend:946 : QEMU_MONITOR_SEND_MSG: mon=0x7f0120022180 msg={"execute":"transaction","arguments":{"actions":[{"type":"block-dirty-bitmap-add","data":{"node":"libvirt-6-format","name":"libvirt-tmp-activewrite","persistent":false,"disabled":false}}]},"id":"libvirt-472"}
 fd=-1

2021-04-18 13:41:17.296+0000: 19074: info : qemuMonitorJSONIOProcessLine:240 : QEMU_MONITOR_RECV_REPLY: mon=0x7f0120022180 reply={"return": {}, "id": "libvirt-472"}

9. Libvirt completes the copy block job

2021-04-18 13:41:17.297+0000: 10893: info : qemuMonitorSend:946 : QEMU_MONITOR_SEND_MSG: mon=0x7f0120022180 msg={"execute":"job-complete","arguments":{"id":"copy-vdb-libvirt-5-format"},"id":"libvirt-473"} fd=-1

2021-04-18 13:41:17.298+0000: 19074: info : qemuMonitorJSONIOProcessLine:240 : QEMU_MONITOR_RECV_REPLY: mon=0x7f0120022180 reply={"return": {}, "id": "libvirt-473"}

2021-04-18 13:41:17.301+0000: 19074: info : qemuMonitorJSONIOProcessLine:235 : QEMU_MONITOR_RECV_EVENT: mon=0x7f0120022180 event={"timestamp": {"seconds": 1618753277, "microseconds": 300701}, "event": "JOB_STATUS_CHANGE", "data": {"status": "waiting", "id": "copy-vdb-libvirt-5-format"}}

2021-04-18 13:41:17.301+0000: 19074: info : qemuMonitorJSONIOProcessLine:235 : QEMU_MONITOR_RECV_EVENT: mon=0x7f0120022180 event={"timestamp": {"seconds": 1618753277, "microseconds": 300762}, "event": "JOB_STATUS_CHANGE", "data": {"status": "pending", "id": "copy-vdb-libvirt-5-format"}}

2021-04-18 13:41:17.301+0000: 19074: info : qemuMonitorJSONIOProcessLine:235 : QEMU_MONITOR_RECV_EVENT: mon=0x7f0120022180 event={"timestamp": {"seconds": 1618753277, "microseconds": 300879}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "copy-vdb-libvirt-5-format", "len": 0, "offset": 0, "speed": 0, "type": "mirror"}}

2021-04-18 13:41:17.301+0000: 19074: info : qemuMonitorJSONIOProcessLine:235 : QEMU_MONITOR_RECV_EVENT: mon=0x7f0120022180 event={"timestamp": {"seconds": 1618753277, "microseconds": 300917}, "event": "JOB_STATUS_CHANGE", "data": {"status": "concluded", "id": "copy-vdb-libvirt-5-format"}}

2021-04-18 13:41:17.302+0000: 10893: info : qemuMonitorSend:946 : QEMU_MONITOR_SEND_MSG: mon=0x7f0120022180 msg={"execute":"job-dismiss","arguments":{"id":"copy-vdb-libvirt-5-format"},"id":"libvirt-475"} fd=-1

2021-04-18 13:41:17.305+0000: 19074: info : qemuMonitorJSONIOProcessLine:235 : QEMU_MONITOR_RECV_EVENT: mon=0x7f0120022180 event={"timestamp": {"seconds": 1618753277, "microseconds": 304257}, "event": "JOB_STATUS_CHANGE", "data": {"status": "null", "id": "copy-vdb-libvirt-5-format"}}

2021-04-18 13:41:17.306+0000: 10893: debug : qemuBlockJobProcessEventConcludedCopyAbort:1402 : copy job 'copy-vdb-libvirt-5-format' on VM 'vm0' aborted

10. Libvirt tries to delete the target node?!

2021-04-18 13:41:17.306+0000: 10893: info : qemuMonitorSend:946 : QEMU_MONITOR_SEND_MSG: mon=0x7f0120022180 msg={"execute":"blockdev-del","arguments":{"node-name":"libvirt-6-format"},"id":"libvirt-476"} fd=-1

11. The operation failed, device is used after successful pivot.

2021-04-18 13:41:17.308+0000: 19074: info : qemuMonitorJSONIOProcessLine:240 : QEMU_MONITOR_RECV_REPLY: mon=0x7f0120022180 reply={"id": "libvirt-476", "error": {"class": "GenericError", "desc": "Node libvirt-6-format is in use"}}

2021-04-18 13:41:17.308+0000: 10893: error : qemuMonitorJSONCheckErrorFull:416 : internal error: unable to execute QEMU command 'blockdev-del': Node libvirt-6-format is in use

12. Libvirt tries to delete the target node backing chain?!

2021-04-18 13:41:17.310+0000: 10893: info : qemuMonitorSend:946 : QEMU_MONITOR_SEND_MSG: mon=0x7f0120022180 msg={"execute":"blockdev-del","arguments":{"node-name":"libvirt-7-format"},"id":"libvirt-478"} fd=-1

10. Operation failed since device is used.

2021-04-18 13:41:17.311+0000: 19074: info : qemuMonitorJSONIOProcessLine:240 : QEMU_MONITOR_RECV_REPLY: mon=0x7f0120022180 reply={"id": "libvirt-478", "error": {"class": "GenericError", "desc": "Node 'libvirt-7-format' is busy: node is used as backing hd of 'libvirt-6-format'"}}

11. Accessing new device fails:

2021-04-18 13:41:17.998+0000: 10894: error : qemuDomainGetBlockInfo:11935 : invalid argument: invalid path /rhev/data-center/mnt/blockSD/44fd974e-6466-45aa-b932-67c2732d0ae8/images/d9fe3172-512d-45d6-b433-a834140fd268/dd9b9a5b-65ae-4463-9a62-2f01e73cebf8 not assigned to domain

This error repeats many times.

Comment 1 Nir Soffer 2021-04-20 10:02:43 UTC
Created attachment 1773724 [details]
libvirt/qemu vm log

Comment 2 Nir Soffer 2021-04-20 10:03:30 UTC
Created attachment 1773725 [details]
vdsm log

Comment 3 Peter Krempa 2021-04-20 10:50:00 UTC
In this instance libvirt is getting confused by an unfortunate serialization of events.

The bug happens due to the re-issued 'ready' state which is re-instated after a 'standby' state which happens after we already entered qemuDomainBlockJobAbort and switched the blockjob into synchronous mode. In synchronous mode the order of stuff happening allows libvirt to actually clear the QEMU_BLOCKJOB_STATE_PIVOTING state if a ready event is received clearing the PIVOTING state.

The block job finalization code then uses QEMU_BLOCKJOB_STATE_PIVOTING state to determine what to do next and since the state is NOT PIVOTING we call the wrong handler which attempts to execute steps as if the job were aborted.

Comment 4 Peter Krempa 2021-04-21 13:08:08 UTC
Fixed upstream:

commit 81e770c2f152635219cec76585c770fb9f6ce030
Author: Peter Krempa <pkrempa>
Date:   Tue Apr 20 12:58:28 2021 +0200

    qemu: blockjob: Transition into 'ready' state only from expected states
    
    In certain rare occasions qemu can transition a block job which was
    already 'ready' into 'standby' and then back. If this happens in the
    following order libvirt will get confused about the actual job state:
    
    1) the block copy job is 'ready' (job->state == QEMU_BLOCKJOB_STATE_READY)
    
    2) user calls qemuDomainBlockJobAbort with VIR_DOMAIN_BLOCK_JOB_ABORT_PIVOT
       flag but without VIR_DOMAIN_BLOCK_JOB_ABORT_ASYNC
    
    3) the block job is switched to synchronous event handling
    
    4) the block job blips to 'standby' and back to 'ready', the event is
       not processed since the blockjob is in sync mode for now
    
    5) qemuDomainBlockJobPivot is called:
        5.1) 'job-complete' QMP command is issued
        5.2) job->state is set to QEMU_BLOCKJOB_STATE_PIVOTING
    
    6) code for synchronous-wait for the job completion in qemuDomainBlockJobAbort
       is invoked
    
    7) the waiting loop calls qemuBlockJobUpdate:
    
        7.1) job->newstate is QEMU_BLOCKJOB_STATE_READY due to 4)
        7.2) qemuBlockJobEventProcess is called
        7.3) the handler for QEMU_BLOCKJOB_STATE_READY overwrites
             job->state from QEMU_BLOCKJOB_STATE_PIVOTING to QEMU_BLOCKJOB_STATE_READY
    
    8) qemuDomainBlockJobAbort is looking for a finished job, so waits again
    
    9) qemu finishes the blockjob and transitions it into 'concluded' state
    
    10) qemuBlockJobUpdate is triggered again, this time finalizing the job.
        10.1) job->newstate is = QEMU_BLOCKJOB_STATE_CONCLUDED
              job->state is = QEMU_BLOCKJOB_STATE_READY
        10.2) qemuBlockJobEventProcessConcluded is called, the function
              checks whether there was an error with the blockjob. Since
              there was no error job->newstate becomes
              QEMU_BLOCKJOB_STATE_COMPLETED.
        10.3) qemuBlockJobEventProcessConcludedTransition selects the action
              for the appropriate block job type where we have:
    
        case QEMU_BLOCKJOB_TYPE_COPY:
            if (job->state == QEMU_BLOCKJOB_STATE_PIVOTING && success)
                qemuBlockJobProcessEventConcludedCopyPivot(driver, vm, job, asyncJob);
            else
                qemuBlockJobProcessEventConcludedCopyAbort(driver, vm, job, asyncJob);
            break;
    
              Since job->state is QEMU_BLOCKJOB_STATE_READY,
              qemuBlockJobProcessEventConcludedCopyAbort is called.
    
    This patch forbids transitions to QEMU_BLOCKJOB_STATE_READY if the
    previous job state isn't QEMU_BLOCKJOB_STATE_RUNNING or
    QEMU_BLOCKJOB_STATE_NEW.
    
    Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1951507
    Signed-off-by: Peter Krempa <pkrempa>
    Reviewed-by: Michal Privoznik <mprivozn>

Comment 5 Nir Soffer 2021-04-22 13:01:42 UTC
Peter thanks for the quick fix!

Do you known when we can have a build (maybe scratch build?)
with this fix, and when it should be released?

Comment 6 Eric Hadley 2021-04-23 15:40:13 UTC
Could you please add the upstream version (fixed in which release)?

Comment 8 Peter Krempa 2021-04-26 12:51:06 UTC
(In reply to Eric Hadley from comment #6)
> Could you please add the upstream version (fixed in which release)?

The commit mentioned above will be released in libvirt-7.3 which is due to be released on 2021/05/03.

Comment 10 Meina Li 2021-04-27 06:24:09 UTC
Still can't reproduce it on:
libvirt-6.6.0-13.2.module+el8.3.1+10483+85317cf0.x86_64
qemu-kvm-5.1.0-21.module+el8.3.1+10464+8ad18d1a.x86_64

Test Steps:
# cat test.sh
#!/bin/bash

DOMAIN=lmn
DISK=vda
TDISK="/dev/sdb"
LOG_FILE="/var/log/libvirt/libvirtd.log"
FIND_STR="not assigned to domain"

rm -rf /tmp/$DOMAIN.s*
for i in {1..100}
do
    echo "------$i------"
    virsh destroy $DOMAIN
    virsh create $DOMAIN.xml;
    sleep 3;
    virsh snapshot-create-as $DOMAIN --disk-only --diskspec $DISK,file=/tmp/$DOMAIN.s$i;
    virsh blockcopy $DOMAIN $DISK $TDISK --wait --verbose;
    sleep 2;
    virsh blockjob $DOMAIN $DISK --pivot;
    virsh dumpxml $DOMAIN | grep /disk -B8
    virsh domblkthreshold $DOMAIN $DISK 100 
    virsh domstats $DOMAIN --block --backing
    if [ `grep -c "$FIND_STR" $LOG_FILE` -ne '0' ]; then
        break
    else
        continue
    fi
done

# sh test.sh
...
------100------
Domain lmn destroyed

Domain lmn created from lmn.xml

Domain snapshot 1619503529 created
Block Copy: [100 %]
Now in mirroring phase

    <emulator>/usr/libexec/qemu-kvm</emulator>
    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2' cache='none'/>
      <source file='/dev/sdb' index='3'/>
      <backingStore/>
      <target dev='vda' bus='virtio'/>
      <alias name='virtio-disk0'/>
      <address type='pci' domain='0x0000' bus='0x08' slot='0x00' function='0x0'/>
    </disk>

Domain: 'lmn'
  block.count=1
  block.0.name=vda
  block.0.path=/dev/sdb
  block.0.backingIndex=3
  block.0.rd.reqs=1
  block.0.rd.bytes=512
  block.0.rd.times=575213
  block.0.wr.reqs=0
  block.0.wr.bytes=0
  block.0.wr.times=0
  block.0.fl.reqs=0
  block.0.fl.times=0
  block.0.allocation=210042880
  block.0.capacity=209715200
  block.0.physical=0
  block.0.threshold=100

Comment 11 Peter Krempa 2021-04-27 12:51:24 UTC
Firstly the bug happens only when pivoting to the new image, which your script actually doesn't contain. You are missing either '--pivot' with 'virsh blockcopy' or a separate 'virsh blockjob --pivot' call.

Secondly the original report outlines a _very_ rare race condition consisting of actually two issues intersecting at the correct time, which will probably be almost impossible to reproduce.

It's caused by the block job blipping to 'standby' and back to 'ready' at a very precise point. The blipping is described in https://bugzilla.redhat.com/show_bug.cgi?id=1945635 which by itself is rare and it must be timed properly to hit inside qemuDomainBlockJobAbort function when the job was switched to synchronous handling. This will be almost impossible to achieve with a test script as it requires a stressed system and even then it's rather sporradic.

Since the bug happens only when two 'ready' events are delivered to libvirt in specific places I don't think I can offer any form of instrumentation which would make it more likely to happen as normally qemu doesn't switch to the job to 'standby' state.

Comment 12 Nir Soffer 2021-04-27 13:10:01 UTC
This bug was found in ovirt system tests. I don't have exact data but it fails
about once per 50 runs. We don't know how to reproduce this reliably.

The issue of flipping between "ready" and "standby" state is easier to reproduce,
with ovirt-stress[1] you can reproduce this in 1-2 hours on a visualized environment.
Reproducing on real server with fast storage takes about one day.

I think the only way to test such issues is stress test.

[1] https://gitlab.com/nirs/ovirt-stress/-/tree/master/delete-snapshot

Comment 14 yalzhang@redhat.com 2021-05-10 01:35:56 UTC
Hi Nir, Could you please help to test the scratch build for rhel 8.4 on bug 1955667#c1? As we can not reproduce it now, and we will focus on the related regression test.Thank you in advance.

Comment 15 Nir Soffer 2021-05-10 13:16:58 UTC
(In reply to yalzhang from comment #14)
> Hi Nir, Could you please help to test the scratch build for rhel 8.4 on bug
> 1955667#c1? As we can not reproduce it now, and we will focus on the related
> regression test.Thank you in advance.

I cannot reproduce the issue. My logs came from ovirt system tests runs where 
it fails randomly (about once per 50 runs).

Maybe RHV QE can help by using this build in the automated tests.

Comment 19 Peter Krempa 2021-05-21 13:32:47 UTC
Note that https://bugzilla.redhat.com/show_bug.cgi?id=1962053 claims to be able to always reproduce the same issue with rhel-9 packages.

Comment 20 Meina Li 2021-05-25 03:08:49 UTC
Basic test on:
libvirt-7.3.0-1.module+el8.5.0+11004+f4810536.x86_64
qemu-kvm-6.0.0-16.module+el8.5.0+10848+2dccc46d.x86_64

The test was passed with the following script:
# cat test.sh
#!/bin/bash

DOMAIN=lmn
DISK=vda
TDISK="/dev/sdc"
LOG_FILE="/var/log/libvirt/libvirtd.log"
FIND_STR="not assigned to domain"

rm -rf /tmp/$DOMAIN.s*
for i in {1..100}
do
    echo "------$i------"
    virsh destroy $DOMAIN
    virsh create $DOMAIN.xml;
    sleep 3;
    virsh snapshot-create-as $DOMAIN --disk-only --diskspec $DISK,file=/tmp/$DOMAIN.s$i;
    virsh blockcopy $DOMAIN $DISK --blockdev $TDISK --wait --verbose;
    sleep 2;
    virsh blockjob $DOMAIN $DISK --pivot;
    virsh dumpxml $DOMAIN | grep /disk -B8
    virsh domblkthreshold $DOMAIN $DISK 100 
    virsh domstats $DOMAIN --block --backing
    if [ `grep -c "$FIND_STR" $LOG_FILE` -ne '0' ]; then
        break
    else
        continue
    fi
done

Comment 23 Meina Li 2021-05-27 01:29:46 UTC
According to comment 20 and comment 21, move this bug to be verified.

Comment 24 Peter Krempa 2021-07-23 07:46:56 UTC
*** Bug 1979015 has been marked as a duplicate of this bug. ***

Comment 25 Eyal Shenitzky 2021-09-02 12:18:48 UTC
*** Bug 1978713 has been marked as a duplicate of this bug. ***

Comment 27 errata-xmlrpc 2021-11-16 07:52:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (virt:av bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:4684


Note You need to log in before you can comment on or make changes to this bug.