Created attachment 1773723 [details] libvirt debug log Description of problem: After successful blockCopy, libvirt fail blockInfo with the target drive with: invalid argument: invalid path /path... not assigned to domain In libvirt log we can see that libvirt complete the job succcessfuly, and then it tries remove the target nodes. Aftter that blockInfo for that drive fail. This breaks extending volumes in RHV, so the VM will pause the disk becomes full, and is unlikely to recover, since blockInfo() is called when extending a volume. This also breaks live merge for same reason, and I guess any API trying to use the drive will fail. Version-Release number of selected component (if applicable): libvirt-daemon 6.6.0-13.el8 qemu-kvm-5.1.0-20.el8, How reproducible: Fails randomly in oVirt system tests. Steps to Reproduce: 1. Live storage migration in RHV when source disk is on file storage domain and target disk on block storage domain. Actual results: The blockCopy job succeeds, but after that the drive path is "not assigned to domain" Expected results: Drive path is recognizable after successful blockCopy, or blockCopy fails so RHV can continue to use the old drive. Additional info: Looks like libivrt is confused when completing the block job; so this may also be qemu bug changing the behavior in an incompatible way. Tested with Centos, not with RHEL. This is environment used by oVirt system tests. This may affect current RHV customers using same libvirt and qemu-kvm version in RHEL AV 8.3.1. ## vdsm.log 1. Stopping disk replication, pivoting from file volume to block volume 2021-04-18 15:41:17,264+0200 DEBUG (jsonrpc/2) [virt.vm] (vmId='8020459c-ba31-48f5-9ad1-3af792151dbf') Stopping the disk repli cation switching to the destination drive: {'imageID': 'd9fe3172-512d-45d6-b433-a834140fd268', 'poolID': '044320e3-7070-4cee-a 654-8d4b19ce2cfd', 'volumeID': 'dd9b9a5b-65ae-4463-9a62-2f01e73cebf8', 'device': 'disk', 'domainID': '44fd974e-6466-45aa-b932- 67c2732d0ae8'} (vm:4382) 2. Libvirt reports that the block job is ready 2021-04-18 15:41:17,301+0200 INFO (libvirt/events) [virt.vm] (vmId='8020459c-ba31-48f5-9ad1-3af792151dbf') Block job (untrack ed) type COPY for drive vdb is ready (vm:5675) 3. Libvirt reports the job has completed 2021-04-18 15:41:17,326+0200 INFO (libvirt/events) [virt.vm] (vmId='8020459c-ba31-48f5-9ad1-3af792151dbf') Block job (untrack ed) type COPY for drive vdb has completed (vm:5666) 4. The libvirt call succeeded, vdsm update the drive type 2021-04-18 15:41:17,351+0200 DEBUG (jsonrpc/2) [virt.vm] (vmId='8020459c-ba31-48f5-9ad1-3af792151dbf') Drive vdb type changed from 'file' to 'block' (storage:528) 2021-04-18 15:41:17,351+0200 DEBUG (jsonrpc/2) [virt.vm] (vmId='8020459c-ba31-48f5-9ad1-3af792151dbf') Drive vdb move from '/rhev/data-center/mnt/192.168.202.2:_exports_nfs_share1/e80b1b07-12c2-4308-a219-d430065dc40e/images/d9fe3172-512d-45d6-b433-a834140fd268/dd9b9a5b-65ae-4463-9a62-2f01e73cebf8' to '/rhev/data-center/mnt/blockSD/44fd974e-6466-45aa-b932-67c2732d0ae8/images/d9fe3172-512d-45d6-b433-a834140fd268/dd9b9a5b-65ae-4463-9a62-2f01e73cebf8', unsetting threshold (storage:440) 5. Vdsm report success 2021-04-18 15:41:17,360+0200 DEBUG (jsonrpc/2) [api] FINISH diskReplicateFinish response={'status': {'code': 0, 'message': 'Done'}} (api:137) 6. When vdsm try to set block threshold, libvirt fail 2021-04-18 15:41:17,999+0200 ERROR (periodic/1) [virt.vm] (vmId='8020459c-ba31-48f5-9ad1-3af792151dbf') Unable to get watermarks for drive vdb: invalid argument: invalid path /rhev/data-center/mnt/blockSD/44fd974e-6466-45aa-b932-67c2732d0ae8/images/d9fe3172-512d-45d6-b433-a834140fd268/dd9b9a5b-65ae-4463-9a62-2f01e73cebf8 not assigned to domain (vm:1318) ... 2021-04-18 15:41:20,003+0200 ERROR (periodic/0) [virt.vm] (vmId='8020459c-ba31-48f5-9ad1-3af792151dbf') Unable to get watermarks for drive vdb: invalid argument: invalid path /rhev/data-center/mnt/blockSD/44fd974e-6466-45aa-b932-67c2732d0ae8/images/d9fe3172-512d-45d6-b433-a834140fd268/dd9b9a5b-65ae-4463-9a62-2f01e73cebf8 not assigned to domain (vm:1318) ... 2021-04-18 15:41:22,000+0200 ERROR (periodic/1) [virt.vm] (vmId='8020459c-ba31-48f5-9ad1-3af792151dbf') Unable to get watermarks for drive vdb: invalid argument: invalid path /rhev/data-center/mnt/blockSD/44fd974e-6466-45aa-b932-67c2732d0ae8/images/d9fe3172-512d-45d6-b433-a834140fd268/dd9b9a5b-65ae-4463-9a62-2f01e73cebf8 not assigned to domain (vm:1318) ## libvirtd.log 1. Creating snapshot on block device 2021-04-18 13:40:34.235+0000: 10892: debug : virDomainSnapshotCreateXML:221 : dom=0x7f01400140b0, (VM: name=vm0, uuid=8020459c -ba31-48f5-9ad1-3af792151dbf), xmlDesc=<?xml version='1.0' encoding='utf-8'?> <domainsnapshot><disks><disk name="vdb" snapshot="external" type="file"><source file="/rhev/data-center/mnt/192.168.202.2:_exports_nfs_share1/e80b1b07-12c2-4308-a219-d430065dc40e/images/d9fe3172-512d-45d6-b433-a834140fd268/dd9b9a5b-65ae-4463-9a62-2f01e73cebf8" type="file"><seclabel model="dac" relabel="no" type="none" /></source></disk></disks></domainsnapshot>, flags=0x34 2. Adding new device as libvirt-5-storage (source) 2021-04-18 13:40:34.263+0000: 10892: info : qemuMonitorSend:946 : QEMU_MONITOR_SEND_MSG: mon=0x7f0120022180 msg={"execute":"blockdev-add","arguments":{"driver":"file","filename":"/rhev/data-center/mnt/192.168.202.2:_exports_nfs_share1/e80b1b07-12c2-4308-a219-d430065dc40e/images/d9fe3172-512d-45d6-b433-a834140fd268/dd9b9a5b-65ae-4463-9a62-2f01e73cebf8","aio":"threads","node-name":"libvirt-5-storage","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"},"id":"libvirt-444"} fd=-1 3. Starting block copy from source to target 2021-04-18 13:41:10.813+0000: 10891: debug : virDomainBlockCopy:10374 : dom=0x7f0130005330, (VM: name=vm0, uuid=8020459c-ba31- 48f5-9ad1-3af792151dbf), disk=vdb, destxml=<?xml version='1.0' encoding='utf-8'?> <disk device="disk" snapshot="no" type="block"><source dev="/rhev/data-center/mnt/blockSD/44fd974e-6466-45aa-b932-67c2732d0ae8/images/d9fe3172-512d-45d6-b433-a834140fd268/dd9b9a5b-65ae-4463-9a62-2f01e73cebf8"><seclabel model="dac" relabel="no" type="none" /></source><driver cache="none" error_policy="stop" io="native" name="qemu" type="qcow2" /></disk>, params=(nil), nparams=0, flags=0x7 4. Adding libvirt-6-storage (target) 2021-04-18 13:41:10.874+0000: 10891: info : qemuMonitorSend:946 : QEMU_MONITOR_SEND_MSG: mon=0x7f0120022180 msg={"execute":"blockdev-add","arguments":{"driver":"host_device","filename":"/rhev/data-center/mnt/blockSD/44fd974e-6466-45aa-b932-67c2732d0ae8/images/d9fe3172-512d-45d6-b433-a834140fd268/dd9b9a5b-65ae-4463-9a62-2f01e73cebf8","aio":"threads","node-name":"libvirt-6-storage","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"},"id":"libvirt-456"} fd=-1 5. Starting block job copy-vdb-libvirt-5-format 2021-04-18 13:41:10.896+0000: 10891: info : qemuMonitorSend:946 : QEMU_MONITOR_SEND_MSG: mon=0x7f0120022180 msg={"execute":"blockdev-mirror","arguments":{"job-id":"copy-vdb-libvirt-5-format","device":"libvirt-5-format","target":"libvirt-6-format","sync":"top","auto-finalize":true,"auto-dismiss":false},"id":"libvirt-458"} fd=-1 6. Trying to abort the block copy block job, switching to target flags=VIR_DOMAIN_BLOCK_JOB_ABORT_PIVOT 2021-04-18 13:41:17.265+0000: 10893: debug : virDomainBlockJobAbort:9898 : dom=0x7f013800b500, (VM: name=vm0, uuid=8020459c-ba31-48f5-9ad1-3af792151dbf), disk=vdb, flags=0x2 2021-04-18 13:41:17.265+0000: 10893: info : qemuMonitorSend:946 : QEMU_MONITOR_SEND_MSG: mon=0x7f0120022180 msg={"execute":"blockdev-add","arguments":{"driver":"host_device","filename":"/rhev/data-center/mnt/blockSD/44fd974e-6466-45aa-b932-67c2732d0ae8/images/d9fe3172-512d-45d6-b433-a834140fd268/a8f66889-b221-43b8-a02c-43c983a0451c","aio":"threads","node-name":"libvirt-7-storage","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"},"id":"libvirt-469"} fd=-1 2021-04-18 13:41:17.275+0000: 19074: debug : qemuMonitorJSONIOProcessLine:220 : Line [{"return": {}, "id": "libvirt-469"}] 2021-04-18 13:41:17.275+0000: 19074: info : qemuMonitorJSONIOProcessLine:240 : QEMU_MONITOR_RECV_REPLY: mon=0x7f0120022180 reply={"return": {}, "id": "libvirt-469"} 2021-04-18 13:41:17.276+0000: 10893: info : qemuMonitorSend:946 : QEMU_MONITOR_SEND_MSG: mon=0x7f0120022180 msg={"execute":"blockdev-add","arguments":{"node-name":"libvirt-7-format","read-only":true,"cache":{"direct":true,"no-flush":false},"driver":"qcow2","file":"libvirt-7-storage","backing":null},"id":"libvirt-470"} fd=-1 2021-04-18 13:41:17.286+0000: 19074: info : qemuMonitorJSONIOProcessLine:240 : QEMU_MONITOR_RECV_REPLY: mon=0x7f0120022180 reply={"return": {}, "id": "libvirt-470"} 2021-04-18 13:41:17.286+0000: 10893: info : qemuMonitorSend:946 : QEMU_MONITOR_SEND_MSG: mon=0x7f0120022180 msg={"execute":"transaction","arguments":{"actions":[{"type":"blockdev-snapshot","data":{"node":"libvirt-7-format","overlay":"libvirt-6-format"}}]},"id":"libvirt-471"} fd=-1 7. Copy block job move to standby state? 2021-04-18 13:41:17.289+0000: 19074: info : qemuMonitorJSONIOProcessLine:235 : QEMU_MONITOR_RECV_EVENT: mon=0x7f0120022180 event={"timestamp": {"seconds": 1618753277, "microseconds": 288831}, "event": "JOB_STATUS_CHANGE", "data": {"status": "standby", "id": "copy-vdb-libvirt-5-format"}} 8. Copy block jobs move back to ready state 2021-04-18 13:41:17.289+0000: 19074: info : qemuMonitorJSONIOProcessLine:235 : QEMU_MONITOR_RECV_EVENT: mon=0x7f0120022180 event={"timestamp": {"seconds": 1618753277, "microseconds": 289019}, "event": "JOB_STATUS_CHANGE", "data": {"status": "ready", "id": "copy-vdb-libvirt-5-format"}} 2021-04-18 13:41:17.292+0000: 19074: info : qemuMonitorJSONIOProcessLine:240 : QEMU_MONITOR_RECV_REPLY: mon=0x7f0120022180 reply={"return": {}, "id": "libvirt-471"} 2021-04-18 13:41:17.293+0000: 10893: info : qemuMonitorSend:946 : QEMU_MONITOR_SEND_MSG: mon=0x7f0120022180 msg={"execute":"transaction","arguments":{"actions":[{"type":"block-dirty-bitmap-add","data":{"node":"libvirt-6-format","name":"libvirt-tmp-activewrite","persistent":false,"disabled":false}}]},"id":"libvirt-472"} fd=-1 2021-04-18 13:41:17.296+0000: 19074: info : qemuMonitorJSONIOProcessLine:240 : QEMU_MONITOR_RECV_REPLY: mon=0x7f0120022180 reply={"return": {}, "id": "libvirt-472"} 9. Libvirt completes the copy block job 2021-04-18 13:41:17.297+0000: 10893: info : qemuMonitorSend:946 : QEMU_MONITOR_SEND_MSG: mon=0x7f0120022180 msg={"execute":"job-complete","arguments":{"id":"copy-vdb-libvirt-5-format"},"id":"libvirt-473"} fd=-1 2021-04-18 13:41:17.298+0000: 19074: info : qemuMonitorJSONIOProcessLine:240 : QEMU_MONITOR_RECV_REPLY: mon=0x7f0120022180 reply={"return": {}, "id": "libvirt-473"} 2021-04-18 13:41:17.301+0000: 19074: info : qemuMonitorJSONIOProcessLine:235 : QEMU_MONITOR_RECV_EVENT: mon=0x7f0120022180 event={"timestamp": {"seconds": 1618753277, "microseconds": 300701}, "event": "JOB_STATUS_CHANGE", "data": {"status": "waiting", "id": "copy-vdb-libvirt-5-format"}} 2021-04-18 13:41:17.301+0000: 19074: info : qemuMonitorJSONIOProcessLine:235 : QEMU_MONITOR_RECV_EVENT: mon=0x7f0120022180 event={"timestamp": {"seconds": 1618753277, "microseconds": 300762}, "event": "JOB_STATUS_CHANGE", "data": {"status": "pending", "id": "copy-vdb-libvirt-5-format"}} 2021-04-18 13:41:17.301+0000: 19074: info : qemuMonitorJSONIOProcessLine:235 : QEMU_MONITOR_RECV_EVENT: mon=0x7f0120022180 event={"timestamp": {"seconds": 1618753277, "microseconds": 300879}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "copy-vdb-libvirt-5-format", "len": 0, "offset": 0, "speed": 0, "type": "mirror"}} 2021-04-18 13:41:17.301+0000: 19074: info : qemuMonitorJSONIOProcessLine:235 : QEMU_MONITOR_RECV_EVENT: mon=0x7f0120022180 event={"timestamp": {"seconds": 1618753277, "microseconds": 300917}, "event": "JOB_STATUS_CHANGE", "data": {"status": "concluded", "id": "copy-vdb-libvirt-5-format"}} 2021-04-18 13:41:17.302+0000: 10893: info : qemuMonitorSend:946 : QEMU_MONITOR_SEND_MSG: mon=0x7f0120022180 msg={"execute":"job-dismiss","arguments":{"id":"copy-vdb-libvirt-5-format"},"id":"libvirt-475"} fd=-1 2021-04-18 13:41:17.305+0000: 19074: info : qemuMonitorJSONIOProcessLine:235 : QEMU_MONITOR_RECV_EVENT: mon=0x7f0120022180 event={"timestamp": {"seconds": 1618753277, "microseconds": 304257}, "event": "JOB_STATUS_CHANGE", "data": {"status": "null", "id": "copy-vdb-libvirt-5-format"}} 2021-04-18 13:41:17.306+0000: 10893: debug : qemuBlockJobProcessEventConcludedCopyAbort:1402 : copy job 'copy-vdb-libvirt-5-format' on VM 'vm0' aborted 10. Libvirt tries to delete the target node?! 2021-04-18 13:41:17.306+0000: 10893: info : qemuMonitorSend:946 : QEMU_MONITOR_SEND_MSG: mon=0x7f0120022180 msg={"execute":"blockdev-del","arguments":{"node-name":"libvirt-6-format"},"id":"libvirt-476"} fd=-1 11. The operation failed, device is used after successful pivot. 2021-04-18 13:41:17.308+0000: 19074: info : qemuMonitorJSONIOProcessLine:240 : QEMU_MONITOR_RECV_REPLY: mon=0x7f0120022180 reply={"id": "libvirt-476", "error": {"class": "GenericError", "desc": "Node libvirt-6-format is in use"}} 2021-04-18 13:41:17.308+0000: 10893: error : qemuMonitorJSONCheckErrorFull:416 : internal error: unable to execute QEMU command 'blockdev-del': Node libvirt-6-format is in use 12. Libvirt tries to delete the target node backing chain?! 2021-04-18 13:41:17.310+0000: 10893: info : qemuMonitorSend:946 : QEMU_MONITOR_SEND_MSG: mon=0x7f0120022180 msg={"execute":"blockdev-del","arguments":{"node-name":"libvirt-7-format"},"id":"libvirt-478"} fd=-1 10. Operation failed since device is used. 2021-04-18 13:41:17.311+0000: 19074: info : qemuMonitorJSONIOProcessLine:240 : QEMU_MONITOR_RECV_REPLY: mon=0x7f0120022180 reply={"id": "libvirt-478", "error": {"class": "GenericError", "desc": "Node 'libvirt-7-format' is busy: node is used as backing hd of 'libvirt-6-format'"}} 11. Accessing new device fails: 2021-04-18 13:41:17.998+0000: 10894: error : qemuDomainGetBlockInfo:11935 : invalid argument: invalid path /rhev/data-center/mnt/blockSD/44fd974e-6466-45aa-b932-67c2732d0ae8/images/d9fe3172-512d-45d6-b433-a834140fd268/dd9b9a5b-65ae-4463-9a62-2f01e73cebf8 not assigned to domain This error repeats many times.
Created attachment 1773724 [details] libvirt/qemu vm log
Created attachment 1773725 [details] vdsm log
In this instance libvirt is getting confused by an unfortunate serialization of events. The bug happens due to the re-issued 'ready' state which is re-instated after a 'standby' state which happens after we already entered qemuDomainBlockJobAbort and switched the blockjob into synchronous mode. In synchronous mode the order of stuff happening allows libvirt to actually clear the QEMU_BLOCKJOB_STATE_PIVOTING state if a ready event is received clearing the PIVOTING state. The block job finalization code then uses QEMU_BLOCKJOB_STATE_PIVOTING state to determine what to do next and since the state is NOT PIVOTING we call the wrong handler which attempts to execute steps as if the job were aborted.
Fixed upstream: commit 81e770c2f152635219cec76585c770fb9f6ce030 Author: Peter Krempa <pkrempa> Date: Tue Apr 20 12:58:28 2021 +0200 qemu: blockjob: Transition into 'ready' state only from expected states In certain rare occasions qemu can transition a block job which was already 'ready' into 'standby' and then back. If this happens in the following order libvirt will get confused about the actual job state: 1) the block copy job is 'ready' (job->state == QEMU_BLOCKJOB_STATE_READY) 2) user calls qemuDomainBlockJobAbort with VIR_DOMAIN_BLOCK_JOB_ABORT_PIVOT flag but without VIR_DOMAIN_BLOCK_JOB_ABORT_ASYNC 3) the block job is switched to synchronous event handling 4) the block job blips to 'standby' and back to 'ready', the event is not processed since the blockjob is in sync mode for now 5) qemuDomainBlockJobPivot is called: 5.1) 'job-complete' QMP command is issued 5.2) job->state is set to QEMU_BLOCKJOB_STATE_PIVOTING 6) code for synchronous-wait for the job completion in qemuDomainBlockJobAbort is invoked 7) the waiting loop calls qemuBlockJobUpdate: 7.1) job->newstate is QEMU_BLOCKJOB_STATE_READY due to 4) 7.2) qemuBlockJobEventProcess is called 7.3) the handler for QEMU_BLOCKJOB_STATE_READY overwrites job->state from QEMU_BLOCKJOB_STATE_PIVOTING to QEMU_BLOCKJOB_STATE_READY 8) qemuDomainBlockJobAbort is looking for a finished job, so waits again 9) qemu finishes the blockjob and transitions it into 'concluded' state 10) qemuBlockJobUpdate is triggered again, this time finalizing the job. 10.1) job->newstate is = QEMU_BLOCKJOB_STATE_CONCLUDED job->state is = QEMU_BLOCKJOB_STATE_READY 10.2) qemuBlockJobEventProcessConcluded is called, the function checks whether there was an error with the blockjob. Since there was no error job->newstate becomes QEMU_BLOCKJOB_STATE_COMPLETED. 10.3) qemuBlockJobEventProcessConcludedTransition selects the action for the appropriate block job type where we have: case QEMU_BLOCKJOB_TYPE_COPY: if (job->state == QEMU_BLOCKJOB_STATE_PIVOTING && success) qemuBlockJobProcessEventConcludedCopyPivot(driver, vm, job, asyncJob); else qemuBlockJobProcessEventConcludedCopyAbort(driver, vm, job, asyncJob); break; Since job->state is QEMU_BLOCKJOB_STATE_READY, qemuBlockJobProcessEventConcludedCopyAbort is called. This patch forbids transitions to QEMU_BLOCKJOB_STATE_READY if the previous job state isn't QEMU_BLOCKJOB_STATE_RUNNING or QEMU_BLOCKJOB_STATE_NEW. Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1951507 Signed-off-by: Peter Krempa <pkrempa> Reviewed-by: Michal Privoznik <mprivozn>
Peter thanks for the quick fix! Do you known when we can have a build (maybe scratch build?) with this fix, and when it should be released?
Could you please add the upstream version (fixed in which release)?
(In reply to Eric Hadley from comment #6) > Could you please add the upstream version (fixed in which release)? The commit mentioned above will be released in libvirt-7.3 which is due to be released on 2021/05/03.
Still can't reproduce it on: libvirt-6.6.0-13.2.module+el8.3.1+10483+85317cf0.x86_64 qemu-kvm-5.1.0-21.module+el8.3.1+10464+8ad18d1a.x86_64 Test Steps: # cat test.sh #!/bin/bash DOMAIN=lmn DISK=vda TDISK="/dev/sdb" LOG_FILE="/var/log/libvirt/libvirtd.log" FIND_STR="not assigned to domain" rm -rf /tmp/$DOMAIN.s* for i in {1..100} do echo "------$i------" virsh destroy $DOMAIN virsh create $DOMAIN.xml; sleep 3; virsh snapshot-create-as $DOMAIN --disk-only --diskspec $DISK,file=/tmp/$DOMAIN.s$i; virsh blockcopy $DOMAIN $DISK $TDISK --wait --verbose; sleep 2; virsh blockjob $DOMAIN $DISK --pivot; virsh dumpxml $DOMAIN | grep /disk -B8 virsh domblkthreshold $DOMAIN $DISK 100 virsh domstats $DOMAIN --block --backing if [ `grep -c "$FIND_STR" $LOG_FILE` -ne '0' ]; then break else continue fi done # sh test.sh ... ------100------ Domain lmn destroyed Domain lmn created from lmn.xml Domain snapshot 1619503529 created Block Copy: [100 %] Now in mirroring phase <emulator>/usr/libexec/qemu-kvm</emulator> <disk type='file' device='disk'> <driver name='qemu' type='qcow2' cache='none'/> <source file='/dev/sdb' index='3'/> <backingStore/> <target dev='vda' bus='virtio'/> <alias name='virtio-disk0'/> <address type='pci' domain='0x0000' bus='0x08' slot='0x00' function='0x0'/> </disk> Domain: 'lmn' block.count=1 block.0.name=vda block.0.path=/dev/sdb block.0.backingIndex=3 block.0.rd.reqs=1 block.0.rd.bytes=512 block.0.rd.times=575213 block.0.wr.reqs=0 block.0.wr.bytes=0 block.0.wr.times=0 block.0.fl.reqs=0 block.0.fl.times=0 block.0.allocation=210042880 block.0.capacity=209715200 block.0.physical=0 block.0.threshold=100
Firstly the bug happens only when pivoting to the new image, which your script actually doesn't contain. You are missing either '--pivot' with 'virsh blockcopy' or a separate 'virsh blockjob --pivot' call. Secondly the original report outlines a _very_ rare race condition consisting of actually two issues intersecting at the correct time, which will probably be almost impossible to reproduce. It's caused by the block job blipping to 'standby' and back to 'ready' at a very precise point. The blipping is described in https://bugzilla.redhat.com/show_bug.cgi?id=1945635 which by itself is rare and it must be timed properly to hit inside qemuDomainBlockJobAbort function when the job was switched to synchronous handling. This will be almost impossible to achieve with a test script as it requires a stressed system and even then it's rather sporradic. Since the bug happens only when two 'ready' events are delivered to libvirt in specific places I don't think I can offer any form of instrumentation which would make it more likely to happen as normally qemu doesn't switch to the job to 'standby' state.
This bug was found in ovirt system tests. I don't have exact data but it fails about once per 50 runs. We don't know how to reproduce this reliably. The issue of flipping between "ready" and "standby" state is easier to reproduce, with ovirt-stress[1] you can reproduce this in 1-2 hours on a visualized environment. Reproducing on real server with fast storage takes about one day. I think the only way to test such issues is stress test. [1] https://gitlab.com/nirs/ovirt-stress/-/tree/master/delete-snapshot
Hi Nir, Could you please help to test the scratch build for rhel 8.4 on bug 1955667#c1? As we can not reproduce it now, and we will focus on the related regression test.Thank you in advance.
(In reply to yalzhang from comment #14) > Hi Nir, Could you please help to test the scratch build for rhel 8.4 on bug > 1955667#c1? As we can not reproduce it now, and we will focus on the related > regression test.Thank you in advance. I cannot reproduce the issue. My logs came from ovirt system tests runs where it fails randomly (about once per 50 runs). Maybe RHV QE can help by using this build in the automated tests.
Note that https://bugzilla.redhat.com/show_bug.cgi?id=1962053 claims to be able to always reproduce the same issue with rhel-9 packages.
Basic test on: libvirt-7.3.0-1.module+el8.5.0+11004+f4810536.x86_64 qemu-kvm-6.0.0-16.module+el8.5.0+10848+2dccc46d.x86_64 The test was passed with the following script: # cat test.sh #!/bin/bash DOMAIN=lmn DISK=vda TDISK="/dev/sdc" LOG_FILE="/var/log/libvirt/libvirtd.log" FIND_STR="not assigned to domain" rm -rf /tmp/$DOMAIN.s* for i in {1..100} do echo "------$i------" virsh destroy $DOMAIN virsh create $DOMAIN.xml; sleep 3; virsh snapshot-create-as $DOMAIN --disk-only --diskspec $DISK,file=/tmp/$DOMAIN.s$i; virsh blockcopy $DOMAIN $DISK --blockdev $TDISK --wait --verbose; sleep 2; virsh blockjob $DOMAIN $DISK --pivot; virsh dumpxml $DOMAIN | grep /disk -B8 virsh domblkthreshold $DOMAIN $DISK 100 virsh domstats $DOMAIN --block --backing if [ `grep -c "$FIND_STR" $LOG_FILE` -ne '0' ]; then break else continue fi done
According to comment 20 and comment 21, move this bug to be verified.
*** Bug 1979015 has been marked as a duplicate of this bug. ***
*** Bug 1978713 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (virt:av bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:4684