Bug 1892672 - Backup hangs after user power off VM from the inside guest OS
Summary: Backup hangs after user power off VM from the inside guest OS
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: Backup-Restore.VMs
Version: 4.4.1.1
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ovirt-4.4.6
: ---
Assignee: Nir Soffer
QA Contact: Amit Sharir
bugs@ovirt.org
URL:
Whiteboard:
Depends On: 1900505 1918968
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-10-29 12:29 UTC by Yuriy Khokhlov (Veeam)
Modified: 2021-06-02 15:03 UTC (History)
8 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2021-05-05 05:36:36 UTC
oVirt Team: Storage
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 114307 0 None MERGED spec: update qemu-kvm requirement 2021-04-19 06:41:20 UTC

Description Yuriy Khokhlov (Veeam) 2020-10-29 12:29:44 UTC
Description of problem:
Backup hangs after user power off VM from the inside guest OS.

Version-Release number of selected component (if applicable):
Two RHEV 4.4 nodes with layer rhvh-4.4.1.1-0.20200722.0+1 
Ovirt-manager 4.4.1
NFS 4.1 storage domain
ImageIO version:
# rpm -qa | grep image
ovirt-imageio-client-2.0.9-1.el8ev.x86_64
genisoimage-1.1.11-39.el8.x86_64
ovirt-imageio-common-2.0.9-1.el8ev.x86_64
ovirt-imageio-daemon-2.0.9-1.el8ev.x86_64
redhat-virtualization-host-image-update-placeholder-4.4.1-1.el8ev.noarch


How reproducible:


Steps to Reproduce:
1. Run a full backup and wait when the backup starts disk processing.
2. Run power off command inside the guest.

Actual results:
The backup hangs. Image transfer will stop sending data.

Expected results:
The backup should be successful. Image transfer sends data normally.

Additional info:
We will provide logs soon.

Comment 1 RHEL Program Management 2020-10-29 21:39:51 UTC
The documentation text flag should only be set after 'doc text' field is provided. Please provide the documentation text and set the flag to '?' again.

Comment 2 Eyal Shenitzky 2020-11-03 11:55:35 UTC
Nir,

Should we add a mechanism that will detect if the state of the VM changed during an image transfer in case of backup disks?

Comment 3 Nir Soffer 2020-11-03 12:31:06 UTC
(In reply to Eyal Shenitzky from comment #2)
qemu will terminate when powering off the guest within the guest.
This is something we do not control. We need to file RFE for libvirt
to support keeping qemu running in case. This will allow completing
the backup.

For now, if qemu terminate during backup, image transfer will fail, 
and the user can retry the transfer. If retrying fails, the user must
finalize the backup and try again later.

We can document this limitation.

Comment 4 Nir Soffer 2020-11-22 11:45:28 UTC
(In reply to Yuriy Khokhlov (Veeam) from comment #0)

I reproduced this like this:

1. Start vm with raw disk
2. Start backup
3. Wait until download was started
4. In the guest, run "poweroff"

Here is the client log:

$ ./backup_vm.py -c engine-dev full --backup-dir /var/tmp/backups/qcow2 4dc3bb16-f8d1-4f59-9388-a93f68da7cf0
[   0.0 ] Starting full backup for VM 4dc3bb16-f8d1-4f59-9388-a93f68da7cf0
[   0.7 ] Waiting until backup 5c4407d2-f0ba-4381-968f-6077c2787d0b is ready
[   1.8 ] Creating image transfer for disk 126eea31-c5a2-4c01-a18d-9822b0c05c2a
[   3.1 ] Image transfer efa8a20d-d766-486b-b29d-23cb2f82eb5d is ready
[  79.57% ] 4.77 GiB, 4.98 seconds, 982.38 MiB/s                               
[   8.1 ] Finalizing image transfer

(This hanged for several minutes, until I interrupted the script)

^CTraceback (most recent call last):
  File "./backup_vm.py", line 177, in cmd_full
    download_backup(connection, backup.id, args)
  File "./backup_vm.py", line 355, in download_backup
    download_disk(connection, backup_uuid, disk, disk_path, args, incremental=incremental)
  File "./backup_vm.py", line 390, in download_disk
    client.download(
  File "/usr/lib64/python3.8/site-packages/ovirt_imageio/client/_api.py", line 172, in download
    io.copy(
  File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/io.py", line 69, in copy
    log.debug("Executor failed")
  File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/io.py", line 189, in __exit__
    self.stop()
  File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/io.py", line 166, in stop
    raise self._errors[0]
  File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/io.py", line 238, in _run
    handler.copy(req)
  File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/io.py", line 282, in copy
    self._src.write_to(self._dst, req.length, self._buf)
  File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/backends/http.py", line 213, in write_to
    raise RuntimeError(
RuntimeError: Expected 134217728 bytes, got 33554432 bytes

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./backup_vm.py", line 429, in <module>
    main()
  File "./backup_vm.py", line 161, in main
    args.command(args)
  File "./backup_vm.py", line 179, in cmd_full
    stop_backup(connection, backup.id, args)
  File "./backup_vm.py", line 332, in stop_backup
    time.sleep(1)
KeyboardInterrupt

So:
- download failed with expected error when qemu terminated (OK)
- client finalized the transfer (OK)
- client hanged trying to stop the backup


Looking in engine log we see:

1. User finalize the transfer

2020-11-22 12:53:26,915+02 INFO  [org.ovirt.engine.core.bll.storage.disk.image.ImageTransferUpdater] (default task-1) [f0103c0
6-7c1a-480b-9db7-669db1ddf9b7] Updating image transfer efa8a20d-d766-486b-b29d-23cb2f82eb5d (image 126eea31-c5a2-4c01-a18d-982
2b0c05c2a) phase to Finalizing Success

2. Image transfer complete successfuly

2020-11-22 12:53:36,352+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledEx
ecutorService-engineScheduledThreadPool-Thread-44) [733680b9-4c72-452f-8732-f0a697bc6cde] EVENT_ID: TRANSFER_IMAGE_SUCCEEDED(1
,032), Image Download with disk fedora-32.raw succeeded.

3. Engine stop NBD server in vdsm

2020-11-22 12:53:27,175+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.StopNbdServerVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-24) [733680b9-4c72-452f-8732-f0a697bc6cde] START, StopNbdServerVDSCommand(HostName
 = host4, NbdServerVDSParameters:{hostId='9de3098d-89e8-444c-95ed-0fc2ea74e170', serverId='b04bf14e-a0d1-4ed6-9f54-a50c391a82ae', storageDomainId='null', imageId='null', volumeId='null', readonly='false', discard='false', backingChain='true'}), log id: 38aa9060                                                                                                                    
2020-11-22 12:53:27,192+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.StopNbdServerVDSCommand] (EE-ManagedScheduledExecu
torService-engineScheduledThreadPool-Thread-24) [733680b9-4c72-452f-8732-f0a697bc6cde] FINISH, StopNbdServerVDSCommand, return: , log id: 38aa9060

(This is a bug, NBD server is not used in online backup)

4. Engine fails to terminate the backup

2020-11-22 12:53:58,015+02 WARN  [org.ovirt.engine.core.vdsbroker.vdsbroker.StopVmBackupVDSCommand] (default task-1) [cf8e1bb2
-e32b-44df-ba0b-aa349fc8851f] Unexpected return value: Status [code=1600, message=Backup Error: {'vm_id': '4dc3bb16-f8d1-4f59-
9388-a93f68da7cf0', 'backup_id': '5c4407d2-f0ba-4381-968f-6077c2787d0b', 'reason': 'Failed to end VM backup: Timed out during
operation: cannot acquire state change lock (held by monitor=qemu-event)'}]

5. Engine report that backup is finalized

2020-11-22 12:53:58,032+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-1) [cf8e
1bb2-e32b-44df-ba0b-aa349fc8851f] EVENT_ID: VM_BACKUP_FINALIZED(10,794), Backup <UNKNOWN> for VM backup-raw finalized (User: a
dmin@internal-authz).


In engine event log we see:

VDSM host4 command StopVmBackupVDS failed: Backup Error: {'vm_id': '4dc3bb16-f8d1-4f59-9388-a93f68da7cf0', 'backup_id': '5c4407d2-f0ba-4381-968f-6077c2787d0b', 'reason': 'Failed to end VM backup: Timed out during operation: cannot acquire state change lock (held by monitor=qemu-event)'}

Backup <UNKNOWN> for VM backup-raw finalized (User: admin@internal-authz).


In vdsm log we see:

2020-11-22 12:53:57,995+0200 ERROR (jsonrpc/7) [api] FINISH stop_backup error=Backup Error: {'vm_id': '4dc3bb16-f8d1-4f59-9388
-a93f68da7cf0', 'backup_id': '5c4407d2-f0ba-4381-968f-6077c2787d0b', 'reason': 'Failed to end VM backup: Timed out during oper
ation: cannot acquire state change lock (held by monitor=qemu-event)'} (api:131)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/virt/virdomain.py", line 101, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/common/libvirtconnection.py", line 131, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/common/function.py", line 94, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python3.6/site-packages/libvirt.py", line 553, in abortJob
    if ret == -1: raise libvirtError ('virDomainAbortJob() failed', dom=self)
libvirt.libvirtError: Timed out during operation: cannot acquire state change lock (held by monitor=qemu-event)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/virt/backup.py", line 220, in stop_backup
    dom.abortJob()
  File "/usr/lib/python3.6/site-packages/vdsm/virt/virdomain.py", line 159, in call
    return getattr(self._vm._dom, name)(*a, **kw)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/virdomain.py", line 109, in f
    raise toe
vdsm.virt.virdomain.TimeoutError: Timed out during operation: cannot acquire state change lock (held by monitor=qemu-event)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/common/api.py", line 124, in method
    ret = func(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/API.py", line 671, in stop_backup
    return self.vm.stop_backup(backup_id)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/backup.py", line 70, in wrapper
    return f(*a, **kw)
  File "<decorator-gen-265>", line 2, in stop_backup
  File "/usr/lib/python3.6/site-packages/vdsm/common/api.py", line 101, in method
    return func(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 4006, in stop_backup
    return backup.stop_backup(self, dom, backup_id=backup_id)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/backup.py", line 226, in stop_backup
    backup_id=backup_id)
vdsm.common.exception.BackupError: Backup Error: {'vm_id': '4dc3bb16-f8d1-4f59-9388-a93f68da7cf0', 'backup_id': '5c4407d2-f0ba
-4381-968f-6077c2787d0b', 'reason': 'Failed to end VM backup: Timed out during operation: cannot acquire state change lock (he
ld by monitor=qemu-event)'}

2020-11-22 12:53:58,008+0200 INFO  (jsonrpc/7) [api.virt] FINISH stop_backup return={'status': {'code': 1600, 'message': "Backup Error: {'vm_id': '4dc3bb16-f8d1-4f59-9388-a93f68da7cf0', 'backup_id': '5c4407d2-f0ba-4381-968f-6077c2787d0b', 'reason': 'Failed to end VM backup: Timed out during operation: cannot acquire state change lock (held by monitor=qemu-event)'}"}} from=::ffff:192.168.122.11,57176, flow_id=cf8e1bb2-e32b-44df-ba0b-aa349fc8851f, vmId=4dc3bb16-f8d1-4f59-9388-a93f68da7cf0 (api:54)

We don't see any other attempt to stop this backup, and the scratch
disk was not removed.

# tree /var/lib/vdsm/storage/transient_disks/4dc3bb16-f8d1-4f59-9388-a93f68da7cf0/
/var/lib/vdsm/storage/transient_disks/4dc3bb16-f8d1-4f59-9388-a93f68da7cf0/
└── 5c4407d2-f0ba-4381-968f-6077c2787d0b.sda


The user code is stuck in this loop:

329     # TODO: Wait until backup is finalized.
330     backup = backup_service.get()
331     while backup.phase != types.BackupPhase.FINALIZING:
332         time.sleep(1)
333         backup = backup_service.get()


So we had:
- Guest powered off, we cannot prevent that and we cannot keep qemu 
  running in this case, so the image transfer fails (Doc BUG)
- image transfer fails with expected error (OK)
- engine stops NBD server during online backup (BUG)
- vdsm fail to stop the backup because of timeout in libvirt (OK)
- engine fail to stop the backup because vdsm failed (OK)
- engine report that the backup was finalized, without stopping
  the backup, leaving leftover scratch disk (BUG)
- backup_vm.py (our example code) wait incorrectly for completion of the backup (BUG)

Comment 5 Nir Soffer 2020-11-22 13:14:49 UTC
It is more interesting - qemu closed the nbd server and failed the backup,
but it did not terminate, but hanged in shutdown state:

# virsh -r list
 Id   Name         State
--------------------------------
 1    backup-raw   in shutdown

# ps auxf | grep qemu-kvm | grep -v grep
qemu        9990  0.7 14.9 2776992 573688 ?      Sl   12:46   0:32 /usr/libexec/qemu-kvm -name guest=backup-raw,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-1-backup-raw/master-key.aes -machine pc-q35-rhel8.1.0,accel=kvm,usb=off,dump-guest-core=off -cpu Skylake-Client,hle=off,rtm=off,mpx=off -m size=1048576k,slots=16,maxmem=4194304k -overcommit mem-lock=off -smp 1,maxcpus=16,sockets=16,dies=1,cores=1,threads=1 -object iothread,id=iothread1 -numa node,nodeid=0,cpus=0-15,mem=1024 -uuid 4dc3bb16-f8d1-4f59-9388-a93f68da7cf0 -smbios type=1,manufacturer=oVirt,product=RHEL,version=8.3-1.0.el8,serial=d8635601-ea8b-4c5c-a624-41bd72b862d6,uuid=4dc3bb16-f8d1-4f59-9388-a93f68da7cf0,family=oVirt -no-user-config -nodefaults -chardev socket,id=charmonitor,fd=37,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=2020-11-22T10:46:53,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -global ICH9-LPC.disable_s3=1 -global ICH9-LPC.disable_s4=1 -boot strict=on -device pcie-root-port,port=0x10,chassis=1,id=pci.1,bus=pcie.0,multifunction=on,addr=0x2 -device pcie-root-port,port=0x11,chassis=2,id=pci.2,bus=pcie.0,addr=0x2.0x1 -device pcie-root-port,port=0x12,chassis=3,id=pci.3,bus=pcie.0,addr=0x2.0x2 -device pcie-root-port,port=0x13,chassis=4,id=pci.4,bus=pcie.0,addr=0x2.0x3 -device pcie-root-port,port=0x14,chassis=5,id=pci.5,bus=pcie.0,addr=0x2.0x4 -device pcie-root-port,port=0x15,chassis=6,id=pci.6,bus=pcie.0,addr=0x2.0x5 -device pcie-root-port,port=0x16,chassis=7,id=pci.7,bus=pcie.0,addr=0x2.0x6 -device pcie-root-port,port=0x17,chassis=8,id=pci.8,bus=pcie.0,addr=0x2.0x7 -device pcie-root-port,port=0x18,chassis=9,id=pci.9,bus=pcie.0,multifunction=on,addr=0x3 -device pcie-root-port,port=0x19,chassis=10,id=pci.10,bus=pcie.0,addr=0x3.0x1 -device pcie-root-port,port=0x1a,chassis=11,id=pci.11,bus=pcie.0,addr=0x3.0x2 -device pcie-root-port,port=0x1b,chassis=12,id=pci.12,bus=pcie.0,addr=0x3.0x3 -device pcie-root-port,port=0x1c,chassis=13,id=pci.13,bus=pcie.0,addr=0x3.0x4 -device pcie-root-port,port=0x1d,chassis=14,id=pci.14,bus=pcie.0,addr=0x3.0x5 -device pcie-root-port,port=0x1e,chassis=15,id=pci.15,bus=pcie.0,addr=0x3.0x6 -device pcie-root-port,port=0x1f,chassis=16,id=pci.16,bus=pcie.0,addr=0x3.0x7 -device qemu-xhci,p2=8,p3=8,id=ua-eda09d92-c79c-4a6e-851c-70ec0cb902ce,bus=pci.4,addr=0x0 -device virtio-scsi-pci,iothread=iothread1,id=ua-6d18cf44-c8f5-4cd2-8de1-38b2b9fccc1c,bus=pci.2,addr=0x0 -device virtio-serial-pci,id=ua-73ba5fe8-c6ca-446e-8975-3af807045085,max_ports=16,bus=pci.3,addr=0x0 -device ide-cd,bus=ide.2,id=ua-0e36cc67-69e9-4e18-9ee4-ec78a93ceb79,werror=report,rerror=report -blockdev {"driver":"file","filename":"/rhev/data-center/mnt/storage:_export_1/839db77f-fde3-4e13-bfb6-56be604631ed/images/126eea31-c5a2-4c01-a18d-9822b0c05c2a/c7c8040e-597a-402f-b371-3f74510b6a27","aio":"threads","node-name":"libvirt-1-storage","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"} -blockdev {"node-name":"libvirt-1-format","read-only":false,"discard":"unmap","cache":{"direct":true,"no-flush":false},"driver":"raw","file":"libvirt-1-storage"} -device scsi-hd,bus=ua-6d18cf44-c8f5-4cd2-8de1-38b2b9fccc1c.0,channel=0,scsi-id=0,lun=0,device_id=126eea31-c5a2-4c01-a18d-9822b0c05c2a,drive=libvirt-1-format,id=ua-126eea31-c5a2-4c01-a18d-9822b0c05c2a,bootindex=1,write-cache=on,serial=126eea31-c5a2-4c01-a18d-9822b0c05c2a,werror=stop,rerror=stop -netdev tap,fd=39,id=hostua-ae4ec1f0-b1d6-4b0c-ac9a-7e28c12a883b,vhost=on,vhostfd=40 -device virtio-net-pci,host_mtu=1500,netdev=hostua-ae4ec1f0-b1d6-4b0c-ac9a-7e28c12a883b,id=ua-ae4ec1f0-b1d6-4b0c-ac9a-7e28c12a883b,mac=56:6f:7a:5a:00:00,bus=pci.1,addr=0x0 -chardev socket,id=charchannel0,fd=41,server,nowait -device virtserialport,bus=ua-73ba5fe8-c6ca-446e-8975-3af807045085.0,nr=1,chardev=charchannel0,id=channel0,name=ovirt-guest-agent.0 -chardev socket,id=charchannel1,fd=42,server,nowait -device virtserialport,bus=ua-73ba5fe8-c6ca-446e-8975-3af807045085.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0 -chardev spicevmc,id=charchannel2,name=vdagent -device virtserialport,bus=ua-73ba5fe8-c6ca-446e-8975-3af807045085.0,nr=3,chardev=charchannel2,id=channel2,name=com.redhat.spice.0 -device usb-tablet,id=input0,bus=ua-eda09d92-c79c-4a6e-851c-70ec0cb902ce.0,port=1 -spice port=5900,tls-port=5901,addr=192.168.122.23,x509-dir=/etc/pki/vdsm/libvirt-spice,tls-channel=main,tls-channel=display,tls-channel=inputs,tls-channel=cursor,tls-channel=playback,tls-channel=record,tls-channel=smartcard,tls-channel=usbredir,seamless-migration=on -object tls-creds-x509,id=vnc-tls-creds0,dir=/etc/pki/vdsm/libvirt-vnc,endpoint=server,verify-peer=no -vnc 192.168.122.23:2,password,tls-creds=vnc-tls-creds0 -k en-us -device qxl-vga,id=ua-7f3bd26a-a9d8-449e-8e32-70110d026f87,ram_size=67108864,vram_size=8388608,vram64_size_mb=0,vgamem_mb=16,max_outputs=1,bus=pcie.0,addr=0x1 -device virtio-balloon-pci,id=ua-c98b71de-459a-428b-9e54-e53d8b6c0acd,bus=pci.5,addr=0x0 -object rng-random,id=objua-3b051a0c-bfa1-494f-9101-f9cac2cd54ed,filename=/dev/urandom -device virtio-rng-pci,rng=objua-3b051a0c-bfa1-494f-9101-f9cac2cd54ed,id=ua-3b051a0c-bfa1-494f-9101-f9cac2cd54ed,bus=pci.6,addr=0x0 -sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny -msg timestamp=on


Loggign more info in backup_vm.py:

diff --git a/sdk/examples/backup_vm.py b/sdk/examples/backup_vm.py
index fd40154..f871cb8 100755
--- a/sdk/examples/backup_vm.py
+++ b/sdk/examples/backup_vm.py
@@ -329,6 +329,7 @@ def stop_backup(connection, backup_uuid, args):
     # TODO: Wait until backup is finalized.
     backup = backup_service.get()
     while backup.phase != types.BackupPhase.FINALIZING:
+        print("backup.phase = %r, waiting" % backup.phase)
         time.sleep(1)
         backup = backup_service.get()
 

Show that the backup still exists, and it did not change
its state, even when engine logged the event:

   Backup <UNKNOWN> for VM backup-raw finalized (User: admin@internal-authz).

So engine lied about this backup, it was not finalized.

$ ./backup_vm.py -c engine-dev stop 4dc3bb16-f8d1-4f59-9388-a93f68da7cf0 5c4407d2-f0ba-4381-968f-6077c2787d0b
[   0.0 ] Finalizing backup 5c4407d2-f0ba-4381-968f-6077c2787d0b
backup.phase = <BackupPhase.READY: 'ready'>, waiting
backup.phase = <BackupPhase.READY: 'ready'>, waiting
backup.phase = <BackupPhase.READY: 'ready'>, waiting
backup.phase = <BackupPhase.READY: 'ready'>, waiting
backup.phase = <BackupPhase.READY: 'ready'>, waiting
backup.phase = <BackupPhase.READY: 'ready'>, waiting
backup.phase = <BackupPhase.READY: 'ready'>, waiting
backup.phase = <BackupPhase.READY: 'ready'>, waiting
backup.phase = <BackupPhase.READY: 'ready'>, waiting
backup.phase = <BackupPhase.READY: 'ready'>, waiting
backup.phase = <BackupPhase.READY: 'ready'>, waiting
backup.phase = <BackupPhase.READY: 'ready'>, waiting
backup.phase = <BackupPhase.READY: 'ready'>, waiting

Unlike other calls, the finalize call is handled synchronously.
It returns only when the operation is completed on vdsm side
(timed out in this case), but it succeeded even when the actual
operation failed.

Since the VM is stuck, I tried to terminate it, but trying to 
shutdown or poweroff the VM from the UI fails with:

   Error while executing action:

   backup-raw:
   Cannot shutdown VM. The VM is during a backup operation.

So we cannot shutdown the VM since backup is in progress, and we
cannot stop backup since libvirt times out when getting state chagne
lock.

Why qemu is stuck? looks a deadlock in qemu_mutex_lock_iothread_impl:

Thread 7 (Thread 0x7efc52bfe700 (LWP 10005)):
#0  0x00007efcaf04c2fc in pthread_cond_wait@@GLIBC_2.3.2 () from target:/lib64/libpthread.so.0
#1  0x0000563a370c21fd in qemu_cond_wait_impl (cond=<optimized out>, mutex=0x563a38f7fce8, 
    file=0x563a37243868 "/builddir/build/BUILD/qemu-5.1.0/ui/vnc-jobs.c", line=214)
    at /usr/src/debug/qemu-kvm-5.1.0-13.module+el8.3.0+8424+e82f331d.x86_64/util/qemu-thread-posix.c:174
#2  0x0000563a36fe92d1 in vnc_worker_thread_loop (queue=queue@entry=0x563a38f7fcb0)
    at /usr/src/debug/qemu-kvm-5.1.0-13.module+el8.3.0+8424+e82f331d.x86_64/ui/vnc-jobs.c:214
#3  0x0000563a36fe9c00 in vnc_worker_thread (arg=0x563a38f7fcb0)
    at /usr/src/debug/qemu-kvm-5.1.0-13.module+el8.3.0+8424+e82f331d.x86_64/ui/vnc-jobs.c:324
#4  0x0000563a370c1e44 in qemu_thread_start (args=0x563a38903660)
    at /usr/src/debug/qemu-kvm-5.1.0-13.module+el8.3.0+8424+e82f331d.x86_64/util/qemu-thread-posix.c:521
#5  0x00007efcaf04614a in start_thread () from target:/lib64/libpthread.so.0
#6  0x00007efcaed77f23 in clone () from target:/lib64/libc.so.6

Thread 6 (Thread 0x7efc533ff700 (LWP 10004)):
#0  0x00007efcaed6cca1 in poll () from target:/lib64/libc.so.6
#1  0x00007efcb336d9b6 in g_main_context_iterate.isra () from target:/lib64/libglib-2.0.so.0
#2  0x00007efcb336dd72 in g_main_loop_run () from target:/lib64/libglib-2.0.so.0
#3  0x00007efcb0f528db in red_worker_main () from target:/lib64/libspice-server.so.1
#4  0x00007efcaf04614a in start_thread () from target:/lib64/libpthread.so.0
#5  0x00007efcaed77f23 in clone () from target:/lib64/libc.so.6

Thread 5 (Thread 0x7efca2419700 (LWP 10001)):
#0  0x00007efcaf04c2fc in pthread_cond_wait@@GLIBC_2.3.2 () from target:/lib64/libpthread.so.0
#1  0x0000563a370c21fd in qemu_cond_wait_impl (cond=<optimized out>, mutex=0x563a378f91c0 <qemu_global_mutex>, 
    file=0x563a371875d8 "/builddir/build/BUILD/qemu-5.1.0/softmmu/cpus.c", line=1145)
    at /usr/src/debug/qemu-kvm-5.1.0-13.module+el8.3.0+8424+e82f331d.x86_64/util/qemu-thread-posix.c:174
#2  0x0000563a36e1e277 in qemu_wait_io_event (cpu=cpu@entry=0x563a3865f440)
    at /usr/src/debug/qemu-kvm-5.1.0-13.module+el8.3.0+8424+e82f331d.x86_64/softmmu/cpus.c:1145
#3  0x0000563a36e1faa8 in qemu_kvm_cpu_thread_fn (arg=0x563a3865f440)
    at /usr/src/debug/qemu-kvm-5.1.0-13.module+el8.3.0+8424+e82f331d.x86_64/softmmu/cpus.c:1193
#4  qemu_kvm_cpu_thread_fn (arg=0x563a3865f440)
    at /usr/src/debug/qemu-kvm-5.1.0-13.module+el8.3.0+8424+e82f331d.x86_64/softmmu/cpus.c:1160
#5  0x0000563a370c1e44 in qemu_thread_start (args=0x563a38685c60)
    at /usr/src/debug/qemu-kvm-5.1.0-13.module+el8.3.0+8424+e82f331d.x86_64/util/qemu-thread-posix.c:521
#6  0x00007efcaf04614a in start_thread () from target:/lib64/libpthread.so.0
#7  0x00007efcaed77f23 in clone () from target:/lib64/libc.so.6

Thread 4 (Thread 0x7efca2c1a700 (LWP 10000)):
#0  0x00007efcaed6cca1 in poll () from target:/lib64/libc.so.6
#1  0x00007efcb336d9b6 in g_main_context_iterate.isra () from target:/lib64/libglib-2.0.so.0
#2  0x00007efcb336dd72 in g_main_loop_run () from target:/lib64/libglib-2.0.so.0
#3  0x0000563a36e9b931 in iothread_run (opaque=0x563a384f2e60)
    at /usr/src/debug/qemu-kvm-5.1.0-13.module+el8.3.0+8424+e82f331d.x86_64/iothread.c:82
#4  0x0000563a370c1e44 in qemu_thread_start (args=0x563a3865ccc0)
    at /usr/src/debug/qemu-kvm-5.1.0-13.module+el8.3.0+8424+e82f331d.x86_64/util/qemu-thread-posix.c:521
#5  0x00007efcaf04614a in start_thread () from target:/lib64/libpthread.so.0
#6  0x00007efcaed77f23 in clone () from target:/lib64/libc.so.6

Thread 3 (Thread 0x7efca3fff700 (LWP 9995)):
#0  0x00007efcaed6cd96 in ppoll () from target:/lib64/libc.so.6
#1  0x0000563a370d8bf9 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, __fds=<optimized out>)
    at /usr/include/bits/poll2.h:77
#2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=timeout@entry=-1)
    at /usr/src/debug/qemu-kvm-5.1.0-13.module+el8.3.0+8424+e82f331d.x86_64/util/qemu-timer.c:335
#3  0x0000563a370bfe21 in fdmon_poll_wait (ctx=0x563a385e3300, ready_list=0x7efca3ffe6b0, timeout=-1)
    at /usr/src/debug/qemu-kvm-5.1.0-13.module+el8.3.0+8424+e82f331d.x86_64/util/fdmon-poll.c:79
#4  0x0000563a370bf62e in aio_poll (ctx=0x563a385e3300, blocking=blocking@entry=true)
    at /usr/src/debug/qemu-kvm-5.1.0-13.module+el8.3.0+8424+e82f331d.x86_64/util/aio-posix.c:600
#5  0x0000563a36e9b904 in iothread_run (opaque=0x563a38500800)
    at /usr/src/debug/qemu-kvm-5.1.0-13.module+el8.3.0+8424+e82f331d.x86_64/iothread.c:75
#6  0x0000563a370c1e44 in qemu_thread_start (args=0x563a385e3810)
    at /usr/src/debug/qemu-kvm-5.1.0-13.module+el8.3.0+8424+e82f331d.x86_64/util/qemu-thread-posix.c:521
#7  0x00007efcaf04614a in start_thread () from target:/lib64/libpthread.so.0
#8  0x00007efcaed77f23 in clone () from target:/lib64/libc.so.6

Thread 2 (Thread 0x7efca8e03700 (LWP 9994)):
#0  0x00007efcaf04f65d in __lll_lock_wait () from target:/lib64/libpthread.so.0
#1  0x00007efcaf048979 in pthread_mutex_lock () from target:/lib64/libpthread.so.0
#2  0x0000563a370c1f2d in qemu_mutex_lock_impl (mutex=0x563a378f91c0 <qemu_global_mutex>, 
    file=0x563a37272ea8 "/builddir/build/BUILD/qemu-5.1.0/util/rcu.c", line=266)
    at /usr/src/debug/qemu-kvm-5.1.0-13.module+el8.3.0+8424+e82f331d.x86_64/util/qemu-thread-posix.c:79
#3  0x0000563a36e1f99e in qemu_mutex_lock_iothread_impl (
    file=file@entry=0x563a37272ea8 "/builddir/build/BUILD/qemu-5.1.0/util/rcu.c", line=line@entry=266)
    at /usr/src/debug/qemu-kvm-5.1.0-13.module+el8.3.0+8424+e82f331d.x86_64/softmmu/cpus.c:1782
#4  0x0000563a370ca916 in call_rcu_thread (opaque=<optimized out>)
    at /usr/src/debug/qemu-kvm-5.1.0-13.module+el8.3.0+8424+e82f331d.x86_64/util/rcu.c:266
#5  0x0000563a370c1e44 in qemu_thread_start (args=0x563a383f7760)
    at /usr/src/debug/qemu-kvm-5.1.0-13.module+el8.3.0+8424+e82f331d.x86_64/util/qemu-thread-posix.c:521
#6  0x00007efcaf04614a in start_thread () from target:/lib64/libpthread.so.0
#7  0x00007efcaed77f23 in clone () from target:/lib64/libc.so.6

Thread 1 (Thread 0x7efcb407a680 (LWP 9990)):
#0  0x00007efcaed6cd96 in ppoll () from target:/lib64/libc.so.6
#1  0x0000563a370d8bf9 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, __fds=<optimized out>)
    at /usr/include/bits/poll2.h:77
#2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=timeout@entry=-1)
    at /usr/src/debug/qemu-kvm-5.1.0-13.module+el8.3.0+8424+e82f331d.x86_64/util/qemu-timer.c:335
#3  0x0000563a370bfe21 in fdmon_poll_wait (ctx=0x563a385dea20, ready_list=0x7ffc7f1c9cf0, timeout=-1)
    at /usr/src/debug/qemu-kvm-5.1.0-13.module+el8.3.0+8424+e82f331d.x86_64/util/fdmon-poll.c:79
#4  0x0000563a370bf62e in aio_poll (ctx=0x563a385dea20, blocking=blocking@entry=true)
    at /usr/src/debug/qemu-kvm-5.1.0-13.module+el8.3.0+8424+e82f331d.x86_64/util/aio-posix.c:600
#5  0x0000563a37057db2 in nbd_export_close_all ()
    at /usr/src/debug/qemu-kvm-5.1.0-13.module+el8.3.0+8424+e82f331d.x86_64/nbd/server.c:1751
#6  0x0000563a36ff58a9 in bdrv_close_all ()
    at /usr/src/debug/qemu-kvm-5.1.0-13.module+el8.3.0+8424+e82f331d.x86_64/block.c:4427
#7  0x0000563a36e332bb in qemu_cleanup ()
    at /usr/src/debug/qemu-kvm-5.1.0-13.module+el8.3.0+8424+e82f331d.x86_64/softmmu/vl.c:4531
#8  0x0000563a36d4bbd7 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>)
    at /usr/src/debug/qemu-kvm-5.1.0-13.module+el8.3.0+8424+e82f331d.x86_64/softmmu/main.c:50


Why libvirtd is stuck? seems that it hangs in qemuMonitorSend waiting for reply
from qemu, that never received since qemu is hanged on iothread lock.

Thread 19 (Thread 0x7fc4fcbf6700 (LWP 10062)):
#0  0x00007fc5489782fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fc54c1cc91a in virCondWait () from /lib64/libvirt.so.0
#2  0x00007fc50286a924 in qemuMonitorSend () from /usr/lib64/libvirt/connection-driver/libvirt_driver_qemu.so
#3  0x00007fc50287a605 in qemuMonitorJSONCommandWithFd () from /usr/lib64/libvirt/connection-driver/libvirt_driver_qemu.so
#4  0x00007fc50288aeac in qemuMonitorJSONGetJobInfo () from /usr/lib64/libvirt/connection-driver/libvirt_driver_qemu.so
#5  0x00007fc5027ecfb9 in qemuBlockJobUpdate () from /usr/lib64/libvirt/connection-driver/libvirt_driver_qemu.so
#6  0x00007fc5028c0968 in qemuProcessEventHandler () from /usr/lib64/libvirt/connection-driver/libvirt_driver_qemu.so
#7  0x00007fc54c1cd46f in virThreadPoolWorker () from /lib64/libvirt.so.0
#8  0x00007fc54c1cca6b in virThreadHelper () from /lib64/libvirt.so.0
#9  0x00007fc54897214a in start_thread () from /lib64/libpthread.so.0
#10 0x00007fc548287f23 in clone () from /lib64/libc.so.6

Thread 18 (Thread 0x7fc4fef88700 (LWP 9993)):
#0  0x00007fc54827cca1 in poll () from /lib64/libc.so.6
#1  0x00007fc548ddb9b6 in g_main_context_iterate.isra () from /lib64/libglib-2.0.so.0
#2  0x00007fc548ddbd72 in g_main_loop_run () from /lib64/libglib-2.0.so.0
#3  0x00007fc54c1722be in virEventThreadWorker () from /lib64/libvirt.so.0
#4  0x00007fc548e03d4a in g_thread_proxy () from /lib64/libglib-2.0.so.0
#5  0x00007fc54897214a in start_thread () from /lib64/libpthread.so.0
#6  0x00007fc548287f23 in clone () from /lib64/libc.so.6

Thread 17 (Thread 0x7fc4fd3f7700 (LWP 1602)):
#0  0x00007fc5489782fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fc54c1cc91a in virCondWait () from /lib64/libvirt.so.0
#2  0x00007fc5031c73fc in udevEventHandleThread () from /usr/lib64/libvirt/connection-driver/libvirt_driver_nodedev.so
#3  0x00007fc54c1cca6b in virThreadHelper () from /lib64/libvirt.so.0
#4  0x00007fc54897214a in start_thread () from /lib64/libpthread.so.0
#5  0x00007fc548287f23 in clone () from /lib64/libc.so.6

Thread 16 (Thread 0x7fc500779700 (LWP 1514)):
#0  0x00007fc5489782fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fc54c1cc91a in virCondWait () from /lib64/libvirt.so.0
#2  0x00007fc54c1cd53b in virThreadPoolWorker () from /lib64/libvirt.so.0
#3  0x00007fc54c1cca6b in virThreadHelper () from /lib64/libvirt.so.0
#4  0x00007fc54897214a in start_thread () from /lib64/libpthread.so.0
#5  0x00007fc548287f23 in clone () from /lib64/libc.so.6

Thread 15 (Thread 0x7fc500f7a700 (LWP 1513)):
#0  0x00007fc5489782fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fc54c1cc91a in virCondWait () from /lib64/libvirt.so.0
#2  0x00007fc54c1cd53b in virThreadPoolWorker () from /lib64/libvirt.so.0
#3  0x00007fc54c1cca6b in virThreadHelper () from /lib64/libvirt.so.0
#4  0x00007fc54897214a in start_thread () from /lib64/libpthread.so.0
#5  0x00007fc548287f23 in clone () from /lib64/libc.so.6

Thread 14 (Thread 0x7fc50177b700 (LWP 1511)):
#0  0x00007fc5489782fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fc54c1cc91a in virCondWait () from /lib64/libvirt.so.0
#2  0x00007fc54c1cd53b in virThreadPoolWorker () from /lib64/libvirt.so.0
#3  0x00007fc54c1cca6b in virThreadHelper () from /lib64/libvirt.so.0
#4  0x00007fc54897214a in start_thread () from /lib64/libpthread.so.0
#5  0x00007fc548287f23 in clone () from /lib64/libc.so.6

Thread 13 (Thread 0x7fc501f7c700 (LWP 1510)):
#0  0x00007fc5489782fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fc54c1cc91a in virCondWait () from /lib64/libvirt.so.0
#2  0x00007fc54c1cd53b in virThreadPoolWorker () from /lib64/libvirt.so.0
#3  0x00007fc54c1cca6b in virThreadHelper () from /lib64/libvirt.so.0
#4  0x00007fc54897214a in start_thread () from /lib64/libpthread.so.0
#5  0x00007fc548287f23 in clone () from /lib64/libc.so.6

Thread 12 (Thread 0x7fc50277d700 (LWP 1509)):
#0  0x00007fc5489782fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fc54c1cc91a in virCondWait () from /lib64/libvirt.so.0
#2  0x00007fc54c1cd53b in virThreadPoolWorker () from /lib64/libvirt.so.0
#3  0x00007fc54c1cca6b in virThreadHelper () from /lib64/libvirt.so.0
#4  0x00007fc54897214a in start_thread () from /lib64/libpthread.so.0
#5  0x00007fc548287f23 in clone () from /lib64/libc.so.6

Thread 11 (Thread 0x7fc53d547700 (LWP 1393)):
#0  0x00007fc5489782fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fc54c1cc91a in virCondWait () from /lib64/libvirt.so.0
#2  0x00007fc54c1cd4f8 in virThreadPoolWorker () from /lib64/libvirt.so.0
#3  0x00007fc54c1cca6b in virThreadHelper () from /lib64/libvirt.so.0
#4  0x00007fc54897214a in start_thread () from /lib64/libpthread.so.0
#5  0x00007fc548287f23 in clone () from /lib64/libc.so.6

Thread 10 (Thread 0x7fc53dd48700 (LWP 1392)):
#0  0x00007fc5489782fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fc54c1cc91a in virCondWait () from /lib64/libvirt.so.0
#2  0x00007fc54c1cd4f8 in virThreadPoolWorker () from /lib64/libvirt.so.0
#3  0x00007fc54c1cca6b in virThreadHelper () from /lib64/libvirt.so.0
#4  0x00007fc54897214a in start_thread () from /lib64/libpthread.so.0
#5  0x00007fc548287f23 in clone () from /lib64/libc.so.6

Thread 9 (Thread 0x7fc53e549700 (LWP 1391)):
#0  0x00007fc5489782fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fc54c1cc91a in virCondWait () from /lib64/libvirt.so.0
#2  0x00007fc54c1cd4f8 in virThreadPoolWorker () from /lib64/libvirt.so.0
#3  0x00007fc54c1cca6b in virThreadHelper () from /lib64/libvirt.so.0
#4  0x00007fc54897214a in start_thread () from /lib64/libpthread.so.0
#5  0x00007fc548287f23 in clone () from /lib64/libc.so.6

Thread 8 (Thread 0x7fc53ed4a700 (LWP 1390)):
#0  0x00007fc5489782fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fc54c1cc91a in virCondWait () from /lib64/libvirt.so.0
#2  0x00007fc54c1cd4f8 in virThreadPoolWorker () from /lib64/libvirt.so.0
#3  0x00007fc54c1cca6b in virThreadHelper () from /lib64/libvirt.so.0
#4  0x00007fc54897214a in start_thread () from /lib64/libpthread.so.0
#5  0x00007fc548287f23 in clone () from /lib64/libc.so.6

Thread 7 (Thread 0x7fc53f54b700 (LWP 1389)):
#0  0x00007fc5489782fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fc54c1cc91a in virCondWait () from /lib64/libvirt.so.0
#2  0x00007fc54c1cd4f8 in virThreadPoolWorker () from /lib64/libvirt.so.0
#3  0x00007fc54c1cca6b in virThreadHelper () from /lib64/libvirt.so.0
#4  0x00007fc54897214a in start_thread () from /lib64/libpthread.so.0
#5  0x00007fc548287f23 in clone () from /lib64/libc.so.6

Thread 6 (Thread 0x7fc53fd4c700 (LWP 1388)):
#0  0x00007fc5489782fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fc54c1cc91a in virCondWait () from /lib64/libvirt.so.0
#2  0x00007fc54c1cd53b in virThreadPoolWorker () from /lib64/libvirt.so.0
#3  0x00007fc54c1cca6b in virThreadHelper () from /lib64/libvirt.so.0
#4  0x00007fc54897214a in start_thread () from /lib64/libpthread.so.0
#5  0x00007fc548287f23 in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x7fc54054d700 (LWP 1387)):
#0  0x00007fc5489782fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fc54c1cc91a in virCondWait () from /lib64/libvirt.so.0
#2  0x00007fc54c1cd53b in virThreadPoolWorker () from /lib64/libvirt.so.0
#3  0x00007fc54c1cca6b in virThreadHelper () from /lib64/libvirt.so.0
#4  0x00007fc54897214a in start_thread () from /lib64/libpthread.so.0
#5  0x00007fc548287f23 in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7fc540d4e700 (LWP 1386)):
#0  0x00007fc5489782fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fc54c1cc91a in virCondWait () from /lib64/libvirt.so.0
#2  0x00007fc54c1cd53b in virThreadPoolWorker () from /lib64/libvirt.so.0
#3  0x00007fc54c1cca6b in virThreadHelper () from /lib64/libvirt.so.0
#4  0x00007fc54897214a in start_thread () from /lib64/libpthread.so.0
#5  0x00007fc548287f23 in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7fc54154f700 (LWP 1385)):
#0  0x00007fc5489782fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fc54c1cc91a in virCondWait () from /lib64/libvirt.so.0
#2  0x00007fc54c1cd53b in virThreadPoolWorker () from /lib64/libvirt.so.0
#3  0x00007fc54c1cca6b in virThreadHelper () from /lib64/libvirt.so.0
#4  0x00007fc54897214a in start_thread () from /lib64/libpthread.so.0
#5  0x00007fc548287f23 in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7fc541d50700 (LWP 1384)):
#0  0x00007fc5489782fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fc54c1cc91a in virCondWait () from /lib64/libvirt.so.0
#2  0x00007fc54c1cd53b in virThreadPoolWorker () from /lib64/libvirt.so.0
#3  0x00007fc54c1cca6b in virThreadHelper () from /lib64/libvirt.so.0
#4  0x00007fc54897214a in start_thread () from /lib64/libpthread.so.0
#5  0x00007fc548287f23 in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7fc54ccfdc00 (LWP 1355)):
#0  0x00007fc54827cca1 in poll () from /lib64/libc.so.6
#1  0x00007fc548ddb9b6 in g_main_context_iterate.isra () from /lib64/libglib-2.0.so.0
#2  0x00007fc548ddbae0 in g_main_context_iteration () from /lib64/libglib-2.0.so.0
#3  0x00007fc54c172084 in virEventGLibRunOnce () from /lib64/libvirt.so.0
#4  0x00007fc54c2bd6c5 in virNetDaemonRun () from /lib64/libvirt.so.0
#5  0x000055eb8223e4df in main ()


Trying to kill qemu:

# kill 9990
# ps -p 9990
    PID TTY          TIME CMD
   9990 ?        00:00:32 qemu-kvm

The process is stuck, so the only way is:

# kill -9 9990

System detect that VM was powered off now.

Trying to stop the backup again:

$ ./backup_vm.py -c engine-dev stop 4dc3bb16-f8d1-4f59-9388-a93f68da7cf0 5c4407d2-f0ba-4381-968f-6077c2787d0b
[   0.0 ] Finalizing backup 5c4407d2-f0ba-4381-968f-6077c2787d0b
Traceback (most recent call last):
  File "./backup_vm.py", line 430, in <module>
    main()
  File "./backup_vm.py", line 161, in main
    args.command(args)
  File "./backup_vm.py", line 245, in cmd_stop
    stop_backup(connection, args.backup_uuid, args)
  File "./backup_vm.py", line 327, in stop_backup
    backup_service.finalize()
  File "/home/nsoffer/.local/lib/python3.8/site-packages/ovirtsdk4/services.py", line 33306, in finalize
    return self._internal_action(action, 'finalize', None, headers, query, wait)
  File "/home/nsoffer/.local/lib/python3.8/site-packages/ovirtsdk4/service.py", line 299, in _internal_action
    return future.wait() if wait else future
  File "/home/nsoffer/.local/lib/python3.8/site-packages/ovirtsdk4/service.py", line 55, in wait
    return self._code(response)
  File "/home/nsoffer/.local/lib/python3.8/site-packages/ovirtsdk4/service.py", line 296, in callback
    self._check_fault(response)
  File "/home/nsoffer/.local/lib/python3.8/site-packages/ovirtsdk4/service.py", line 132, in _check_fault
    self._raise_error(response, body)
  File "/home/nsoffer/.local/lib/python3.8/site-packages/ovirtsdk4/service.py", line 118, in _raise_error
    raise error
ovirtsdk4.Error: Fault reason is "Operation Failed". Fault detail is "[Host not found.]". HTTP response code is 400.

So we have no backup, good.

Looking in engine log:

1. Engine try to end the backup

2020-11-22 14:55:15,153+02 INFO  [org.ovirt.engine.core.bll.StopVmBackupCommand] (default task-1) [79461714-65f1-4141-8aad-3d356
4345907] Stopping VmBackup '5c4407d2-f0ba-4381-968f-6077c2787d0b

2. Engine fail to find the host that run the VM

564345907] Cannot get vdsManager for vdsid='00000000-0000-0000-0000-000000000000'.
2020-11-22 14:55:15,156+02 ERROR [org.ovirt.engine.core.vdsbroker.ResourceManager] (default task-1) [79461714-65f1-4141-8aad-3d3564345907] createCommand failed: EngineException: Vds with id: 00000000-0000-0000-0000-000000000000 was not found (Failed with e
rror RESOURCE_MANAGER_VDS_NOT_FOUND and code 5004)

Looks like engine uses empty Guid to locate the host, probably because it
did not save the host in the db, and it assumes that we always have a VM
object when backup is running.

3. Engine log failure to finalize the backup

2020-11-22 14:55:15,175+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-1) [79461714-65f1-4141-8aad-3d3564345907] EVENT_ID: VM_BACKUP_FAILED_TO_FINALIZE(10,795), Backup <UNKNOWN> for VM backup-raw failed to finalize (User: admin@internal-authz).


Looking in host:

# tree /var/lib/vdsm/storage/transient_disks/4dc3bb16-f8d1-4f59-9388-a93f68da7cf0/
/var/lib/vdsm/storage/transient_disks/4dc3bb16-f8d1-4f59-9388-a93f68da7cf0/
└── 5c4407d2-f0ba-4381-968f-6077c2787d0b.sda

Scratch disk was not removed.


So we have more bugs:
- qemu deadlock after poweroff during backup(BUG)
- libvirt hangs on qemu monitor without a timeout (BUG)
- Engine logs that backup was finialized while backup was not finalized (BUG)
- Finalize action is synchronous instead of asynchronous (BUG?)
- Finalize succeeded even when actual backup was not finalized (BUG)
- Backup reported as ready after calling finalize. Is it really ready?
  looks like we need to move to FINAlIZING state when calling finalize
  to make sure that no new transfer can be started once finalize was called.
- No way to terminate the VM because backup is running. User should be able
  to abort a backup (BUG)
- No way to shutdown/poweroff the VM, the only way is to kill qemu
- After killing qemu, engine fails to stop the backup since host is not found (BUG)
- Scratch disk is left on host side, never deleted

Comment 6 Nir Soffer 2020-11-22 14:27:28 UTC
Last issue - after shutting down the vm and failing to finalize
the backup, the vm remain locked and cannot be started again:

    Error while executing action:

    backup-raw:
    Cannot run VM: VM is locked. Please try again in a few minutes.

I think the issue is the backup command that did not clean up
the memory locks.

Comment 7 Nir Soffer 2020-11-22 14:32:20 UTC
A leftover backup command is in the db:

engine=# select * from vm_backups;
              backup_id               | from_checkpoint_id | to_checkpoint_id |                vm_id                 | phase |  
      _create_date        
--------------------------------------+--------------------+------------------+--------------------------------------+-------+--
--------------------------
 5c4407d2-f0ba-4381-968f-6077c2787d0b |                    |                  | 4dc3bb16-f8d1-4f59-9388-a93f68da7cf0 | Ready | 2
020-11-22 12:53:19.422+02

The backup cannot be finalized because is does not find the host the VM
is running on, since the VM is not running. The issues is that we assume
the the VM must be running during the entire backup. If the VM is powered
off from the guest, or qemu is killed, we don't have the host id in the
command so we cannot clean up the backup.

The host id must be kept in the backup job to avoid this issue.

Comment 8 Nir Soffer 2020-11-22 15:28:39 UTC
Second attempt to reproduce, qemu did not get stuck, and the backup
did not hang:

$ ./backup_vm.py -c engine-dev full 4dc3bb16-f8d1-4f59-9388-a93f68da7cf0
[   0.0 ] Starting full backup for VM 4dc3bb16-f8d1-4f59-9388-a93f68da7cf0
[   0.3 ] Waiting until backup 4ab7aec5-cc29-4a15-921b-d78180473b14 is ready
[   1.3 ] Creating image transfer for disk 126eea31-c5a2-4c01-a18d-9822b0c05c2a
[   2.6 ] Image transfer 1b113c65-3638-4ecd-b9c5-1cf9348c00ce is ready
[  63.68% ] 3.82 GiB, 3.46 seconds, 1.10 GiB/s                                 
[   6.0 ] Finalizing image transfer
[   9.1 ] Finalizing backup
Traceback (most recent call last):
  File "./backup_vm.py", line 177, in cmd_full
    download_backup(connection, backup.id, args)
  File "./backup_vm.py", line 357, in download_backup
    download_disk(connection, backup_uuid, disk, disk_path, args, incremental=incremental)
  File "./backup_vm.py", line 392, in download_disk
    client.download(
  File "/usr/lib64/python3.8/site-packages/ovirt_imageio/client/_api.py", line 172, in download
    io.copy(
  File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/io.py", line 69, in copy
    log.debug("Executor failed")
  File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/io.py", line 189, in __exit__
    self.stop()
  File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/io.py", line 166, in stop
    raise self._errors[0]
  File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/io.py", line 238, in _run
    handler.copy(req)
  File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/io.py", line 282, in copy
    self._src.write_to(self._dst, req.length, self._buf)
  File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/backends/http.py", line 204, in write_to
    res = self._get(length)
  File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/backends/http.py", line 431, in _get
    self._reraise(res.status, res.read())
  File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/backends/http.py", line 585, in _reraise
    raise http.Error(status, msg)
ovirt_imageio._internal.http.Error: Server failed to perform the request, check logs

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./backup_vm.py", line 431, in <module>
    main()
  File "./backup_vm.py", line 161, in main
    args.command(args)
  File "./backup_vm.py", line 180, in cmd_full
    stop_backup(connection, backup.id, args)
  File "./backup_vm.py", line 329, in stop_backup
    backup_service.finalize()
  File "/home/nsoffer/.local/lib/python3.8/site-packages/ovirtsdk4/services.py", line 33306, in finalize
    return self._internal_action(action, 'finalize', None, headers, query, wait)
  File "/home/nsoffer/.local/lib/python3.8/site-packages/ovirtsdk4/service.py", line 299, in _internal_action
    return future.wait() if wait else future
  File "/home/nsoffer/.local/lib/python3.8/site-packages/ovirtsdk4/service.py", line 55, in wait
    return self._code(response)
  File "/home/nsoffer/.local/lib/python3.8/site-packages/ovirtsdk4/service.py", line 296, in callback
    self._check_fault(response)
  File "/home/nsoffer/.local/lib/python3.8/site-packages/ovirtsdk4/service.py", line 132, in _check_fault
    self._raise_error(response, body)
  File "/home/nsoffer/.local/lib/python3.8/site-packages/ovirtsdk4/service.py", line 118, in _raise_error
    raise error
ovirtsdk4.Error: Fault reason is "Operation Failed". Fault detail is "[Host not found.]". HTTP response code is 400.


In this case stopping the backup failed since the host was not found
and the only way to clear this backup is to manually delete the backup
from the db:

$ sudo -u postgres psql -d engine
psql (12.1)
Type "help" for help.

engine=# select * from vm_backups;
              backup_id               | from_checkpoint_id | to_checkpoint_id |                vm_id                 | phase |  
      _create_date        
--------------------------------------+--------------------+------------------+--------------------------------------+-------+--
--------------------------
 4ab7aec5-cc29-4a15-921b-d78180473b14 |                    |                  | 4dc3bb16-f8d1-4f59-9388-a93f68da7cf0 | Ready | 2
020-11-22 17:20:06.204+02
(1 row)

engine=# delete from vm_backups where backup_id='4ab7aec5-cc29-4a15-921b-d78180473b14';
DELETE 1
engine=# \q

$ sudo systemctl restart ovirt-engine

Comment 9 Nir Soffer 2020-11-22 15:46:48 UTC
Third attempt, backup hangs:

$ ./backup_vm.py -c engine-dev full --backup-dir /var/tmp/backups/raw 4dc3bb16-f8d1-4f59-9388-a93f68da7cf0
[   0.0 ] Starting full backup for VM 4dc3bb16-f8d1-4f59-9388-a93f68da7cf0
[   0.8 ] Waiting until backup f01630c5-943c-4845-b404-29e5ff0b509b is ready
[   1.8 ] Creating image transfer for disk 126eea31-c5a2-4c01-a18d-9822b0c05c2a
[   3.1 ] Image transfer 2c150870-915b-4cc2-a4a1-ca32287099de is ready
[  45.55% ] 2.73 GiB, 2.84 seconds, 983.96 MiB/s                               
[   5.9 ] Finalizing image transfer
[   9.0 ] Finalizing backup
[  39.1 ] Waiting until backup is finalized
^CTraceback (most recent call last):
  File "./backup_vm.py", line 177, in cmd_full
    download_backup(connection, backup.id, args)
  File "./backup_vm.py", line 357, in download_backup
    download_disk(connection, backup_uuid, disk, disk_path, args, incremental=incremental)
  File "./backup_vm.py", line 392, in download_disk
    client.download(
  File "/usr/lib64/python3.8/site-packages/ovirt_imageio/client/_api.py", line 172, in download
    io.copy(
  File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/io.py", line 69, in copy
    log.debug("Executor failed")
  File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/io.py", line 189, in __exit__
    self.stop()
  File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/io.py", line 166, in stop
    raise self._errors[0]
  File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/io.py", line 238, in _run
    handler.copy(req)
  File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/io.py", line 282, in copy
    self._src.write_to(self._dst, req.length, self._buf)
  File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/backends/http.py", line 204, in write_to
    res = self._get(length)
  File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/backends/http.py", line 431, in _get
    self._reraise(res.status, res.read())
  File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/backends/http.py", line 585, in _reraise
    raise http.Error(status, msg)
ovirt_imageio._internal.http.Error: Server failed to perform the request, check logs

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./backup_vm.py", line 431, in <module>
    main()
  File "./backup_vm.py", line 161, in main
    args.command(args)
  File "./backup_vm.py", line 180, in cmd_full
    stop_backup(connection, backup.id, args)
  File "./backup_vm.py", line 335, in stop_backup
    time.sleep(1)
KeyboardInterrupt


Qemu is deadlocked. After killing qemu I could start the vm - but the
backup was still in the db. Looks like we have a locking bug - the backup
should prevent starting the VM. Once the VM was started, you could
not stop it because of the backup.

Comment 10 Nir Soffer 2020-11-22 16:42:51 UTC
I tried more times to collect logs and check how reproducible
is this issue:

Attempt 4 - qemu did not hang
Attempt 5 - qemu did not hang but backup hang, engine failed to 
            stop the backup since the vm was destroyed.

So this failed 3 of 5 attempts with various failures. I'll file
bugs for qemu, libvirt, ovirt-engine and ovirt-engine-sdk.

Comment 11 Nir Soffer 2020-11-22 16:51:43 UTC
In the last attempt, vdsm failed stop_backup with:

2020-11-22 18:37:25,071+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.StopVmBackupVDSCommand] (default task-2) [72d778a8
-fe49-441e-8b5d-167606b85e75] Command 'StopVmBackupVDSCommand(HostName = host3, VmBackupVDSParameters:{hostId='f1593c12-5d93-4
44b-99f4-cd31ffd45506', backupId='76073f2e-4709-404c-bd81-82ee89f85e24'})' execution failed: VDSGenericException: VDSErrorExce
ption: Failed to StopVmBackupVDS, error = General Exception: ("VM '4dc3bb16-f8d1-4f59-9388-a93f68da7cf0' was not defined yet o
r was undefined",), code = 100

This error is not useful for engine. If stopping backup fails because the
vm is undefined, vdsm should treat this as success. Undefined vm cannot
have a running backup.

In this case vdsm should have removed the transient disk and return 
success.

I'll file also vdsm bug for this.

Comment 12 Nir Soffer 2020-11-22 17:04:32 UTC
Attempt 6 was successful - qemu exit cleanly and engine stopped the
backup successfully.

So this is 50% reproducible. I think it is more likey when powering
off right after the download starts.

$ ./backup_vm.py -c engine-dev full --backup-dir /var/tmp/backups/raw 4dc3bb16-f8d1-4f59-9388-a93f68da7cf0
[   0.0 ] Starting full backup for VM 4dc3bb16-f8d1-4f59-9388-a93f68da7cf0
[   0.7 ] Waiting until backup 370a3445-62e0-4b8a-b2b6-6a3662a52413 is ready
[   1.8 ] Creating image transfer for disk 126eea31-c5a2-4c01-a18d-9822b0c05c2a
[   3.1 ] Image transfer f4fc43f9-dd7d-47d9-9fca-6697b4cf7a23 is ready
[  78.12% ] 4.69 GiB, 4.89 seconds, 981.80 MiB/s                               
[   8.0 ] Finalizing image transfer
[   9.0 ] Finalizing backup
[   9.1 ] Waiting until backup is finalized
Traceback (most recent call last):
  File "./backup_vm.py", line 431, in <module>
    main()
  File "./backup_vm.py", line 161, in main
    args.command(args)
  File "./backup_vm.py", line 177, in cmd_full
    download_backup(connection, backup.id, args)
  File "./backup_vm.py", line 357, in download_backup
    download_disk(connection, backup_uuid, disk, disk_path, args, incremental=incremental)
  File "./backup_vm.py", line 392, in download_disk
    client.download(
  File "/usr/lib64/python3.8/site-packages/ovirt_imageio/client/_api.py", line 172, in download
    io.copy(
  File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/io.py", line 69, in copy
    log.debug("Executor failed")
  File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/io.py", line 189, in __exit__
    self.stop()
  File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/io.py", line 166, in stop
    raise self._errors[0]
  File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/io.py", line 238, in _run
    handler.copy(req)
  File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/io.py", line 282, in copy
    self._src.write_to(self._dst, req.length, self._buf)
  File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/backends/http.py", line 204, in write_to
    res = self._get(length)
  File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/backends/http.py", line 431, in _get
    self._reraise(res.status, res.read())
  File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/backends/http.py", line 585, in _reraise
    raise http.Error(status, msg)
ovirt_imageio._internal.http.Error: Server failed to perform the request, check logs

Comment 13 Nir Soffer 2020-11-23 08:53:11 UTC
I created bug 1900505 for the qemu deadlock. Until this is fixed, we cannot
fix the hang during backup.

Comment 14 Nir Soffer 2020-11-23 09:24:55 UTC
I filed bug 1900518 for the vdsm issue, failing VM.stop_backup when vm was
destroyed before VM.stop_backup was called.

Comment 15 Eyal Shenitzky 2021-04-19 06:41:21 UTC
Qemu version with the fix will be used now by VDSM after patch - https://gerrit.ovirt.org/#/c/vdsm/+/114307/ merged.
Moving to QA for testing.

Comment 16 Amit Sharir 2021-04-29 15:12:38 UTC
version for validation:

# rpm -q ovirt-engine
ovirt-engine-4.4.6.5-0.17.el8ev.noarch

# rpm -qa | grep release
rhv-release-4.4.6-6-001.noarch
redhat-release-8.4-0.6.el8.x86_64
redhat-release-eula-8.4-0.6.el8.x86_64

# rpm -q vdsm
vdsm-4.40.60.5-1.el8ev.x86_64



The script mentioned in comment #4 didn't work for me. (./backup_vm.py -c engine-dev full --backup-dir /var/tmp/backups/qcow2 4dc3bb16-f8d1-4f59-9388-a93f68da7cf0 - I received many errors when I tried to run it in this manner.


For now, I validated this bug using the following steps.

1. python3 /usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py -c engine start [vmuuid]
2. python3 /usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py -c engine download [vmuuid] --backup-dir . --backup-uuid [backup]
3. I connected directly to the relevant VM via SSH connection and before the download was complete I did the "poweroff" command via the VM terminal.


Verification Summary and conclusions -

The following flow resulted in a timeout error : (RuntimeError: Timed out waiting for transfer 89ba7382-be4e-4102-9f71-4d9003cb6afd to finalize, I also got additional errors in my terminal), although the expected result was that the backup should be successful and Image transfer sends data normally.

Comment 17 Eyal Shenitzky 2021-05-02 11:02:52 UTC
A(In reply to Amit Sharir from comment #16)
> version for validation:
> 
> # rpm -q ovirt-engine
> ovirt-engine-4.4.6.5-0.17.el8ev.noarch
> 
> # rpm -qa | grep release
> rhv-release-4.4.6-6-001.noarch
> redhat-release-8.4-0.6.el8.x86_64
> redhat-release-eula-8.4-0.6.el8.x86_64
> 
> # rpm -q vdsm
> vdsm-4.40.60.5-1.el8ev.x86_64
> 
> 
> 
> The script mentioned in comment #4 didn't work for me. (./backup_vm.py -c
> engine-dev full --backup-dir /var/tmp/backups/qcow2
> 4dc3bb16-f8d1-4f59-9388-a93f68da7cf0 - I received many errors when I tried
> to run it in this manner.

I don't have the errors so I can't tell why it fails, but I can assume that you don't have the config file (engine-dev) and you tried to use the same VM ID as Nir mentioned which doesn't exist in your environment.

> 
> 
> For now, I validated this bug using the following steps.
> 
> 1. python3 /usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py -c
> engine start [vmuuid]
> 2. python3 /usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py -c
> engine download [vmuuid] --backup-dir . --backup-uuid [backup]
> 3. I connected directly to the relevant VM via SSH connection and before the
> download was complete I did the "poweroff" command via the VM terminal.

Looks OK.

> 
> 
> Verification Summary and conclusions -
> 
> The following flow resulted in a timeout error : (RuntimeError: Timed out
> waiting for transfer 89ba7382-be4e-4102-9f71-4d9003cb6afd to finalize, I
> also got additional errors in my terminal), although the expected result was
> that the backup should be successful and Image transfer sends data normally.

Make sense, QEMU terminates due to the power-off so the transfer will eventually fail due to time out.
Please see comment #3.

The expected results should be that if the user will retry to download the backup (cold backup download) it will succeed and, the user should be able to finalize the backup.

Comment 18 Amit Sharir 2021-05-04 08:34:56 UTC
Eyal thanks for the clarifications.

Regarding the error I received with (./backup_vm.py -c engine-dev full --backup-dir /var/tmp/backups/qcow2 4dc3bb16-f8d1-4f59-9388-a93f68da7cf0), the VM id and config file that I used were correct. We found an issue with the SDK script when using the "full" parameter. We opened accordingly this bug - https://bugzilla.redhat.com/show_bug.cgi?id=1956167 regarding this specific issue.



I ran the verification flow mentioned in comment #16 again and got the expected results mentioned in comment #4 and comment #17.

version for validation

# rpm -q vdsm
vdsm-4.40.60.6-1.el8ev.x86_64

# rpm -q ovirt-engine
ovirt-engine-4.4.6.6-0.10.el8ev.noarch

# rpm -qa | grep release
rhv-release-4.4.6-7-001.noarch
redhat-release-8.4-0.6.el8.x86_64
redhat-release-eula-8.4-0.6.el8.x86_64

Verification summary and conclusions -

The actual and expected results were the same.
I was able to retry the download of the backup (cold backup download) successfully and I was able to finalize the backup.

Bug verified.

Comment 19 Sandro Bonazzola 2021-05-05 05:36:36 UTC
This bugzilla is included in oVirt 4.4.6 release, published on May 4th 2021.

Since the problem described in this bug report should be resolved in oVirt 4.4.6 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.