Description of problem: Backup hangs after user power off VM from the inside guest OS. Version-Release number of selected component (if applicable): Two RHEV 4.4 nodes with layer rhvh-4.4.1.1-0.20200722.0+1 Ovirt-manager 4.4.1 NFS 4.1 storage domain ImageIO version: # rpm -qa | grep image ovirt-imageio-client-2.0.9-1.el8ev.x86_64 genisoimage-1.1.11-39.el8.x86_64 ovirt-imageio-common-2.0.9-1.el8ev.x86_64 ovirt-imageio-daemon-2.0.9-1.el8ev.x86_64 redhat-virtualization-host-image-update-placeholder-4.4.1-1.el8ev.noarch How reproducible: Steps to Reproduce: 1. Run a full backup and wait when the backup starts disk processing. 2. Run power off command inside the guest. Actual results: The backup hangs. Image transfer will stop sending data. Expected results: The backup should be successful. Image transfer sends data normally. Additional info: We will provide logs soon.
The documentation text flag should only be set after 'doc text' field is provided. Please provide the documentation text and set the flag to '?' again.
Nir, Should we add a mechanism that will detect if the state of the VM changed during an image transfer in case of backup disks?
(In reply to Eyal Shenitzky from comment #2) qemu will terminate when powering off the guest within the guest. This is something we do not control. We need to file RFE for libvirt to support keeping qemu running in case. This will allow completing the backup. For now, if qemu terminate during backup, image transfer will fail, and the user can retry the transfer. If retrying fails, the user must finalize the backup and try again later. We can document this limitation.
(In reply to Yuriy Khokhlov (Veeam) from comment #0) I reproduced this like this: 1. Start vm with raw disk 2. Start backup 3. Wait until download was started 4. In the guest, run "poweroff" Here is the client log: $ ./backup_vm.py -c engine-dev full --backup-dir /var/tmp/backups/qcow2 4dc3bb16-f8d1-4f59-9388-a93f68da7cf0 [ 0.0 ] Starting full backup for VM 4dc3bb16-f8d1-4f59-9388-a93f68da7cf0 [ 0.7 ] Waiting until backup 5c4407d2-f0ba-4381-968f-6077c2787d0b is ready [ 1.8 ] Creating image transfer for disk 126eea31-c5a2-4c01-a18d-9822b0c05c2a [ 3.1 ] Image transfer efa8a20d-d766-486b-b29d-23cb2f82eb5d is ready [ 79.57% ] 4.77 GiB, 4.98 seconds, 982.38 MiB/s [ 8.1 ] Finalizing image transfer (This hanged for several minutes, until I interrupted the script) ^CTraceback (most recent call last): File "./backup_vm.py", line 177, in cmd_full download_backup(connection, backup.id, args) File "./backup_vm.py", line 355, in download_backup download_disk(connection, backup_uuid, disk, disk_path, args, incremental=incremental) File "./backup_vm.py", line 390, in download_disk client.download( File "/usr/lib64/python3.8/site-packages/ovirt_imageio/client/_api.py", line 172, in download io.copy( File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/io.py", line 69, in copy log.debug("Executor failed") File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/io.py", line 189, in __exit__ self.stop() File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/io.py", line 166, in stop raise self._errors[0] File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/io.py", line 238, in _run handler.copy(req) File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/io.py", line 282, in copy self._src.write_to(self._dst, req.length, self._buf) File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/backends/http.py", line 213, in write_to raise RuntimeError( RuntimeError: Expected 134217728 bytes, got 33554432 bytes During handling of the above exception, another exception occurred: Traceback (most recent call last): File "./backup_vm.py", line 429, in <module> main() File "./backup_vm.py", line 161, in main args.command(args) File "./backup_vm.py", line 179, in cmd_full stop_backup(connection, backup.id, args) File "./backup_vm.py", line 332, in stop_backup time.sleep(1) KeyboardInterrupt So: - download failed with expected error when qemu terminated (OK) - client finalized the transfer (OK) - client hanged trying to stop the backup Looking in engine log we see: 1. User finalize the transfer 2020-11-22 12:53:26,915+02 INFO [org.ovirt.engine.core.bll.storage.disk.image.ImageTransferUpdater] (default task-1) [f0103c0 6-7c1a-480b-9db7-669db1ddf9b7] Updating image transfer efa8a20d-d766-486b-b29d-23cb2f82eb5d (image 126eea31-c5a2-4c01-a18d-982 2b0c05c2a) phase to Finalizing Success 2. Image transfer complete successfuly 2020-11-22 12:53:36,352+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledEx ecutorService-engineScheduledThreadPool-Thread-44) [733680b9-4c72-452f-8732-f0a697bc6cde] EVENT_ID: TRANSFER_IMAGE_SUCCEEDED(1 ,032), Image Download with disk fedora-32.raw succeeded. 3. Engine stop NBD server in vdsm 2020-11-22 12:53:27,175+02 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.StopNbdServerVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-24) [733680b9-4c72-452f-8732-f0a697bc6cde] START, StopNbdServerVDSCommand(HostName = host4, NbdServerVDSParameters:{hostId='9de3098d-89e8-444c-95ed-0fc2ea74e170', serverId='b04bf14e-a0d1-4ed6-9f54-a50c391a82ae', storageDomainId='null', imageId='null', volumeId='null', readonly='false', discard='false', backingChain='true'}), log id: 38aa9060 2020-11-22 12:53:27,192+02 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.StopNbdServerVDSCommand] (EE-ManagedScheduledExecu torService-engineScheduledThreadPool-Thread-24) [733680b9-4c72-452f-8732-f0a697bc6cde] FINISH, StopNbdServerVDSCommand, return: , log id: 38aa9060 (This is a bug, NBD server is not used in online backup) 4. Engine fails to terminate the backup 2020-11-22 12:53:58,015+02 WARN [org.ovirt.engine.core.vdsbroker.vdsbroker.StopVmBackupVDSCommand] (default task-1) [cf8e1bb2 -e32b-44df-ba0b-aa349fc8851f] Unexpected return value: Status [code=1600, message=Backup Error: {'vm_id': '4dc3bb16-f8d1-4f59- 9388-a93f68da7cf0', 'backup_id': '5c4407d2-f0ba-4381-968f-6077c2787d0b', 'reason': 'Failed to end VM backup: Timed out during operation: cannot acquire state change lock (held by monitor=qemu-event)'}] 5. Engine report that backup is finalized 2020-11-22 12:53:58,032+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-1) [cf8e 1bb2-e32b-44df-ba0b-aa349fc8851f] EVENT_ID: VM_BACKUP_FINALIZED(10,794), Backup <UNKNOWN> for VM backup-raw finalized (User: a dmin@internal-authz). In engine event log we see: VDSM host4 command StopVmBackupVDS failed: Backup Error: {'vm_id': '4dc3bb16-f8d1-4f59-9388-a93f68da7cf0', 'backup_id': '5c4407d2-f0ba-4381-968f-6077c2787d0b', 'reason': 'Failed to end VM backup: Timed out during operation: cannot acquire state change lock (held by monitor=qemu-event)'} Backup <UNKNOWN> for VM backup-raw finalized (User: admin@internal-authz). In vdsm log we see: 2020-11-22 12:53:57,995+0200 ERROR (jsonrpc/7) [api] FINISH stop_backup error=Backup Error: {'vm_id': '4dc3bb16-f8d1-4f59-9388 -a93f68da7cf0', 'backup_id': '5c4407d2-f0ba-4381-968f-6077c2787d0b', 'reason': 'Failed to end VM backup: Timed out during oper ation: cannot acquire state change lock (held by monitor=qemu-event)'} (api:131) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/virt/virdomain.py", line 101, in f ret = attr(*args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/common/libvirtconnection.py", line 131, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/common/function.py", line 94, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python3.6/site-packages/libvirt.py", line 553, in abortJob if ret == -1: raise libvirtError ('virDomainAbortJob() failed', dom=self) libvirt.libvirtError: Timed out during operation: cannot acquire state change lock (held by monitor=qemu-event) During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/virt/backup.py", line 220, in stop_backup dom.abortJob() File "/usr/lib/python3.6/site-packages/vdsm/virt/virdomain.py", line 159, in call return getattr(self._vm._dom, name)(*a, **kw) File "/usr/lib/python3.6/site-packages/vdsm/virt/virdomain.py", line 109, in f raise toe vdsm.virt.virdomain.TimeoutError: Timed out during operation: cannot acquire state change lock (held by monitor=qemu-event) During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/common/api.py", line 124, in method ret = func(*args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/API.py", line 671, in stop_backup return self.vm.stop_backup(backup_id) File "/usr/lib/python3.6/site-packages/vdsm/virt/backup.py", line 70, in wrapper return f(*a, **kw) File "<decorator-gen-265>", line 2, in stop_backup File "/usr/lib/python3.6/site-packages/vdsm/common/api.py", line 101, in method return func(*args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 4006, in stop_backup return backup.stop_backup(self, dom, backup_id=backup_id) File "/usr/lib/python3.6/site-packages/vdsm/virt/backup.py", line 226, in stop_backup backup_id=backup_id) vdsm.common.exception.BackupError: Backup Error: {'vm_id': '4dc3bb16-f8d1-4f59-9388-a93f68da7cf0', 'backup_id': '5c4407d2-f0ba -4381-968f-6077c2787d0b', 'reason': 'Failed to end VM backup: Timed out during operation: cannot acquire state change lock (he ld by monitor=qemu-event)'} 2020-11-22 12:53:58,008+0200 INFO (jsonrpc/7) [api.virt] FINISH stop_backup return={'status': {'code': 1600, 'message': "Backup Error: {'vm_id': '4dc3bb16-f8d1-4f59-9388-a93f68da7cf0', 'backup_id': '5c4407d2-f0ba-4381-968f-6077c2787d0b', 'reason': 'Failed to end VM backup: Timed out during operation: cannot acquire state change lock (held by monitor=qemu-event)'}"}} from=::ffff:192.168.122.11,57176, flow_id=cf8e1bb2-e32b-44df-ba0b-aa349fc8851f, vmId=4dc3bb16-f8d1-4f59-9388-a93f68da7cf0 (api:54) We don't see any other attempt to stop this backup, and the scratch disk was not removed. # tree /var/lib/vdsm/storage/transient_disks/4dc3bb16-f8d1-4f59-9388-a93f68da7cf0/ /var/lib/vdsm/storage/transient_disks/4dc3bb16-f8d1-4f59-9388-a93f68da7cf0/ └── 5c4407d2-f0ba-4381-968f-6077c2787d0b.sda The user code is stuck in this loop: 329 # TODO: Wait until backup is finalized. 330 backup = backup_service.get() 331 while backup.phase != types.BackupPhase.FINALIZING: 332 time.sleep(1) 333 backup = backup_service.get() So we had: - Guest powered off, we cannot prevent that and we cannot keep qemu running in this case, so the image transfer fails (Doc BUG) - image transfer fails with expected error (OK) - engine stops NBD server during online backup (BUG) - vdsm fail to stop the backup because of timeout in libvirt (OK) - engine fail to stop the backup because vdsm failed (OK) - engine report that the backup was finalized, without stopping the backup, leaving leftover scratch disk (BUG) - backup_vm.py (our example code) wait incorrectly for completion of the backup (BUG)
It is more interesting - qemu closed the nbd server and failed the backup, but it did not terminate, but hanged in shutdown state: # virsh -r list Id Name State -------------------------------- 1 backup-raw in shutdown # ps auxf | grep qemu-kvm | grep -v grep qemu 9990 0.7 14.9 2776992 573688 ? Sl 12:46 0:32 /usr/libexec/qemu-kvm -name guest=backup-raw,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-1-backup-raw/master-key.aes -machine pc-q35-rhel8.1.0,accel=kvm,usb=off,dump-guest-core=off -cpu Skylake-Client,hle=off,rtm=off,mpx=off -m size=1048576k,slots=16,maxmem=4194304k -overcommit mem-lock=off -smp 1,maxcpus=16,sockets=16,dies=1,cores=1,threads=1 -object iothread,id=iothread1 -numa node,nodeid=0,cpus=0-15,mem=1024 -uuid 4dc3bb16-f8d1-4f59-9388-a93f68da7cf0 -smbios type=1,manufacturer=oVirt,product=RHEL,version=8.3-1.0.el8,serial=d8635601-ea8b-4c5c-a624-41bd72b862d6,uuid=4dc3bb16-f8d1-4f59-9388-a93f68da7cf0,family=oVirt -no-user-config -nodefaults -chardev socket,id=charmonitor,fd=37,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=2020-11-22T10:46:53,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -global ICH9-LPC.disable_s3=1 -global ICH9-LPC.disable_s4=1 -boot strict=on -device pcie-root-port,port=0x10,chassis=1,id=pci.1,bus=pcie.0,multifunction=on,addr=0x2 -device pcie-root-port,port=0x11,chassis=2,id=pci.2,bus=pcie.0,addr=0x2.0x1 -device pcie-root-port,port=0x12,chassis=3,id=pci.3,bus=pcie.0,addr=0x2.0x2 -device pcie-root-port,port=0x13,chassis=4,id=pci.4,bus=pcie.0,addr=0x2.0x3 -device pcie-root-port,port=0x14,chassis=5,id=pci.5,bus=pcie.0,addr=0x2.0x4 -device pcie-root-port,port=0x15,chassis=6,id=pci.6,bus=pcie.0,addr=0x2.0x5 -device pcie-root-port,port=0x16,chassis=7,id=pci.7,bus=pcie.0,addr=0x2.0x6 -device pcie-root-port,port=0x17,chassis=8,id=pci.8,bus=pcie.0,addr=0x2.0x7 -device pcie-root-port,port=0x18,chassis=9,id=pci.9,bus=pcie.0,multifunction=on,addr=0x3 -device pcie-root-port,port=0x19,chassis=10,id=pci.10,bus=pcie.0,addr=0x3.0x1 -device pcie-root-port,port=0x1a,chassis=11,id=pci.11,bus=pcie.0,addr=0x3.0x2 -device pcie-root-port,port=0x1b,chassis=12,id=pci.12,bus=pcie.0,addr=0x3.0x3 -device pcie-root-port,port=0x1c,chassis=13,id=pci.13,bus=pcie.0,addr=0x3.0x4 -device pcie-root-port,port=0x1d,chassis=14,id=pci.14,bus=pcie.0,addr=0x3.0x5 -device pcie-root-port,port=0x1e,chassis=15,id=pci.15,bus=pcie.0,addr=0x3.0x6 -device pcie-root-port,port=0x1f,chassis=16,id=pci.16,bus=pcie.0,addr=0x3.0x7 -device qemu-xhci,p2=8,p3=8,id=ua-eda09d92-c79c-4a6e-851c-70ec0cb902ce,bus=pci.4,addr=0x0 -device virtio-scsi-pci,iothread=iothread1,id=ua-6d18cf44-c8f5-4cd2-8de1-38b2b9fccc1c,bus=pci.2,addr=0x0 -device virtio-serial-pci,id=ua-73ba5fe8-c6ca-446e-8975-3af807045085,max_ports=16,bus=pci.3,addr=0x0 -device ide-cd,bus=ide.2,id=ua-0e36cc67-69e9-4e18-9ee4-ec78a93ceb79,werror=report,rerror=report -blockdev {"driver":"file","filename":"/rhev/data-center/mnt/storage:_export_1/839db77f-fde3-4e13-bfb6-56be604631ed/images/126eea31-c5a2-4c01-a18d-9822b0c05c2a/c7c8040e-597a-402f-b371-3f74510b6a27","aio":"threads","node-name":"libvirt-1-storage","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"} -blockdev {"node-name":"libvirt-1-format","read-only":false,"discard":"unmap","cache":{"direct":true,"no-flush":false},"driver":"raw","file":"libvirt-1-storage"} -device scsi-hd,bus=ua-6d18cf44-c8f5-4cd2-8de1-38b2b9fccc1c.0,channel=0,scsi-id=0,lun=0,device_id=126eea31-c5a2-4c01-a18d-9822b0c05c2a,drive=libvirt-1-format,id=ua-126eea31-c5a2-4c01-a18d-9822b0c05c2a,bootindex=1,write-cache=on,serial=126eea31-c5a2-4c01-a18d-9822b0c05c2a,werror=stop,rerror=stop -netdev tap,fd=39,id=hostua-ae4ec1f0-b1d6-4b0c-ac9a-7e28c12a883b,vhost=on,vhostfd=40 -device virtio-net-pci,host_mtu=1500,netdev=hostua-ae4ec1f0-b1d6-4b0c-ac9a-7e28c12a883b,id=ua-ae4ec1f0-b1d6-4b0c-ac9a-7e28c12a883b,mac=56:6f:7a:5a:00:00,bus=pci.1,addr=0x0 -chardev socket,id=charchannel0,fd=41,server,nowait -device virtserialport,bus=ua-73ba5fe8-c6ca-446e-8975-3af807045085.0,nr=1,chardev=charchannel0,id=channel0,name=ovirt-guest-agent.0 -chardev socket,id=charchannel1,fd=42,server,nowait -device virtserialport,bus=ua-73ba5fe8-c6ca-446e-8975-3af807045085.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0 -chardev spicevmc,id=charchannel2,name=vdagent -device virtserialport,bus=ua-73ba5fe8-c6ca-446e-8975-3af807045085.0,nr=3,chardev=charchannel2,id=channel2,name=com.redhat.spice.0 -device usb-tablet,id=input0,bus=ua-eda09d92-c79c-4a6e-851c-70ec0cb902ce.0,port=1 -spice port=5900,tls-port=5901,addr=192.168.122.23,x509-dir=/etc/pki/vdsm/libvirt-spice,tls-channel=main,tls-channel=display,tls-channel=inputs,tls-channel=cursor,tls-channel=playback,tls-channel=record,tls-channel=smartcard,tls-channel=usbredir,seamless-migration=on -object tls-creds-x509,id=vnc-tls-creds0,dir=/etc/pki/vdsm/libvirt-vnc,endpoint=server,verify-peer=no -vnc 192.168.122.23:2,password,tls-creds=vnc-tls-creds0 -k en-us -device qxl-vga,id=ua-7f3bd26a-a9d8-449e-8e32-70110d026f87,ram_size=67108864,vram_size=8388608,vram64_size_mb=0,vgamem_mb=16,max_outputs=1,bus=pcie.0,addr=0x1 -device virtio-balloon-pci,id=ua-c98b71de-459a-428b-9e54-e53d8b6c0acd,bus=pci.5,addr=0x0 -object rng-random,id=objua-3b051a0c-bfa1-494f-9101-f9cac2cd54ed,filename=/dev/urandom -device virtio-rng-pci,rng=objua-3b051a0c-bfa1-494f-9101-f9cac2cd54ed,id=ua-3b051a0c-bfa1-494f-9101-f9cac2cd54ed,bus=pci.6,addr=0x0 -sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny -msg timestamp=on Loggign more info in backup_vm.py: diff --git a/sdk/examples/backup_vm.py b/sdk/examples/backup_vm.py index fd40154..f871cb8 100755 --- a/sdk/examples/backup_vm.py +++ b/sdk/examples/backup_vm.py @@ -329,6 +329,7 @@ def stop_backup(connection, backup_uuid, args): # TODO: Wait until backup is finalized. backup = backup_service.get() while backup.phase != types.BackupPhase.FINALIZING: + print("backup.phase = %r, waiting" % backup.phase) time.sleep(1) backup = backup_service.get() Show that the backup still exists, and it did not change its state, even when engine logged the event: Backup <UNKNOWN> for VM backup-raw finalized (User: admin@internal-authz). So engine lied about this backup, it was not finalized. $ ./backup_vm.py -c engine-dev stop 4dc3bb16-f8d1-4f59-9388-a93f68da7cf0 5c4407d2-f0ba-4381-968f-6077c2787d0b [ 0.0 ] Finalizing backup 5c4407d2-f0ba-4381-968f-6077c2787d0b backup.phase = <BackupPhase.READY: 'ready'>, waiting backup.phase = <BackupPhase.READY: 'ready'>, waiting backup.phase = <BackupPhase.READY: 'ready'>, waiting backup.phase = <BackupPhase.READY: 'ready'>, waiting backup.phase = <BackupPhase.READY: 'ready'>, waiting backup.phase = <BackupPhase.READY: 'ready'>, waiting backup.phase = <BackupPhase.READY: 'ready'>, waiting backup.phase = <BackupPhase.READY: 'ready'>, waiting backup.phase = <BackupPhase.READY: 'ready'>, waiting backup.phase = <BackupPhase.READY: 'ready'>, waiting backup.phase = <BackupPhase.READY: 'ready'>, waiting backup.phase = <BackupPhase.READY: 'ready'>, waiting backup.phase = <BackupPhase.READY: 'ready'>, waiting Unlike other calls, the finalize call is handled synchronously. It returns only when the operation is completed on vdsm side (timed out in this case), but it succeeded even when the actual operation failed. Since the VM is stuck, I tried to terminate it, but trying to shutdown or poweroff the VM from the UI fails with: Error while executing action: backup-raw: Cannot shutdown VM. The VM is during a backup operation. So we cannot shutdown the VM since backup is in progress, and we cannot stop backup since libvirt times out when getting state chagne lock. Why qemu is stuck? looks a deadlock in qemu_mutex_lock_iothread_impl: Thread 7 (Thread 0x7efc52bfe700 (LWP 10005)): #0 0x00007efcaf04c2fc in pthread_cond_wait@@GLIBC_2.3.2 () from target:/lib64/libpthread.so.0 #1 0x0000563a370c21fd in qemu_cond_wait_impl (cond=<optimized out>, mutex=0x563a38f7fce8, file=0x563a37243868 "/builddir/build/BUILD/qemu-5.1.0/ui/vnc-jobs.c", line=214) at /usr/src/debug/qemu-kvm-5.1.0-13.module+el8.3.0+8424+e82f331d.x86_64/util/qemu-thread-posix.c:174 #2 0x0000563a36fe92d1 in vnc_worker_thread_loop (queue=queue@entry=0x563a38f7fcb0) at /usr/src/debug/qemu-kvm-5.1.0-13.module+el8.3.0+8424+e82f331d.x86_64/ui/vnc-jobs.c:214 #3 0x0000563a36fe9c00 in vnc_worker_thread (arg=0x563a38f7fcb0) at /usr/src/debug/qemu-kvm-5.1.0-13.module+el8.3.0+8424+e82f331d.x86_64/ui/vnc-jobs.c:324 #4 0x0000563a370c1e44 in qemu_thread_start (args=0x563a38903660) at /usr/src/debug/qemu-kvm-5.1.0-13.module+el8.3.0+8424+e82f331d.x86_64/util/qemu-thread-posix.c:521 #5 0x00007efcaf04614a in start_thread () from target:/lib64/libpthread.so.0 #6 0x00007efcaed77f23 in clone () from target:/lib64/libc.so.6 Thread 6 (Thread 0x7efc533ff700 (LWP 10004)): #0 0x00007efcaed6cca1 in poll () from target:/lib64/libc.so.6 #1 0x00007efcb336d9b6 in g_main_context_iterate.isra () from target:/lib64/libglib-2.0.so.0 #2 0x00007efcb336dd72 in g_main_loop_run () from target:/lib64/libglib-2.0.so.0 #3 0x00007efcb0f528db in red_worker_main () from target:/lib64/libspice-server.so.1 #4 0x00007efcaf04614a in start_thread () from target:/lib64/libpthread.so.0 #5 0x00007efcaed77f23 in clone () from target:/lib64/libc.so.6 Thread 5 (Thread 0x7efca2419700 (LWP 10001)): #0 0x00007efcaf04c2fc in pthread_cond_wait@@GLIBC_2.3.2 () from target:/lib64/libpthread.so.0 #1 0x0000563a370c21fd in qemu_cond_wait_impl (cond=<optimized out>, mutex=0x563a378f91c0 <qemu_global_mutex>, file=0x563a371875d8 "/builddir/build/BUILD/qemu-5.1.0/softmmu/cpus.c", line=1145) at /usr/src/debug/qemu-kvm-5.1.0-13.module+el8.3.0+8424+e82f331d.x86_64/util/qemu-thread-posix.c:174 #2 0x0000563a36e1e277 in qemu_wait_io_event (cpu=cpu@entry=0x563a3865f440) at /usr/src/debug/qemu-kvm-5.1.0-13.module+el8.3.0+8424+e82f331d.x86_64/softmmu/cpus.c:1145 #3 0x0000563a36e1faa8 in qemu_kvm_cpu_thread_fn (arg=0x563a3865f440) at /usr/src/debug/qemu-kvm-5.1.0-13.module+el8.3.0+8424+e82f331d.x86_64/softmmu/cpus.c:1193 #4 qemu_kvm_cpu_thread_fn (arg=0x563a3865f440) at /usr/src/debug/qemu-kvm-5.1.0-13.module+el8.3.0+8424+e82f331d.x86_64/softmmu/cpus.c:1160 #5 0x0000563a370c1e44 in qemu_thread_start (args=0x563a38685c60) at /usr/src/debug/qemu-kvm-5.1.0-13.module+el8.3.0+8424+e82f331d.x86_64/util/qemu-thread-posix.c:521 #6 0x00007efcaf04614a in start_thread () from target:/lib64/libpthread.so.0 #7 0x00007efcaed77f23 in clone () from target:/lib64/libc.so.6 Thread 4 (Thread 0x7efca2c1a700 (LWP 10000)): #0 0x00007efcaed6cca1 in poll () from target:/lib64/libc.so.6 #1 0x00007efcb336d9b6 in g_main_context_iterate.isra () from target:/lib64/libglib-2.0.so.0 #2 0x00007efcb336dd72 in g_main_loop_run () from target:/lib64/libglib-2.0.so.0 #3 0x0000563a36e9b931 in iothread_run (opaque=0x563a384f2e60) at /usr/src/debug/qemu-kvm-5.1.0-13.module+el8.3.0+8424+e82f331d.x86_64/iothread.c:82 #4 0x0000563a370c1e44 in qemu_thread_start (args=0x563a3865ccc0) at /usr/src/debug/qemu-kvm-5.1.0-13.module+el8.3.0+8424+e82f331d.x86_64/util/qemu-thread-posix.c:521 #5 0x00007efcaf04614a in start_thread () from target:/lib64/libpthread.so.0 #6 0x00007efcaed77f23 in clone () from target:/lib64/libc.so.6 Thread 3 (Thread 0x7efca3fff700 (LWP 9995)): #0 0x00007efcaed6cd96 in ppoll () from target:/lib64/libc.so.6 #1 0x0000563a370d8bf9 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/bits/poll2.h:77 #2 qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=timeout@entry=-1) at /usr/src/debug/qemu-kvm-5.1.0-13.module+el8.3.0+8424+e82f331d.x86_64/util/qemu-timer.c:335 #3 0x0000563a370bfe21 in fdmon_poll_wait (ctx=0x563a385e3300, ready_list=0x7efca3ffe6b0, timeout=-1) at /usr/src/debug/qemu-kvm-5.1.0-13.module+el8.3.0+8424+e82f331d.x86_64/util/fdmon-poll.c:79 #4 0x0000563a370bf62e in aio_poll (ctx=0x563a385e3300, blocking=blocking@entry=true) at /usr/src/debug/qemu-kvm-5.1.0-13.module+el8.3.0+8424+e82f331d.x86_64/util/aio-posix.c:600 #5 0x0000563a36e9b904 in iothread_run (opaque=0x563a38500800) at /usr/src/debug/qemu-kvm-5.1.0-13.module+el8.3.0+8424+e82f331d.x86_64/iothread.c:75 #6 0x0000563a370c1e44 in qemu_thread_start (args=0x563a385e3810) at /usr/src/debug/qemu-kvm-5.1.0-13.module+el8.3.0+8424+e82f331d.x86_64/util/qemu-thread-posix.c:521 #7 0x00007efcaf04614a in start_thread () from target:/lib64/libpthread.so.0 #8 0x00007efcaed77f23 in clone () from target:/lib64/libc.so.6 Thread 2 (Thread 0x7efca8e03700 (LWP 9994)): #0 0x00007efcaf04f65d in __lll_lock_wait () from target:/lib64/libpthread.so.0 #1 0x00007efcaf048979 in pthread_mutex_lock () from target:/lib64/libpthread.so.0 #2 0x0000563a370c1f2d in qemu_mutex_lock_impl (mutex=0x563a378f91c0 <qemu_global_mutex>, file=0x563a37272ea8 "/builddir/build/BUILD/qemu-5.1.0/util/rcu.c", line=266) at /usr/src/debug/qemu-kvm-5.1.0-13.module+el8.3.0+8424+e82f331d.x86_64/util/qemu-thread-posix.c:79 #3 0x0000563a36e1f99e in qemu_mutex_lock_iothread_impl ( file=file@entry=0x563a37272ea8 "/builddir/build/BUILD/qemu-5.1.0/util/rcu.c", line=line@entry=266) at /usr/src/debug/qemu-kvm-5.1.0-13.module+el8.3.0+8424+e82f331d.x86_64/softmmu/cpus.c:1782 #4 0x0000563a370ca916 in call_rcu_thread (opaque=<optimized out>) at /usr/src/debug/qemu-kvm-5.1.0-13.module+el8.3.0+8424+e82f331d.x86_64/util/rcu.c:266 #5 0x0000563a370c1e44 in qemu_thread_start (args=0x563a383f7760) at /usr/src/debug/qemu-kvm-5.1.0-13.module+el8.3.0+8424+e82f331d.x86_64/util/qemu-thread-posix.c:521 #6 0x00007efcaf04614a in start_thread () from target:/lib64/libpthread.so.0 #7 0x00007efcaed77f23 in clone () from target:/lib64/libc.so.6 Thread 1 (Thread 0x7efcb407a680 (LWP 9990)): #0 0x00007efcaed6cd96 in ppoll () from target:/lib64/libc.so.6 #1 0x0000563a370d8bf9 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/bits/poll2.h:77 #2 qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=timeout@entry=-1) at /usr/src/debug/qemu-kvm-5.1.0-13.module+el8.3.0+8424+e82f331d.x86_64/util/qemu-timer.c:335 #3 0x0000563a370bfe21 in fdmon_poll_wait (ctx=0x563a385dea20, ready_list=0x7ffc7f1c9cf0, timeout=-1) at /usr/src/debug/qemu-kvm-5.1.0-13.module+el8.3.0+8424+e82f331d.x86_64/util/fdmon-poll.c:79 #4 0x0000563a370bf62e in aio_poll (ctx=0x563a385dea20, blocking=blocking@entry=true) at /usr/src/debug/qemu-kvm-5.1.0-13.module+el8.3.0+8424+e82f331d.x86_64/util/aio-posix.c:600 #5 0x0000563a37057db2 in nbd_export_close_all () at /usr/src/debug/qemu-kvm-5.1.0-13.module+el8.3.0+8424+e82f331d.x86_64/nbd/server.c:1751 #6 0x0000563a36ff58a9 in bdrv_close_all () at /usr/src/debug/qemu-kvm-5.1.0-13.module+el8.3.0+8424+e82f331d.x86_64/block.c:4427 #7 0x0000563a36e332bb in qemu_cleanup () at /usr/src/debug/qemu-kvm-5.1.0-13.module+el8.3.0+8424+e82f331d.x86_64/softmmu/vl.c:4531 #8 0x0000563a36d4bbd7 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at /usr/src/debug/qemu-kvm-5.1.0-13.module+el8.3.0+8424+e82f331d.x86_64/softmmu/main.c:50 Why libvirtd is stuck? seems that it hangs in qemuMonitorSend waiting for reply from qemu, that never received since qemu is hanged on iothread lock. Thread 19 (Thread 0x7fc4fcbf6700 (LWP 10062)): #0 0x00007fc5489782fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007fc54c1cc91a in virCondWait () from /lib64/libvirt.so.0 #2 0x00007fc50286a924 in qemuMonitorSend () from /usr/lib64/libvirt/connection-driver/libvirt_driver_qemu.so #3 0x00007fc50287a605 in qemuMonitorJSONCommandWithFd () from /usr/lib64/libvirt/connection-driver/libvirt_driver_qemu.so #4 0x00007fc50288aeac in qemuMonitorJSONGetJobInfo () from /usr/lib64/libvirt/connection-driver/libvirt_driver_qemu.so #5 0x00007fc5027ecfb9 in qemuBlockJobUpdate () from /usr/lib64/libvirt/connection-driver/libvirt_driver_qemu.so #6 0x00007fc5028c0968 in qemuProcessEventHandler () from /usr/lib64/libvirt/connection-driver/libvirt_driver_qemu.so #7 0x00007fc54c1cd46f in virThreadPoolWorker () from /lib64/libvirt.so.0 #8 0x00007fc54c1cca6b in virThreadHelper () from /lib64/libvirt.so.0 #9 0x00007fc54897214a in start_thread () from /lib64/libpthread.so.0 #10 0x00007fc548287f23 in clone () from /lib64/libc.so.6 Thread 18 (Thread 0x7fc4fef88700 (LWP 9993)): #0 0x00007fc54827cca1 in poll () from /lib64/libc.so.6 #1 0x00007fc548ddb9b6 in g_main_context_iterate.isra () from /lib64/libglib-2.0.so.0 #2 0x00007fc548ddbd72 in g_main_loop_run () from /lib64/libglib-2.0.so.0 #3 0x00007fc54c1722be in virEventThreadWorker () from /lib64/libvirt.so.0 #4 0x00007fc548e03d4a in g_thread_proxy () from /lib64/libglib-2.0.so.0 #5 0x00007fc54897214a in start_thread () from /lib64/libpthread.so.0 #6 0x00007fc548287f23 in clone () from /lib64/libc.so.6 Thread 17 (Thread 0x7fc4fd3f7700 (LWP 1602)): #0 0x00007fc5489782fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007fc54c1cc91a in virCondWait () from /lib64/libvirt.so.0 #2 0x00007fc5031c73fc in udevEventHandleThread () from /usr/lib64/libvirt/connection-driver/libvirt_driver_nodedev.so #3 0x00007fc54c1cca6b in virThreadHelper () from /lib64/libvirt.so.0 #4 0x00007fc54897214a in start_thread () from /lib64/libpthread.so.0 #5 0x00007fc548287f23 in clone () from /lib64/libc.so.6 Thread 16 (Thread 0x7fc500779700 (LWP 1514)): #0 0x00007fc5489782fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007fc54c1cc91a in virCondWait () from /lib64/libvirt.so.0 #2 0x00007fc54c1cd53b in virThreadPoolWorker () from /lib64/libvirt.so.0 #3 0x00007fc54c1cca6b in virThreadHelper () from /lib64/libvirt.so.0 #4 0x00007fc54897214a in start_thread () from /lib64/libpthread.so.0 #5 0x00007fc548287f23 in clone () from /lib64/libc.so.6 Thread 15 (Thread 0x7fc500f7a700 (LWP 1513)): #0 0x00007fc5489782fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007fc54c1cc91a in virCondWait () from /lib64/libvirt.so.0 #2 0x00007fc54c1cd53b in virThreadPoolWorker () from /lib64/libvirt.so.0 #3 0x00007fc54c1cca6b in virThreadHelper () from /lib64/libvirt.so.0 #4 0x00007fc54897214a in start_thread () from /lib64/libpthread.so.0 #5 0x00007fc548287f23 in clone () from /lib64/libc.so.6 Thread 14 (Thread 0x7fc50177b700 (LWP 1511)): #0 0x00007fc5489782fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007fc54c1cc91a in virCondWait () from /lib64/libvirt.so.0 #2 0x00007fc54c1cd53b in virThreadPoolWorker () from /lib64/libvirt.so.0 #3 0x00007fc54c1cca6b in virThreadHelper () from /lib64/libvirt.so.0 #4 0x00007fc54897214a in start_thread () from /lib64/libpthread.so.0 #5 0x00007fc548287f23 in clone () from /lib64/libc.so.6 Thread 13 (Thread 0x7fc501f7c700 (LWP 1510)): #0 0x00007fc5489782fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007fc54c1cc91a in virCondWait () from /lib64/libvirt.so.0 #2 0x00007fc54c1cd53b in virThreadPoolWorker () from /lib64/libvirt.so.0 #3 0x00007fc54c1cca6b in virThreadHelper () from /lib64/libvirt.so.0 #4 0x00007fc54897214a in start_thread () from /lib64/libpthread.so.0 #5 0x00007fc548287f23 in clone () from /lib64/libc.so.6 Thread 12 (Thread 0x7fc50277d700 (LWP 1509)): #0 0x00007fc5489782fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007fc54c1cc91a in virCondWait () from /lib64/libvirt.so.0 #2 0x00007fc54c1cd53b in virThreadPoolWorker () from /lib64/libvirt.so.0 #3 0x00007fc54c1cca6b in virThreadHelper () from /lib64/libvirt.so.0 #4 0x00007fc54897214a in start_thread () from /lib64/libpthread.so.0 #5 0x00007fc548287f23 in clone () from /lib64/libc.so.6 Thread 11 (Thread 0x7fc53d547700 (LWP 1393)): #0 0x00007fc5489782fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007fc54c1cc91a in virCondWait () from /lib64/libvirt.so.0 #2 0x00007fc54c1cd4f8 in virThreadPoolWorker () from /lib64/libvirt.so.0 #3 0x00007fc54c1cca6b in virThreadHelper () from /lib64/libvirt.so.0 #4 0x00007fc54897214a in start_thread () from /lib64/libpthread.so.0 #5 0x00007fc548287f23 in clone () from /lib64/libc.so.6 Thread 10 (Thread 0x7fc53dd48700 (LWP 1392)): #0 0x00007fc5489782fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007fc54c1cc91a in virCondWait () from /lib64/libvirt.so.0 #2 0x00007fc54c1cd4f8 in virThreadPoolWorker () from /lib64/libvirt.so.0 #3 0x00007fc54c1cca6b in virThreadHelper () from /lib64/libvirt.so.0 #4 0x00007fc54897214a in start_thread () from /lib64/libpthread.so.0 #5 0x00007fc548287f23 in clone () from /lib64/libc.so.6 Thread 9 (Thread 0x7fc53e549700 (LWP 1391)): #0 0x00007fc5489782fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007fc54c1cc91a in virCondWait () from /lib64/libvirt.so.0 #2 0x00007fc54c1cd4f8 in virThreadPoolWorker () from /lib64/libvirt.so.0 #3 0x00007fc54c1cca6b in virThreadHelper () from /lib64/libvirt.so.0 #4 0x00007fc54897214a in start_thread () from /lib64/libpthread.so.0 #5 0x00007fc548287f23 in clone () from /lib64/libc.so.6 Thread 8 (Thread 0x7fc53ed4a700 (LWP 1390)): #0 0x00007fc5489782fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007fc54c1cc91a in virCondWait () from /lib64/libvirt.so.0 #2 0x00007fc54c1cd4f8 in virThreadPoolWorker () from /lib64/libvirt.so.0 #3 0x00007fc54c1cca6b in virThreadHelper () from /lib64/libvirt.so.0 #4 0x00007fc54897214a in start_thread () from /lib64/libpthread.so.0 #5 0x00007fc548287f23 in clone () from /lib64/libc.so.6 Thread 7 (Thread 0x7fc53f54b700 (LWP 1389)): #0 0x00007fc5489782fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007fc54c1cc91a in virCondWait () from /lib64/libvirt.so.0 #2 0x00007fc54c1cd4f8 in virThreadPoolWorker () from /lib64/libvirt.so.0 #3 0x00007fc54c1cca6b in virThreadHelper () from /lib64/libvirt.so.0 #4 0x00007fc54897214a in start_thread () from /lib64/libpthread.so.0 #5 0x00007fc548287f23 in clone () from /lib64/libc.so.6 Thread 6 (Thread 0x7fc53fd4c700 (LWP 1388)): #0 0x00007fc5489782fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007fc54c1cc91a in virCondWait () from /lib64/libvirt.so.0 #2 0x00007fc54c1cd53b in virThreadPoolWorker () from /lib64/libvirt.so.0 #3 0x00007fc54c1cca6b in virThreadHelper () from /lib64/libvirt.so.0 #4 0x00007fc54897214a in start_thread () from /lib64/libpthread.so.0 #5 0x00007fc548287f23 in clone () from /lib64/libc.so.6 Thread 5 (Thread 0x7fc54054d700 (LWP 1387)): #0 0x00007fc5489782fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007fc54c1cc91a in virCondWait () from /lib64/libvirt.so.0 #2 0x00007fc54c1cd53b in virThreadPoolWorker () from /lib64/libvirt.so.0 #3 0x00007fc54c1cca6b in virThreadHelper () from /lib64/libvirt.so.0 #4 0x00007fc54897214a in start_thread () from /lib64/libpthread.so.0 #5 0x00007fc548287f23 in clone () from /lib64/libc.so.6 Thread 4 (Thread 0x7fc540d4e700 (LWP 1386)): #0 0x00007fc5489782fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007fc54c1cc91a in virCondWait () from /lib64/libvirt.so.0 #2 0x00007fc54c1cd53b in virThreadPoolWorker () from /lib64/libvirt.so.0 #3 0x00007fc54c1cca6b in virThreadHelper () from /lib64/libvirt.so.0 #4 0x00007fc54897214a in start_thread () from /lib64/libpthread.so.0 #5 0x00007fc548287f23 in clone () from /lib64/libc.so.6 Thread 3 (Thread 0x7fc54154f700 (LWP 1385)): #0 0x00007fc5489782fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007fc54c1cc91a in virCondWait () from /lib64/libvirt.so.0 #2 0x00007fc54c1cd53b in virThreadPoolWorker () from /lib64/libvirt.so.0 #3 0x00007fc54c1cca6b in virThreadHelper () from /lib64/libvirt.so.0 #4 0x00007fc54897214a in start_thread () from /lib64/libpthread.so.0 #5 0x00007fc548287f23 in clone () from /lib64/libc.so.6 Thread 2 (Thread 0x7fc541d50700 (LWP 1384)): #0 0x00007fc5489782fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007fc54c1cc91a in virCondWait () from /lib64/libvirt.so.0 #2 0x00007fc54c1cd53b in virThreadPoolWorker () from /lib64/libvirt.so.0 #3 0x00007fc54c1cca6b in virThreadHelper () from /lib64/libvirt.so.0 #4 0x00007fc54897214a in start_thread () from /lib64/libpthread.so.0 #5 0x00007fc548287f23 in clone () from /lib64/libc.so.6 Thread 1 (Thread 0x7fc54ccfdc00 (LWP 1355)): #0 0x00007fc54827cca1 in poll () from /lib64/libc.so.6 #1 0x00007fc548ddb9b6 in g_main_context_iterate.isra () from /lib64/libglib-2.0.so.0 #2 0x00007fc548ddbae0 in g_main_context_iteration () from /lib64/libglib-2.0.so.0 #3 0x00007fc54c172084 in virEventGLibRunOnce () from /lib64/libvirt.so.0 #4 0x00007fc54c2bd6c5 in virNetDaemonRun () from /lib64/libvirt.so.0 #5 0x000055eb8223e4df in main () Trying to kill qemu: # kill 9990 # ps -p 9990 PID TTY TIME CMD 9990 ? 00:00:32 qemu-kvm The process is stuck, so the only way is: # kill -9 9990 System detect that VM was powered off now. Trying to stop the backup again: $ ./backup_vm.py -c engine-dev stop 4dc3bb16-f8d1-4f59-9388-a93f68da7cf0 5c4407d2-f0ba-4381-968f-6077c2787d0b [ 0.0 ] Finalizing backup 5c4407d2-f0ba-4381-968f-6077c2787d0b Traceback (most recent call last): File "./backup_vm.py", line 430, in <module> main() File "./backup_vm.py", line 161, in main args.command(args) File "./backup_vm.py", line 245, in cmd_stop stop_backup(connection, args.backup_uuid, args) File "./backup_vm.py", line 327, in stop_backup backup_service.finalize() File "/home/nsoffer/.local/lib/python3.8/site-packages/ovirtsdk4/services.py", line 33306, in finalize return self._internal_action(action, 'finalize', None, headers, query, wait) File "/home/nsoffer/.local/lib/python3.8/site-packages/ovirtsdk4/service.py", line 299, in _internal_action return future.wait() if wait else future File "/home/nsoffer/.local/lib/python3.8/site-packages/ovirtsdk4/service.py", line 55, in wait return self._code(response) File "/home/nsoffer/.local/lib/python3.8/site-packages/ovirtsdk4/service.py", line 296, in callback self._check_fault(response) File "/home/nsoffer/.local/lib/python3.8/site-packages/ovirtsdk4/service.py", line 132, in _check_fault self._raise_error(response, body) File "/home/nsoffer/.local/lib/python3.8/site-packages/ovirtsdk4/service.py", line 118, in _raise_error raise error ovirtsdk4.Error: Fault reason is "Operation Failed". Fault detail is "[Host not found.]". HTTP response code is 400. So we have no backup, good. Looking in engine log: 1. Engine try to end the backup 2020-11-22 14:55:15,153+02 INFO [org.ovirt.engine.core.bll.StopVmBackupCommand] (default task-1) [79461714-65f1-4141-8aad-3d356 4345907] Stopping VmBackup '5c4407d2-f0ba-4381-968f-6077c2787d0b 2. Engine fail to find the host that run the VM 564345907] Cannot get vdsManager for vdsid='00000000-0000-0000-0000-000000000000'. 2020-11-22 14:55:15,156+02 ERROR [org.ovirt.engine.core.vdsbroker.ResourceManager] (default task-1) [79461714-65f1-4141-8aad-3d3564345907] createCommand failed: EngineException: Vds with id: 00000000-0000-0000-0000-000000000000 was not found (Failed with e rror RESOURCE_MANAGER_VDS_NOT_FOUND and code 5004) Looks like engine uses empty Guid to locate the host, probably because it did not save the host in the db, and it assumes that we always have a VM object when backup is running. 3. Engine log failure to finalize the backup 2020-11-22 14:55:15,175+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-1) [79461714-65f1-4141-8aad-3d3564345907] EVENT_ID: VM_BACKUP_FAILED_TO_FINALIZE(10,795), Backup <UNKNOWN> for VM backup-raw failed to finalize (User: admin@internal-authz). Looking in host: # tree /var/lib/vdsm/storage/transient_disks/4dc3bb16-f8d1-4f59-9388-a93f68da7cf0/ /var/lib/vdsm/storage/transient_disks/4dc3bb16-f8d1-4f59-9388-a93f68da7cf0/ └── 5c4407d2-f0ba-4381-968f-6077c2787d0b.sda Scratch disk was not removed. So we have more bugs: - qemu deadlock after poweroff during backup(BUG) - libvirt hangs on qemu monitor without a timeout (BUG) - Engine logs that backup was finialized while backup was not finalized (BUG) - Finalize action is synchronous instead of asynchronous (BUG?) - Finalize succeeded even when actual backup was not finalized (BUG) - Backup reported as ready after calling finalize. Is it really ready? looks like we need to move to FINAlIZING state when calling finalize to make sure that no new transfer can be started once finalize was called. - No way to terminate the VM because backup is running. User should be able to abort a backup (BUG) - No way to shutdown/poweroff the VM, the only way is to kill qemu - After killing qemu, engine fails to stop the backup since host is not found (BUG) - Scratch disk is left on host side, never deleted
Last issue - after shutting down the vm and failing to finalize the backup, the vm remain locked and cannot be started again: Error while executing action: backup-raw: Cannot run VM: VM is locked. Please try again in a few minutes. I think the issue is the backup command that did not clean up the memory locks.
A leftover backup command is in the db: engine=# select * from vm_backups; backup_id | from_checkpoint_id | to_checkpoint_id | vm_id | phase | _create_date --------------------------------------+--------------------+------------------+--------------------------------------+-------+-- -------------------------- 5c4407d2-f0ba-4381-968f-6077c2787d0b | | | 4dc3bb16-f8d1-4f59-9388-a93f68da7cf0 | Ready | 2 020-11-22 12:53:19.422+02 The backup cannot be finalized because is does not find the host the VM is running on, since the VM is not running. The issues is that we assume the the VM must be running during the entire backup. If the VM is powered off from the guest, or qemu is killed, we don't have the host id in the command so we cannot clean up the backup. The host id must be kept in the backup job to avoid this issue.
Second attempt to reproduce, qemu did not get stuck, and the backup did not hang: $ ./backup_vm.py -c engine-dev full 4dc3bb16-f8d1-4f59-9388-a93f68da7cf0 [ 0.0 ] Starting full backup for VM 4dc3bb16-f8d1-4f59-9388-a93f68da7cf0 [ 0.3 ] Waiting until backup 4ab7aec5-cc29-4a15-921b-d78180473b14 is ready [ 1.3 ] Creating image transfer for disk 126eea31-c5a2-4c01-a18d-9822b0c05c2a [ 2.6 ] Image transfer 1b113c65-3638-4ecd-b9c5-1cf9348c00ce is ready [ 63.68% ] 3.82 GiB, 3.46 seconds, 1.10 GiB/s [ 6.0 ] Finalizing image transfer [ 9.1 ] Finalizing backup Traceback (most recent call last): File "./backup_vm.py", line 177, in cmd_full download_backup(connection, backup.id, args) File "./backup_vm.py", line 357, in download_backup download_disk(connection, backup_uuid, disk, disk_path, args, incremental=incremental) File "./backup_vm.py", line 392, in download_disk client.download( File "/usr/lib64/python3.8/site-packages/ovirt_imageio/client/_api.py", line 172, in download io.copy( File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/io.py", line 69, in copy log.debug("Executor failed") File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/io.py", line 189, in __exit__ self.stop() File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/io.py", line 166, in stop raise self._errors[0] File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/io.py", line 238, in _run handler.copy(req) File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/io.py", line 282, in copy self._src.write_to(self._dst, req.length, self._buf) File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/backends/http.py", line 204, in write_to res = self._get(length) File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/backends/http.py", line 431, in _get self._reraise(res.status, res.read()) File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/backends/http.py", line 585, in _reraise raise http.Error(status, msg) ovirt_imageio._internal.http.Error: Server failed to perform the request, check logs During handling of the above exception, another exception occurred: Traceback (most recent call last): File "./backup_vm.py", line 431, in <module> main() File "./backup_vm.py", line 161, in main args.command(args) File "./backup_vm.py", line 180, in cmd_full stop_backup(connection, backup.id, args) File "./backup_vm.py", line 329, in stop_backup backup_service.finalize() File "/home/nsoffer/.local/lib/python3.8/site-packages/ovirtsdk4/services.py", line 33306, in finalize return self._internal_action(action, 'finalize', None, headers, query, wait) File "/home/nsoffer/.local/lib/python3.8/site-packages/ovirtsdk4/service.py", line 299, in _internal_action return future.wait() if wait else future File "/home/nsoffer/.local/lib/python3.8/site-packages/ovirtsdk4/service.py", line 55, in wait return self._code(response) File "/home/nsoffer/.local/lib/python3.8/site-packages/ovirtsdk4/service.py", line 296, in callback self._check_fault(response) File "/home/nsoffer/.local/lib/python3.8/site-packages/ovirtsdk4/service.py", line 132, in _check_fault self._raise_error(response, body) File "/home/nsoffer/.local/lib/python3.8/site-packages/ovirtsdk4/service.py", line 118, in _raise_error raise error ovirtsdk4.Error: Fault reason is "Operation Failed". Fault detail is "[Host not found.]". HTTP response code is 400. In this case stopping the backup failed since the host was not found and the only way to clear this backup is to manually delete the backup from the db: $ sudo -u postgres psql -d engine psql (12.1) Type "help" for help. engine=# select * from vm_backups; backup_id | from_checkpoint_id | to_checkpoint_id | vm_id | phase | _create_date --------------------------------------+--------------------+------------------+--------------------------------------+-------+-- -------------------------- 4ab7aec5-cc29-4a15-921b-d78180473b14 | | | 4dc3bb16-f8d1-4f59-9388-a93f68da7cf0 | Ready | 2 020-11-22 17:20:06.204+02 (1 row) engine=# delete from vm_backups where backup_id='4ab7aec5-cc29-4a15-921b-d78180473b14'; DELETE 1 engine=# \q $ sudo systemctl restart ovirt-engine
Third attempt, backup hangs: $ ./backup_vm.py -c engine-dev full --backup-dir /var/tmp/backups/raw 4dc3bb16-f8d1-4f59-9388-a93f68da7cf0 [ 0.0 ] Starting full backup for VM 4dc3bb16-f8d1-4f59-9388-a93f68da7cf0 [ 0.8 ] Waiting until backup f01630c5-943c-4845-b404-29e5ff0b509b is ready [ 1.8 ] Creating image transfer for disk 126eea31-c5a2-4c01-a18d-9822b0c05c2a [ 3.1 ] Image transfer 2c150870-915b-4cc2-a4a1-ca32287099de is ready [ 45.55% ] 2.73 GiB, 2.84 seconds, 983.96 MiB/s [ 5.9 ] Finalizing image transfer [ 9.0 ] Finalizing backup [ 39.1 ] Waiting until backup is finalized ^CTraceback (most recent call last): File "./backup_vm.py", line 177, in cmd_full download_backup(connection, backup.id, args) File "./backup_vm.py", line 357, in download_backup download_disk(connection, backup_uuid, disk, disk_path, args, incremental=incremental) File "./backup_vm.py", line 392, in download_disk client.download( File "/usr/lib64/python3.8/site-packages/ovirt_imageio/client/_api.py", line 172, in download io.copy( File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/io.py", line 69, in copy log.debug("Executor failed") File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/io.py", line 189, in __exit__ self.stop() File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/io.py", line 166, in stop raise self._errors[0] File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/io.py", line 238, in _run handler.copy(req) File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/io.py", line 282, in copy self._src.write_to(self._dst, req.length, self._buf) File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/backends/http.py", line 204, in write_to res = self._get(length) File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/backends/http.py", line 431, in _get self._reraise(res.status, res.read()) File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/backends/http.py", line 585, in _reraise raise http.Error(status, msg) ovirt_imageio._internal.http.Error: Server failed to perform the request, check logs During handling of the above exception, another exception occurred: Traceback (most recent call last): File "./backup_vm.py", line 431, in <module> main() File "./backup_vm.py", line 161, in main args.command(args) File "./backup_vm.py", line 180, in cmd_full stop_backup(connection, backup.id, args) File "./backup_vm.py", line 335, in stop_backup time.sleep(1) KeyboardInterrupt Qemu is deadlocked. After killing qemu I could start the vm - but the backup was still in the db. Looks like we have a locking bug - the backup should prevent starting the VM. Once the VM was started, you could not stop it because of the backup.
I tried more times to collect logs and check how reproducible is this issue: Attempt 4 - qemu did not hang Attempt 5 - qemu did not hang but backup hang, engine failed to stop the backup since the vm was destroyed. So this failed 3 of 5 attempts with various failures. I'll file bugs for qemu, libvirt, ovirt-engine and ovirt-engine-sdk.
In the last attempt, vdsm failed stop_backup with: 2020-11-22 18:37:25,071+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.StopVmBackupVDSCommand] (default task-2) [72d778a8 -fe49-441e-8b5d-167606b85e75] Command 'StopVmBackupVDSCommand(HostName = host3, VmBackupVDSParameters:{hostId='f1593c12-5d93-4 44b-99f4-cd31ffd45506', backupId='76073f2e-4709-404c-bd81-82ee89f85e24'})' execution failed: VDSGenericException: VDSErrorExce ption: Failed to StopVmBackupVDS, error = General Exception: ("VM '4dc3bb16-f8d1-4f59-9388-a93f68da7cf0' was not defined yet o r was undefined",), code = 100 This error is not useful for engine. If stopping backup fails because the vm is undefined, vdsm should treat this as success. Undefined vm cannot have a running backup. In this case vdsm should have removed the transient disk and return success. I'll file also vdsm bug for this.
Attempt 6 was successful - qemu exit cleanly and engine stopped the backup successfully. So this is 50% reproducible. I think it is more likey when powering off right after the download starts. $ ./backup_vm.py -c engine-dev full --backup-dir /var/tmp/backups/raw 4dc3bb16-f8d1-4f59-9388-a93f68da7cf0 [ 0.0 ] Starting full backup for VM 4dc3bb16-f8d1-4f59-9388-a93f68da7cf0 [ 0.7 ] Waiting until backup 370a3445-62e0-4b8a-b2b6-6a3662a52413 is ready [ 1.8 ] Creating image transfer for disk 126eea31-c5a2-4c01-a18d-9822b0c05c2a [ 3.1 ] Image transfer f4fc43f9-dd7d-47d9-9fca-6697b4cf7a23 is ready [ 78.12% ] 4.69 GiB, 4.89 seconds, 981.80 MiB/s [ 8.0 ] Finalizing image transfer [ 9.0 ] Finalizing backup [ 9.1 ] Waiting until backup is finalized Traceback (most recent call last): File "./backup_vm.py", line 431, in <module> main() File "./backup_vm.py", line 161, in main args.command(args) File "./backup_vm.py", line 177, in cmd_full download_backup(connection, backup.id, args) File "./backup_vm.py", line 357, in download_backup download_disk(connection, backup_uuid, disk, disk_path, args, incremental=incremental) File "./backup_vm.py", line 392, in download_disk client.download( File "/usr/lib64/python3.8/site-packages/ovirt_imageio/client/_api.py", line 172, in download io.copy( File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/io.py", line 69, in copy log.debug("Executor failed") File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/io.py", line 189, in __exit__ self.stop() File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/io.py", line 166, in stop raise self._errors[0] File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/io.py", line 238, in _run handler.copy(req) File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/io.py", line 282, in copy self._src.write_to(self._dst, req.length, self._buf) File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/backends/http.py", line 204, in write_to res = self._get(length) File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/backends/http.py", line 431, in _get self._reraise(res.status, res.read()) File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/backends/http.py", line 585, in _reraise raise http.Error(status, msg) ovirt_imageio._internal.http.Error: Server failed to perform the request, check logs
I created bug 1900505 for the qemu deadlock. Until this is fixed, we cannot fix the hang during backup.
I filed bug 1900518 for the vdsm issue, failing VM.stop_backup when vm was destroyed before VM.stop_backup was called.
Qemu version with the fix will be used now by VDSM after patch - https://gerrit.ovirt.org/#/c/vdsm/+/114307/ merged. Moving to QA for testing.
version for validation: # rpm -q ovirt-engine ovirt-engine-4.4.6.5-0.17.el8ev.noarch # rpm -qa | grep release rhv-release-4.4.6-6-001.noarch redhat-release-8.4-0.6.el8.x86_64 redhat-release-eula-8.4-0.6.el8.x86_64 # rpm -q vdsm vdsm-4.40.60.5-1.el8ev.x86_64 The script mentioned in comment #4 didn't work for me. (./backup_vm.py -c engine-dev full --backup-dir /var/tmp/backups/qcow2 4dc3bb16-f8d1-4f59-9388-a93f68da7cf0 - I received many errors when I tried to run it in this manner. For now, I validated this bug using the following steps. 1. python3 /usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py -c engine start [vmuuid] 2. python3 /usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py -c engine download [vmuuid] --backup-dir . --backup-uuid [backup] 3. I connected directly to the relevant VM via SSH connection and before the download was complete I did the "poweroff" command via the VM terminal. Verification Summary and conclusions - The following flow resulted in a timeout error : (RuntimeError: Timed out waiting for transfer 89ba7382-be4e-4102-9f71-4d9003cb6afd to finalize, I also got additional errors in my terminal), although the expected result was that the backup should be successful and Image transfer sends data normally.
A(In reply to Amit Sharir from comment #16) > version for validation: > > # rpm -q ovirt-engine > ovirt-engine-4.4.6.5-0.17.el8ev.noarch > > # rpm -qa | grep release > rhv-release-4.4.6-6-001.noarch > redhat-release-8.4-0.6.el8.x86_64 > redhat-release-eula-8.4-0.6.el8.x86_64 > > # rpm -q vdsm > vdsm-4.40.60.5-1.el8ev.x86_64 > > > > The script mentioned in comment #4 didn't work for me. (./backup_vm.py -c > engine-dev full --backup-dir /var/tmp/backups/qcow2 > 4dc3bb16-f8d1-4f59-9388-a93f68da7cf0 - I received many errors when I tried > to run it in this manner. I don't have the errors so I can't tell why it fails, but I can assume that you don't have the config file (engine-dev) and you tried to use the same VM ID as Nir mentioned which doesn't exist in your environment. > > > For now, I validated this bug using the following steps. > > 1. python3 /usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py -c > engine start [vmuuid] > 2. python3 /usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py -c > engine download [vmuuid] --backup-dir . --backup-uuid [backup] > 3. I connected directly to the relevant VM via SSH connection and before the > download was complete I did the "poweroff" command via the VM terminal. Looks OK. > > > Verification Summary and conclusions - > > The following flow resulted in a timeout error : (RuntimeError: Timed out > waiting for transfer 89ba7382-be4e-4102-9f71-4d9003cb6afd to finalize, I > also got additional errors in my terminal), although the expected result was > that the backup should be successful and Image transfer sends data normally. Make sense, QEMU terminates due to the power-off so the transfer will eventually fail due to time out. Please see comment #3. The expected results should be that if the user will retry to download the backup (cold backup download) it will succeed and, the user should be able to finalize the backup.
Eyal thanks for the clarifications. Regarding the error I received with (./backup_vm.py -c engine-dev full --backup-dir /var/tmp/backups/qcow2 4dc3bb16-f8d1-4f59-9388-a93f68da7cf0), the VM id and config file that I used were correct. We found an issue with the SDK script when using the "full" parameter. We opened accordingly this bug - https://bugzilla.redhat.com/show_bug.cgi?id=1956167 regarding this specific issue. I ran the verification flow mentioned in comment #16 again and got the expected results mentioned in comment #4 and comment #17. version for validation # rpm -q vdsm vdsm-4.40.60.6-1.el8ev.x86_64 # rpm -q ovirt-engine ovirt-engine-4.4.6.6-0.10.el8ev.noarch # rpm -qa | grep release rhv-release-4.4.6-7-001.noarch redhat-release-8.4-0.6.el8.x86_64 redhat-release-eula-8.4-0.6.el8.x86_64 Verification summary and conclusions - The actual and expected results were the same. I was able to retry the download of the backup (cold backup download) successfully and I was able to finalize the backup. Bug verified.
This bugzilla is included in oVirt 4.4.6 release, published on May 4th 2021. Since the problem described in this bug report should be resolved in oVirt 4.4.6 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.