Created attachment 1732487 [details] Logs from 3 backups showing this issue Description of problem: If a vm was destroyed during backup, engine fail to stop the backup with this error: $ ./backup_vm.py -c engine-dev stop 4dc3bb16-f8d1-4f59-9388-a93f68da7cf0 b7de98d7-914c-4c24-a9dd-a3202505e511 [ 0.0 ] Finalizing backup b7de98d7-914c-4c24-a9dd-a3202505e511 Traceback (most recent call last): File "./backup_vm.py", line 431, in <module> main() File "./backup_vm.py", line 161, in main args.command(args) File "./backup_vm.py", line 247, in cmd_stop stop_backup(connection, args.backup_uuid, args) File "./backup_vm.py", line 329, in stop_backup backup_service.finalize() File "/home/nsoffer/.local/lib/python3.8/site-packages/ovirtsdk4/services.py", line 33306, in finalize return self._internal_action(action, 'finalize', None, headers, query, wait) File "/home/nsoffer/.local/lib/python3.8/site-packages/ovirtsdk4/service.py", line 299, in _internal_action return future.wait() if wait else future File "/home/nsoffer/.local/lib/python3.8/site-packages/ovirtsdk4/service.py", line 55, in wait return self._code(response) File "/home/nsoffer/.local/lib/python3.8/site-packages/ovirtsdk4/service.py", line 296, in callback self._check_fault(response) File "/home/nsoffer/.local/lib/python3.8/site-packages/ovirtsdk4/service.py", line 132, in _check_fault self._raise_error(response, body) File "/home/nsoffer/.local/lib/python3.8/site-packages/ovirtsdk4/service.py", line 118, in _raise_error raise error ovirtsdk4.Error: Fault reason is "Operation Failed". Fault detail is "[Host not found.]". HTTP response code is 400. In engine log we see: 2020-11-22 18:08:36,409+02 ERROR [org.ovirt.engine.core.bll.StopVmBackupCommand] (default task-5) [59e594ba-825c-41c6-9b9e-0259ca331393] Command 'org.ovirt.engine.core.bll.StopVmBackupCommand' failed: EngineException: Vds with id: 00000000-0000-0000-0000-000000000000 was not found (Failed with error RESOURCE_MANAGER_VDS_NOT_FOUND and code 5004) 2020-11-22 18:08:36,409+02 ERROR [org.ovirt.engine.core.bll.StopVmBackupCommand] (default task-5) [59e594ba-825c-41c6-9b9e-0259ca331393] Exception: java.lang.RuntimeException: EngineException: Vds with id: 00000000-0000-0000-0000-000000000000 was not found (Failed with error RESOURCE_MANAGER_VDS_NOT_FOUND and code 5004) Version-Release number of selected component (if applicable): 4.4.4.2_master How reproducible: 90% Steps to Reproduce: 1. Start a full backup 2. Wait until download starts 3. In the guest, poweroff 4. In most case the qemu hangs, and the vm moves to uknown state. In some case the qemu exits the the system detects that vm is not running before the trying to stop backup. Actual results: Stopping backup fails with "Host not found" error. The scratch disks are left on the host, this may cause the host disk to become full. Expected results: Engine succeed to stop backup even if the vm is not running and scratch disks are removed. Additional info: The issue is assuming that we always have a running VM during backup, so storing the vm id in the vm_backups table is enough to locate the host when stopping the backup. But this assumption breaks when a vm is destroied in the middle of a backup, for example when a user power of the vm inside the guest. The system detect that the vm was terminated so trying to access the vm vds id will return empty id. engine=# select * from vm_backups; backup_id | from_checkpoint_id | to_checkpoint_id | vm_id | phase | _create_date --------------------------------------+--------------------+------------------+--------------------------------------+-------+-- -------------------------- 4ab7aec5-cc29-4a15-921b-d78180473b14 | | | 4dc3bb16-f8d1-4f59-9388-a93f68da7cf0 | Ready | 2 020-11-22 17:20:06.204+02 We need to add the host id to vm_backups table, so stopping a backup does not depend on having a running vm. This must be fixed to make incremental backup fully supported.
Verified on rhv-4.4.4-6 > Steps to reproduce with backup_vm.py: > > 1. Create a VM with a disk that enabled for incremental backup. > 2. Start the VM. > 3. Start a backup for the VM using - backup_bm.py -c engine start <vm-id> [root@storage-ge5-vdsm3 examples]# python3 backup_vm.py -c engine start 89502d53-710a-4cd2-85ab-35a934f5f306 [ 0.0 ] Starting full backup for VM '89502d53-710a-4cd2-85ab-35a934f5f306' [ 2.7 ] Waiting until backup bfbb9494-dc76-47d6-ac02-25a619bc6f8c is ready [ 5.8 ] Created checkpoint 'd7523bd4-dd15-45ce-9cef-7430693344b8' (to use in --from-checkpoint-uuid for the next incremental backup) [ 6.1 ] Backup bfbb9494-dc76-47d6-ac02-25a619bc6f8c is ready > 4. Power-off the VM > 5. Stop the VM backup using - backup_bm.py -c engine stop <vm-id> [root@storage-ge5-vdsm3 examples]# python3 backup_vm.py -c engine stop 89502d53-710a-4cd2-85ab-35a934f5f306 bfbb9494-dc76-47d6-ac02-25a619bc6f8c [ 0.0 ] Finalizing backup bfbb9494-dc76-47d6-ac02-25a619bc6f8c [ 0.4 ] Backup bfbb9494-dc76-47d6-ac02-25a619bc6f8c has finalized Actual result (as expected): Engine succeed to stop backup even if the vm is not running engine=# select * from vm_backups; (0 rows) Engine reports: EVENT_ID: VM_BACKUP_FINALIZED(10,794), Backup <UNKNOWN> for VM test finalized . . 2020-12-16 16:31:27,190+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-4) [d3716982-6cf4-44c6-9503-ebfea5323545] EVENT_ID: VM_BACKUP_SUCCEEDED(10,793), Backup bfbb9494-dc76-47d6-ac02-25a619bc6f8c for VM test completed successfully (User: admin@internal-authz).
This bugzilla is included in oVirt 4.4.4 release, published on December 21st 2020. Since the problem described in this bug report should be resolved in oVirt 4.4.4 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.