Created attachment 1732443 [details] logs from 3 backups showing this issue Description of problem: When powering of a vm from the guest during backup, qemu hangs and stopping the backup fails in vdsm (see bug 1892672). Engine reports that the backup was finalized while the backup was not finalized, and leftover scratch disks are left on the host running the backup. Here are logs from a run showing this issue: 1. Starting backup 2020-11-22 17:54:20,367+02 INFO [org.ovirt.engine.core.bll.StartVmBackupCommand] (default task-5) [33bd8b12-d604-4882-a193-4beecf1cf41e] Created VmBackup entity 'b7de98d7-914c-4c24-a9dd-a3202505e511' 2. Starting transfer for this backup 2020-11-22 17:54:21,714+02 INFO [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (default task-5) [353218d4-d310-4c35-ba70-5dcea5c5d99a] Successfully added Download disk 'fedora-32.raw' (disk id: '126eea31-c5a2-4c01-a18d-9822b0c05c2a', image id: 'c7c8040e-597a-402f-b371-3f74510b6a27') for image transfer command 'f0754c8a-7b66-4ac9-8b77-7a5857588e71' 2020-11-22 17:54:21,777+02 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.AddImageTicketVDSCommand] (default task-5) [353218d4-d310-4c35-ba70-5dcea5c5d99a] START, AddImageTicketVDSCommand(HostName = host4, AddImageTicketVDSCommandParameters:{hostId='9de3098d-89e8-444c-95ed-0fc2ea74e170', ticketId='1bec74cf-aa0c-45b9-aec0-e22bd56bcbcb', timeout='300', operations='[read]', size='6442450944', url='nbd:unix:/run/vdsm/backup/b7de98d7-914c-4c24-a9dd-a3202505e511:exportname=sda', filename='null', sparse='true', transferId='f0754c8a-7b66-4ac9-8b77-7a5857588e71', dirty='false'}), log id: 6960e215 3. Transfer finalized 2020-11-22 17:54:28,859+02 INFO [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (EE-ManagedScheduledE xecutorService-engineScheduledThreadPool-Thread-86) [353218d4-d310-4c35-ba70-5dcea5c5d99a] Finalizing successful transfer for Download disk 'fedora-32.raw' (disk id: '126eea31-c5a2-4c01-a18d-9822b0c05c2a', image id: 'c7c8040e-597a-402f-b371-3f74510b6a2 7') 4. Stopping backup 2020-11-22 17:54:29,950+02 INFO [org.ovirt.engine.core.bll.StopVmBackupCommand] (default task-5) [a8f31e56-1a74-418c-9131-ee708e368f06] Stopping VmBackup 'b7de98d7-914c-4c24-a9dd-a3202505e511' 5. Stopping backup failed in vdsm 2020-11-22 17:54:59,962+02 WARN [org.ovirt.engine.core.vdsbroker.vdsbroker.StopVmBackupVDSCommand] (default task-5) [a8f31e56-1a74-418c-9131-ee708e368f06] Unexpected return value: Status [code=1600, message=Backup Error: {'vm_id': '4dc3bb16-f8d1-4f59-9388-a93f68da7cf0', 'backup_id': 'b7de98d7-914c-4c24-a9dd-a3202505e511', 'reason': 'Failed to end VM backup: Timed out during operation: cannot acquire state change lock (held by monitor=qemu-event)'}] 6. Engine report finalized event with missing backup id 2020-11-22 17:55:00,042+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-5) [a8f3 1e56-1a74-418c-9131-ee708e368f06] EVENT_ID: VM_BACKUP_FINALIZED(10,794), Backup <UNKNOWN> for VM backup-raw finalized (User: a dmin@internal-authz). Client accessing this backup is hang, waiting for the backup to change state from "ready" to "finalizing": diff --git a/sdk/examples/backup_vm.py b/sdk/examples/backup_vm.py index fd40154..f871cb8 100755 --- a/sdk/examples/backup_vm.py +++ b/sdk/examples/backup_vm.py @@ -329,6 +329,7 @@ def stop_backup(connection, backup_uuid, args): # TODO: Wait until backup is finalized. backup = backup_service.get() while backup.phase != types.BackupPhase.FINALIZING: + print("backup.phase = %r, waiting" % backup.phase) time.sleep(1) backup = backup_service.get() $ ./backup_vm.py -c engine-dev stop 4dc3bb16-f8d1-4f59-9388-a93f68da7cf0 5c4407d2-f0ba-4381-968f-6077c2787d0b [ 0.0 ] Finalizing backup 5c4407d2-f0ba-4381-968f-6077c2787d0b backup.phase = <BackupPhase.READY: 'ready'>, waiting backup.phase = <BackupPhase.READY: 'ready'>, waiting backup.phase = <BackupPhase.READY: 'ready'>, waiting backup.phase = <BackupPhase.READY: 'ready'>, waiting backup.phase = <BackupPhase.READY: 'ready'>, waiting backup.phase = <BackupPhase.READY: 'ready'>, waiting backup.phase = <BackupPhase.READY: 'ready'>, waiting backup.phase = <BackupPhase.READY: 'ready'>, waiting backup.phase = <BackupPhase.READY: 'ready'>, waiting backup.phase = <BackupPhase.READY: 'ready'>, waiting backup.phase = <BackupPhase.READY: 'ready'>, waiting backup.phase = <BackupPhase.READY: 'ready'>, waiting backup.phase = <BackupPhase.READY: 'ready'>, waiting Clients using events to detect backup completion will consider this backup as finalized, while the backup was not. Version-Release number of selected component (if applicable): 4.4.4.2_master How reproducible: 37% (3 of 8 attempts) Steps to Reproduce: 1. Start full backup with backup_vm.py 2. Wait until download starts 3. In the guest, poweroff 4. qemu hangs (see bug 1900505) 5. backup_vm.py detect failure in the transfer and finalize the transfer 6. backup_vm.py try to finalize the backup Actual results: Engine reports that backup was finalized VM_BACKUP_FINALIZED(10,794) Expected results: Engine reports that backup failed to finalize VM_BACKUP_FAILED_TO_FINALIZE(10,795) Additional info: Engine cannot finalize the backup until the VM is destroyed, since qemu is hang, and libvirt cannot stop the backup while qemu is hang, see: https://bugzilla.redhat.com/show_bug.cgi?id=1900505#c4
According to the logs, the engine failed to finalize the backup as expected. 2020-11-22 17:55:00,030+02 INFO [org.ovirt.engine.core.bll.StopVmBackupCommand] (default task-5) [a8f31e56-1a74-418c-9131-ee708e368f06] Failed to stop VmBackup 'b7de98d7-914c-4c24-a9dd-a3202505e511' The problem here is that the audit log reports that the backup was finalized instead of failed to finalize. 2020-11-22 17:55:00,042+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-5) [a8f31e56-1a74-418c-9131-ee708e368f06] EVENT_ID: VM_BACKUP_FINALIZED(10,794), Backup <UNKNOWN> for VM backup-raw finalized (User: admin@internal-authz). The fix will be to propagate the VM_BACKUP_FAILED_TO_FINALIZE audit log instead of the VM_BACKUP_FINALIZED.
Created attachment 1739343 [details] failed attempt to verify - dumpxml
Created attachment 1739662 [details] engine + vdsm + dumpxml for the second attempt to verify
Verified on: ovirt-engine-4.4.4.4-0.9.el8ev.noarch rhv-release-4.4.4-5-001.noarch 0. Edit the /usr/lib/python3.6/site-packages/vdsm/virt/backup.py to raise the exception 1. Create a VM with a disk that enabled for incremental backup. 2. Start the VM. 3. Start a backup for the VM [root@storage-ge5-vdsm3 examples]# python3 backup_vm.py -c engine start 89502d53-710a-4cd2-85ab-35a934f5f306 [ 0.0 ] Starting full backup for VM '89502d53-710a-4cd2-85ab-35a934f5f306' [ 1.9 ] Waiting until backup 21c37d1d-1cb8-4aef-80f9-2ca0c7ca4c5b is ready [ 2.9 ] Created checkpoint '7d2debbb-c320-475e-96bc-676dfb390f86' (to use in --from-checkpoint-uuid for the next incremental backup) [ 3.0 ] Backup 21c37d1d-1cb8-4aef-80f9-2ca0c7ca4c5b is ready 5. Stop the VM backup [root@storage-ge5-vdsm3 examples]# python3 backup_vm.py -c engine stop 89502d53-710a-4cd2-85ab-35a934f5f306 21c37d1d-1cb8-4aef-80f9-2ca0c7ca4c5b [ 0.0 ] Finalizing backup 21c37d1d-1cb8-4aef-80f9-2ca0c7ca4c5b Traceback (most recent call last): File "backup_vm.py", line 429, in <module> main() File "backup_vm.py", line 161, in main args.command(args) File "backup_vm.py", line 245, in cmd_stop stop_backup(connection, args.backup_uuid, args) File "backup_vm.py", line 327, in stop_backup backup_service.finalize() File "/usr/lib64/python3.6/site-packages/ovirtsdk4/services.py", line 33306, in finalize return self._internal_action(action, 'finalize', None, headers, query, wait) File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 299, in _internal_action return future.wait() if wait else future File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 55, in wait return self._code(response) File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 296, in callback self._check_fault(response) File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 132, in _check_fault self._raise_error(response, body) File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 118, in _raise_error raise error ovirtsdk4.Error: Fault reason is "Operation Failed". Fault detail is "[]". HTTP response code is 400. Expected results - Stop backup should fail and audit logs should include the following message - VM_BACKUP_FAILED_TO_FINALIZE Actual result: Stop backup should fail engine log include message VM_BACKUP_FAILED_TO_FINALIZE 2020-12-17 11:09:45,184+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-11) [feb0506c-9156-42a6-ba16-d9214d63b39d] EVENT_ID: VM_BACKUP_FAILED_TO_FINALIZE(10,795), Backup <UNKNOWN> for VM test failed to finalize (User: admin@internal-authz). Attaching engine + vdsm logs
Created attachment 1739931 [details] verification logs engine + vdsm
This bugzilla is included in oVirt 4.4.4 release, published on December 21st 2020. Since the problem described in this bug report should be resolved in oVirt 4.4.4 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.