Today we restarted the ovirt-engine vm. Now that should cause no problems. But it seems like its not the case. During the night a backup was started for a VM: 2022-07-15 00:59:41,659+02 INFO [org.ovirt.engine.core.bll.storage.backup.StartVmBackupCommand] (default task-190) [e3107634-cf1a-4733-9945-7d21335c0641] Lock Acquired to object 'EngineLock:{exclusiveLocks='[3aba5c17-4830-4a53-8cd2-c615e77571c0=VM, 89291790-7b3b-4a40-9ca6-a6c29ef7f005=DISK]', sharedLocks=''}' 2022-07-15 00:59:41,667+02 INFO [org.ovirt.engine.core.bll.storage.backup.StartVmBackupCommand] (default task-190) [e3107634-cf1a-4733-9945-7d21335c0641] Running command: StartVmBackupCommand internal: false. Entities affected : ID: 3aba5c17-4830-4a53-8cd2-c615e77571c0 Type: VMAction group BACKUP_DISK with role type ADMIN, ID: 89291790-7b3b-4a40-9ca6-a6c29ef7f005 Type: DiskAction group BACKUP_DISK with role type ADMIN 2022-07-15 00:59:41,669+02 INFO [org.ovirt.engine.core.bll.storage.backup.StartVmBackupCommand] (default task-190) [e3107634-cf1a-4733-9945-7d21335c0641] Creating VmBackup entity for VM '3aba5c17-4830-4a53-8cd2-c615e77571c0' 2022-07-15 00:59:41,675+02 INFO [org.ovirt.engine.core.bll.storage.backup.StartVmBackupCommand] (default task-190) [e3107634-cf1a-4733-9945-7d21335c0641] Created VmBackup entity '91c5cfbf-bcc1-4c45-9554-f21f21438b06' At noon the ovirt-engine was restarted. But due to some issues on the client side, the backup was never finalized. So we did a manual finalize of the backup, which seems to have succeeded. 2022-07-15 13:35:27,278+02 INFO [org.ovirt.engine.core.bll.storage.backup.StopVmBackupCommand] (default task-10) [d9af576d-8bb3-4c40-95a1-a41dae1ea648] Running command: StopVmBackupCommand internal: false. Entities affected : ID: 3aba5c17-4830-4a53-8cd2-c615e77571c0 Type: VMAction group BACKUP_DISK with role type ADMIN, ID: 89291790-7b3b-4a40-9ca6-a6c29ef7f005 Type: DiskAction group BACKUP_DISK with role type ADMIN 2022-07-15 13:35:27,278+02 INFO [org.ovirt.engine.core.bll.storage.backup.StopVmBackupCommand] (default task-10) [d9af576d-8bb3-4c40-95a1-a41dae1ea648] Stopping backup '91c5cfbf-bcc1-4c45-9554-f21f21438b06' 2022-07-15 13:35:27,281+02 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.StopVmBackupVDSCommand] (default task-10) [d9af576d-8bb3-4c40-95a1-a41dae1ea648] START, StopVmBackupVDSCommand(HostName = ovn002, VmBackupVDSParameters:{hostId='24009aef-82e8-408c-9ef1-baab24688917', backupId='91c5cfbf-bcc1-4c45-9554-f21f21438b06', requireConsistency=' false'}), log id: 79c1412 2022-07-15 13:35:27,287+02 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.StopVmBackupVDSCommand] (default task-10) [d9af576d-8bb3-4c40-95a1-a41dae1ea648] FINISH, StopVmBackupVDSCommand, return: , log id: 79c1412 2022-07-15 13:35:27,288+02 INFO [org.ovirt.engine.core.bll.storage.backup.StopVmBackupCommand] (default task-10) [d9af576d-8bb3-4c40-95a1-a41dae1ea648] Change VM '3aba5c17-4830-4a53-8cd2-c615e77571c0' backup '91c5cfbf-bcc1-4c45-9554-f21f21438b06' phase from 'READY' to 'FINALIZING' 2022-07-15 13:35:27,297+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-10) [d9af576d-8bb3-4c40-95a1-a41dae1ea648] EVENT_ID: VM_BACKUP_FINALIZED(10,794), Backup 91c5cfbf-bcc1-4c45-9554-f21f21438b06 for VM api001 finalized (User: admin@internal-authz). VDSM Entries: 2022-07-15 13:35:27,371+0200 INFO (jsonrpc/0) [api.virt] START stop_backup(backup_id='91c5cfbf-bcc1-4c45-9554-f21f21438b06') from=::ffff:172.18.128.93,42694, flow_id=d9af576d-8bb3-4c40-95a1-a41dae1ea648, vmId=3aba5c17-4830-4a53-8cd2-c615e77571c0 (api:48) 2022-07-15 13:35:27,372+0200 INFO (jsonrpc/0) [virt.vm] (vmId='3aba5c17-4830-4a53-8cd2-c615e77571c0') Stop monitoring scratch disk {'index': 27, 'sd_id': '6a2e5543-df94-4124-be14-489ee588cde0', 'img_id': 'da877c35-2ac0-4b83-a5a6-5b44578f602d', 'vol_id': 'c806e08c-0a87-497d-879d-56193b8339b9'} for drive sda (backup:378) 2022-07-15 13:35:27,372+0200 INFO (jsonrpc/0) [storage.backup] Removing scratch disks for backup id: 91c5cfbf-bcc1-4c45-9554-f21f21438b06 (backup:687) 2022-07-15 13:35:27,372+0200 INFO (jsonrpc/0) [vdsm.api] START list_transient_disks(owner_name='3aba5c17-4830-4a53-8cd2-c615e77571c0') from=::ffff:172.18.128.93,42694, flow_id=d9af576d-8bb3-4c40-95a1-a41dae1ea648, task_id=adfc95c5-a023-4f3c-b491-67a32bfcda7b (api:48) 2022-07-15 13:35:27,372+0200 INFO (jsonrpc/0) [vdsm.api] FINISH list_transient_disks return={'result': []} from=::ffff:172.18.128.93,42694, flow_id=d9af576d-8bb3-4c40-95a1-a41dae1ea648, task_id=adfc95c5-a023-4f3c-b491-67a32bfcda7b (api:54) 2022-07-15 13:35:27,372+0200 INFO (jsonrpc/0) [api.virt] FINISH stop_backup return={'status': {'code': 0, 'message': 'Done'}} from=::ffff:172.18.128.93,42694, flow_id=d9af576d-8bb3-4c40-95a1-a41dae1ea648, vmId=3aba5c17-4830-4a53-8cd2-c615e77571c0 (api:54) 2022-07-15 13:35:27,415+0200 ERROR (libvirt/events) [virt.vm] (vmId='3aba5c17-4830-4a53-8cd2-c615e77571c0') Block job (untracked) type BACKUP for drive sda was canceled (vm:5908) But while it printed the backup was finalized. It was not :) engine=# select * from vm_backups; backup_id | from_checkpoint_id | to_checkpoint_id | vm_id | phase | _create_date | host_id | description | _update_date | backup_type | snapshot_id | is_stopped --------------------------------------+--------------------------------------+--------------------------------------+--------------------------------------+------------+----------------------------+--------------------------------------+-------------+----------------------------+-------------+--------------------------------------+------------ 84af32c8-4f00-4af1-a67d-8776a0c8a635 | 773962b2-0b4f-4929-b75d-2d3eceb73f49 | 2e20f16e-3143-4e73-8409-d3bc5e54213a | 3671b003-188a-4d8e-aec2-95dfbef5bda1 | Ready | 2022-07-15 00:59:41.603+02 | | | 2022-07-15 00:59:48.408+02 | hybrid | eaa3e340-7cca-4e47-9e6e-2d346d0f9270 | t 91c5cfbf-bcc1-4c45-9554-f21f21438b06 | e93c5883-592a-4f31-9525-8408ea04a9c5 | 6d9b2540-8c60-4c18-8446-0f0b461b2f68 | 3aba5c17-4830-4a53-8cd2-c615e77571c0 | Finalizing | 2022-07-15 00:59:41.67+02 | 24009aef-82e8-408c-9ef1-baab24688917 | | 2022-07-15 13:35:27.288+02 | live | | f 007e2e73-bd36-48f1-9aa9-fbdd0be9b86b | 4ed7c5be-661f-48b5-aa55-7caa03d4f149 | f4426ce5-f436-4ecc-9020-f06396cfc8b9 | fdce1412-6a01-4a3d-90f0-3e2bedc0dd7b | Finalizing | 2022-07-15 00:59:41.78+02 | 62844ed0-9409-4c8b-ba5a-6d51f58855cd | | 2022-07-15 13:48:56.968+02 | live | | f 03bc4ca1-6838-43b8-a24d-b9d27790acff | 5f5fcb28-cf92-4eb6-9c4c-c6016023d747 | c4d8cb84-4feb-4367-92f6-ac95fd091f9b | a2926e08-d19b-4e3a-98fa-cc1ee241a037 | Ready | 2022-07-15 01:00:08.859+02 | | | 2022-07-15 01:00:22.23+02 | hybrid | 6510c5a2-6fe3-4bdb-b1f6-82e82d865f71 | t 587f51a4-be7b-4513-97d5-071d9f849870 | 09cf0b73-44bd-4d84-b2e5-60ba3dcafe06 | 6049dca0-a7c0-488a-ae38-530728aedb5c | f237c8ad-74cc-460f-b3c6-8541ba099aa6 | Ready | 2022-07-15 00:59:41.348+02 | | | 2022-07-15 00:59:49.497+02 | hybrid | 00899b28-2656-4373-97ad-7b17a78e01df | t 439f3850-91ec-49ec-ad78-3db59895ef6d | 05c35317-e3af-4c54-ac8a-cfcbb7d0d720 | 5402676e-62fe-4770-bc6e-5c093ef93616 | 4fa83fcd-09d8-4d87-b9d6-70180ab7269b | Ready | 2022-07-15 00:59:53.544+02 | | | 2022-07-15 01:00:14.439+02 | hybrid | 3d331aa3-d176-42fb-8ac0-f9b751f2c85e | t 8efd4eb6-30a5-44d4-8d16-7373ad1f925d | 250f9948-04a4-4598-adc5-b72649a6581f | e0201e3d-62f3-4d1f-972e-1b776290190e | c7e92d82-5106-4bed-b78b-83c3eb92b299 | Ready | 2022-07-15 00:59:54.903+02 | | | 2022-07-15 01:00:17.812+02 | hybrid | f4e79af5-e83d-4433-bcce-906b9b2f5ae7 | t (7 rows) We've tried to finalize all those backups. And all succeeded, but never really finalized and cleaned up its snapshots (for hybrid backup) nor its scratch disk (for live backup). While investigating the code, it seems to me that this is caused because the StartVmBackupCommand only exists in memory, and is lost during a ovirt-engine restart. This means that while the state is set to finalizing in StopVmBackupCommand, it never does it cleanup which resides in StartVmBackupCommand.
The documentation text flag should only be set after 'doc text' field is provided. Please provide the documentation text and set the flag to '?' again.
Can you share the full log including the entire engine startup sequence? The commands should be persisted in the command_entities table and reloaded if they were still active when ovirt engine starts. Can you share the content of this table?
After discussion in the mail, this likely happened after upgrade and the relevant command_entities entries were cleared, probably because the over 50 hours (default AsyncTaskZombieTaskLifeInMinutes) passed. And once engine was restarted, a command was no created to monitor the backup, leading to /finalize calls being ignored. To improve the situation we can use the finished backup monitoring thread to also check if unfinished backups are not being monitored and fail them automatically.
Took some time to further debug this. It seems to be indeed related to the upgrade and running engine-setup command. What happens is the following: 1. engine-setup is executed 2. one of the steps is 'Cleaning stale zombie tasks and commands', which executes ('/usr/share/ovirt-engine/setup/dbutils/taskcleaner.sh', '-l', '/var/log/ovirt-engine/setup/ovirt-engine-setup-20220816110556-t4cu7b.log', '-q', '-R', '-z', '-A') 3. that commend runs: SELECT DeleteAllJobs(); SELECT DeleteAllEntitySnapshot();SELECT DeleteAsyncTasksZombies(); 4. now DeleteAllEntitySnapshot() is doing the following: DELETE FROM business_entity_snapshot; and also PERFORM DeleteAllCommandEntities(); 5. PERFORM DeleteAllCommandEntities() -> DELETE FROM command_entities; (without a WHERE/lifetime/whatever) 6. command_entities table is empty If you now restart ovirt-engine, you lose all the context of the backup. and end up with some dangling backup snapshots etc. Multiple solutions I can think of: 1. Warn user to have no backups running while running engine-setup 2. Make sure engine-setup does not remove backup related entries 3. StartVmBackupCommand should only handle the creation of the backup. All? data of the backup is saved into the vm_backups table. The removal of the backup should be completely handled by StopVmBackupCommand (and not partly by StartVmBackupCommand)
(In reply to Jean-Louis Dupond from comment #5) > Multiple solutions I can think of: > 1. Warn user to have no backups running while running engine-setup > 2. Make sure engine-setup does not remove backup related entries > 3. StartVmBackupCommand should only handle the creation of the backup. All? > data of the backup is saved into the vm_backups table. The removal of the > backup should be completely handled by StopVmBackupCommand (and not partly > by StartVmBackupCommand) Thanks for the follow up @jean-louis ! Yeah, that's sort of what we had in mind as well and we think about going forward with a variation of #3 - changing StartVmBackupCommand to stop monitoring the execution of the backup and finalizing it would be non-trivial but making sure that we don't leave backups that are not monitored by an instance of StartVmBackupCommand is simpler and should cover this case (as a result the backup and its associated entities, e.g., vm snapshot, would be removed)
This bug has low overall severity and is not going to be further verified by QE. If you believe special care is required, feel free to properly align relevant severity, flags and keywords to raise PM_Score or use one of the Bumps ('PrioBumpField', 'PrioBumpGSS', 'PrioBumpPM', 'PrioBumpQA') in Keywords to raise it's PM_Score above verification threashold (1000).
This bug has low overall severity and passed an automated regression suite, and is not going to be further verified by QE. If you believe special care is required, feel free to re-open to ON_QA status.