Bug 1900546 - [CBT][incremental backup] Engine reports that backup was finalized when stopping backup failed
Summary: [CBT][incremental backup] Engine reports that backup was finalized when stopp...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Storage
Version: 4.4.4.1
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ovirt-4.4.4
: 4.4.4.4
Assignee: Eyal Shenitzky
QA Contact: Ilan Zuckerman
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-11-23 11:00 UTC by Nir Soffer
Modified: 2020-12-21 12:35 UTC (History)
3 users (show)

Fixed In Version: ovirt-engine-4.4.4.4
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-12-21 12:35:53 UTC
oVirt Team: Storage
Embargoed:


Attachments (Terms of Use)
logs from 3 backups showing this issue (73.83 KB, application/gzip)
2020-11-23 11:00 UTC, Nir Soffer
no flags Details
failed attempt to verify - dumpxml (6.58 KB, application/zip)
2020-12-15 14:23 UTC, Ilan Zuckerman
no flags Details
engine + vdsm + dumpxml for the second attempt to verify (144.25 KB, application/zip)
2020-12-16 14:48 UTC, Ilan Zuckerman
no flags Details
verification logs engine + vdsm (359.86 KB, application/zip)
2020-12-17 09:50 UTC, Ilan Zuckerman
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 112548 0 master MERGED core: set StopVmBackupCommand succeeded only if backup stopped 2020-12-21 07:27:29 UTC

Description Nir Soffer 2020-11-23 11:00:09 UTC
Created attachment 1732443 [details]
logs from 3 backups showing this issue

Description of problem:

When powering of a vm from the guest during backup, qemu hangs and stopping
the backup fails in vdsm (see bug 1892672). Engine reports that the backup
was finalized while the backup was not finalized, and leftover scratch disks
are left on the host running the backup.

Here are logs from a run showing this issue:

1. Starting backup

2020-11-22 17:54:20,367+02 INFO  [org.ovirt.engine.core.bll.StartVmBackupCommand] (default task-5) [33bd8b12-d604-4882-a193-4beecf1cf41e] Created VmBackup entity 'b7de98d7-914c-4c24-a9dd-a3202505e511'

2. Starting transfer for this backup

2020-11-22 17:54:21,714+02 INFO  [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (default task-5) [353218d4-d310-4c35-ba70-5dcea5c5d99a] Successfully added Download disk 'fedora-32.raw' (disk id: '126eea31-c5a2-4c01-a18d-9822b0c05c2a', image id: 'c7c8040e-597a-402f-b371-3f74510b6a27') for image transfer command 'f0754c8a-7b66-4ac9-8b77-7a5857588e71'
2020-11-22 17:54:21,777+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.AddImageTicketVDSCommand] (default task-5) [353218d4-d310-4c35-ba70-5dcea5c5d99a] START, AddImageTicketVDSCommand(HostName = host4, AddImageTicketVDSCommandParameters:{hostId='9de3098d-89e8-444c-95ed-0fc2ea74e170', ticketId='1bec74cf-aa0c-45b9-aec0-e22bd56bcbcb', timeout='300', operations='[read]', size='6442450944', url='nbd:unix:/run/vdsm/backup/b7de98d7-914c-4c24-a9dd-a3202505e511:exportname=sda', filename='null', sparse='true', transferId='f0754c8a-7b66-4ac9-8b77-7a5857588e71', dirty='false'}), log id: 6960e215

3. Transfer finalized

2020-11-22 17:54:28,859+02 INFO  [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (EE-ManagedScheduledE
xecutorService-engineScheduledThreadPool-Thread-86) [353218d4-d310-4c35-ba70-5dcea5c5d99a] Finalizing successful transfer for 
Download disk 'fedora-32.raw' (disk id: '126eea31-c5a2-4c01-a18d-9822b0c05c2a', image id: 'c7c8040e-597a-402f-b371-3f74510b6a2
7')

4. Stopping backup

2020-11-22 17:54:29,950+02 INFO  [org.ovirt.engine.core.bll.StopVmBackupCommand] (default task-5) [a8f31e56-1a74-418c-9131-ee708e368f06] Stopping VmBackup 'b7de98d7-914c-4c24-a9dd-a3202505e511'

5. Stopping backup failed in vdsm

2020-11-22 17:54:59,962+02 WARN  [org.ovirt.engine.core.vdsbroker.vdsbroker.StopVmBackupVDSCommand] (default task-5) [a8f31e56-1a74-418c-9131-ee708e368f06] Unexpected return value: Status [code=1600, message=Backup Error: {'vm_id': '4dc3bb16-f8d1-4f59-9388-a93f68da7cf0', 'backup_id': 'b7de98d7-914c-4c24-a9dd-a3202505e511', 'reason': 'Failed to end VM backup: Timed out during operation: cannot acquire state change lock (held by monitor=qemu-event)'}]

6. Engine report finalized event with missing backup id

2020-11-22 17:55:00,042+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-5) [a8f3
1e56-1a74-418c-9131-ee708e368f06] EVENT_ID: VM_BACKUP_FINALIZED(10,794), Backup <UNKNOWN> for VM backup-raw finalized (User: a
dmin@internal-authz).


Client accessing this backup is hang, waiting for the backup to change
state from "ready" to "finalizing":

diff --git a/sdk/examples/backup_vm.py b/sdk/examples/backup_vm.py
index fd40154..f871cb8 100755
--- a/sdk/examples/backup_vm.py
+++ b/sdk/examples/backup_vm.py
@@ -329,6 +329,7 @@ def stop_backup(connection, backup_uuid, args):
     # TODO: Wait until backup is finalized.
     backup = backup_service.get()
     while backup.phase != types.BackupPhase.FINALIZING:
+        print("backup.phase = %r, waiting" % backup.phase)
         time.sleep(1)
         backup = backup_service.get()

$ ./backup_vm.py -c engine-dev stop 4dc3bb16-f8d1-4f59-9388-a93f68da7cf0 5c4407d2-f0ba-4381-968f-6077c2787d0b
[   0.0 ] Finalizing backup 5c4407d2-f0ba-4381-968f-6077c2787d0b
backup.phase = <BackupPhase.READY: 'ready'>, waiting
backup.phase = <BackupPhase.READY: 'ready'>, waiting
backup.phase = <BackupPhase.READY: 'ready'>, waiting
backup.phase = <BackupPhase.READY: 'ready'>, waiting
backup.phase = <BackupPhase.READY: 'ready'>, waiting
backup.phase = <BackupPhase.READY: 'ready'>, waiting
backup.phase = <BackupPhase.READY: 'ready'>, waiting
backup.phase = <BackupPhase.READY: 'ready'>, waiting
backup.phase = <BackupPhase.READY: 'ready'>, waiting
backup.phase = <BackupPhase.READY: 'ready'>, waiting
backup.phase = <BackupPhase.READY: 'ready'>, waiting
backup.phase = <BackupPhase.READY: 'ready'>, waiting
backup.phase = <BackupPhase.READY: 'ready'>, waiting

Clients using events to detect backup completion will consider this backup
as finalized, while the backup was not.


Version-Release number of selected component (if applicable):
4.4.4.2_master

How reproducible:
37% (3 of 8 attempts)

Steps to Reproduce:
1. Start full backup with backup_vm.py
2. Wait until download starts
3. In the guest, poweroff
4. qemu hangs (see bug 1900505)
5. backup_vm.py detect failure in the transfer and finalize the transfer
6. backup_vm.py try to finalize the backup

Actual results:
Engine reports that backup was finalized
VM_BACKUP_FINALIZED(10,794)

Expected results:
Engine reports that backup failed to finalize VM_BACKUP_FAILED_TO_FINALIZE(10,795)

Additional info:
Engine cannot finalize the backup until the VM is destroyed, since qemu is
hang, and libvirt cannot stop the backup while qemu is hang, see:
https://bugzilla.redhat.com/show_bug.cgi?id=1900505#c4

Comment 1 Eyal Shenitzky 2020-12-08 12:37:01 UTC
According to the logs, the engine failed to finalize the backup as expected.

2020-11-22 17:55:00,030+02 INFO  [org.ovirt.engine.core.bll.StopVmBackupCommand] (default task-5) [a8f31e56-1a74-418c-9131-ee708e368f06] Failed to stop VmBackup 'b7de98d7-914c-4c24-a9dd-a3202505e511'

The problem here is that the audit log reports that the backup was finalized instead of failed to finalize.

2020-11-22 17:55:00,042+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-5) [a8f31e56-1a74-418c-9131-ee708e368f06] EVENT_ID: VM_BACKUP_FINALIZED(10,794), Backup <UNKNOWN> for VM backup-raw finalized (User: admin@internal-authz).

The fix will be to propagate the VM_BACKUP_FAILED_TO_FINALIZE audit log instead of the VM_BACKUP_FINALIZED.

Comment 3 Ilan Zuckerman 2020-12-15 14:23:49 UTC
Created attachment 1739343 [details]
failed attempt to verify - dumpxml

Comment 6 Ilan Zuckerman 2020-12-16 14:48:19 UTC
Created attachment 1739662 [details]
engine + vdsm + dumpxml for the second attempt to verify

Comment 8 Ilan Zuckerman 2020-12-17 09:48:43 UTC
Verified on:

ovirt-engine-4.4.4.4-0.9.el8ev.noarch
rhv-release-4.4.4-5-001.noarch

0. Edit the /usr/lib/python3.6/site-packages/vdsm/virt/backup.py  to raise the exception
1. Create a VM with a disk that enabled for incremental backup.
2. Start the VM.
3. Start a backup for the VM 

[root@storage-ge5-vdsm3 examples]# python3 backup_vm.py -c engine start 89502d53-710a-4cd2-85ab-35a934f5f306
[   0.0 ] Starting full backup for VM '89502d53-710a-4cd2-85ab-35a934f5f306'
[   1.9 ] Waiting until backup 21c37d1d-1cb8-4aef-80f9-2ca0c7ca4c5b is ready
[   2.9 ] Created checkpoint '7d2debbb-c320-475e-96bc-676dfb390f86' (to use in --from-checkpoint-uuid for the next incremental backup)
[   3.0 ] Backup 21c37d1d-1cb8-4aef-80f9-2ca0c7ca4c5b is ready


5. Stop the VM backup

[root@storage-ge5-vdsm3 examples]# python3 backup_vm.py -c engine stop 89502d53-710a-4cd2-85ab-35a934f5f306 21c37d1d-1cb8-4aef-80f9-2ca0c7ca4c5b
[   0.0 ] Finalizing backup 21c37d1d-1cb8-4aef-80f9-2ca0c7ca4c5b
Traceback (most recent call last):
  File "backup_vm.py", line 429, in <module>
    main()
  File "backup_vm.py", line 161, in main
    args.command(args)
  File "backup_vm.py", line 245, in cmd_stop
    stop_backup(connection, args.backup_uuid, args)
  File "backup_vm.py", line 327, in stop_backup
    backup_service.finalize()
  File "/usr/lib64/python3.6/site-packages/ovirtsdk4/services.py", line 33306, in finalize
    return self._internal_action(action, 'finalize', None, headers, query, wait)
  File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 299, in _internal_action
    return future.wait() if wait else future
  File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 55, in wait
    return self._code(response)
  File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 296, in callback
    self._check_fault(response)
  File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 132, in _check_fault
    self._raise_error(response, body)
  File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 118, in _raise_error
    raise error
ovirtsdk4.Error: Fault reason is "Operation Failed". Fault detail is "[]". HTTP response code is 400.


Expected results - 
Stop backup should fail and audit logs should include the following message - VM_BACKUP_FAILED_TO_FINALIZE

Actual result:
Stop backup should fail
engine log include message VM_BACKUP_FAILED_TO_FINALIZE

2020-12-17 11:09:45,184+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-11) [feb0506c-9156-42a6-ba16-d9214d63b39d] EVENT_ID: VM_BACKUP_FAILED_TO_FINALIZE(10,795), Backup <UNKNOWN> for VM test failed to finalize (User: admin@internal-authz).


Attaching engine + vdsm logs

Comment 9 Ilan Zuckerman 2020-12-17 09:50:05 UTC
Created attachment 1739931 [details]
verification logs engine + vdsm

Comment 10 Sandro Bonazzola 2020-12-21 12:35:53 UTC
This bugzilla is included in oVirt 4.4.4 release, published on December 21st 2020.

Since the problem described in this bug report should be resolved in oVirt 4.4.4 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.