Bug 1900552 - [CBT][incremental backup] VmBackup.finalize synchronous instead of asynchronous
Summary: [CBT][incremental backup] VmBackup.finalize synchronous instead of asynchronous
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Storage
Version: 4.4.4.1
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ovirt-4.5.1
: ---
Assignee: Mark Kemel
QA Contact: Evelina Shames
URL:
Whiteboard:
: 2037277 2039717 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-11-23 11:31 UTC by Nir Soffer
Modified: 2022-08-23 19:39 UTC (History)
9 users (show)

Fixed In Version: ovirt-engine-4.5.1.1
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-06-23 05:54:58 UTC
oVirt Team: Storage
Embargoed:
pm-rhel: ovirt-4.5?
eshames: testing_plan_complete+


Attachments (Terms of Use)
Logs showing backup flow when engine hangs 30 seconds in VM.stop_backup (73.83 KB, application/gzip)
2020-11-23 11:31 UTC, Nir Soffer
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github oVirt ovirt-engine pull 281 0 None Draft Make VmBackupStop async using flag 2022-04-18 09:29:01 UTC
Github oVirt ovirt-engine pull 367 0 None open core: handle image transfer during backup stop 2022-05-12 06:59:12 UTC

Internal Links: 2037334

Description Nir Soffer 2020-11-23 11:31:09 UTC
Created attachment 1732456 [details]
Logs showing backup flow when engine hangs 30 seconds in VM.stop_backup

Description of problem:

Calling VmBackup.finalize() blocks while waiting for vdsm to stop the 
backup, but does not fail if the vdsm call failed.

Finalizing a backup may take time in vdsm, for example if libvirt is blocked
on a hang qemu monitor, or if the vdsm request is waiting in the jsonrpc
qeueue.

Here is example backup_vm runs when finalizing backup hangs:

$ ./backup_vm.py -c engine-dev full --backup-dir /var/tmp/backups/raw 4dc3bb16-f8d1-4f59-9388-a93f68da7cf0
[   0.0 ] Starting full backup for VM 4dc3bb16-f8d1-4f59-9388-a93f68da7cf0
[   0.8 ] Waiting until backup 7b4df572-f664-4b2a-9d7e-b4be9b4ed667 is ready
[   1.9 ] Creating image transfer for disk 126eea31-c5a2-4c01-a18d-9822b0c05c2a
[   3.3 ] Image transfer a9c01706-dc77-417b-91c2-d1c8c53a5403 is ready
[  73.17% ] 4.39 GiB, 12.02 seconds, 373.96 MiB/s                              
[  15.3 ] Finalizing image transfer
[  17.3 ] Finalizing backup
[  47.4 ] Waiting until backup is finalized
...

In this case the finalize() call took 30 seconds, waiting for vdsm response
which was blocked 30 seconds on libvirt, timing out after 30 seconds.

Users managing multiple backups do not want to wait for the response. They
want to be able to finalize multiple backups (e.g. backing up 1000's vms
during backup window), and wait for backup completion separately.

It may be useful to provide optional blocking interface, that wait until
a backup is finalized, but in this case the API must fail if backup fail
to finalize, so it cannot get stuck for ever.

The blocking API creates another issue - calling finalize() does not change
the state of the backup - it remains in "ready" state. If finalizing the
backup fails in vdsm, the backup is still considered "ready", and new 
transfers may be started.

It would be more useful if finalize was async, and backup state was changing
the state to "finalizing" *before* doing anything int he backend, similar to
image tranfer.

If backup failed to finalize its state can be left as "finalizing", and no
new transfers can be started for this backup.

Version-Release number of selected component (if applicable):
4.4.4.2_master

How reproducible:
Always

Steps to Reproduce:
1. Start backup
2. Stop backup

Actual results:
Engine try to stop the backup before returning response to the API caller.

Expected results:
Engine switch state to "finalizing" and return response to the API caller.
Then try to stop the backup.

Additional info:
Changing backup state may break users expecting the current behavior, but
since this feature is still tech preview we can still fix the API.
Once backup is released as fully supported API, we cannot make such API
changes.

Comment 1 Eyal Shenitzky 2021-08-29 08:54:10 UTC
Incremental backup is fully implemented and we shouldn't do any further changes in the API for backward compatibility.
Closing.

Comment 2 Nir Soffer 2022-03-30 21:15:52 UTC
Reopening since current behavior is wrong, and cause too much trouble.

The current API is implemented in the wrong way, and we can fix it
without affecting users of the API.

How stopping backup should work:

1. User call finalize()
2. The system set a "stopped" flag for the backup
3. The system wakes up the backup command if not running
4. The user get a response 
5. The backup command check the stopped flag in all phases, and cleans up
   as needed depending on the current phase.
6. The user poll the backup phase
7. The system mark the backup as finished when done

If the user invoke finalize() more than once, the system can safely ignore
the request, since the stopped flag is already set.

Comment 4 Arik 2022-04-20 14:32:55 UTC
*** Bug 2039717 has been marked as a duplicate of this bug. ***

Comment 5 Arik 2022-05-24 15:14:33 UTC
*** Bug 2037277 has been marked as a duplicate of this bug. ***

Comment 6 Avihai 2022-05-26 10:56:27 UTC
Hi Mark,
Please provide the verification scenario for this bug.

Comment 7 Mark Kemel 2022-06-02 11:38:11 UTC
Verification steps:

1. Start a full backup
2. Send backup.finalize request before backup reaches status "Ready"
3. Make sure that the backup finalizes gracefully
4. Start another backup
5. When 'Ready', start image transfer
6. Send backup.finalize request while image transfer is running. Make sure the request fails
7. Cancel transfer/wait for it to finish
8. Send backup.finalize request again, make sure backup is finalized successfuly

Comment 8 Evelina Shames 2022-06-16 14:20:42 UTC
(In reply to Mark Kemel from comment #7)
> Verification steps:
> 
> 1. Start a full backup
> 2. Send backup.finalize request before backup reaches status "Ready"
> 3. Make sure that the backup finalizes gracefully
> 4. Start another backup
> 5. When 'Ready', start image transfer
> 6. Send backup.finalize request while image transfer is running. Make sure
> the request fails

<fault>
    <detail>[Cannot stop VM backup. There is an active image transfer for VM backup]</detail>
    <reason>Operation Failed</reason>
</fault>

> 7. Cancel transfer/wait for it to finish
> 8. Send backup.finalize request again, make sure backup is finalized
> successfully

<action>
    <status>complete</status>
</action>

The backup operation ended successfully.

Verified with the above steps on:
ovirt-engine-4.5.1.1-0.14.el8ev
vdsm-4.50.1.2-1.el8ev.x86_64

Comment 9 Sandro Bonazzola 2022-06-23 05:54:58 UTC
This bugzilla is included in oVirt 4.5.1 release, published on June 22nd 2022.
Since the problem described in this bug report should be resolved in oVirt 4.5.1 release, it has been closed with a resolution of CURRENT RELEASE.
If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.