1985746 – [CBT][Veeam] Full backup is stuck on 'FINALIZING' status when an error occurs during the image transfer flow

Bug 1985746 - [CBT][Veeam] Full backup is stuck on 'FINALIZING' status when an error occurs during the image transfer flow

Summary: [CBT][Veeam] Full backup is stuck on 'FINALIZING' status when an error occurs...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	Backup-Restore.VMs
Sub Component:
Version:	4.4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	ovirt-4.4.10
Target Release:	---
Assignee:	Pavel Bar
QA Contact:	Amit Sharir
Docs Contact:	bugs@ovirt.org
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-07-25 13:12 UTC by Pavel Bar
Modified:	2022-08-23 11:23 UTC (History)
CC List:	7 users (show)
Fixed In Version:	ovirt-engine-4.4.10.3
Clone Of:
Environment:
Last Closed:	2022-01-19 07:00:13 UTC
oVirt Team:	Storage
Embargoed:
Dependent Products:
Flags:	pm-rhel: ovirt-4.4+ asharir: testing_plan_complete+ ahadas: devel_ack+

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
oVirt gerrit	116660	0	master	MERGED	backup: avoid stopping backup in finalizing or final phases	2022-01-05 16:40:38 UTC
oVirt gerrit	118203	0	ovirt-engine-4.4	MERGED	backup: avoid stopping backup in finalizing or final phases	2022-01-06 07:25:05 UTC

Internal Links: 2016701

Description Pavel Bar 2021-07-25 13:12:18 UTC

Description of problem:
In certain error flows (i.e., exception occurred) the backup phase does not end as "FAILED", but instead of is stuck in "FINALIZING".

Version-Release number of selected component (if applicable):
4.4.7

How reproducible:
Execute full backup multiple times until this happens.
Not happens all the time, pretty hard to reproduce, but the issue is real.

Steps to Reproduce:
1. Execute full backup multiple times until this happens.

Actual results:
Being stuck in "FINALIZING" phase causes the backup script to be stuck on infinite waiting and also the VM is stuck on "backup in progress" and can't be neither turned off nor execute another backup operation.

Expected results:
The backup should always end in either "Succeeded" or "Failed" statuses.

Comment 2 Pavel Bar 2021-09-12 13:25:40 UTC

I succeeded to reproduce the issue on my environment, while having many additional logs printed.
See the attached log file.

Some conclusions:
1) IMHO doesn't have to be "stress testing", i.e., running backups immediately one after another using some kind of an automatic script. Just happens from time to time when an internal command fails (i.e., "AddVolumeBitmapCommand").
2) The reasons for that are 2:
  2a) "StartVmBackupCommand" fails (since "AddVolumeBitmapCommand" fails). And it doesn't neither wait for "StopVmBackupCommand" to finish, nor updates the backup phase to "FAILED".
  2b) "StopVmBackupCommand" runs and changes the backup phase from 'READY' to 'FINALIZING'. But "StartVmBackupCommand" (that was supposed to run after that and handle the 'FINALIZING' phase) is already finished (with failure).
Thus the backup stays in this intermediate non-final phase and additional attempts to work with the VM fail - can't perform another backup, cant't turn off the VM.

Comment 3 Yury.Panchenko 2021-10-05 18:48:34 UTC

Hello Pavel,
Can you increase severety of this bug?
In some cases it can block a backup finalization, so a customer can't start a new backup.

Comment 4 Nir Soffer 2021-10-22 17:25:38 UTC

Bumping the priority and severity of this. This is a severe issue since 
lot of operations are blocked during backup mode.

Comment 8 Amit Sharir 2021-11-28 09:54:39 UTC

Hi Paval,

Do you have any recommendations on how to verify this? 
Should I run a modified script?
Simply executing the full backup script multiple times isn't deterministic enough for verification.

Thanks,

Comment 10 Arik 2021-12-05 13:18:40 UTC

(In reply to Pavel Bar from comment #0)
> and can't be neither turned off ...

It can from within the guest and by specifying the 'force' flag from the API, right?
I'd say we should also consider letting the webadmin specifying it by default, at least for the powering-off operation, after all - the admin should be able to power off the VM without going through the API, what do you think?

Comment 11 Eyal Shenitzky 2021-12-06 09:48:17 UTC

(In reply to Arik from comment #10)
> (In reply to Pavel Bar from comment #0)
> > and can't be neither turned off ...
> 
> It can from within the guest and by specifying the 'force' flag from the
> API, right?
> I'd say we should also consider letting the webadmin specifying it by
> default, at least for the powering-off operation, after all - the admin
> should be able to power off the VM without going through the API, what do
> you think?

Right, we can shutdown/power-off the VM using 'force' option via the API.
We also have an open RFE to add that functionality to the UI - bug 1994663.

Comment 14 Nir Soffer 2021-12-20 17:24:14 UTC

(In reply to Pavel Bar from comment #0)
> Steps to Reproduce:
> 1. Execute full backup multiple times until this happens.

Does it happen in current master? I think this happens only during cold backup
because we did not wait for the add bitmaps jobs, and they were running during
the backup.

If you have more detailed reproduction steps with current master, please share
them here.

Comment 15 Pavel Bar 2022-01-04 23:54:08 UTC

As Nir mentioned, cold flow is probably a more likely scenario for the original bug, but I think better test both cold and live backup flows.

Testing scenarios QE might want to test - for both cold & live backups perform multiple times:

Positive flows:
1) Full backup (with / without "--skip-download" option).
2) Start + [optional download] + stop.

Negative flows:
1) Start + [optional download] + multiple stops.
I mean try running 'backup stop' a few times in parallel *after* 'backup start' finished and the backup phase is "READY".
1 of them should succeed, the rest are expected to fail.
Backup will end successfully (DB/API status should be "SUCCEEDED", the event log should include 1 successful "backup finalized" message and a failure 'backup finalized' message for each failed 'backup stop').
2) Start + stop (or multiple stops).
'backup stop' operation(s) should be executed *before* 'backup start' finished  - that is before backup reached the READY phase.
Backup should end with failure (DB/API status should be "FAILED", the event log should include error messages for every 'backup stop' that was executed).
3) Full backup + stop (or multiple stops).
Same as above depending in which 'full backup' operation phase the 'backup stop' operations were executed - before or after full backup reached the "READY" phase. Depending on that will be the expected behavior - see 2 cases above (in the above 2 test cases it was easier to control when the backup stop' operations are executed, since the flow is divided to sub-flows each executed separately).

Comment 16 Amit Sharir 2022-01-17 14:42:35 UTC

Version:
ovirt-engine-4.4.10.4-0.1.el8ev.noarch
vdsm-4.40.100.2-1.el8ev.x86_64


Verification flow:

I did multiple tests on cold+hot VM's including the flows pbar mentioned in #c15.

Verification Conclusions:

The expected output matched the actual output.
The total flow mentioned was done with no errors, the backup was never stuck on 'FINALIZING' status


Bug verified.

Comment 17 Sandro Bonazzola 2022-01-19 07:00:13 UTC

This bugzilla is included in oVirt 4.4.10 release, published on January 18th 2022.

Since the problem described in this bug report should be resolved in oVirt 4.4.10 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Comment 18 Amit Sharir 2022-02-02 11:38:07 UTC

Added Polarion TC 27935

Note You need to log in before you can comment on or make changes to this bug.