Description of problem: In certain error flows (i.e., exception occurred) the backup phase does not end as "FAILED", but instead of is stuck in "FINALIZING". Version-Release number of selected component (if applicable): 4.4.7 How reproducible: Execute full backup multiple times until this happens. Not happens all the time, pretty hard to reproduce, but the issue is real. Steps to Reproduce: 1. Execute full backup multiple times until this happens. Actual results: Being stuck in "FINALIZING" phase causes the backup script to be stuck on infinite waiting and also the VM is stuck on "backup in progress" and can't be neither turned off nor execute another backup operation. Expected results: The backup should always end in either "Succeeded" or "Failed" statuses.
I succeeded to reproduce the issue on my environment, while having many additional logs printed. See the attached log file. Some conclusions: 1) IMHO doesn't have to be "stress testing", i.e., running backups immediately one after another using some kind of an automatic script. Just happens from time to time when an internal command fails (i.e., "AddVolumeBitmapCommand"). 2) The reasons for that are 2: 2a) "StartVmBackupCommand" fails (since "AddVolumeBitmapCommand" fails). And it doesn't neither wait for "StopVmBackupCommand" to finish, nor updates the backup phase to "FAILED". 2b) "StopVmBackupCommand" runs and changes the backup phase from 'READY' to 'FINALIZING'. But "StartVmBackupCommand" (that was supposed to run after that and handle the 'FINALIZING' phase) is already finished (with failure). Thus the backup stays in this intermediate non-final phase and additional attempts to work with the VM fail - can't perform another backup, cant't turn off the VM.
Hello Pavel, Can you increase severety of this bug? In some cases it can block a backup finalization, so a customer can't start a new backup.
Bumping the priority and severity of this. This is a severe issue since lot of operations are blocked during backup mode.
Hi Paval, Do you have any recommendations on how to verify this? Should I run a modified script? Simply executing the full backup script multiple times isn't deterministic enough for verification. Thanks,
(In reply to Pavel Bar from comment #0) > and can't be neither turned off ... It can from within the guest and by specifying the 'force' flag from the API, right? I'd say we should also consider letting the webadmin specifying it by default, at least for the powering-off operation, after all - the admin should be able to power off the VM without going through the API, what do you think?
(In reply to Arik from comment #10) > (In reply to Pavel Bar from comment #0) > > and can't be neither turned off ... > > It can from within the guest and by specifying the 'force' flag from the > API, right? > I'd say we should also consider letting the webadmin specifying it by > default, at least for the powering-off operation, after all - the admin > should be able to power off the VM without going through the API, what do > you think? Right, we can shutdown/power-off the VM using 'force' option via the API. We also have an open RFE to add that functionality to the UI - bug 1994663.
(In reply to Pavel Bar from comment #0) > Steps to Reproduce: > 1. Execute full backup multiple times until this happens. Does it happen in current master? I think this happens only during cold backup because we did not wait for the add bitmaps jobs, and they were running during the backup. If you have more detailed reproduction steps with current master, please share them here.
As Nir mentioned, cold flow is probably a more likely scenario for the original bug, but I think better test both cold and live backup flows. Testing scenarios QE might want to test - for both cold & live backups perform multiple times: Positive flows: 1) Full backup (with / without "--skip-download" option). 2) Start + [optional download] + stop. Negative flows: 1) Start + [optional download] + multiple stops. I mean try running 'backup stop' a few times in parallel *after* 'backup start' finished and the backup phase is "READY". 1 of them should succeed, the rest are expected to fail. Backup will end successfully (DB/API status should be "SUCCEEDED", the event log should include 1 successful "backup finalized" message and a failure 'backup finalized' message for each failed 'backup stop'). 2) Start + stop (or multiple stops). 'backup stop' operation(s) should be executed *before* 'backup start' finished - that is before backup reached the READY phase. Backup should end with failure (DB/API status should be "FAILED", the event log should include error messages for every 'backup stop' that was executed). 3) Full backup + stop (or multiple stops). Same as above depending in which 'full backup' operation phase the 'backup stop' operations were executed - before or after full backup reached the "READY" phase. Depending on that will be the expected behavior - see 2 cases above (in the above 2 test cases it was easier to control when the backup stop' operations are executed, since the flow is divided to sub-flows each executed separately).
Version: ovirt-engine-4.4.10.4-0.1.el8ev.noarch vdsm-4.40.100.2-1.el8ev.x86_64 Verification flow: I did multiple tests on cold+hot VM's including the flows pbar mentioned in #c15. Verification Conclusions: The expected output matched the actual output. The total flow mentioned was done with no errors, the backup was never stuck on 'FINALIZING' status Bug verified.
This bugzilla is included in oVirt 4.4.10 release, published on January 18th 2022. Since the problem described in this bug report should be resolved in oVirt 4.4.10 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.
Added Polarion TC 27935