Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1949470

Summary: Limit the number of pivot retries
Product: [oVirt] vdsm Reporter: Nir Soffer <nsoffer>
Component: GeneralAssignee: Nir Soffer <nsoffer>
Status: CLOSED DEFERRED QA Contact: Evelina Shames <eshames>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.40.60.3CC: ahadas, bugs, eshames, sfishbai
Target Milestone: ---Keywords: ZStream
Target Release: ---Flags: pm-rhel: ovirt-4.5?
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-05-23 11:09:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Nir Soffer 2021-04-14 11:13:57 UTC
Description of problem:

In bug 1857347 we tried to fix a case when libvirt block commit job failed
with unrecoverable error. Unfortunately the fix was not correct, making the
situation even worse, bug 1945675. The fix was reverted and now vdsm is
retrying pivot after unexpected errors.

Retrying proved very useful to mitigate temporary errors, for example
bug 1945635, when libvirt block job is flipping states between "ready" and
"standby". Testing show that in all cases the pivot was successful in the
second retry.

However if the libvirt error is not temporary, retrying will not help and
the operation will never complete. In this case vdsm need to abort the
current libvirt block job and fail the merge operation.

We don't have a way to detect unrecoverable error in libvirt, since the error
is typically caused by a bug in qemu or libvirt so libvirt reports internal
error for all unexpected cases.

The only way to tell if the error is recoverable is to retry the operation,
and fail after several retries.

I think the best way to fix this is:

- Keep the cleanup method in the job (e.g. "pivot", "abort")
- Keep the number of pivot attempts in the job (like extend attempts)
- When pivot fails, increase the pivot attempt counter.
- When starting cleanup, if pivot attempt counter exceed the maximum value,
  change the job cleanup method to "abort". From this point, the job
  will try to abort the libvirt block job without the pivot flag.
- There is no limit the the number of abort attempts, we must not
  leave libvirt block job running.

Expected flow, starting at the point we start the cleanup, assuming
maximum 3 pivot attempts (the actual number of retries may need to be
larger):

00:00 try to pivot, fail: wait for next update
00:15 try to pivot, fail: wait for next update
00:30 try to pivot, fail: switch job to cleanup="abort"
00:45 try to avbort, fail: wait for next update
01:00 try to avbort, fail: wait for next update
01:15 try to abort, success: untrack job

Comment 3 Eyal Shenitzky 2021-10-21 11:03:23 UTC
This fix can solve a theoretical loop of retries due to a consistent failure to perform the pivot operation on libvirt side.
Therefore, severity is lowered to medium.

Comment 5 Arik 2022-01-23 15:39:13 UTC
Nir, you suggested to close this one, right? if so, and as you were the one that filed it, please explain why and close

Comment 6 Nir Soffer 2022-03-25 10:04:03 UTC
It would be better if we limit the number of retries, but the only case
when it can help is libvirt bug, and in this case trying to stop the 
pivot attempt and abort the merge may also fail.

I think we need to reconsider this for 4.5.1, since live merge becomes
now a daily operation with hybrid incremental backup.

Comment 7 Arik 2022-05-23 11:09:37 UTC
as we have no known issue that requires this at this point, it's more of an RFE that we can do to be more robust to failures doing the pivot phase due to platform issues
so I'm closing this one in favor of: https://github.com/oVirt/vdsm/issues/197