Bug 1851531 - BMO can get into hot reconcile loop when changing Status
Summary: BMO can get into hot reconcile loop when changing Status
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Bare Metal Hardware Provisioning
Version: 4.5
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.5.z
Assignee: Zane Bitter
QA Contact: Raviv Bar-Tal
URL:
Whiteboard:
Depends On: 1851530
Blocks: 1851532
TreeView+ depends on / blocked
 
Reported: 2020-06-26 20:35 UTC by Zane Bitter
Modified: 2020-09-08 10:54 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: The controller for BareMetalHost objects mirrored status data, including a timestamp of the latest status update, to an annotation (which was not needed by OpenShift). This could result in the BareMetalHost entering a state of continuous flux. Consequence: Affected BareMetalHosts would be subject to longer and longer back-offs between reconciliation to prevent the controller overwhelming the Kubernetes API. Fix: The annotation causing the problem is no longer written. Result: BareMetalHost objects are reconciled as scheduled.
Clone Of: 1851530
: 1851532 (view as bug list)
Environment:
Last Closed: 2020-09-08 10:54:03 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift baremetal-operator pull 84 0 None closed Bug 1851531: Do not write status annotation 2021-02-01 09:06:03 UTC
Red Hat Product Errata RHBA-2020:3510 0 None None None 2020-09-08 10:54:25 UTC

Description Zane Bitter 2020-06-26 20:35:56 UTC
+++ This bug was initially created as a clone of Bug #1851530 +++

Description of problem:
As described in: https://github.com/metal3-io/baremetal-operator/pull/565

The code to write the 'status' annotation (an annotation containing the Status data) whenever the status changes can cause an infinite hot loop. Since the annotation and the Status subresource cannot be written at the same time, we re-read the object after writing the annotation and before trying to write the Status. However, if we get a previously cached version then there will be an error and we'll begin the Reconcile cycle again. The new Status changes generated by this new Reconcile may contain different timestamps, which will result in the annotation being updated and the whole cycle repeating. Rate limiting helps to ensure that once this happens once the timestamps only get further and further apart, so the loop is self-sustaining.

We don't actually need or want to write a status annotation. We want to be able to *read* one, and we backported the code to do so to both 4.5 (bug 1835457) and 4.4 (bug 1843230). However, the code to both read and create the annotation was in the same patch, so we ended up with both.

Comment 4 errata-xmlrpc 2020-09-08 10:54:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.5.8 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3510


Note You need to log in before you can comment on or make changes to this bug.