Bug 1851531

Summary: BMO can get into hot reconcile loop when changing Status
Product: OpenShift Container Platform Reporter: Zane Bitter <zbitter>
Component: Bare Metal Hardware ProvisioningAssignee: Zane Bitter <zbitter>
Bare Metal Hardware Provisioning sub component: baremetal-operator QA Contact: Raviv Bar-Tal <rbartal>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: unspecified CC: augol, beth.white, shardy
Version: 4.5Keywords: Triaged
Target Milestone: ---   
Target Release: 4.5.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: The controller for BareMetalHost objects mirrored status data, including a timestamp of the latest status update, to an annotation (which was not needed by OpenShift). This could result in the BareMetalHost entering a state of continuous flux. Consequence: Affected BareMetalHosts would be subject to longer and longer back-offs between reconciliation to prevent the controller overwhelming the Kubernetes API. Fix: The annotation causing the problem is no longer written. Result: BareMetalHost objects are reconciled as scheduled.
Story Points: ---
Clone Of: 1851530
: 1851532 (view as bug list) Environment:
Last Closed: 2020-09-08 10:54:03 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On: 1851530    
Bug Blocks: 1851532    

Description Zane Bitter 2020-06-26 20:35:56 UTC
+++ This bug was initially created as a clone of Bug #1851530 +++

Description of problem:
As described in: https://github.com/metal3-io/baremetal-operator/pull/565

The code to write the 'status' annotation (an annotation containing the Status data) whenever the status changes can cause an infinite hot loop. Since the annotation and the Status subresource cannot be written at the same time, we re-read the object after writing the annotation and before trying to write the Status. However, if we get a previously cached version then there will be an error and we'll begin the Reconcile cycle again. The new Status changes generated by this new Reconcile may contain different timestamps, which will result in the annotation being updated and the whole cycle repeating. Rate limiting helps to ensure that once this happens once the timestamps only get further and further apart, so the loop is self-sustaining.

We don't actually need or want to write a status annotation. We want to be able to *read* one, and we backported the code to do so to both 4.5 (bug 1835457) and 4.4 (bug 1843230). However, the code to both read and create the annotation was in the same patch, so we ended up with both.

Comment 4 errata-xmlrpc 2020-09-08 10:54:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.5.8 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3510