Bug 1851530

Summary: BMO can get into hot reconcile loop when changing Status
Product: OpenShift Container Platform Reporter: Zane Bitter <zbitter>
Component: Bare Metal Hardware ProvisioningAssignee: Zane Bitter <zbitter>
Bare Metal Hardware Provisioning sub component: baremetal-operator QA Contact: Lubov <lshilin>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: unspecified CC: lshilin, rbartal, stbenjam
Version: 4.6Keywords: Triaged
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1851531 (view as bug list) Environment:
Last Closed: 2020-10-27 16:09:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1851531    

Description Zane Bitter 2020-06-26 20:34:49 UTC
Description of problem:
As described in: https://github.com/metal3-io/baremetal-operator/pull/565

The code to write the 'status' annotation (an annotation containing the Status data) whenever the status changes can cause an infinite hot loop. Since the annotation and the Status subresource cannot be written at the same time, we re-read the object after writing the annotation and before trying to write the Status. However, if we get a previously cached version then there will be an error and we'll begin the Reconcile cycle again. The new Status changes generated by this new Reconcile may contain different timestamps, which will result in the annotation being updated and the whole cycle repeating. Rate limiting helps to ensure that once this happens once the timestamps only get further and further apart, so the loop is self-sustaining.

We don't actually need or want to write a status annotation. We want to be able to *read* one, and we backported the code to do so to both 4.5 (bug 1835457) and 4.4 (bug 1843230). However, the code to both read and create the annotation was in the same patch, so we ended up with both.

Comment 1 Zane Bitter 2020-07-13 20:11:38 UTC
The offending code was removed from upstream in https://github.com/metal3-io/baremetal-operator/pull/566 and we picked it up in https://github.com/openshift/baremetal-operator/pull/82

Comment 7 Lubov 2020-08-17 14:23:17 UTC
Verified on
Client Version: 4.6.0-0.nightly-2020-08-16-072105
Server Version: 4.6.0-0.nightly-2020-08-16-072105
Kubernetes Version: v1.19.0-rc.2+99cb93a-dirty

Comment 8 Zane Bitter 2020-08-27 21:21:23 UTC
It turns out we picked up the original fix slightly earlier (in https://github.com/openshift/baremetal-operator/pull/82) and a couple of additional improvements to that fix later. I've linked the correct PRs.

Comment 9 Zane Bitter 2020-08-27 21:22:12 UTC
(In reply to Zane Bitter from comment #8)
> (in https://github.com/openshift/baremetal-operator/pull/82)

Grrr, no in https://github.com/openshift/baremetal-operator/pull/81

Comment 11 errata-xmlrpc 2020-10-27 16:09:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196