Bug 1851530 - BMO can get into hot reconcile loop when changing Status
Summary: BMO can get into hot reconcile loop when changing Status
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Bare Metal Hardware Provisioning
Version: 4.6
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.6.0
Assignee: Zane Bitter
QA Contact: Lubov
URL:
Whiteboard:
Depends On:
Blocks: 1851531
TreeView+ depends on / blocked
 
Reported: 2020-06-26 20:34 UTC by Zane Bitter
Modified: 2020-10-27 16:10 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1851531 (view as bug list)
Environment:
Last Closed: 2020-10-27 16:09:46 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift baremetal-operator pull 81 0 None closed Merge changes from upstream as of 2020-07-02 2020-10-28 12:36:10 UTC
Github openshift baremetal-operator pull 83 0 None closed Merge upstream 20200713 103725 2020-10-28 12:35:55 UTC
Github openshift baremetal-operator pull 86 0 None closed Sync downstream 20200716 2020-10-28 12:35:55 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:10:13 UTC

Description Zane Bitter 2020-06-26 20:34:49 UTC
Description of problem:
As described in: https://github.com/metal3-io/baremetal-operator/pull/565

The code to write the 'status' annotation (an annotation containing the Status data) whenever the status changes can cause an infinite hot loop. Since the annotation and the Status subresource cannot be written at the same time, we re-read the object after writing the annotation and before trying to write the Status. However, if we get a previously cached version then there will be an error and we'll begin the Reconcile cycle again. The new Status changes generated by this new Reconcile may contain different timestamps, which will result in the annotation being updated and the whole cycle repeating. Rate limiting helps to ensure that once this happens once the timestamps only get further and further apart, so the loop is self-sustaining.

We don't actually need or want to write a status annotation. We want to be able to *read* one, and we backported the code to do so to both 4.5 (bug 1835457) and 4.4 (bug 1843230). However, the code to both read and create the annotation was in the same patch, so we ended up with both.

Comment 1 Zane Bitter 2020-07-13 20:11:38 UTC
The offending code was removed from upstream in https://github.com/metal3-io/baremetal-operator/pull/566 and we picked it up in https://github.com/openshift/baremetal-operator/pull/82

Comment 7 Lubov 2020-08-17 14:23:17 UTC
Verified on
Client Version: 4.6.0-0.nightly-2020-08-16-072105
Server Version: 4.6.0-0.nightly-2020-08-16-072105
Kubernetes Version: v1.19.0-rc.2+99cb93a-dirty

Comment 8 Zane Bitter 2020-08-27 21:21:23 UTC
It turns out we picked up the original fix slightly earlier (in https://github.com/openshift/baremetal-operator/pull/82) and a couple of additional improvements to that fix later. I've linked the correct PRs.

Comment 9 Zane Bitter 2020-08-27 21:22:12 UTC
(In reply to Zane Bitter from comment #8)
> (in https://github.com/openshift/baremetal-operator/pull/82)

Grrr, no in https://github.com/openshift/baremetal-operator/pull/81

Comment 11 errata-xmlrpc 2020-10-27 16:09:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.