Bug 2112321 - BMAC reconcile loop never stops after changes
Summary: BMAC reconcile loop never stops after changes
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Advanced Cluster Management for Kubernetes
Classification: Red Hat
Component: Infrastructure Operator
Version: rhacm-2.6
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: rhacm-2.6
Assignee: Mat Kowalski
QA Contact:
Derek
URL:
Whiteboard:
Depends On:
Blocks: 2112817
TreeView+ depends on / blocked
 
Reported: 2022-07-29 10:37 UTC by Mat Kowalski
Modified: 2022-09-06 22:35 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2112817 (view as bug list)
Environment:
Last Closed: 2022-09-06 22:34:44 UTC
Target Upstream Version:
Embargoed:
cbynum: rhacm-2.6+
cbynum: rhacm-2.6.z+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift assisted-service pull 4201 0 None Merged Bug 2112321: Stop reconcile loop after changing CR inside BMAC 2022-08-01 08:30:50 UTC
Github stolostron backlog issues 24676 0 None None None 2022-07-29 13:31:08 UTC
Red Hat Issue Tracker MGMTBUGSM-490 0 None None None 2022-07-29 12:11:57 UTC
Red Hat Product Errata RHSA-2022:6370 0 None None None 2022-09-06 22:35:08 UTC

Description Mat Kowalski 2022-07-29 10:37:31 UTC
In multiple places inside BMAC we are changing state of the CR and mark the result as `dirty: true`. This is later handled by 2 following approaches

1) reconcileComplete{dirty: true}
2) reconcileComplete{dirty: true, stop: true}

Given that we should be stopping reconcile loop after every change made, in order to avoid errors like

```
Operation cannot be fulfilled on baremetalhosts.metal3.io [...]: the object has been modified; please apply your changes to the latest version and try again
```

we should only use `dirty: true` together with `stop: true`. Currently using (1) leads to some unexpected races when objects are modified (or not) multiple times in the same loop.

Comment 1 Trey West 2022-08-11 20:06:58 UTC
Hey @mko , looking at my cluster, I only see one log like you mentioned after running the cluster for about a week.

time="2022-08-10T16:21:51Z" level=error msg="Error updating hardwaredetails" func="github.com/openshift/assisted-service/internal/controller/controllers.(*BMACReconciler).Reconcile" file="/remote-source/assisted-service/app/internal/controller/controllers/bmh_agent_controller.go:272" bare_metal_host=mdhcp-master-0-0-bmh bare_metal_host_namespace=mdhcp-0 error="Operation cannot be fulfilled on baremetalhosts.metal3.io \"mdhcp-master-0-0-bmh\": the object has been modified; please apply your changes to the latest version and try again" go-id=721 request_id=9af6205c-acbe-4bad-931e-dc49d54d1ff9

Safe to say this is verified or do you think it needs more investigation?

Comment 2 Mat Kowalski 2022-08-14 17:44:06 UTC
After discussions with @mfilanov we reached the conclusion that `the object has been modified; please apply your changes to the latest version and try again` on its own is not yet an indication of a bug nor an error as in a healthy scenario the controller should succeeded in the next attempt to reconcile.

For me personally having you seeing only 1 instance of this is a good indicator that the PR improved a situation, but like with any change that aims on fighting with race conditions, it is tricky to say when we are completely done with the desired result.

I think the ultimate verification of this BZ here will come from @dcain as part of https://bugzilla.redhat.com/show_bug.cgi?id=2099929 which we are working on verifying.

Comment 3 Trey West 2022-08-26 13:44:27 UTC
Verified on 2.1.0-DOWNANDBACK-2022-08-23-13-09-58

Comment 6 errata-xmlrpc 2022-09-06 22:34:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Advanced Cluster Management 2.6.0 security updates and bug fixes), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:6370


Note You need to log in before you can comment on or make changes to this bug.