Bug 2112321

Summary: BMAC reconcile loop never stops after changes
Product: Red Hat Advanced Cluster Management for Kubernetes Reporter: Mat Kowalski <mko>
Component: Infrastructure OperatorAssignee: Mat Kowalski <mko>
Status: CLOSED ERRATA QA Contact:
Severity: unspecified Docs Contact: Derek <dcadzow>
Priority: unspecified    
Version: rhacm-2.6CC: cbynum, ccrum, dcain, fpercoco, hhamid, skoksal, trwest, yfirst
Target Milestone: ---Flags: cbynum: rhacm-2.6+
cbynum: rhacm-2.6.z+
Target Release: rhacm-2.6   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2112817 (view as bug list) Environment:
Last Closed: 2022-09-06 22:34:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2112817    

Description Mat Kowalski 2022-07-29 10:37:31 UTC
In multiple places inside BMAC we are changing state of the CR and mark the result as `dirty: true`. This is later handled by 2 following approaches

1) reconcileComplete{dirty: true}
2) reconcileComplete{dirty: true, stop: true}

Given that we should be stopping reconcile loop after every change made, in order to avoid errors like

```
Operation cannot be fulfilled on baremetalhosts.metal3.io [...]: the object has been modified; please apply your changes to the latest version and try again
```

we should only use `dirty: true` together with `stop: true`. Currently using (1) leads to some unexpected races when objects are modified (or not) multiple times in the same loop.

Comment 1 Trey West 2022-08-11 20:06:58 UTC
Hey @mko , looking at my cluster, I only see one log like you mentioned after running the cluster for about a week.

time="2022-08-10T16:21:51Z" level=error msg="Error updating hardwaredetails" func="github.com/openshift/assisted-service/internal/controller/controllers.(*BMACReconciler).Reconcile" file="/remote-source/assisted-service/app/internal/controller/controllers/bmh_agent_controller.go:272" bare_metal_host=mdhcp-master-0-0-bmh bare_metal_host_namespace=mdhcp-0 error="Operation cannot be fulfilled on baremetalhosts.metal3.io \"mdhcp-master-0-0-bmh\": the object has been modified; please apply your changes to the latest version and try again" go-id=721 request_id=9af6205c-acbe-4bad-931e-dc49d54d1ff9

Safe to say this is verified or do you think it needs more investigation?

Comment 2 Mat Kowalski 2022-08-14 17:44:06 UTC
After discussions with @mfilanov we reached the conclusion that `the object has been modified; please apply your changes to the latest version and try again` on its own is not yet an indication of a bug nor an error as in a healthy scenario the controller should succeeded in the next attempt to reconcile.

For me personally having you seeing only 1 instance of this is a good indicator that the PR improved a situation, but like with any change that aims on fighting with race conditions, it is tricky to say when we are completely done with the desired result.

I think the ultimate verification of this BZ here will come from @dcain as part of https://bugzilla.redhat.com/show_bug.cgi?id=2099929 which we are working on verifying.

Comment 3 Trey West 2022-08-26 13:44:27 UTC
Verified on 2.1.0-DOWNANDBACK-2022-08-23-13-09-58

Comment 6 errata-xmlrpc 2022-09-06 22:34:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Advanced Cluster Management 2.6.0 security updates and bug fixes), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:6370