In multiple places inside BMAC we are changing state of the CR and mark the result as `dirty: true`. This is later handled by 2 following approaches 1) reconcileComplete{dirty: true} 2) reconcileComplete{dirty: true, stop: true} Given that we should be stopping reconcile loop after every change made, in order to avoid errors like ``` Operation cannot be fulfilled on baremetalhosts.metal3.io [...]: the object has been modified; please apply your changes to the latest version and try again ``` we should only use `dirty: true` together with `stop: true`. Currently using (1) leads to some unexpected races when objects are modified (or not) multiple times in the same loop.
Hey @mko , looking at my cluster, I only see one log like you mentioned after running the cluster for about a week. time="2022-08-10T16:21:51Z" level=error msg="Error updating hardwaredetails" func="github.com/openshift/assisted-service/internal/controller/controllers.(*BMACReconciler).Reconcile" file="/remote-source/assisted-service/app/internal/controller/controllers/bmh_agent_controller.go:272" bare_metal_host=mdhcp-master-0-0-bmh bare_metal_host_namespace=mdhcp-0 error="Operation cannot be fulfilled on baremetalhosts.metal3.io \"mdhcp-master-0-0-bmh\": the object has been modified; please apply your changes to the latest version and try again" go-id=721 request_id=9af6205c-acbe-4bad-931e-dc49d54d1ff9 Safe to say this is verified or do you think it needs more investigation?
After discussions with @mfilanov we reached the conclusion that `the object has been modified; please apply your changes to the latest version and try again` on its own is not yet an indication of a bug nor an error as in a healthy scenario the controller should succeeded in the next attempt to reconcile. For me personally having you seeing only 1 instance of this is a good indicator that the PR improved a situation, but like with any change that aims on fighting with race conditions, it is tricky to say when we are completely done with the desired result. I think the ultimate verification of this BZ here will come from @dcain as part of https://bugzilla.redhat.com/show_bug.cgi?id=2099929 which we are working on verifying.
Verified on 2.1.0-DOWNANDBACK-2022-08-23-13-09-58
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat Advanced Cluster Management 2.6.0 security updates and bug fixes), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:6370