2112321 – BMAC reconcile loop never stops after changes

Bug 2112321 - BMAC reconcile loop never stops after changes

Summary: BMAC reconcile loop never stops after changes

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Advanced Cluster Management for Kubernetes
Classification:	Red Hat
Component:	Infrastructure Operator
Sub Component:
Version:	rhacm-2.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	rhacm-2.6
Assignee:	Mat Kowalski
QA Contact:
Docs Contact:	Derek
URL:
Whiteboard:
Depends On:
Blocks:	2112817
TreeView+	depends on / blocked

Reported:	2022-07-29 10:37 UTC by Mat Kowalski
Modified:	2022-09-06 22:35 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	2112817 (view as bug list)
Environment:
Last Closed:	2022-09-06 22:34:44 UTC
Target Upstream Version:
Embargoed:
Flags:	cbynum: rhacm-2.6+ cbynum: rhacm-2.6.z+

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift assisted-service pull 4201	None	Merged	Bug 2112321: Stop reconcile loop after changing CR inside BMAC	2022-08-01 08:30:50 UTC
Github	stolostron backlog issues 24676	None	None	None	2022-07-29 13:31:08 UTC
Red Hat Issue Tracker	MGMTBUGSM-490	None	None	None	2022-07-29 12:11:57 UTC
Red Hat Product Errata	RHSA-2022:6370	None	None	None	2022-09-06 22:35:08 UTC

Description Mat Kowalski 2022-07-29 10:37:31 UTC

In multiple places inside BMAC we are changing state of the CR and mark the result as `dirty: true`. This is later handled by 2 following approaches

1) reconcileComplete{dirty: true}
2) reconcileComplete{dirty: true, stop: true}

Given that we should be stopping reconcile loop after every change made, in order to avoid errors like

```
Operation cannot be fulfilled on baremetalhosts.metal3.io [...]: the object has been modified; please apply your changes to the latest version and try again
```

we should only use `dirty: true` together with `stop: true`. Currently using (1) leads to some unexpected races when objects are modified (or not) multiple times in the same loop.

Comment 1 Trey West 2022-08-11 20:06:58 UTC

Hey @mko , looking at my cluster, I only see one log like you mentioned after running the cluster for about a week.

time="2022-08-10T16:21:51Z" level=error msg="Error updating hardwaredetails" func="github.com/openshift/assisted-service/internal/controller/controllers.(*BMACReconciler).Reconcile" file="/remote-source/assisted-service/app/internal/controller/controllers/bmh_agent_controller.go:272" bare_metal_host=mdhcp-master-0-0-bmh bare_metal_host_namespace=mdhcp-0 error="Operation cannot be fulfilled on baremetalhosts.metal3.io \"mdhcp-master-0-0-bmh\": the object has been modified; please apply your changes to the latest version and try again" go-id=721 request_id=9af6205c-acbe-4bad-931e-dc49d54d1ff9

Safe to say this is verified or do you think it needs more investigation?

Comment 2 Mat Kowalski 2022-08-14 17:44:06 UTC

After discussions with @mfilanov we reached the conclusion that `the object has been modified; please apply your changes to the latest version and try again` on its own is not yet an indication of a bug nor an error as in a healthy scenario the controller should succeeded in the next attempt to reconcile.

For me personally having you seeing only 1 instance of this is a good indicator that the PR improved a situation, but like with any change that aims on fighting with race conditions, it is tricky to say when we are completely done with the desired result.

I think the ultimate verification of this BZ here will come from @dcain as part of https://bugzilla.redhat.com/show_bug.cgi?id=2099929 which we are working on verifying.

Comment 3 Trey West 2022-08-26 13:44:27 UTC

Verified on 2.1.0-DOWNANDBACK-2022-08-23-13-09-58

Comment 6 errata-xmlrpc 2022-09-06 22:34:44 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Advanced Cluster Management 2.6.0 security updates and bug fixes), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:6370

Note You need to log in before you can comment on or make changes to this bug.