Bug 2019946

Summary: CephCluster updates might result in infinite reconciles
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Sébastien Han <shan>
Component: rookAssignee: Sébastien Han <shan>
Status: CLOSED CURRENTRELEASE QA Contact: Yosi Ben Shimon <ybenshim>
Severity: high Docs Contact:
Priority: high    
Version: 4.9CC: madam, muagarwa, ocs-bugs, odf-bz-bot, rperiyas, tnielsen
Target Milestone: ---   
Target Release: ODF 4.9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: v4.9.0-228.ci Doc Type: Bug Fix
Doc Text:
The monitor list part of the cluster peer token secret was not sorted, so each time we were reconciling, the peer secret token will see its content updated with randomized monitors. This would enter our predicate and trigger a reconcile. Then the next reconcile would update the list again etc. Potentially an endless one, if the randomized list is already different.
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-01-07 17:46:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Sébastien Han 2021-11-03 16:53:00 UTC
Description of problem (please be detailed as possible and provide log
snippets):

When the CephCluster object is updated, the reconcile might run forever since the list of monitor for the mirror token is updated with a different order.

Version of all relevant components (if applicable):


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.
2.
3.


Actual results:


Expected results:


Additional info:

Comment 2 Sébastien Han 2021-11-04 14:53:33 UTC
downstream PR: https://github.com/red-hat-storage/rook/pull/312

Comment 6 Travis Nielsen 2021-11-19 16:35:16 UTC
To verify this BZ, you would really need to analyze the rook operator log to see if it is reconciling the cluster multiple times even while there are no changes to the cephcluster CR. For example:
- Install OCS
- Wait for the ceph daemons to be created, including the OSD pods 
- Wait for a few more minutes to ensure the operator is done
- Grep the rook operator log for messages that indicate how many times the operator reconciled. You could grep for a specific message such as "done reconciling ceph cluster in namespace" that only occurs once per reconcile.
- The reconcile should only occur once or maybe twice. If it's more than twice, the operator is finding a difference during each reconcile and keeps retrying when it should not.

Comment 7 Yosi Ben Shimon 2021-11-22 08:03:30 UTC
Following Travis steps from comment #6 on a 1 day old cluster:
odf-operator.v4.9.0

From rook-ceph-operator logs there's only one occurrence of:
2021-11-21 07:36:28.279416 I | ceph-cluster-controller: done reconciling ceph cluster in namespace "openshift-storage"

Moving to VERIFIED