Bug 2019946 - CephCluster updates might result in infinite reconciles
Summary: CephCluster updates might result in infinite reconciles
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: rook
Version: 4.9
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ODF 4.9.0
Assignee: Sébastien Han
QA Contact: Yosi Ben Shimon
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-11-03 16:53 UTC by Sébastien Han
Modified: 2023-08-09 17:03 UTC (History)
6 users (show)

Fixed In Version: v4.9.0-228.ci
Doc Type: Bug Fix
Doc Text:
The monitor list part of the cluster peer token secret was not sorted, so each time we were reconciling, the peer secret token will see its content updated with randomized monitors. This would enter our predicate and trigger a reconcile. Then the next reconcile would update the list again etc. Potentially an endless one, if the randomized list is already different.
Clone Of:
Environment:
Last Closed: 2022-01-07 17:46:31 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github red-hat-storage rook pull 312 0 None open Bug 2019946: rbd-mirror: use a sorted list for peer token content 2021-11-04 14:53:09 UTC
Github rook rook pull 9091 0 None open rbd-mirror: use a sorted list for peer token content 2021-11-03 16:53:37 UTC

Description Sébastien Han 2021-11-03 16:53:00 UTC
Description of problem (please be detailed as possible and provide log
snippets):

When the CephCluster object is updated, the reconcile might run forever since the list of monitor for the mirror token is updated with a different order.

Version of all relevant components (if applicable):


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.
2.
3.


Actual results:


Expected results:


Additional info:

Comment 2 Sébastien Han 2021-11-04 14:53:33 UTC
downstream PR: https://github.com/red-hat-storage/rook/pull/312

Comment 6 Travis Nielsen 2021-11-19 16:35:16 UTC
To verify this BZ, you would really need to analyze the rook operator log to see if it is reconciling the cluster multiple times even while there are no changes to the cephcluster CR. For example:
- Install OCS
- Wait for the ceph daemons to be created, including the OSD pods 
- Wait for a few more minutes to ensure the operator is done
- Grep the rook operator log for messages that indicate how many times the operator reconciled. You could grep for a specific message such as "done reconciling ceph cluster in namespace" that only occurs once per reconcile.
- The reconcile should only occur once or maybe twice. If it's more than twice, the operator is finding a difference during each reconcile and keeps retrying when it should not.

Comment 7 Yosi Ben Shimon 2021-11-22 08:03:30 UTC
Following Travis steps from comment #6 on a 1 day old cluster:
odf-operator.v4.9.0

From rook-ceph-operator logs there's only one occurrence of:
2021-11-21 07:36:28.279416 I | ceph-cluster-controller: done reconciling ceph cluster in namespace "openshift-storage"

Moving to VERIFIED


Note You need to log in before you can comment on or make changes to this bug.