Description of problem (please be detailed as possible and provide log snippets): When the CephCluster object is updated, the reconcile might run forever since the list of monitor for the mirror token is updated with a different order. Version of all relevant components (if applicable): Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
downstream PR: https://github.com/red-hat-storage/rook/pull/312
To verify this BZ, you would really need to analyze the rook operator log to see if it is reconciling the cluster multiple times even while there are no changes to the cephcluster CR. For example: - Install OCS - Wait for the ceph daemons to be created, including the OSD pods - Wait for a few more minutes to ensure the operator is done - Grep the rook operator log for messages that indicate how many times the operator reconciled. You could grep for a specific message such as "done reconciling ceph cluster in namespace" that only occurs once per reconcile. - The reconcile should only occur once or maybe twice. If it's more than twice, the operator is finding a difference during each reconcile and keeps retrying when it should not.
Following Travis steps from comment #6 on a 1 day old cluster: odf-operator.v4.9.0 From rook-ceph-operator logs there's only one occurrence of: 2021-11-21 07:36:28.279416 I | ceph-cluster-controller: done reconciling ceph cluster in namespace "openshift-storage" Moving to VERIFIED