Description of problem (please be detailed as possible and provide log snippests): ===================================================================== With Bug 1927338, events were added to OCS operator, especially for the cases when uninstall was stuck and there were issues. But they still had to check the rook logs to understand the cause of issues if there were problems with cephcluster deletion(deletion stuck). Nitin has already added the Events in rook in the Upstream and this bug is to track the backport of the code to OCS 4.8 downstream branch For more details, see Bug 1927338#c12 and Bug 1927338#c6 Version of all relevant components (if applicable): ================================================== OCS 4.8 Not sure if we need to backport the fix to 4.7 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? ============================================================ No but one needs to check the logs to get information on failures Is there any workaround available to the best of your knowledge? =============================================================== Check logs Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? ================================================================ 3 Can this issue reproducible? =============================== Yes Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: ================================================= No Steps to Reproduce: ====================== 1. Create PVCs and OBCs 2. With default modes for storagecluster ( uninstall.ocs.openshift.io/mode: graceful), initiate storagecluster deletion 3. In ocs-operator logs, we only get the indication that deletion is waiting for cephcluster to be removed 4. oc describe of cephcluster also do not have the details Actual results: =================== Need to go through multiple logs to understand what is causing cephcluster deletion to get stuck Expected results: ===================== We already have events in IMP CRs managed by storagecluster(see Bug 1927338#c11), but we also need events in the cephcluster so that we know what is affecting uninstall via oc describe <CR> itself
This will be picked up in the next resync with downstream to release-4.8
Included in the latest resync to release-4.8
Hi As mentioned in comment#11 , i powered off two storage nodes and observed that there were no events seen through CLI . Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal ReconcileSucceeded 48m (x2 over 22h) ClusterController cluster has been configured successfully The second scenario was suggested by @nigoyal which included to scale down the ocs-operator and then edit the cephcluster and change the mon count to 10 . Kept the cluster in same state for 5 hours and observed that events were generated and the count of events was correct . Steps performed to validate the fix are mentioned below :- 1. Deployed OCS 4.8 cluster . The pods, nodes and ceph health was fine . 2. Scaled down the ocs -operator . [root@localhost ocs4_8_aws]# oc scale deployment ocs-operator --replicas=0 -n openshift-storage deployment.apps/ocs-operator scaled 3. Edited the cephcluster to change the mon count to 10 . [root@localhost ocs4_8_aws]# oc edit -n openshift-storage cephcluster ocs-storagecluster-cephcluster cephcluster.ceph.rook.io/ocs-storagecluster-cephcluster edited 4. Observed the events for next 5 hour using command "oc describe cephcluster -n openshift-storage". Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning ReconcileFailed 3m32s ClusterController failed to reconcile cluster "ocs-storagecluster-cephcluster": failed to configure local ceph cluster: failed to perform validation before cluster creation: mon count 10 cannot be even, must be odd to support a healthy quorum ==================================================================================================================================================================================== Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning ReconcileFailed 12m (x2 over 84m) ClusterController failed to reconcile cluster "ocs-storagecluster-cephcluster": failed to configure local ceph cluster: failed to perform validation before cluster creation: mon count 10 cannot be even, must be odd to support a healthy quorum ==================================================================================================================================================================================== Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning ReconcileFailed 29m (x3 over 168m) ClusterController failed to reconcile cluster "ocs-storagecluster-cephcluster": failed to configure local ceph cluster: failed to perform validation before cluster creation: mon count 10 cannot be even, must be odd to support a healthy quorum ===================================================================================================================================================================================== Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning ReconcileFailed 12m (x6 over 5h51m) ClusterController failed to reconcile cluster "ocs-storagecluster-cephcluster": failed to configure local ceph cluster: failed to perform validation before cluster creation: mon count 10 cannot be even, must be odd to support a healthy quorum ===================================================================================================================================================================================== 5.Changed the mon count to 3 and then again to 10 and observed that events were generated and event count was increased. Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning ReconcileFailed 9m25s ClusterController failed to reconcile cluster "ocs-storagecluster-cephcluster": failed to configure local ceph cluster: failed to create cluster: failed to start ceph monitors: failed to assign pods to mons: CANCELLING CURRENT ORCHESTRATION Warning ReconcileFailed 45s (x7 over 6h7m) ClusterController failed to reconcile cluster "ocs-storagecluster-cephcluster": failed to configure local ceph cluster: failed to perform validation before cluster creation: mon count 10 cannot be even, must be odd to support a healthy quorum ===================================================================================================================================================================================== The events were generated and the event count did not increase before hour . Hence moving the bug to verified state . Thanks
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Container Storage 4.8.0 container images bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:3003