OCS QE team is following up this BZ. GSS and development team can contact us if any help or information is needed from our end
Acking for 4.8 and we also should clone to 4.7.z after confirming the fix. The side effect of this issue to lose quorum is too severe and the workaround of reseting the mon quorum to a single mon for recovery is too involved.
The fix is low risk, we should backport to 4.7.z and 4.6.z
This is merged downstream to 4.8 with https://github.com/openshift/rook/pull/235. I'll clone for 4.7.z and 4.6.z.
SetUp: OCP Version: 4.8.0-0.nightly-2021-07-01-185624 OCS Version: ocs-operator.v4.8.0-433.ci Provider: Vmware type:lso ceph versions: sh-4.4# ceph versions { "mon": { "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 3 }, "mgr": { "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 1 }, "osd": { "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 2 }, "mds": { "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 1 }, "rgw": { "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 1 }, "overall": { "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 8 } } Test Process: 1.Get worker node name where monitoring pod run $ oc get pods -o wide | grep mon rook-ceph-mon-b-6d9f985498-6c75t 2/2 Running 0 25h 10.128.4.11 compute-1 <none> <none> rook-ceph-mon-e-784fc9db98-hl5kz 2/2 Running 0 18h 10.131.2.30 compute-0 <none> <none> rook-ceph-mon-g-7db748f8c4-h7p9b 2/2 Running 0 17m 10.130.2.27 compute-2 <none> <none> 2.Verify pdb status, allowed_disruptions=1, max_unavailable_mon=1 $ oc get pdb NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE rook-ceph-mds-ocs-storagecluster-cephfilesystem 1 N/A 1 25h rook-ceph-mon-pdb N/A 1 1 25h rook-ceph-osd N/A 1 1 14h 3.Drain node where monitoring pod run $ oc adm drain compute-0 --force=true --ignore-daemonsets --delete-local-data pod/rook-ceph-mgr-a-98c8b6586-4758q evicted pod/rook-ceph-osd-1-5ffc4659b6-7tvw8 evicted pod/rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-556cf965lc28p evicted pod/rook-ceph-mon-e-784fc9db98-hl5kz evicted pod/rook-ceph-crashcollector-compute-0-56bb7595-r8nwl evicted node/compute-0 evicted 4.Verify pdb status, allowed_disruptions=0, max_unavailable_mon=1 $ oc get pdb NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE rook-ceph-mds-ocs-storagecluster-cephfilesystem 1 N/A 1 25h rook-ceph-mon-pdb N/A 1 0 25h rook-ceph-osd-host-compute-1 N/A 0 0 111s rook-ceph-osd-host-compute-2 N/A 0 0 111s 5.Verify the number of mon pods is 3 for (1400 seconds) $ oc get pods | grep mon rook-ceph-mon-b-6d9f985498-6c75t 2/2 Running 0 25h rook-ceph-mon-e-784fc9db98-56ztb 0/2 Pending 0 22m rook-ceph-mon-g-7db748f8c4-h7p9b 2/2 Running 0 41m [odedviner@localhost auth]$ oc get pods | grep mon rook-ceph-mon-b-6d9f985498-6c75t 2/2 Running 0 25h rook-ceph-mon-g-7db748f8c4-h7p9b 2/2 Running 0 41m rook-ceph-mon-h-canary-76db9df589-v6vn6 0/2 Pending 0 0s [odedviner@localhost auth]$ oc get pods | grep mon rook-ceph-mon-b-6d9f985498-6c75t 2/2 Running 0 25h rook-ceph-mon-g-7db748f8c4-h7p9b 2/2 Running 0 41m rook-ceph-mon-h-canary-76db9df589-v6vn6 0/2 Pending 0 2s 6.Respin rook-ceph operator pod $ oc get pods | grep rook-ceph-operator rook-ceph-operator-7cbd4c6dcf-jsscw 1/1 Running 0 25h $ oc delete pod rook-ceph-operator-7cbd4c6dcf-jsscw pod "rook-ceph-operator-7cbd4c6dcf-jsscw" deleted $ oc get pods | grep rook-ceph-operator rook-ceph-operator-7cbd4c6dcf-8hzr6 1/1 Running 0 7s 7.Uncordon the node $ oc adm uncordon compute-0 node/compute-0 uncordoned 8.Wait for mon and osd pods to be on running state $ oc get pods -o wide| grep mon rook-ceph-mon-b-6d9f985498-6c75t 2/2 Running 0 25h 10.128.4.11 compute-1 <none> <none> rook-ceph-mon-e-784fc9db98-2d2xv 2/2 Running 0 5m2s 10.131.2.35 compute-0 <none> <none> rook-ceph-mon-g-7db748f8c4-h7p9b 2/2 Running 0 49m 10.130.2.27 compute-2 <none> <none> 9.Verify pdb status, disruptions_allowed=1, max_unavailable_mon=1 $ oc get pdb NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE rook-ceph-mds-ocs-storagecluster-cephfilesystem 1 N/A 1 25h rook-ceph-mon-pdb N/A 1 1 25h rook-ceph-osd-host-compute-1 N/A 0 0 29m rook-ceph-osd-host-compute-2 N/A 0 0 29m
bug fixed [based on https://bugzilla.redhat.com/show_bug.cgi?id=1955831#c15]
I think the heading should be "unreliable mon quorom" rest looks good.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Container Storage 4.8.0 container images bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:3003