Bug 1955831
Summary: | [GSS][mon] rook-operator scales mons to 4 after healthCheck timeout | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat OpenShift Container Storage | Reporter: | Randy Martinez <r.martinez> |
Component: | rook | Assignee: | Travis Nielsen <tnielsen> |
Status: | CLOSED ERRATA | QA Contact: | Oded <oviner> |
Severity: | urgent | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.6 | CC: | alitke, assingh, chhudson, hnallurv, kjosy, madam, muagarwa, ocs-bugs, olakra, sraghave, tdesala, tnielsen |
Target Milestone: | --- | Keywords: | AutomationBackLog |
Target Release: | OCS 4.8.0 | ||
Hardware: | All | ||
OS: | All | ||
Whiteboard: | |||
Fixed In Version: | 4.8.0-402.ci | Doc Type: | Bug Fix |
Doc Text: |
.Reliable mon quorum in the node drains and mon failover scenarios
Earlier, if the operator was restarted during a mon failover, the operator could erroneously remove the new mon. Hence, the mon quorum was at risk when the operator removed the new mon. With this update, the operator will restore the state when the mon failover is in progress and properly complete the mon failover after the operator is restarted. Now, the mon quorum is more reliable in the node drains and mon failover scenarios.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2021-08-03 18:15:57 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1959983, 1959985 |
Comment 3
Shrivaibavi Raghaventhiran
2021-05-03 07:09:00 UTC
Acking for 4.8 and we also should clone to 4.7.z after confirming the fix. The side effect of this issue to lose quorum is too severe and the workaround of reseting the mon quorum to a single mon for recovery is too involved. The fix is low risk, we should backport to 4.7.z and 4.6.z This is merged downstream to 4.8 with https://github.com/openshift/rook/pull/235. I'll clone for 4.7.z and 4.6.z. SetUp: OCP Version: 4.8.0-0.nightly-2021-07-01-185624 OCS Version: ocs-operator.v4.8.0-433.ci Provider: Vmware type:lso ceph versions: sh-4.4# ceph versions { "mon": { "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 3 }, "mgr": { "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 1 }, "osd": { "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 2 }, "mds": { "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 1 }, "rgw": { "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 1 }, "overall": { "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 8 } } Test Process: 1.Get worker node name where monitoring pod run $ oc get pods -o wide | grep mon rook-ceph-mon-b-6d9f985498-6c75t 2/2 Running 0 25h 10.128.4.11 compute-1 <none> <none> rook-ceph-mon-e-784fc9db98-hl5kz 2/2 Running 0 18h 10.131.2.30 compute-0 <none> <none> rook-ceph-mon-g-7db748f8c4-h7p9b 2/2 Running 0 17m 10.130.2.27 compute-2 <none> <none> 2.Verify pdb status, allowed_disruptions=1, max_unavailable_mon=1 $ oc get pdb NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE rook-ceph-mds-ocs-storagecluster-cephfilesystem 1 N/A 1 25h rook-ceph-mon-pdb N/A 1 1 25h rook-ceph-osd N/A 1 1 14h 3.Drain node where monitoring pod run $ oc adm drain compute-0 --force=true --ignore-daemonsets --delete-local-data pod/rook-ceph-mgr-a-98c8b6586-4758q evicted pod/rook-ceph-osd-1-5ffc4659b6-7tvw8 evicted pod/rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-556cf965lc28p evicted pod/rook-ceph-mon-e-784fc9db98-hl5kz evicted pod/rook-ceph-crashcollector-compute-0-56bb7595-r8nwl evicted node/compute-0 evicted 4.Verify pdb status, allowed_disruptions=0, max_unavailable_mon=1 $ oc get pdb NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE rook-ceph-mds-ocs-storagecluster-cephfilesystem 1 N/A 1 25h rook-ceph-mon-pdb N/A 1 0 25h rook-ceph-osd-host-compute-1 N/A 0 0 111s rook-ceph-osd-host-compute-2 N/A 0 0 111s 5.Verify the number of mon pods is 3 for (1400 seconds) $ oc get pods | grep mon rook-ceph-mon-b-6d9f985498-6c75t 2/2 Running 0 25h rook-ceph-mon-e-784fc9db98-56ztb 0/2 Pending 0 22m rook-ceph-mon-g-7db748f8c4-h7p9b 2/2 Running 0 41m [odedviner@localhost auth]$ oc get pods | grep mon rook-ceph-mon-b-6d9f985498-6c75t 2/2 Running 0 25h rook-ceph-mon-g-7db748f8c4-h7p9b 2/2 Running 0 41m rook-ceph-mon-h-canary-76db9df589-v6vn6 0/2 Pending 0 0s [odedviner@localhost auth]$ oc get pods | grep mon rook-ceph-mon-b-6d9f985498-6c75t 2/2 Running 0 25h rook-ceph-mon-g-7db748f8c4-h7p9b 2/2 Running 0 41m rook-ceph-mon-h-canary-76db9df589-v6vn6 0/2 Pending 0 2s 6.Respin rook-ceph operator pod $ oc get pods | grep rook-ceph-operator rook-ceph-operator-7cbd4c6dcf-jsscw 1/1 Running 0 25h $ oc delete pod rook-ceph-operator-7cbd4c6dcf-jsscw pod "rook-ceph-operator-7cbd4c6dcf-jsscw" deleted $ oc get pods | grep rook-ceph-operator rook-ceph-operator-7cbd4c6dcf-8hzr6 1/1 Running 0 7s 7.Uncordon the node $ oc adm uncordon compute-0 node/compute-0 uncordoned 8.Wait for mon and osd pods to be on running state $ oc get pods -o wide| grep mon rook-ceph-mon-b-6d9f985498-6c75t 2/2 Running 0 25h 10.128.4.11 compute-1 <none> <none> rook-ceph-mon-e-784fc9db98-2d2xv 2/2 Running 0 5m2s 10.131.2.35 compute-0 <none> <none> rook-ceph-mon-g-7db748f8c4-h7p9b 2/2 Running 0 49m 10.130.2.27 compute-2 <none> <none> 9.Verify pdb status, disruptions_allowed=1, max_unavailable_mon=1 $ oc get pdb NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE rook-ceph-mds-ocs-storagecluster-cephfilesystem 1 N/A 1 25h rook-ceph-mon-pdb N/A 1 1 25h rook-ceph-osd-host-compute-1 N/A 0 0 29m rook-ceph-osd-host-compute-2 N/A 0 0 29m bug fixed [based on https://bugzilla.redhat.com/show_bug.cgi?id=1955831#c15] I think the heading should be "unreliable mon quorom" rest looks good. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Container Storage 4.8.0 container images bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:3003 |