Bug 1955831 - [GSS][mon] rook-operator scales mons to 4 after healthCheck timeout
Summary: [GSS][mon] rook-operator scales mons to 4 after healthCheck timeout
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Container Storage
Classification: Red Hat Storage
Component: rook
Version: 4.6
Hardware: All
OS: All
unspecified
urgent
Target Milestone: ---
: OCS 4.8.0
Assignee: Travis Nielsen
QA Contact: Oded
URL:
Whiteboard:
Depends On:
Blocks: 1959983 1959985
TreeView+ depends on / blocked
 
Reported: 2021-04-30 22:43 UTC by Randy Martinez
Modified: 2024-06-14 01:25 UTC (History)
12 users (show)

Fixed In Version: 4.8.0-402.ci
Doc Type: Bug Fix
Doc Text:
.Reliable mon quorum in the node drains and mon failover scenarios Earlier, if the operator was restarted during a mon failover, the operator could erroneously remove the new mon. Hence, the mon quorum was at risk when the operator removed the new mon. With this update, the operator will restore the state when the mon failover is in progress and properly complete the mon failover after the operator is restarted. Now, the mon quorum is more reliable in the node drains and mon failover scenarios.
Clone Of:
Environment:
Last Closed: 2021-08-03 18:15:57 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github red-hat-storage ocs-ci pull 4440 0 None closed GSS bug,Verify the num of mon pods is 3 when drain node 2021-07-15 06:10:26 UTC
Github rook rook issues 7797 0 None open Mon failover can cause mons to fall out of quorum if the operator is disrupted in the middle of the failover 2021-04-30 22:45:24 UTC
Github rook rook pull 7884 0 None open ceph: Persist expected mon endpoints immediately during mon failover 2021-05-11 21:18:54 UTC
Red Hat Product Errata RHBA-2021:3003 0 None None None 2021-08-03 18:16:21 UTC

Comment 3 Shrivaibavi Raghaventhiran 2021-05-03 07:09:00 UTC
OCS QE team is following up this BZ. GSS and development team can contact us if any help or information is needed from our end

Comment 5 Travis Nielsen 2021-05-10 17:12:00 UTC
Acking for 4.8 and we also should clone to 4.7.z after confirming the fix. The side effect of this issue to lose quorum is too severe and the workaround of reseting the mon quorum to a single mon for recovery is too involved.

Comment 6 Travis Nielsen 2021-05-11 21:18:54 UTC
The fix is low risk, we should backport to 4.7.z and 4.6.z

Comment 8 Travis Nielsen 2021-05-12 18:33:47 UTC
This is merged downstream to 4.8 with https://github.com/openshift/rook/pull/235.
I'll clone for 4.7.z and 4.6.z.

Comment 15 Oded 2021-07-05 09:41:51 UTC
SetUp:
OCP Version: 4.8.0-0.nightly-2021-07-01-185624
OCS Version: ocs-operator.v4.8.0-433.ci
Provider: Vmware
type:lso
ceph versions:
sh-4.4# ceph versions
{
    "mon": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 3
    },
    "mgr": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 1
    },
    "osd": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 2
    },
    "mds": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 1
    },
    "rgw": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 1
    },
    "overall": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 8
    }
}


Test Process:
1.Get worker node name where monitoring pod run
$ oc get pods -o wide | grep mon
rook-ceph-mon-b-6d9f985498-6c75t                                  2/2     Running     0          25h   10.128.4.11    compute-1   <none>           <none>
rook-ceph-mon-e-784fc9db98-hl5kz                                  2/2     Running     0          18h   10.131.2.30    compute-0   <none>           <none>
rook-ceph-mon-g-7db748f8c4-h7p9b                                  2/2     Running     0          17m   10.130.2.27    compute-2   <none>           <none>


2.Verify pdb status, allowed_disruptions=1, max_unavailable_mon=1
$ oc get pdb 
NAME                                              MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
rook-ceph-mds-ocs-storagecluster-cephfilesystem   1               N/A               1                     25h
rook-ceph-mon-pdb                                 N/A             1                 1                     25h
rook-ceph-osd                                     N/A             1                 1                     14h

3.Drain node where monitoring pod run
$ oc adm drain compute-0 --force=true --ignore-daemonsets --delete-local-data
pod/rook-ceph-mgr-a-98c8b6586-4758q evicted
pod/rook-ceph-osd-1-5ffc4659b6-7tvw8 evicted
pod/rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-556cf965lc28p evicted
pod/rook-ceph-mon-e-784fc9db98-hl5kz evicted
pod/rook-ceph-crashcollector-compute-0-56bb7595-r8nwl evicted
node/compute-0 evicted

4.Verify pdb status, allowed_disruptions=0, max_unavailable_mon=1
$ oc get pdb
NAME                                              MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
rook-ceph-mds-ocs-storagecluster-cephfilesystem   1               N/A               1                     25h
rook-ceph-mon-pdb                                 N/A             1                 0                     25h
rook-ceph-osd-host-compute-1                      N/A             0                 0                     111s
rook-ceph-osd-host-compute-2                      N/A             0                 0                     111s

5.Verify the number of mon pods is 3 for (1400 seconds)
$ oc get pods | grep mon
rook-ceph-mon-b-6d9f985498-6c75t                                  2/2     Running     0          25h
rook-ceph-mon-e-784fc9db98-56ztb                                  0/2     Pending     0          22m
rook-ceph-mon-g-7db748f8c4-h7p9b                                  2/2     Running     0          41m
[odedviner@localhost auth]$ oc get pods | grep mon
rook-ceph-mon-b-6d9f985498-6c75t                                  2/2     Running     0          25h
rook-ceph-mon-g-7db748f8c4-h7p9b                                  2/2     Running     0          41m
rook-ceph-mon-h-canary-76db9df589-v6vn6                           0/2     Pending     0          0s
[odedviner@localhost auth]$ oc get pods | grep mon
rook-ceph-mon-b-6d9f985498-6c75t                                  2/2     Running     0          25h
rook-ceph-mon-g-7db748f8c4-h7p9b                                  2/2     Running     0          41m
rook-ceph-mon-h-canary-76db9df589-v6vn6                           0/2     Pending     0          2s


6.Respin  rook-ceph operator pod
$ oc get pods | grep rook-ceph-operator
rook-ceph-operator-7cbd4c6dcf-jsscw                               1/1     Running     0          25h

$ oc delete pod rook-ceph-operator-7cbd4c6dcf-jsscw 
pod "rook-ceph-operator-7cbd4c6dcf-jsscw" deleted

$ oc get pods | grep rook-ceph-operator
rook-ceph-operator-7cbd4c6dcf-8hzr6                               1/1     Running     0          7s

7.Uncordon the node
$ oc adm uncordon compute-0
node/compute-0 uncordoned

8.Wait for mon and osd pods to be on running state
$ oc get pods -o wide| grep mon
rook-ceph-mon-b-6d9f985498-6c75t                                  2/2     Running     0          25h     10.128.4.11    compute-1   <none>           <none>
rook-ceph-mon-e-784fc9db98-2d2xv                                  2/2     Running     0          5m2s    10.131.2.35    compute-0   <none>           <none>
rook-ceph-mon-g-7db748f8c4-h7p9b                                  2/2     Running     0          49m     10.130.2.27    compute-2   <none>           <none>

9.Verify pdb status, disruptions_allowed=1, max_unavailable_mon=1
$ oc get pdb
NAME                                              MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
rook-ceph-mds-ocs-storagecluster-cephfilesystem   1               N/A               1                     25h
rook-ceph-mon-pdb                                 N/A             1                 1                     25h
rook-ceph-osd-host-compute-1                      N/A             0                 0                     29m
rook-ceph-osd-host-compute-2                      N/A             0                 0                     29m

Comment 16 Oded 2021-07-05 09:47:36 UTC
bug fixed [based on https://bugzilla.redhat.com/show_bug.cgi?id=1955831#c15]

Comment 18 Mudit Agarwal 2021-07-21 07:20:27 UTC
I think the heading should be "unreliable mon quorom"
rest looks good.

Comment 21 errata-xmlrpc 2021-08-03 18:15:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Container Storage 4.8.0 container images bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3003


Note You need to log in before you can comment on or make changes to this bug.