Bug 1955831

Summary: [GSS][mon] rook-operator scales mons to 4 after healthCheck timeout
Product: [Red Hat Storage] Red Hat OpenShift Container Storage Reporter: Randy Martinez <r.martinez>
Component: rookAssignee: Travis Nielsen <tnielsen>
Status: CLOSED ERRATA QA Contact: Oded <oviner>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 4.6CC: alitke, assingh, chhudson, hnallurv, kjosy, madam, muagarwa, ocs-bugs, olakra, sraghave, tdesala, tnielsen
Target Milestone: ---Keywords: AutomationBackLog
Target Release: OCS 4.8.0   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: 4.8.0-402.ci Doc Type: Bug Fix
Doc Text:
.Reliable mon quorum in the node drains and mon failover scenarios Earlier, if the operator was restarted during a mon failover, the operator could erroneously remove the new mon. Hence, the mon quorum was at risk when the operator removed the new mon. With this update, the operator will restore the state when the mon failover is in progress and properly complete the mon failover after the operator is restarted. Now, the mon quorum is more reliable in the node drains and mon failover scenarios.
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-08-03 18:15:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1959983, 1959985    

Comment 3 Shrivaibavi Raghaventhiran 2021-05-03 07:09:00 UTC
OCS QE team is following up this BZ. GSS and development team can contact us if any help or information is needed from our end

Comment 5 Travis Nielsen 2021-05-10 17:12:00 UTC
Acking for 4.8 and we also should clone to 4.7.z after confirming the fix. The side effect of this issue to lose quorum is too severe and the workaround of reseting the mon quorum to a single mon for recovery is too involved.

Comment 6 Travis Nielsen 2021-05-11 21:18:54 UTC
The fix is low risk, we should backport to 4.7.z and 4.6.z

Comment 8 Travis Nielsen 2021-05-12 18:33:47 UTC
This is merged downstream to 4.8 with https://github.com/openshift/rook/pull/235.
I'll clone for 4.7.z and 4.6.z.

Comment 15 Oded 2021-07-05 09:41:51 UTC
SetUp:
OCP Version: 4.8.0-0.nightly-2021-07-01-185624
OCS Version: ocs-operator.v4.8.0-433.ci
Provider: Vmware
type:lso
ceph versions:
sh-4.4# ceph versions
{
    "mon": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 3
    },
    "mgr": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 1
    },
    "osd": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 2
    },
    "mds": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 1
    },
    "rgw": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 1
    },
    "overall": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 8
    }
}


Test Process:
1.Get worker node name where monitoring pod run
$ oc get pods -o wide | grep mon
rook-ceph-mon-b-6d9f985498-6c75t                                  2/2     Running     0          25h   10.128.4.11    compute-1   <none>           <none>
rook-ceph-mon-e-784fc9db98-hl5kz                                  2/2     Running     0          18h   10.131.2.30    compute-0   <none>           <none>
rook-ceph-mon-g-7db748f8c4-h7p9b                                  2/2     Running     0          17m   10.130.2.27    compute-2   <none>           <none>


2.Verify pdb status, allowed_disruptions=1, max_unavailable_mon=1
$ oc get pdb 
NAME                                              MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
rook-ceph-mds-ocs-storagecluster-cephfilesystem   1               N/A               1                     25h
rook-ceph-mon-pdb                                 N/A             1                 1                     25h
rook-ceph-osd                                     N/A             1                 1                     14h

3.Drain node where monitoring pod run
$ oc adm drain compute-0 --force=true --ignore-daemonsets --delete-local-data
pod/rook-ceph-mgr-a-98c8b6586-4758q evicted
pod/rook-ceph-osd-1-5ffc4659b6-7tvw8 evicted
pod/rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-556cf965lc28p evicted
pod/rook-ceph-mon-e-784fc9db98-hl5kz evicted
pod/rook-ceph-crashcollector-compute-0-56bb7595-r8nwl evicted
node/compute-0 evicted

4.Verify pdb status, allowed_disruptions=0, max_unavailable_mon=1
$ oc get pdb
NAME                                              MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
rook-ceph-mds-ocs-storagecluster-cephfilesystem   1               N/A               1                     25h
rook-ceph-mon-pdb                                 N/A             1                 0                     25h
rook-ceph-osd-host-compute-1                      N/A             0                 0                     111s
rook-ceph-osd-host-compute-2                      N/A             0                 0                     111s

5.Verify the number of mon pods is 3 for (1400 seconds)
$ oc get pods | grep mon
rook-ceph-mon-b-6d9f985498-6c75t                                  2/2     Running     0          25h
rook-ceph-mon-e-784fc9db98-56ztb                                  0/2     Pending     0          22m
rook-ceph-mon-g-7db748f8c4-h7p9b                                  2/2     Running     0          41m
[odedviner@localhost auth]$ oc get pods | grep mon
rook-ceph-mon-b-6d9f985498-6c75t                                  2/2     Running     0          25h
rook-ceph-mon-g-7db748f8c4-h7p9b                                  2/2     Running     0          41m
rook-ceph-mon-h-canary-76db9df589-v6vn6                           0/2     Pending     0          0s
[odedviner@localhost auth]$ oc get pods | grep mon
rook-ceph-mon-b-6d9f985498-6c75t                                  2/2     Running     0          25h
rook-ceph-mon-g-7db748f8c4-h7p9b                                  2/2     Running     0          41m
rook-ceph-mon-h-canary-76db9df589-v6vn6                           0/2     Pending     0          2s


6.Respin  rook-ceph operator pod
$ oc get pods | grep rook-ceph-operator
rook-ceph-operator-7cbd4c6dcf-jsscw                               1/1     Running     0          25h

$ oc delete pod rook-ceph-operator-7cbd4c6dcf-jsscw 
pod "rook-ceph-operator-7cbd4c6dcf-jsscw" deleted

$ oc get pods | grep rook-ceph-operator
rook-ceph-operator-7cbd4c6dcf-8hzr6                               1/1     Running     0          7s

7.Uncordon the node
$ oc adm uncordon compute-0
node/compute-0 uncordoned

8.Wait for mon and osd pods to be on running state
$ oc get pods -o wide| grep mon
rook-ceph-mon-b-6d9f985498-6c75t                                  2/2     Running     0          25h     10.128.4.11    compute-1   <none>           <none>
rook-ceph-mon-e-784fc9db98-2d2xv                                  2/2     Running     0          5m2s    10.131.2.35    compute-0   <none>           <none>
rook-ceph-mon-g-7db748f8c4-h7p9b                                  2/2     Running     0          49m     10.130.2.27    compute-2   <none>           <none>

9.Verify pdb status, disruptions_allowed=1, max_unavailable_mon=1
$ oc get pdb
NAME                                              MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
rook-ceph-mds-ocs-storagecluster-cephfilesystem   1               N/A               1                     25h
rook-ceph-mon-pdb                                 N/A             1                 1                     25h
rook-ceph-osd-host-compute-1                      N/A             0                 0                     29m
rook-ceph-osd-host-compute-2                      N/A             0                 0                     29m

Comment 16 Oded 2021-07-05 09:47:36 UTC
bug fixed [based on https://bugzilla.redhat.com/show_bug.cgi?id=1955831#c15]

Comment 18 Mudit Agarwal 2021-07-21 07:20:27 UTC
I think the heading should be "unreliable mon quorom"
rest looks good.

Comment 21 errata-xmlrpc 2021-08-03 18:15:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Container Storage 4.8.0 container images bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3003