1955831 – [GSS][mon] rook-operator scales mons to 4 after healthCheck timeout

Bug 1955831 - [GSS][mon] rook-operator scales mons to 4 after healthCheck timeout

Summary: [GSS][mon] rook-operator scales mons to 4 after healthCheck timeout

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.6
Hardware:	All
OS:	All
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	OCS 4.8.0
Assignee:	Travis Nielsen
QA Contact:	Oded
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1959983 1959985
TreeView+	depends on / blocked

Reported:	2021-04-30 22:43 UTC by Randy Martinez
Modified:	2024-06-14 01:25 UTC (History)
CC List:	12 users (show)
Fixed In Version:	4.8.0-402.ci
Doc Type:	Bug Fix
Doc Text:	.Reliable mon quorum in the node drains and mon failover scenarios Earlier, if the operator was restarted during a mon failover, the operator could erroneously remove the new mon. Hence, the mon quorum was at risk when the operator removed the new mon. With this update, the operator will restore the state when the mon failover is in progress and properly complete the mon failover after the operator is restarted. Now, the mon quorum is more reliable in the node drains and mon failover scenarios.
Clone Of:
Environment:
Last Closed:	2021-08-03 18:15:57 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	red-hat-storage ocs-ci pull 4440	None	closed	GSS bug,Verify the num of mon pods is 3 when drain node	2021-07-15 06:10:26 UTC
Github	rook rook issues 7797	None	open	Mon failover can cause mons to fall out of quorum if the operator is disrupted in the middle of the failover	2021-04-30 22:45:24 UTC
Github	rook rook pull 7884	None	open	ceph: Persist expected mon endpoints immediately during mon failover	2021-05-11 21:18:54 UTC
Red Hat Product Errata	RHBA-2021:3003	None	None	None	2021-08-03 18:16:21 UTC

Comment 3 Shrivaibavi Raghaventhiran 2021-05-03 07:09:00 UTC

OCS QE team is following up this BZ. GSS and development team can contact us if any help or information is needed from our end

Comment 5 Travis Nielsen 2021-05-10 17:12:00 UTC

Acking for 4.8 and we also should clone to 4.7.z after confirming the fix. The side effect of this issue to lose quorum is too severe and the workaround of reseting the mon quorum to a single mon for recovery is too involved.

Comment 6 Travis Nielsen 2021-05-11 21:18:54 UTC

The fix is low risk, we should backport to 4.7.z and 4.6.z

Comment 8 Travis Nielsen 2021-05-12 18:33:47 UTC

This is merged downstream to 4.8 with https://github.com/openshift/rook/pull/235.
I'll clone for 4.7.z and 4.6.z.

Comment 15 Oded 2021-07-05 09:41:51 UTC

SetUp:
OCP Version: 4.8.0-0.nightly-2021-07-01-185624
OCS Version: ocs-operator.v4.8.0-433.ci
Provider: Vmware
type:lso
ceph versions:
sh-4.4# ceph versions
{
    "mon": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 3
    },
    "mgr": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 1
    },
    "osd": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 2
    },
    "mds": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 1
    },
    "rgw": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 1
    },
    "overall": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 8
    }
}


Test Process:
1.Get worker node name where monitoring pod run
$ oc get pods -o wide | grep mon
rook-ceph-mon-b-6d9f985498-6c75t                                  2/2     Running     0          25h   10.128.4.11    compute-1   <none>           <none>
rook-ceph-mon-e-784fc9db98-hl5kz                                  2/2     Running     0          18h   10.131.2.30    compute-0   <none>           <none>
rook-ceph-mon-g-7db748f8c4-h7p9b                                  2/2     Running     0          17m   10.130.2.27    compute-2   <none>           <none>


2.Verify pdb status, allowed_disruptions=1, max_unavailable_mon=1
$ oc get pdb 
NAME                                              MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
rook-ceph-mds-ocs-storagecluster-cephfilesystem   1               N/A               1                     25h
rook-ceph-mon-pdb                                 N/A             1                 1                     25h
rook-ceph-osd                                     N/A             1                 1                     14h

3.Drain node where monitoring pod run
$ oc adm drain compute-0 --force=true --ignore-daemonsets --delete-local-data
pod/rook-ceph-mgr-a-98c8b6586-4758q evicted
pod/rook-ceph-osd-1-5ffc4659b6-7tvw8 evicted
pod/rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-556cf965lc28p evicted
pod/rook-ceph-mon-e-784fc9db98-hl5kz evicted
pod/rook-ceph-crashcollector-compute-0-56bb7595-r8nwl evicted
node/compute-0 evicted

4.Verify pdb status, allowed_disruptions=0, max_unavailable_mon=1
$ oc get pdb
NAME                                              MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
rook-ceph-mds-ocs-storagecluster-cephfilesystem   1               N/A               1                     25h
rook-ceph-mon-pdb                                 N/A             1                 0                     25h
rook-ceph-osd-host-compute-1                      N/A             0                 0                     111s
rook-ceph-osd-host-compute-2                      N/A             0                 0                     111s

5.Verify the number of mon pods is 3 for (1400 seconds)
$ oc get pods | grep mon
rook-ceph-mon-b-6d9f985498-6c75t                                  2/2     Running     0          25h
rook-ceph-mon-e-784fc9db98-56ztb                                  0/2     Pending     0          22m
rook-ceph-mon-g-7db748f8c4-h7p9b                                  2/2     Running     0          41m
[odedviner@localhost auth]$ oc get pods | grep mon
rook-ceph-mon-b-6d9f985498-6c75t                                  2/2     Running     0          25h
rook-ceph-mon-g-7db748f8c4-h7p9b                                  2/2     Running     0          41m
rook-ceph-mon-h-canary-76db9df589-v6vn6                           0/2     Pending     0          0s
[odedviner@localhost auth]$ oc get pods | grep mon
rook-ceph-mon-b-6d9f985498-6c75t                                  2/2     Running     0          25h
rook-ceph-mon-g-7db748f8c4-h7p9b                                  2/2     Running     0          41m
rook-ceph-mon-h-canary-76db9df589-v6vn6                           0/2     Pending     0          2s


6.Respin  rook-ceph operator pod
$ oc get pods | grep rook-ceph-operator
rook-ceph-operator-7cbd4c6dcf-jsscw                               1/1     Running     0          25h

$ oc delete pod rook-ceph-operator-7cbd4c6dcf-jsscw 
pod "rook-ceph-operator-7cbd4c6dcf-jsscw" deleted

$ oc get pods | grep rook-ceph-operator
rook-ceph-operator-7cbd4c6dcf-8hzr6                               1/1     Running     0          7s

7.Uncordon the node
$ oc adm uncordon compute-0
node/compute-0 uncordoned

8.Wait for mon and osd pods to be on running state
$ oc get pods -o wide| grep mon
rook-ceph-mon-b-6d9f985498-6c75t                                  2/2     Running     0          25h     10.128.4.11    compute-1   <none>           <none>
rook-ceph-mon-e-784fc9db98-2d2xv                                  2/2     Running     0          5m2s    10.131.2.35    compute-0   <none>           <none>
rook-ceph-mon-g-7db748f8c4-h7p9b                                  2/2     Running     0          49m     10.130.2.27    compute-2   <none>           <none>

9.Verify pdb status, disruptions_allowed=1, max_unavailable_mon=1
$ oc get pdb
NAME                                              MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
rook-ceph-mds-ocs-storagecluster-cephfilesystem   1               N/A               1                     25h
rook-ceph-mon-pdb                                 N/A             1                 1                     25h
rook-ceph-osd-host-compute-1                      N/A             0                 0                     29m
rook-ceph-osd-host-compute-2                      N/A             0                 0                     29m

Comment 16 Oded 2021-07-05 09:47:36 UTC

bug fixed [based on https://bugzilla.redhat.com/show_bug.cgi?id=1955831#c15]

Comment 18 Mudit Agarwal 2021-07-21 07:20:27 UTC

I think the heading should be "unreliable mon quorom"
rest looks good.

Comment 21 errata-xmlrpc 2021-08-03 18:15:57 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Container Storage 4.8.0 container images bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3003

Note You need to log in before you can comment on or make changes to this bug.