1959976 – When a node is being drained, increase the mon failover timeout to prevent unnecessary mon failover

Bug 1959976 - When a node is being drained, increase the mon failover timeout to prevent unnecessary mon failover

Summary: When a node is being drained, increase the mon failover timeout to prevent un...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	OCS 4.7.1
Assignee:	Santosh Pillai
QA Contact:	Shrivaibavi Raghaventhiran
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-05-12 18:23 UTC by Travis Nielsen
Modified:	2021-06-15 16:50 UTC (History)
CC List:	4 users (show)
Fixed In Version:	4.7.1-403.ci
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-06-15 16:50:37 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift rook pull 240	0	None	open	bug 1959976: ceph: retry once before mon failover if mon pod is unscheduled	2021-05-17 08:51:12 UTC
Red Hat Product Errata	RHBA-2021:2449	0	None	None	None	2021-06-15 16:50:53 UTC

Description Travis Nielsen 2021-05-12 18:23:34 UTC

This bug was initially created as a copy of Bug #1959964

I am copying this bug because:

Description of problem (please be detailed as possible and provide log
snippests):

During node drain it is expected for a mon to be down for 10-15 minutes while the node is being restarted. Since the default mon failover timeout is 10 minutes, this frequently leads to a mon failover that is unnecessary since the mon is very likely to come back online soon.

When Rook detects a node drain, the failover will be doubled automatically to 20 minutes.

Version of all relevant components (if applicable):

Affects all versions of OCS

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

Is there any workaround available to the best of your knowledge?

No since the mon timeout cannot be adjusted in OCS. See https://bugzilla.redhat.com/show_bug.cgi?id=1955834

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

Can this issue reproducible?

Yes

Can this issue reproduce from the UI?

If this is a regression, please provide more details to justify this:

Steps to Reproduce:
1. Install OCS
2. Drain a node where a mon is running and keep it down for greater than 10 minutes
3. See the operator create a new mon since the old mon is not coming back up.

Actual results:

Mon failover is triggered

Expected results:

Mon failover should not be triggered during node drain up to 20 minutes.

Additional info:

Comment 1 Travis Nielsen 2021-05-12 18:25:10 UTC

Santosh please open a backport PR from release-1.5 to downstream release-4.7.

Comment 8 Shrivaibavi Raghaventhiran 2021-06-02 12:39:57 UTC

Environment:
------------

RHCOS VMWARE 3M 3W

Version:
----------

OCS - ocs-operator.v4.7.1-403.ci
OCP - 4.7.12

Testcases:
-----------

1. Perform single node drain and wait for mon failover to happen
a. Less than 10 mins No failover observed
b. Between 10-20 mins No failover observed
c. More than 20 mins Failover observed
d. Uncordon node and check the mon status
e. let all mons start running and No error in ceph health is observed.

2. Perform single node drain and Restart rook-ceph operator during mon failover to happen
a. Wait for >= 20 mins and restart the rook-ceph operator
b. Uncordon drained node after 20 mins
c. Check for mons running in healthy state and no mons should be observed in pending state post recovery of node

3. Delete the mon deployment
a. Create the cluster, wait for it to be initially deployed
b. Scale down a mon (e.g. mon-a) so it falls out of quorum
c. Wait for the mon failover to be initiated (10 min)
d. As soon as the new mon is created and before the bad mon deployment (rook-ceph-mon-a) is deleted, restart the operator
e. All 3 mons will be running

Performed all the above 3 testcases on OCS 4.6.5 and OCS 4.7.1 versions, Just curious sometimes we see mon-canary pod respinning and sometimes not as it takes the same set of names again "mon-a mon-b and mon-c" when we expect "mon-a mon-b and mon-d"

There is no impact functionality wise just that sometimes we see canary pod respinning and sometimes not. Please clarify

In OCS 4.7.1 we see canary pods 1/3 and 1/7 times in OCS 4.6.5

rook-ceph-mon-a-59c5cfc549-x9xks 2/2 Running 0 100m 10.129.2.241 compute-2 <none> <none>
rook-ceph-mon-b-687b765c8f-t642l 2/2 Running 0 18h 10.131.0.13 compute-1 <none> <none>
rook-ceph-mon-d-canary-786fbb44db-m2x5c 0/2 Pending 0 88s <none> <none> <none> <none>

Summary:
-------

Functionality wise no impact on the cluster, All mons are up and running post recovery. Ceph cluster was accessible throughout. Most importantly mons did not lose quorum.

Comment 9 Santosh Pillai 2021-06-03 07:37:49 UTC

(In reply to Shrivaibavi Raghaventhiran from comment #8)

> Just curious sometimes we see mon-canary pod respinning and sometimes not as


I looked into Rook 4.6-100.9bbe471 and   Rook 4.7-141.c7a26ab rook logs. Both are having the same behavior. Rook is trying to start `mon-d-canary` pods on both the clusters. So no evidence of mon-canary-pods not re spinning. 

> it takes the same set of names again "mon-a mon-b and mon-c" when we expect
> "mon-a mon-b and mon-d"

In the both 4.6 and 4.7 logs, rook tried to failover to mon-d. 
 1) Rook retries the creation of mon-d-canary pods for about 3 minutes. But the mon-d-canary pod didn't start as node was drained and no other nodes were available. 
 2) So rook falls back and tries to start mon-c again (again with a timeout of 20 minutes). 

If the drained node was uncordoned at [1] (see above) then mon-d will created
If the drained node was uncordoned at [2] (see above) then mon-c will re-used. 

Above observation is made after looking to the logs. 

> Summary:
> -------
> 
> Functionality wise no impact on the cluster, All mons are up and running
> post recovery. Ceph cluster was accessible throughout. Most importantly mons
> did not lose quorum.

Comment 10 Shrivaibavi Raghaventhiran 2021-06-03 09:11:44 UTC

Considering comment 8 and comment 9 moving this BZ to Verified state

Comment 15 errata-xmlrpc 2021-06-15 16:50:37 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Container Storage 4.7.1 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2449

Note You need to log in before you can comment on or make changes to this bug.