Bug 1959964

Summary:	When a node is being drained, increase the mon failover timeout to prevent unnecessary mon failover
Product:	[Red Hat Storage] Red Hat OpenShift Container Storage	Reporter:	Travis Nielsen <tnielsen>
Component:	rook	Assignee:	Santosh Pillai <sapillai>
Status:	CLOSED ERRATA	QA Contact:	Shrivaibavi Raghaventhiran <sraghave>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.7	CC:	ebenahar, madam, muagarwa, nberry, ocs-bugs, ratamir
Target Milestone:	---	Keywords:	AutomationBackLog
Target Release:	OCS 4.8.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-08-03 18:16:11 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Travis Nielsen 2021-05-12 17:59:28 UTC

Description of problem (please be detailed as possible and provide log
snippests):

During node drain it is expected for a mon to be down for 10-15 minutes while the node is being restarted. Since the default mon failover timeout is 10 minutes, this frequently leads to a mon failover that is unnecessary since the mon is very likely to come back online soon.

When Rook detects a node drain, the failover will be doubled automatically to 20 minutes.


Version of all relevant components (if applicable):

Affects all versions of OCS


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

No


Is there any workaround available to the best of your knowledge?

No since the mon timeout cannot be adjusted in OCS. See https://bugzilla.redhat.com/show_bug.cgi?id=1955834

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

3

Can this issue reproducible?

Yes

Can this issue reproduce from the UI?

No

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Install OCS
2. Drain a node where a mon is running and keep it down for greater than 10 minutes
3. See the operator create a new mon since the old mon is not coming back up.


Actual results:

Mon failover is triggered

Expected results:

Mon failover should not be triggered during node drain up to 20 minutes.

Additional info:

Comment 1 Travis Nielsen 2021-05-12 18:00:38 UTC

This has already been fixed in OCS 4.8 with https://github.com/rook/rook/pull/7801, this BZ is just for tracking.

Comment 2 Neha Berry 2021-05-21 16:22:49 UTC

This bug was ON_QA even without QA_ACK :) 

Added it now.

Thanks for fixing the bug Travis.

Apart from node drain, please let us know the other explicit scenarios we need to cover as part of this bug verification to have a complete verification. e.g. we should test OCP upgrade too, right ?

Comment 5 Santosh Pillai 2021-05-24 05:48:32 UTC

(In reply to Neha Berry from comment #2)

> Apart from node drain, please let us know the other explicit scenarios we
> need to cover as part of this bug verification to have a complete
> verification. e.g. we should test OCP upgrade too, right ?

Default mon failover timeout is 10 minutes, that is mon failover starts after 10 minutes if the last mon is still down.  This fix doubles the failover timeout. So mon failover only takes place after 20 minutes.

Testing should include:
- Drained node is not uncordoned --> Ensure that mon failover starts only after 20 minutes. 
- Drained node is uncordoned between ~10-20 minutes. ---> ensure that mon failover starts.
- Drained node is uncordoned before ~10 minutes ---> no mon failover should start. 
- Test above scenarios for at least 2-3 mons.
- Test after updating the default mon failover timeout from OCS (if that's possible)

Yes, OCP upgrade should be tested as well.  A regression test should suffice.

Comment 7 Shrivaibavi Raghaventhiran 2021-07-13 18:04:36 UTC

Environment:
------------

RHCOS VMWARE 3M 3W

Version:
----------
OCS - ocs-operator.v4.8.0-430.ci also ran few tests on 4.8.0-rc1
OCP - 4.8.0-0.nightly-2021-06-25-182927

Testcases:
-----------

1. Perform single node drain and wait for mon failover to happen
a. Less than 10 mins No failover observed
b. Between 10-20 mins No failover observed
c. More than 20 mins Failover observed
d. Uncordon node and check the mon status
e. let all mons start running and No error in ceph health is observed. 

2. Perform single node drain and Restart rook-ceph operator during mon failover to happen
a. Wait for >= 20 mins and restart the rook-ceph operator
b. Uncordon drained node after 20 mins
c. Check for mons running in healthy state and no mons should be observed in pending state post recovery of node

3. Delete the mon deployment
a. Create the cluster, wait for it to be initially deployed
b. Scale down a mon (e.g. mon-a) so it falls out of quorum
c. Wait for the mon failover to be initiated (10 min)
d. As soon as the new mon is created and before the bad mon deployment (rook-ceph-mon-a) is deleted, restart the operator
e. All 3 mons will be running

All tests passed. Moving the bz to verified State

Comment 9 errata-xmlrpc 2021-08-03 18:16:11 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Container Storage 4.8.0 container images bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3003