1959985 – [GSS][mon] rook-operator scales mons to 4 after healthCheck timeout

Bug 1959985 - [GSS][mon] rook-operator scales mons to 4 after healthCheck timeout

Summary: [GSS][mon] rook-operator scales mons to 4 after healthCheck timeout

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	OCS 4.7.1
Assignee:	Travis Nielsen
QA Contact:	Shrivaibavi Raghaventhiran
Docs Contact:
URL:
Whiteboard:
Depends On:	1955831
Blocks:
TreeView+	depends on / blocked

Reported:	2021-05-12 18:37 UTC by Travis Nielsen
Modified:	2024-06-14 01:30 UTC (History)
CC List:	9 users (show)
Fixed In Version:	4.7.1-403.ci
Doc Type:	Bug Fix
Doc Text:	Earlier, if the operator was restarted during a mon failover, the operator could erroneously remove the new mon. Hence, the mon quorum was at risk when the operator removed the new mon. With this update, the operator will restore the state when the mon failover is in progress and properly complete the mon failover after the operator is restarted. Now, the mon quorum is more reliable in the node drains and mon failover scenarios.
Clone Of:
Environment:
Last Closed:	2021-06-15 16:50:37 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift rook pull 236	None	open	Bug 1959985: ceph: Persist expected mon endpoints immediately during mon failover	2021-05-13 18:32:59 UTC
Github	red-hat-storage ocs-ci pull 4440	None	closed	GSS bug,Verify the num of mon pods is 3 when drain node	2021-07-15 05:51:50 UTC
Red Hat Product Errata	RHBA-2021:2449	None	None	None	2021-06-15 16:50:53 UTC

Comment 8 Shrivaibavi Raghaventhiran 2021-06-03 11:22:02 UTC

Platform:
----------

Vmware 3M 3W RHCOS cluster

Versions:
----------

OCP - 4.7.12
OCS - ocs-operator.v4.7.1-403.ci

Testcases Executed:
----------------------

1.  Perform single node drain and Restart rook-ceph operator during mon failover to happen
a. Wait for >= 20 mins and restart the rook-ceph operator
b. Uncordon drained node after 20 mins
c. Check for mons running in healthy state and no mons should be observed in pending state post recovery of node

2. Delete the mon deployment
a. Create the cluster, wait for it to be initially deployed
b. Scale down a mon (e.g. mon-a) so it falls out of quorum
c. Wait for the mon failover to be initiated (10 min)
d. As soon as the new mon is created and before the bad mon deployment
(rook-ceph-mon-a) is deleted, restart the operator
e. All 3 mons will be running

Restarted rook-ceph operator just during the failover
I did not see 4 mons, I always saw 3 mons, Its working as expected. we dont see 2 pending Mons

@santosh let me know if this BZ needs any more verifications or if it can be moved to verified state

Comment 9 Santosh Pillai 2021-06-03 15:02:41 UTC

(In reply to Shrivaibavi Raghaventhiran from comment #8)

> @santosh let me know if this BZ needs any more verifications or if it can be
> moved to verified state

Looks good to me. Can be moved to verified state.

Comment 10 Shrivaibavi Raghaventhiran 2021-06-03 16:04:22 UTC

Based on comment 8 and comment 9 moving the BZ to verified state

Comment 12 Travis Nielsen 2021-06-08 17:34:01 UTC

needs info separately answered between comments 7 and 8

Comment 16 errata-xmlrpc 2021-06-15 16:50:37 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Container Storage 4.7.1 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2449

Note You need to log in before you can comment on or make changes to this bug.