Bug 1959983 - [GSS][mon] rook-operator scales mons to 4 after healthCheck timeout
Summary: [GSS][mon] rook-operator scales mons to 4 after healthCheck timeout
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Container Storage
Classification: Red Hat Storage
Component: rook
Version: 4.6
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: OCS 4.6.5
Assignee: Travis Nielsen
QA Contact: Shrivaibavi Raghaventhiran
URL:
Whiteboard:
Depends On: 1955831
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-05-12 18:36 UTC by Travis Nielsen
Modified: 2024-06-14 01:30 UTC (History)
7 users (show)

Fixed In Version: 4.6.5-411.ci
Doc Type: Bug Fix
Doc Text:
Previously, the mon quorum was at risk, as the operator could erroneously remove the new mon if the operator was restarted during a mon failover. With this update, the operator completes the same mon failover after the operator is restarted, and hence the mon quorum is more reliable in the node drains and mon failover scenarios.
Clone Of:
Environment:
Last Closed: 2021-06-17 15:46:46 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift rook pull 237 0 None open Bug 1959983: ceph: Persist expected mon endpoints immediately during mon failover 2021-05-13 18:32:39 UTC
Github red-hat-storage ocs-ci pull 4440 0 None closed GSS bug,Verify the num of mon pods is 3 when drain node 2021-07-15 06:00:48 UTC
Red Hat Product Errata RHSA-2021:2479 0 None None None 2021-06-17 15:46:53 UTC

Comment 1 Elad 2021-05-13 08:01:39 UTC
Hi Travis, following the fix for bug 1959980, the steps to reproduce will change to 20 minutes wait instead of 10, right?

Comment 3 Travis Nielsen 2021-05-13 13:18:11 UTC
Correct, 20 minutes for mon failover if a node drain is detected.

Comment 4 Travis Nielsen 2021-05-13 14:46:34 UTC
Elad Correct, the mon failover during a node drain will be 20 minutes after the related BZ. However, the node drain is not necessary to repro this issue. These repro steps would be sufficient:
1. Create the cluster, wait for it to be initially deployed
2. Scale down a mon (e.g. mon-a) so it falls out of quorum
3. Wait for the mon failover to be initiated (10 min)
4. As soon as the new mon (e.g. mon-d) is created and before the bad mon deployment (rook-ceph-mon-a) is deleted, restart the operator
5. After the operator restarts, it will be confused and remove mon-d, which can leave the mons out of quorum. "ceph status" shows that four mons are in the monmap, which means at least 3 of them must be online for quorum. 
6. In this test scenario, the operator will automatically scale back up mon-a after it restarts, but if mon-a can't start again such as because of the node drain, the mons would stay out of quorum.

For more details see the upstream issue: https://github.com/rook/rook/issues/7797

Comment 8 Shrivaibavi Raghaventhiran 2021-06-03 16:03:03 UTC
Platform:
----------

Vmware 3M 3W RHCOS cluster

Versions:
----------

OCP - 4.6.30
OCS - ocs-operator.v4.6.5-411.ci

Testcases Executed:
----------------------

1.  Perform single node drain and Restart rook-ceph operator during mon failover to happen
a. Wait for >= 20 mins and restart the rook-ceph operator
b. Uncordon drained node after 20 mins
c. Check for mons running in healthy state and no mons should be observed in pending state post recovery of node

2. Delete the mon deployment
a. Create the cluster, wait for it to be initially deployed
b. Scale down a mon (e.g. mon-a) so it falls out of quorum
c. Wait for the mon failover to be initiated (10 min)
d. As soon as the new mon is created and before the bad mon deployment
(rook-ceph-mon-a) is deleted, restart the operator
e. All 3 mons will be running

Restarted rook-ceph operator just during the failover
I did not see 4 mons, I always saw 3 mons, Its working as expected. we dont see 2 pending Mons


Moving the BZ to Verified state

Comment 11 Travis Nielsen 2021-06-09 13:07:07 UTC
How about cutting out a small phrase like this? 

Previously, the mon quorum was at risk, as the operator could erroneously remove the new mon if the operator was restarted during a mon failover. With this update, the operator completes the same mon failover after the operator is restarted, and hence the mon quorum is more reliable in the node drains and mon failover scenarios.

Comment 19 errata-xmlrpc 2021-06-17 15:46:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.6.5 security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2479


Note You need to log in before you can comment on or make changes to this bug.