Bug 1959980
| Summary: | When a node is being drained, increase the mon failover timeout to prevent unnecessary mon failover | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Container Storage | Reporter: | Travis Nielsen <tnielsen> |
| Component: | rook | Assignee: | Santosh Pillai <sapillai> |
| Status: | CLOSED ERRATA | QA Contact: | Shrivaibavi Raghaventhiran <sraghave> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.7 | CC: | madam, muagarwa, ocs-bugs, sapillai, tdesala |
| Target Milestone: | --- | Keywords: | ZStream |
| Target Release: | OCS 4.6.5 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | 4.6.5-411.ci | Doc Type: | Bug Fix |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-06-17 15:46:46 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Travis Nielsen
2021-05-12 18:26:17 UTC
Santosh please open a PR directly to release-4.6 (no need to first backport to upstream release-1.4) (In reply to Travis Nielsen from comment #1) > Santosh please open a PR directly to release-4.6 (no need to first backport > to upstream release-1.4) downstream PR for 4.6 release - https://github.com/openshift/rook/pull/238 Testing looks good. Environment: ------------ RHCOS VMWARE 3M 3W Version: ---------- OCS - ocs-operator.v4.6.5-411.ci OCP - 4.6.30 Testcases: ----------- 1. Perform single node drain and wait for mon failover to happen a. Less than 10 mins No failover observed b. Between 10-20 mins No failover observed c. More than 20 mins Failover observed d. Uncordon node and check the mon status e. let all mons start running and No error in ceph health is observed. 2. Perform single node drain and Restart rook-ceph operator during mon failover to happen a. Wait for >= 20 mins and restart the rook-ceph operator b. Uncordon drained node after 20 mins c. Check for mons running in healthy state and no mons should be observed in pending state post recovery of node 3. Delete the mon deployment a. Create the cluster, wait for it to be initially deployed b. Scale down a mon (e.g. mon-a) so it falls out of quorum c. Wait for the mon failover to be initiated (10 min) d. As soon as the new mon is created and before the bad mon deployment (rook-ceph-mon-a) is deleted, restart the operator e. All 3 mons will be running Performed all the above 3 testcases on OCS 4.6.5 and OCS 4.7.1 versions, Just curious sometimes we see mon-canary pod respinning and sometimes not as it takes the same set of names again "mon-a mon-b and mon-c" when we expect "mon-a mon-b and mon-d" There is no impact functionality wise just that sometimes we see canary pod respinning and sometimes not. Please clarify In OCS 4.7.1 we see canary pods and 1/7 times we see canary pods in OCS 4.6.5 rook-ceph-mon-a-59c5cfc549-x9xks 2/2 Running 0 100m 10.129.2.241 compute-2 <none> <none> rook-ceph-mon-b-687b765c8f-t642l 2/2 Running 0 18h 10.131.0.13 compute-1 <none> <none> rook-ceph-mon-d-canary-786fbb44db-m2x5c 0/2 Pending 0 88s <none> <none> <none> <none> Summary: ------- Functionality wise no impact on the cluster, All mons are up and running post recovery. Ceph cluster was accessible throughout. Most importantly mons did not lose quorum. @Travis Neilsen From BZ https://bugzilla.redhat.com/show_bug.cgi?id=1959976#c9 and from above observations moving this BZ to Verified state Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.6.5 security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2479 |