Bug 2113062
| Summary: | [GSS] ceph cluster unresponsive when 2 nodes of same zone is down in stretch cluster | |||
|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Sonal <sarora> | |
| Component: | rook | Assignee: | Travis Nielsen <tnielsen> | |
| Status: | CLOSED ERRATA | QA Contact: | Mahesh Shetty <mashetty> | |
| Severity: | high | Docs Contact: | ||
| Priority: | unspecified | |||
| Version: | 4.10 | CC: | bkunal, bniver, ebenahar, etamir, hnallurv, jfindysz, madam, mhackett, mmuench, muagarwa, ocs-bugs, odf-bz-bot, olakra, pdhange, pdhiran, racpatel, rcyriac, sheggodu, srai, tdesala, tnielsen, vkolli, vumrao | |
| Target Milestone: | --- | |||
| Target Release: | ODF 4.12.0 | |||
| Hardware: | x86_64 | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | Bug Fix | ||
| Doc Text: |
Previously, the Ceph cluster would become unresponsive when two nodes of the same zone are down in a stretch cluster. If the operator restarts in the middle of a mon failover, then multiple mons may get started on the same node, reducing the mon quorum availability. Thus, two mons could end up on the same node instead of being spread across unique nodes.
With this update, the operator can now cancel the mon failover when the mon failover times out. And in the event that an extra mon is started during an operator restart, the extra mon will be removed based on the topology to ensure these extra mons are not running on the same node or in the same zone, to maintain optimal topology spread.
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 2120598 2120601 (view as bug list) | Environment: | ||
| Last Closed: | 2023-01-31 00:19:40 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 2120598, 2120601 | |||
|
Description
Sonal
2022-08-01 20:14:12 UTC
What is the next course of action here. We should consider these fixes for 4.11.z release, given the criticality of the customer and the impact on the project. Agreed on the critical nature of the backport after the fix is verified on 4.12. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Data Foundation 4.12.0 enhancement and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:0551 |