Bug 2113062 - [GSS] ceph cluster unresponsive when 2 nodes of same zone is down in stretch cluster
Summary: [GSS] ceph cluster unresponsive when 2 nodes of same zone is down in stretch ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: rook
Version: 4.10
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: ODF 4.12.0
Assignee: Travis Nielsen
QA Contact: Mahesh Shetty
URL:
Whiteboard:
Depends On:
Blocks: 2120598 2120601
TreeView+ depends on / blocked
 
Reported: 2022-08-01 20:14 UTC by Sonal
Modified: 2024-01-10 10:20 UTC (History)
23 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Previously, the Ceph cluster would become unresponsive when two nodes of the same zone are down in a stretch cluster. If the operator restarts in the middle of a mon failover, then multiple mons may get started on the same node, reducing the mon quorum availability. Thus, two mons could end up on the same node instead of being spread across unique nodes. With this update, the operator can now cancel the mon failover when the mon failover times out. And in the event that an extra mon is started during an operator restart, the extra mon will be removed based on the topology to ensure these extra mons are not running on the same node or in the same zone, to maintain optimal topology spread.
Clone Of:
: 2120598 2120601 (view as bug list)
Environment:
Last Closed: 2023-01-31 00:19:40 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github red-hat-storage ocs-ci pull 7271 0 None Merged Stretch cluster: test shutdowns and crash scenarios 2024-01-10 10:20:19 UTC
Github rook rook pull 10717 0 None open mon: Improve mon failover reliability to better handle failure and topology 2022-08-12 21:25:00 UTC
Red Hat Product Errata RHBA-2023:0551 0 None None None 2023-01-31 00:19:59 UTC

Description Sonal 2022-08-01 20:14:12 UTC
Description of problem (please be detailed as possible and provide log
snippests):

- On shutting down 2 storage nodes of a zone, ceph cluster became unresponsive. ceph commands timed out.

- Once the two nodes are up, there were 6 mon pods running. After a while, 5 remained, however 2 of them were running on same node.

Version of all relevant components (if applicable):
4.10.5

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes, blocking the platform to go live

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?
Yes, in customer's environment

Can this issue reproduce from the UI?
Yes

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Deploy ODF stretch cluster
2. Shut down 2 storage nodes from a zone.
3. Ceph commands become unresponsive
4. Bring up 2 nodes
5. 6 mon pods are running
6. After a while, 6th one disappears, now out of 5 mons, 2 mons are running on same node


Actual results:
ceph cluster unresponsive when nodes were down.
2 mons on one storage node 

Expected results:
- Since more than 50% of ceph nodes are up, ceph cluster should not be unresponsive, commands should not time out
- 1 mon pod on each storage node and one on arbiter node.


Additional info:
In next private comment.

Comment 23 Venkat Kolli 2022-08-11 22:00:53 UTC
What is the next course of action here. We should consider these fixes for 4.11.z release, given the criticality of the customer and the impact on the project.

Comment 24 Travis Nielsen 2022-08-11 22:51:03 UTC
Agreed on the critical nature of the backport after the fix is verified on 4.12.

Comment 48 errata-xmlrpc 2023-01-31 00:19:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Data Foundation 4.12.0 enhancement and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:0551


Note You need to log in before you can comment on or make changes to this bug.