2113062 – [GSS] ceph cluster unresponsive when 2 nodes of same zone is down in stretch cluster

Bug 2113062 - [GSS] ceph cluster unresponsive when 2 nodes of same zone is down in stretch cluster

Summary: [GSS] ceph cluster unresponsive when 2 nodes of same zone is down in stretch ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.10
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	ODF 4.12.0
Assignee:	Travis Nielsen
QA Contact:	Mahesh Shetty
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2120598 2120601
TreeView+	depends on / blocked

Reported:	2022-08-01 20:14 UTC by Sonal
Modified:	2024-01-10 10:20 UTC (History)
CC List:	23 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Previously, the Ceph cluster would become unresponsive when two nodes of the same zone are down in a stretch cluster. If the operator restarts in the middle of a mon failover, then multiple mons may get started on the same node, reducing the mon quorum availability. Thus, two mons could end up on the same node instead of being spread across unique nodes. With this update, the operator can now cancel the mon failover when the mon failover times out. And in the event that an extra mon is started during an operator restart, the extra mon will be removed based on the topology to ensure these extra mons are not running on the same node or in the same zone, to maintain optimal topology spread.
Clone Of:
Clones:	2120598 2120601 (view as bug list)
Environment:
Last Closed:	2023-01-31 00:19:40 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	red-hat-storage ocs-ci pull 7271	None	Merged	Stretch cluster: test shutdowns and crash scenarios	2024-01-10 10:20:19 UTC
Github	rook rook pull 10717	None	open	mon: Improve mon failover reliability to better handle failure and topology	2022-08-12 21:25:00 UTC
Red Hat Product Errata	RHBA-2023:0551	None	None	None	2023-01-31 00:19:59 UTC

Description Sonal 2022-08-01 20:14:12 UTC

Description of problem (please be detailed as possible and provide log
snippests):

- On shutting down 2 storage nodes of a zone, ceph cluster became unresponsive. ceph commands timed out.

- Once the two nodes are up, there were 6 mon pods running. After a while, 5 remained, however 2 of them were running on same node.

Version of all relevant components (if applicable):
4.10.5

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes, blocking the platform to go live

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?
Yes, in customer's environment

Can this issue reproduce from the UI?
Yes

If this is a regression, please provide more details to justify this:

Steps to Reproduce:
1. Deploy ODF stretch cluster
2. Shut down 2 storage nodes from a zone.
3. Ceph commands become unresponsive
4. Bring up 2 nodes
5. 6 mon pods are running
6. After a while, 6th one disappears, now out of 5 mons, 2 mons are running on same node

Actual results:
ceph cluster unresponsive when nodes were down.
2 mons on one storage node

Expected results:
- Since more than 50% of ceph nodes are up, ceph cluster should not be unresponsive, commands should not time out
- 1 mon pod on each storage node and one on arbiter node.

Additional info:
In next private comment.

Comment 23 Venkat Kolli 2022-08-11 22:00:53 UTC

What is the next course of action here. We should consider these fixes for 4.11.z release, given the criticality of the customer and the impact on the project.

Comment 24 Travis Nielsen 2022-08-11 22:51:03 UTC

Agreed on the critical nature of the backport after the fix is verified on 4.12.

Comment 48 errata-xmlrpc 2023-01-31 00:19:40 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Data Foundation 4.12.0 enhancement and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:0551

Note You need to log in before you can comment on or make changes to this bug.

bkunal
bniver
ebenahar
etamir
hnallurv
jfindysz
madam
mhackett
mmuench
muagarwa
ocs-bugs
odf-bz-bot
olakra
pdhange
pdhiran
racpatel
rcyriac
sheggodu
srai
tdesala
tnielsen
vkolli
vumrao