2133683 – After shutting down 2 worker nodes on the MS provider cluster 2 mons are down and ceph health is not recovered

Bug 2133683 - After shutting down 2 worker nodes on the MS provider cluster 2 mons are down and ceph health is not recovered

Summary: After shutting down 2 worker nodes on the MS provider cluster 2 mons are down...

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Travis Nielsen
QA Contact:	Neha Berry
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2112021
TreeView+	depends on / blocked

Reported:	2022-10-11 07:35 UTC by Dhruv Bindra
Modified:	2023-12-08 04:30 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	2112021
Environment:
Last Closed:	2022-11-07 16:19:50 UTC
Embargoed:

Attachments	(Terms of Use)

Comment 2 Travis Nielsen 2022-10-11 17:58:38 UTC

There is no must-gather, so please provide more details from the cluster:
1. Why are the mons in pending state? "oc describe pod" should show the reason. I suspect they have node affinity to the nodes that were just deleted.
2. Are you using host networking? If so, the mons will always be tied to their node, and you won't be able to take two mons down at the same time without bringing at least one of them back up. That's just not supported.
3. As long as two mons are down, everything else in the cluster will be down, including everything in the rook operator timing out when it tries to run ceph commands.

Comment 3 Travis Nielsen 2022-10-24 15:17:13 UTC

Is this still an issue or shall we close this?

Comment 4 Dhruv Bindra 2022-10-28 05:41:13 UTC

Tagging @ikave to get more info about this bug as he is the QE assignee

Comment 6 Travis Nielsen 2022-11-03 19:19:00 UTC

Moving out of 4.12 while waiting for more details

Comment 7 Travis Nielsen 2022-11-07 16:19:50 UTC

Please reopen if there are more details to investigate

Comment 8 Red Hat Bugzilla 2023-12-08 04:30:55 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.