Bug 2113062

Summary: [GSS] ceph cluster unresponsive when 2 nodes of same zone is down in stretch cluster
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Sonal <sarora>
Component: rookAssignee: Travis Nielsen <tnielsen>
Status: CLOSED ERRATA QA Contact: Mahesh Shetty <mashetty>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.10CC: bkunal, bniver, ebenahar, etamir, hnallurv, jfindysz, madam, mhackett, mmuench, muagarwa, ocs-bugs, odf-bz-bot, olakra, pdhange, pdhiran, racpatel, rcyriac, sheggodu, srai, tdesala, tnielsen, vkolli, vumrao
Target Milestone: ---   
Target Release: ODF 4.12.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Previously, the Ceph cluster would become unresponsive when two nodes of the same zone are down in a stretch cluster. If the operator restarts in the middle of a mon failover, then multiple mons may get started on the same node, reducing the mon quorum availability. Thus, two mons could end up on the same node instead of being spread across unique nodes. With this update, the operator can now cancel the mon failover when the mon failover times out. And in the event that an extra mon is started during an operator restart, the extra mon will be removed based on the topology to ensure these extra mons are not running on the same node or in the same zone, to maintain optimal topology spread.
Story Points: ---
Clone Of:
: 2120598 2120601 (view as bug list) Environment:
Last Closed: 2023-01-31 00:19:40 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2120598, 2120601    

Description Sonal 2022-08-01 20:14:12 UTC
Description of problem (please be detailed as possible and provide log
snippests):

- On shutting down 2 storage nodes of a zone, ceph cluster became unresponsive. ceph commands timed out.

- Once the two nodes are up, there were 6 mon pods running. After a while, 5 remained, however 2 of them were running on same node.

Version of all relevant components (if applicable):
4.10.5

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes, blocking the platform to go live

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?
Yes, in customer's environment

Can this issue reproduce from the UI?
Yes

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Deploy ODF stretch cluster
2. Shut down 2 storage nodes from a zone.
3. Ceph commands become unresponsive
4. Bring up 2 nodes
5. 6 mon pods are running
6. After a while, 6th one disappears, now out of 5 mons, 2 mons are running on same node


Actual results:
ceph cluster unresponsive when nodes were down.
2 mons on one storage node 

Expected results:
- Since more than 50% of ceph nodes are up, ceph cluster should not be unresponsive, commands should not time out
- 1 mon pod on each storage node and one on arbiter node.


Additional info:
In next private comment.

Comment 23 Venkat Kolli 2022-08-11 22:00:53 UTC
What is the next course of action here. We should consider these fixes for 4.11.z release, given the criticality of the customer and the impact on the project.

Comment 24 Travis Nielsen 2022-08-11 22:51:03 UTC
Agreed on the critical nature of the backport after the fix is verified on 4.12.

Comment 48 errata-xmlrpc 2023-01-31 00:19:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Data Foundation 4.12.0 enhancement and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:0551