Description of problem (please be detailed as possible and provide log snippests): - On shutting down 2 storage nodes of a zone, ceph cluster became unresponsive. ceph commands timed out. - Once the two nodes are up, there were 6 mon pods running. After a while, 5 remained, however 2 of them were running on same node. Version of all relevant components (if applicable): 4.10.5 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes, blocking the platform to go live Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 2 Can this issue reproducible? Yes, in customer's environment Can this issue reproduce from the UI? Yes If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Deploy ODF stretch cluster 2. Shut down 2 storage nodes from a zone. 3. Ceph commands become unresponsive 4. Bring up 2 nodes 5. 6 mon pods are running 6. After a while, 6th one disappears, now out of 5 mons, 2 mons are running on same node Actual results: ceph cluster unresponsive when nodes were down. 2 mons on one storage node Expected results: - Since more than 50% of ceph nodes are up, ceph cluster should not be unresponsive, commands should not time out - 1 mon pod on each storage node and one on arbiter node. Additional info: In next private comment.
What is the next course of action here. We should consider these fixes for 4.11.z release, given the criticality of the customer and the impact on the project.
Agreed on the critical nature of the backport after the fix is verified on 4.12.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Data Foundation 4.12.0 enhancement and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:0551