Bug 2142983

Summary: Ceph unresponsive after provoking failure in datacenter, no IO. Stretch Cluster internal mode.
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Vikhyat Umrao <vumrao>
Component: RADOSAssignee: Kamoltat (Junior) Sirivadhna <ksirivad>
Status: CLOSED ERRATA QA Contact: Pawan <pdhiran>
Severity: high Docs Contact: Eliska <ekristov>
Priority: unspecified    
Version: 5.1CC: akupczyk, amathuri, bhubbard, bkunal, bniver, ceph-eng-bugs, cephqe-warriors, choffman, ddomingu, ebenahar, ebonilla, Egarciad, ekristov, flucifre, gfarnum, jclaretm, kdreyer, ksirivad, lflores, mashetty, maugarci, mduasope, mgokhool, muagarwa, nojha, nravinas, ocs-bugs, pdhange, pdhiran, rfriedma, rzarzyns, sarora, sseshasa, sunnagar, tnielsen, tserlin, vereddy, vkolli, vumrao
Target Milestone: ---   
Target Release: 6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ceph-17.2.5-31.el9cp Doc Type: Bug Fix
Doc Text:
.Ceph Monitors are not stuck during failover of a site Previously, the `removed_ranks` variable would not discard its content for every update of the Monitor map. Thus it would replace monitors in a 2-site stretch cluster and fail over of one of the site would cause connection scores, including ranks associated with the scores, to be inconsistent. Inconsistent connection scores would cause deadlock during the monitor election period, which would result in Ceph to become unresponsive. Once this happened, there was no way for the monitor rank associated with the connection score to correct itself. With this fix, the `removed_ranks` variable gets cleared with every update of the monitor map. Monitors are no longer stuck in the election period and Ceph no longer becomes unresponsive when replacing monitors and failing over a site. Moreover, there is a way to manually force the connection scores to correct themselves with the `ceph daemon mon._NAME_ connection scores reset` command.
Story Points: ---
Clone Of: 2142674 Environment:
Last Closed: 2023-03-20 18:59:13 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2142141, 2142174, 2142674    
Bug Blocks: 2126050    

Comment 41 errata-xmlrpc 2023-03-20 18:59:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 6.0 Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:1360