Bug 2142674

Summary: Ceph unresponsive after provoking failure in datacenter, no IO. Stretch Cluster internal mode.
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Vikhyat Umrao <vumrao>
Component: RADOSAssignee: Kamoltat (Junior) Sirivadhna <ksirivad>
Status: CLOSED ERRATA QA Contact: Pawan <pdhiran>
Severity: high Docs Contact: Akash Raj <akraj>
Priority: unspecified    
Version: 5.1CC: akraj, akupczyk, amathuri, bhubbard, bkunal, bniver, ceph-eng-bugs, cephqe-warriors, choffman, ddomingu, ebenahar, ebonilla, Egarciad, flucifre, gfarnum, jclaretm, ksirivad, lflores, madam, mashetty, maugarci, mduasope, mgokhool, muagarwa, nojha, nravinas, ocs-bugs, pdhange, pdhiran, rfriedma, rzarzyns, sarora, sostapov, sseshasa, sunnagar, tnielsen, tserlin, vereddy, vkolli, vumrao
Target Milestone: ---   
Target Release: 5.3   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ceph-16.2.10-88.el8cp Doc Type: Bug Fix
Doc Text:
.Performing a DR test with two sites stretch cluster no longer causes Ceph to become unresponsive Previously, when performing a DR test with two sites stretch-cluster, removing and adding new monitors to the cluster would cause an incorrect rank in `ConnectionTracker` class. Due to this, the monitor would fail to identify itself in the `peer_tracker` copy and would never update its correct field, causing a deadlock in the election process which would lead to Ceph becoming unresponsive. With this fix, the following corrections are made: * Added an assert in the function `notify_rank_removed()`, to compare the expected rank provided by the `Monmap` against the rank that is manually adjusted as a sanity check. * Cleared the variable `removed_ranks` from every `Monmap` update. * Added an action to manually reset `peer_tracker.rank` when executing the command - `ceph connection scores reset` for each monitor. The `peer_tracker.rank` matches the current rank of the monitor. * Added functions in the `Elector` and `ConnectionTracker` classes to check for clean `peer_tracker` when upgrading the monitors, including booting up. If found unclean, `peer_tracker` is cleared. * In {storage-product}, the user can choose to manually remove a monitor rank before shutting down the monitor, causing inconsistency in `Monmap`. Therefore, in `Monitor::notify_new_monmap()` we prevent the function from removing our rank or ranks that don't exist in `Monmap`. The cluster now works as expected and there is no unwarranted downtime. The cluster no longer becomes unresponsive when performing a DR test with two sites stretch-cluster.
Story Points: ---
Clone Of: 2121452
: 2142983 (view as bug list) Environment:
Last Closed: 2023-01-11 17:42:24 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2142141, 2142174    
Bug Blocks: 2121452, 2126049, 2142983, 2150223    

Comment 50 errata-xmlrpc 2023-01-11 17:42:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Ceph Storage 5.3 security update and Bug Fix), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:0076