Bug 2142674

Summary:	Ceph unresponsive after provoking failure in datacenter, no IO. Stretch Cluster internal mode.
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	Vikhyat Umrao <vumrao>
Component:	RADOS	Assignee:	Kamoltat (Junior) Sirivadhna <ksirivad>
Status:	CLOSED ERRATA	QA Contact:	Pawan <pdhiran>
Severity:	high	Docs Contact:	Akash Raj <akraj>
Priority:	unspecified
Version:	5.1	CC:	akraj, akupczyk, amathuri, bhubbard, bkunal, bniver, ceph-eng-bugs, cephqe-warriors, choffman, ddomingu, ebenahar, ebonilla, Egarciad, flucifre, gfarnum, jclaretm, ksirivad, lflores, madam, mashetty, maugarci, mduasope, mgokhool, muagarwa, nojha, nravinas, ocs-bugs, pdhange, pdhiran, rfriedma, rzarzyns, sarora, sostapov, sseshasa, sunnagar, tnielsen, tserlin, vereddy, vkolli, vumrao
Target Milestone:	---
Target Release:	5.3
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	ceph-16.2.10-88.el8cp	Doc Type:	Bug Fix
Doc Text:	.Performing a DR test with two sites stretch cluster no longer causes Ceph to become unresponsive Previously, when performing a DR test with two sites stretch-cluster, removing and adding new monitors to the cluster would cause an incorrect rank in `ConnectionTracker` class. Due to this, the monitor would fail to identify itself in the `peer_tracker` copy and would never update its correct field, causing a deadlock in the election process which would lead to Ceph becoming unresponsive. With this fix, the following corrections are made: * Added an assert in the function `notify_rank_removed()`, to compare the expected rank provided by the `Monmap` against the rank that is manually adjusted as a sanity check. * Cleared the variable `removed_ranks` from every `Monmap` update. * Added an action to manually reset `peer_tracker.rank` when executing the command - `ceph connection scores reset` for each monitor. The `peer_tracker.rank` matches the current rank of the monitor. * Added functions in the `Elector` and `ConnectionTracker` classes to check for clean `peer_tracker` when upgrading the monitors, including booting up. If found unclean, `peer_tracker` is cleared. * In {storage-product}, the user can choose to manually remove a monitor rank before shutting down the monitor, causing inconsistency in `Monmap`. Therefore, in `Monitor::notify_new_monmap()` we prevent the function from removing our rank or ranks that don't exist in `Monmap`. The cluster now works as expected and there is no unwarranted downtime. The cluster no longer becomes unresponsive when performing a DR test with two sites stretch-cluster.	Story Points:	---
Clone Of:	2121452
Clones:	2142983 (view as bug list)		Environment:
Last Closed:	2023-01-11 17:42:24 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	2142141, 2142174
Bug Blocks:	2121452, 2126049, 2142983, 2150223

Comment 50 errata-xmlrpc 2023-01-11 17:42:24 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Ceph Storage 5.3 security update and Bug Fix), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:0076