2142674 – Ceph unresponsive after provoking failure in datacenter, no IO. Stretch Cluster internal mode.

Bug 2142674 - Ceph unresponsive after provoking failure in datacenter, no IO. Stretch Cluster internal mode.

Summary: Ceph unresponsive after provoking failure in datacenter, no IO. Stretch Clust...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	RADOS
Sub Component:
Version:	5.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	5.3
Assignee:	Kamoltat (Junior) Sirivadhna
QA Contact:	Pawan
Docs Contact:	Akash Raj
URL:
Whiteboard:
Depends On:	2142141 2142174
Blocks:	2121452 2126049 2142983 2150223
TreeView+	depends on / blocked

Reported:	2022-11-14 20:37 UTC by Vikhyat Umrao
Modified:	2023-01-17 09:00 UTC (History)
CC List:	40 users (show)
Fixed In Version:	ceph-16.2.10-88.el8cp
Doc Type:	Bug Fix
Doc Text:	.Performing a DR test with two sites stretch cluster no longer causes Ceph to become unresponsive Previously, when performing a DR test with two sites stretch-cluster, removing and adding new monitors to the cluster would cause an incorrect rank in `ConnectionTracker` class. Due to this, the monitor would fail to identify itself in the `peer_tracker` copy and would never update its correct field, causing a deadlock in the election process which would lead to Ceph becoming unresponsive. With this fix, the following corrections are made: * Added an assert in the function `notify_rank_removed()`, to compare the expected rank provided by the `Monmap` against the rank that is manually adjusted as a sanity check. * Cleared the variable `removed_ranks` from every `Monmap` update. * Added an action to manually reset `peer_tracker.rank` when executing the command - `ceph connection scores reset` for each monitor. The `peer_tracker.rank` matches the current rank of the monitor. * Added functions in the `Elector` and `ConnectionTracker` classes to check for clean `peer_tracker` when upgrading the monitors, including booting up. If found unclean, `peer_tracker` is cleared. * In {storage-product}, the user can choose to manually remove a monitor rank before shutting down the monitor, causing inconsistency in `Monmap`. Therefore, in `Monitor::notify_new_monmap()` we prevent the function from removing our rank or ranks that don't exist in `Monmap`. The cluster now works as expected and there is no unwarranted downtime. The cluster no longer becomes unresponsive when performing a DR test with two sites stretch-cluster.
Clone Of:	2121452
Clones:	2142983 (view as bug list)
Environment:
Last Closed:	2023-01-11 17:42:24 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	ceph ceph pull 49312	None	open	pacific: mon/Elector: Change how we handle removed_ranks and notify_rank_removed()	2022-12-16 17:45:01 UTC
Red Hat Issue Tracker	RHCEPH-5613	None	None	None	2022-11-14 20:41:47 UTC
Red Hat Product Errata	RHSA-2023:0076	None	None	None	2023-01-11 17:43:38 UTC

Comment 50 errata-xmlrpc 2023-01-11 17:42:24 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Ceph Storage 5.3 security update and Bug Fix), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:0076

Note You need to log in before you can comment on or make changes to this bug.

akraj
akupczyk
amathuri
bhubbard
bkunal
bniver
ceph-eng-bugs
cephqe-warriors
choffman
ddomingu
ebenahar
ebonilla
Egarciad
flucifre
gfarnum
jclaretm
ksirivad
lflores
madam
mashetty
maugarci
mduasope
mgokhool
muagarwa
nojha
nravinas
ocs-bugs
pdhange
pdhiran
rfriedma
rzarzyns
sarora
sostapov
sseshasa
sunnagar
tnielsen
tserlin
vereddy
vkolli
vumrao