Bug 2142674 - Ceph unresponsive after provoking failure in datacenter, no IO. Stretch Cluster internal mode.
Summary: Ceph unresponsive after provoking failure in datacenter, no IO. Stretch Clust...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RADOS
Version: 5.1
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 5.3
Assignee: Kamoltat (Junior) Sirivadhna
QA Contact: Pawan
Akash Raj
URL:
Whiteboard:
Depends On: 2142141 2142174
Blocks: 2121452 2126049 2142983 2150223
TreeView+ depends on / blocked
 
Reported: 2022-11-14 20:37 UTC by Vikhyat Umrao
Modified: 2023-01-17 09:00 UTC (History)
40 users (show)

Fixed In Version: ceph-16.2.10-88.el8cp
Doc Type: Bug Fix
Doc Text:
.Performing a DR test with two sites stretch cluster no longer causes Ceph to become unresponsive Previously, when performing a DR test with two sites stretch-cluster, removing and adding new monitors to the cluster would cause an incorrect rank in `ConnectionTracker` class. Due to this, the monitor would fail to identify itself in the `peer_tracker` copy and would never update its correct field, causing a deadlock in the election process which would lead to Ceph becoming unresponsive. With this fix, the following corrections are made: * Added an assert in the function `notify_rank_removed()`, to compare the expected rank provided by the `Monmap` against the rank that is manually adjusted as a sanity check. * Cleared the variable `removed_ranks` from every `Monmap` update. * Added an action to manually reset `peer_tracker.rank` when executing the command - `ceph connection scores reset` for each monitor. The `peer_tracker.rank` matches the current rank of the monitor. * Added functions in the `Elector` and `ConnectionTracker` classes to check for clean `peer_tracker` when upgrading the monitors, including booting up. If found unclean, `peer_tracker` is cleared. * In {storage-product}, the user can choose to manually remove a monitor rank before shutting down the monitor, causing inconsistency in `Monmap`. Therefore, in `Monitor::notify_new_monmap()` we prevent the function from removing our rank or ranks that don't exist in `Monmap`. The cluster now works as expected and there is no unwarranted downtime. The cluster no longer becomes unresponsive when performing a DR test with two sites stretch-cluster.
Clone Of: 2121452
: 2142983 (view as bug list)
Environment:
Last Closed: 2023-01-11 17:42:24 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github ceph ceph pull 49312 0 None open pacific: mon/Elector: Change how we handle removed_ranks and notify_rank_removed() 2022-12-16 17:45:01 UTC
Red Hat Issue Tracker RHCEPH-5613 0 None None None 2022-11-14 20:41:47 UTC
Red Hat Product Errata RHSA-2023:0076 0 None None None 2023-01-11 17:43:38 UTC

Comment 50 errata-xmlrpc 2023-01-11 17:42:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Ceph Storage 5.3 security update and Bug Fix), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:0076


Note You need to log in before you can comment on or make changes to this bug.