Bug 2142674
Summary: | Ceph unresponsive after provoking failure in datacenter, no IO. Stretch Cluster internal mode. | |||
---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Vikhyat Umrao <vumrao> | |
Component: | RADOS | Assignee: | Kamoltat (Junior) Sirivadhna <ksirivad> | |
Status: | CLOSED ERRATA | QA Contact: | Pawan <pdhiran> | |
Severity: | high | Docs Contact: | Akash Raj <akraj> | |
Priority: | unspecified | |||
Version: | 5.1 | CC: | akraj, akupczyk, amathuri, bhubbard, bkunal, bniver, ceph-eng-bugs, cephqe-warriors, choffman, ddomingu, ebenahar, ebonilla, Egarciad, flucifre, gfarnum, jclaretm, ksirivad, lflores, madam, mashetty, maugarci, mduasope, mgokhool, muagarwa, nojha, nravinas, ocs-bugs, pdhange, pdhiran, rfriedma, rzarzyns, sarora, sostapov, sseshasa, sunnagar, tnielsen, tserlin, vereddy, vkolli, vumrao | |
Target Milestone: | --- | |||
Target Release: | 5.3 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | ceph-16.2.10-88.el8cp | Doc Type: | Bug Fix | |
Doc Text: |
.Performing a DR test with two sites stretch cluster no longer causes Ceph to become unresponsive
Previously, when performing a DR test with two sites stretch-cluster, removing and adding new monitors to the cluster would cause an incorrect rank in `ConnectionTracker` class. Due to this, the monitor would fail to identify itself in the `peer_tracker` copy and would never update its correct field, causing a deadlock in the election process which would lead to Ceph becoming unresponsive.
With this fix, the following corrections are made:
* Added an assert in the function `notify_rank_removed()`, to compare the expected rank provided by the `Monmap` against the rank that is manually adjusted as a sanity check.
* Cleared the variable `removed_ranks` from every `Monmap` update.
* Added an action to manually reset `peer_tracker.rank` when executing the command - `ceph connection scores reset` for each monitor. The `peer_tracker.rank` matches the current rank of the monitor.
* Added functions in the `Elector` and `ConnectionTracker` classes to check for clean `peer_tracker` when upgrading the monitors, including booting up. If found unclean, `peer_tracker` is cleared.
* In {storage-product}, the user can choose to manually remove a monitor rank before shutting down the monitor, causing inconsistency in `Monmap`. Therefore, in `Monitor::notify_new_monmap()` we prevent the function from removing our rank or ranks that don't exist in `Monmap`.
The cluster now works as expected and there is no unwarranted downtime. The cluster no longer becomes unresponsive when performing a DR test with two sites stretch-cluster.
|
Story Points: | --- | |
Clone Of: | 2121452 | |||
: | 2142983 (view as bug list) | Environment: | ||
Last Closed: | 2023-01-11 17:42:24 UTC | Type: | --- | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 2142141, 2142174 | |||
Bug Blocks: | 2121452, 2126049, 2142983, 2150223 |
Comment 50
errata-xmlrpc
2023-01-11 17:42:24 UTC
|