Bug 1945266
Summary: | Monitor crash - ceph_assert(m < ranks.size()) - observed when number of monitors were reduced from 5 to 3 using ceph orchestrator | |||
---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Vasishta <vashastr> | |
Component: | RADOS | Assignee: | Kamoltat (Junior) Sirivadhna <ksirivad> | |
Status: | CLOSED ERRATA | QA Contact: | Pawan <pdhiran> | |
Severity: | high | Docs Contact: | Eliska <ekristov> | |
Priority: | unspecified | |||
Version: | 5.0 | CC: | akupczyk, anrao, bhubbard, bkunal, ceph-eng-bugs, ckulal, ekristov, gfarnum, hyelloji, kdreyer, ksirivad, mgowri, ngangadh, nojha, pasik, pdhiran, pnataraj, rmandyam, rzarzyns, sseshasa, tserlin, vumrao | |
Target Milestone: | --- | |||
Target Release: | 6.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | ceph-17.2.5-30.el9cp | Doc Type: | Bug Fix | |
Doc Text: |
.The Ceph Monitor no longer crashes after reducing the number of monitors
Previously, when the user reduced the number of monitors in the quorum using the `ceph orch apply mon _NUMBER_` command, `cephadm` would remove the monitor before shutting it down.
This would trigger an assertion because Ceph would assume that the monitor is shutting down before the monitor removal.
With this fix, a sanity check is added to handle the case when the current rank of the monitor is larger or equal to the quorum rank.
The monitor no longer exists in the monitor map, therefore its peers do not ping this monitor, because the address no longer exists.
As a result, the assertion is not triggered if the monitor is removed before shutdown.
|
Story Points: | --- | |
Clone Of: | ||||
: | 2142141 (view as bug list) | Environment: | ||
Last Closed: | 2023-03-20 18:55:33 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1959686, 2126050, 2142141 |
Description
Vasishta
2021-03-31 15:00:38 UTC
*** Bug 1961132 has been marked as a duplicate of this bug. *** *** Bug 2010836 has been marked as a duplicate of this bug. *** Yeah no work here so far. I'll bump it up my priority list since it's actually getting seen. Attached crashed logs for reference captured from MON nodes. We saw the issue while upgrading from 5.x to 5.1. *** Bug 1945272 has been marked as a duplicate of this bug. *** *** Bug 2111411 has been marked as a duplicate of this bug. *** Hi Eliska, anywhere with <> is where I modified the text below: ***** Previously, when the user reduced the number of monitors in the quorum using the `ceph orch apply mon _NUMBER_` command, `cephadm` would remove the monitor before shutting it down. This would trigger an <assertion> because Ceph would assume that the monitor is shutting down before the monitor removal. With this fix, a sanity check is added <to handle the case when> the current rank of the monitor is larger or equal to the quorum rank. The monitor no longer exists in the monitor map, therefore <its peers do> not ping this monitor, because the address no longer exists. As a result, the assertion is not triggered if the monitor is removed before shutdown. ***** Let me know what you think thank you, Kamoltat So the issue was because we are hitting this issue on a different code path that I didn't consider. Therefore, I've filed a new PR: https://github.com/ceph/ceph/pull/49259 that will cover the code paths that I've missed. Here is also a new upstream tracker for this: https://tracker.ceph.com/issues/58155. Cherry-picked to downstream Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat Ceph Storage 6.0 Bug Fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:1360 |