1945266 – Monitor crash - ceph_assert(m < ranks.size()) - observed when number of monitors were reduced from 5 to 3 using ceph orchestrator

Bug 1945266 - Monitor crash - ceph_assert(m < ranks.size()) - observed when number of monitors were reduced from 5 to 3 using ceph orchestrator

Summary: Monitor crash - ceph_assert(m < ranks.size()) - observed when number of monit...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	RADOS
Sub Component:
Version:	5.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	6.0
Assignee:	Kamoltat (Junior) Sirivadhna
QA Contact:	Pawan
Docs Contact:	Eliska
URL:
Whiteboard:
Duplicates (4):	1945272 1961132 2010836 2111411 (view as bug list)
Depends On:
Blocks:	1959686 2126050 2142141
TreeView+	depends on / blocked

Reported:	2021-03-31 15:00 UTC by Vasishta
Modified:	2024-08-29 11:14 UTC (History)
CC List:	22 users (show)
Fixed In Version:	ceph-17.2.5-30.el9cp
Doc Type:	Bug Fix
Doc Text:	.The Ceph Monitor no longer crashes after reducing the number of monitors Previously, when the user reduced the number of monitors in the quorum using the `ceph orch apply mon _NUMBER_` command, `cephadm` would remove the monitor before shutting it down. This would trigger an assertion because Ceph would assume that the monitor is shutting down before the monitor removal. With this fix, a sanity check is added to handle the case when the current rank of the monitor is larger or equal to the quorum rank. The monitor no longer exists in the monitor map, therefore its peers do not ping this monitor, because the address no longer exists. As a result, the assertion is not triggered if the monitor is removed before shutdown.
Clone Of:
Clones:	2142141 (view as bug list)
Environment:
Last Closed:	2023-03-20 18:55:33 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	50089	0	None	None	None	2021-04-01 00:40:54 UTC
Red Hat Product Errata	RHBA-2023:1360	0	None	None	None	2023-03-20 18:56:27 UTC

Description Vasishta 2021-03-31 15:00:38 UTC

Description of problem:
Monitor crashed when number of monitors were redisced from 5 to 3

Version-Release number of selected component (if applicable):
16.1.0-1323.el8cp

How reproducible:
Tried once

Steps to Reproduce:
1. Configure a 5.x cluster using cephadm
2. Add some daemons 
3. Reduce number of monitors to 3 from the default 5

Actual results:
"assert_condition": "m < ranks.size()",
    "assert_file": "/builddir/build/BUILD/ceph-16.1.0-1323-g7e7e1f4e/src/mon/MonMap.h",
    "assert_func": "const entity_addrvec_t& MonMap::get_addrs(unsigned int) const",
    "assert_line": 404,
    "assert_msg": "/builddir/build/BUILD/ceph-16.1.0-1323-g7e7e1f4e/src/mon/MonMap.h: In function 'const entity_addrvec_t& MonMap::get_addrs(unsigned int) const' thread 7f348ecf8700 time 2021-03-31T14:28:43.421216+0000\n/builddir/build/BUILD/ceph-16.1.0-1323-g7e7e1f4e/src/mon/MonMap.h: 404: FAILED ceph_assert(m < ranks.size())\n",

Expected results:
Monitor shouldn't crash

Additional info:
Number of monitors were reduced using
[ceph: root@pluto002 /]# ceph orch apply mon  3
Scheduled mon update...

Comment 4 Manasa 2021-05-17 11:34:18 UTC

*** Bug 1961132 has been marked as a duplicate of this bug. ***

Comment 14 Vikhyat Umrao 2021-10-05 17:58:25 UTC

*** Bug 2010836 has been marked as a duplicate of this bug. ***

Comment 17 Greg Farnum 2021-10-07 17:15:48 UTC

Yeah no work here so far. I'll bump it up my priority list since it's actually getting seen.

Comment 20 Preethi 2021-11-15 06:33:25 UTC

Attached crashed logs for reference captured from MON nodes. We saw the issue while upgrading from 5.x to 5.1.

Comment 33 Kamoltat (Junior) Sirivadhna 2022-03-18 01:26:25 UTC

*** Bug 1945272 has been marked as a duplicate of this bug. ***

Comment 34 Vikhyat Umrao 2022-07-27 15:19:01 UTC

*** Bug 2111411 has been marked as a duplicate of this bug. ***

Comment 45 Kamoltat (Junior) Sirivadhna 2022-11-01 13:20:29 UTC

Hi Eliska,

anywhere with <> is where I modified the text below:


*****

Previously, when the user reduced the number of monitors in the quorum using the `ceph orch apply mon _NUMBER_` command, `cephadm` would remove the monitor before shutting it down.
This would trigger an <assertion> because Ceph would assume that the monitor is shutting down before the monitor removal.

With this fix, a sanity check is added <to handle the case when> the current rank of the monitor is larger or equal to the quorum rank. 
The monitor no longer exists in the monitor map, therefore <its peers do> not ping this monitor, because the address no longer exists.
As a result, the assertion is not triggered if the monitor is removed before shutdown.

*****


Let me know what you think thank you,

Kamoltat

Comment 50 Kamoltat (Junior) Sirivadhna 2022-12-05 20:13:13 UTC

So the issue was because we are hitting this issue on a different code path that I didn't consider. Therefore,
I've filed a new PR: https://github.com/ceph/ceph/pull/49259 that will cover the code paths that I've missed.
Here is also a new upstream tracker for this: https://tracker.ceph.com/issues/58155.

Comment 52 Kamoltat (Junior) Sirivadhna 2022-12-15 22:33:27 UTC

Cherry-picked to downstream

Comment 75 errata-xmlrpc 2023-03-20 18:55:33 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 6.0 Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:1360

Note You need to log in before you can comment on or make changes to this bug.

akupczyk
anrao
bhubbard
bkunal
ceph-eng-bugs
ckulal
ekristov
gfarnum
hyelloji
kdreyer
ksirivad
mgowri
ngangadh
nojha
pasik
pdhiran
pnataraj
rmandyam
rzarzyns
sseshasa
tserlin
vumrao