1835563 – MON crash - src/mon/Monitor.cc: 267: FAILED ceph_assert(session_map.sessions.empty())

Bug 1835563 - MON crash - src/mon/Monitor.cc: 267: FAILED ceph_assert(session_map.sessions.empty())

Summary: MON crash - src/mon/Monitor.cc: 267: FAILED ceph_assert(session_map.sessions....

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	RADOS
Sub Component:
Version:	4.1
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	5.1
Assignee:	Brad Hubbard
QA Contact:	Pawan
Docs Contact:	Ranjini M N
URL:
Whiteboard:
Duplicates (3):	1842536 1879962 1953345 (view as bug list)
Depends On:
Blocks:	1886056 2031073
TreeView+	depends on / blocked

Reported:	2020-05-14 04:37 UTC by Rachana Patel
Modified:	2025-04-04 12:26 UTC (History)
CC List:	24 users (show)
Fixed In Version:	ceph-16.2.7-63.el8cp
Doc Type:	Bug Fix
Doc Text:	.A check is added to prevent new sessions when Ceph Monitor is shutting down Previously, new sessions could be added when the Ceph Monitor was shutting down thereby there were unexpected entries in the session map causing an assert failure resulting in a crash. With this update, a check has been added to prevent a new session if the Ceph Monitor is shutting down and the assert does not fail and works as expected.
Clone Of:
Environment:
Last Closed:	2022-04-04 10:19:51 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
mon logs (220.80 KB, application/x-xz) 2020-10-13 11:56 UTC, Madhavi Kasturi	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	39150	None	None	None	2020-05-14 06:56:13 UTC
Github	ceph ceph pull 44337	None	Merged	mon: prevent new sessions during shutdown	2021-12-17 20:04:17 UTC
Github	ceph ceph pull 44543	None	open	pacific: mon: prevent new sessions during shutdown	2022-01-27 00:04:25 UTC
Red Hat Product Errata	RHSA-2022:1174	None	Closed	RCA - OpenShift ARO upgrade issue on production	2022-06-06 05:07:18 UTC

Comment 4 Josh Durgin 2020-05-15 22:04:28 UTC

This is a crash during shutdown, so it has very little user impact. Additionally it is a race condition seen rarely in the thousands of runs upstream. Thus marking it low/low severity and priority.

Comment 5 Vikhyat Umrao 2020-06-01 17:23:55 UTC

*** Bug 1842536 has been marked as a duplicate of this bug. ***

Comment 6 Neha Ojha 2020-09-17 16:46:35 UTC

*** Bug 1879962 has been marked as a duplicate of this bug. ***

Comment 8 Yaniv Kaul 2020-10-12 06:32:43 UTC

(In reply to Josh Durgin from comment #4)
> This is a crash during shutdown, so it has very little user impact.
> Additionally it is a race condition seen rarely in the thousands of runs
> upstream. Thus marking it low/low severity and priority.

It was seen in OCS in a customer deployment, where 2 MONs crashed. I'm raising it to High/High for the time being, in the hope to understand when it happens and how OCS recovers from it.

Comment 9 Scott Ostapovicz 2020-10-12 13:12:31 UTC

Assigning this to the 5.0 rc so it can be attached to the OCS 4.8 release.

Comment 10 Josh Durgin 2020-10-12 23:11:47 UTC

(In reply to Yaniv Kaul from comment #8)
> (In reply to Josh Durgin from comment #4)
> > This is a crash during shutdown, so it has very little user impact.
> > Additionally it is a race condition seen rarely in the thousands of runs
> > upstream. Thus marking it low/low severity and priority.
> 
> It was seen in OCS in a customer deployment, where 2 MONs crashed. I'm
> raising it to High/High for the time being, in the hope to understand when
> it happens and how OCS recovers from it.

Obviously we should not crash, however there's no user impact here.

It's an assert hit when the monitor is already shutting down.
OCS recovers by continuing to do what it was already going to do - start up new monitors.

That we expose things with no user impact as alerts in OCS is a supportability bug.

Comment 11 Yaniv Kaul 2020-10-13 06:53:41 UTC

(In reply to Josh Durgin from comment #10)
> (In reply to Yaniv Kaul from comment #8)
> > (In reply to Josh Durgin from comment #4)
> > > This is a crash during shutdown, so it has very little user impact.
> > > Additionally it is a race condition seen rarely in the thousands of runs
> > > upstream. Thus marking it low/low severity and priority.
> > 
> > It was seen in OCS in a customer deployment, where 2 MONs crashed. I'm
> > raising it to High/High for the time being, in the hope to understand when
> > it happens and how OCS recovers from it.
> 
> Obviously we should not crash, however there's no user impact here.
> 
> It's an assert hit when the monitor is already shutting down.
> OCS recovers by continuing to do what it was already going to do - start up
> new monitors.
> 
> That we expose things with no user impact as alerts in OCS is a
> supportability bug.

The impact is indeed indirect - the health is not OK and cannot be solved without support.

Comment 12 Madhavi Kasturi 2020-10-13 11:56:07 UTC

Created attachment 1721173 [details]
mon logs

Comment 14 Josh Durgin 2020-11-13 23:03:36 UTC

We'd need a coredump or logs messenger debugging to debug this. Is it reproducible?

Comment 15 Yaniv Kaul 2020-12-02 16:31:41 UTC

Raz - can we  reproduce as Josh asked in comment 14 above?

Comment 29 Neha Ojha 2021-04-30 22:13:11 UTC

*** Bug 1953345 has been marked as a duplicate of this bug. ***

Comment 34 Veera Raghava Reddy 2021-06-11 05:08:12 UTC

Pawan adding needinfo on you for tracking this BZ recreation.

Comment 72 errata-xmlrpc 2022-04-04 10:19:51 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Ceph Storage 5.1 Security, Enhancement, and Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:1174

Note You need to log in before you can comment on or make changes to this bug.

agunn
akupczyk
amanzane
bhubbard
bkunal
bniver
ceph-eng-bugs
jdurgin
mkasturi
mmanjuna
nojha
nravinas
pdhange
pdhiran
prpandey
rmandyam
rzarzyns
sostapov
sseshasa
tserlin
twilkins
vereddy
vumrao
ykaul