Bug 1835563 - MON crash - src/mon/Monitor.cc: 267: FAILED ceph_assert(session_map.sessions.empty())
Summary: MON crash - src/mon/Monitor.cc: 267: FAILED ceph_assert(session_map.sessions....
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RADOS
Version: 4.1
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ---
: 5.1
Assignee: Brad Hubbard
QA Contact: Pawan
Ranjini M N
URL:
Whiteboard:
: 1842536 1879962 1953345 (view as bug list)
Depends On:
Blocks: 1886056 2031073
TreeView+ depends on / blocked
 
Reported: 2020-05-14 04:37 UTC by Rachana Patel
Modified: 2025-04-04 12:26 UTC (History)
24 users (show)

Fixed In Version: ceph-16.2.7-63.el8cp
Doc Type: Bug Fix
Doc Text:
.A check is added to prevent new sessions when Ceph Monitor is shutting down Previously, new sessions could be added when the Ceph Monitor was shutting down thereby there were unexpected entries in the session map causing an assert failure resulting in a crash. With this update, a check has been added to prevent a new session if the Ceph Monitor is shutting down and the assert does not fail and works as expected.
Clone Of:
Environment:
Last Closed: 2022-04-04 10:19:51 UTC
Embargoed:


Attachments (Terms of Use)
mon logs (220.80 KB, application/x-xz)
2020-10-13 11:56 UTC, Madhavi Kasturi
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Ceph Project Bug Tracker 39150 0 None None None 2020-05-14 06:56:13 UTC
Github ceph ceph pull 44337 0 None Merged mon: prevent new sessions during shutdown 2021-12-17 20:04:17 UTC
Github ceph ceph pull 44543 0 None open pacific: mon: prevent new sessions during shutdown 2022-01-27 00:04:25 UTC
Red Hat Product Errata RHSA-2022:1174 0 None Closed RCA - OpenShift ARO upgrade issue on production 2022-06-06 05:07:18 UTC

Comment 4 Josh Durgin 2020-05-15 22:04:28 UTC
This is a crash during shutdown, so it has very little user impact. Additionally it is a race condition seen rarely in the thousands of runs upstream. Thus marking it low/low severity and priority.

Comment 5 Vikhyat Umrao 2020-06-01 17:23:55 UTC
*** Bug 1842536 has been marked as a duplicate of this bug. ***

Comment 6 Neha Ojha 2020-09-17 16:46:35 UTC
*** Bug 1879962 has been marked as a duplicate of this bug. ***

Comment 8 Yaniv Kaul 2020-10-12 06:32:43 UTC
(In reply to Josh Durgin from comment #4)
> This is a crash during shutdown, so it has very little user impact.
> Additionally it is a race condition seen rarely in the thousands of runs
> upstream. Thus marking it low/low severity and priority.

It was seen in OCS in a customer deployment, where 2 MONs crashed. I'm raising it to High/High for the time being, in the hope to understand when it happens and how OCS recovers from it.

Comment 9 Scott Ostapovicz 2020-10-12 13:12:31 UTC
Assigning this to the 5.0 rc so it can be attached to the OCS 4.8 release.

Comment 10 Josh Durgin 2020-10-12 23:11:47 UTC
(In reply to Yaniv Kaul from comment #8)
> (In reply to Josh Durgin from comment #4)
> > This is a crash during shutdown, so it has very little user impact.
> > Additionally it is a race condition seen rarely in the thousands of runs
> > upstream. Thus marking it low/low severity and priority.
> 
> It was seen in OCS in a customer deployment, where 2 MONs crashed. I'm
> raising it to High/High for the time being, in the hope to understand when
> it happens and how OCS recovers from it.

Obviously we should not crash, however there's no user impact here.

It's an assert hit when the monitor is already shutting down.
OCS recovers by continuing to do what it was already going to do - start up new monitors.

That we expose things with no user impact as alerts in OCS is a supportability bug.

Comment 11 Yaniv Kaul 2020-10-13 06:53:41 UTC
(In reply to Josh Durgin from comment #10)
> (In reply to Yaniv Kaul from comment #8)
> > (In reply to Josh Durgin from comment #4)
> > > This is a crash during shutdown, so it has very little user impact.
> > > Additionally it is a race condition seen rarely in the thousands of runs
> > > upstream. Thus marking it low/low severity and priority.
> > 
> > It was seen in OCS in a customer deployment, where 2 MONs crashed. I'm
> > raising it to High/High for the time being, in the hope to understand when
> > it happens and how OCS recovers from it.
> 
> Obviously we should not crash, however there's no user impact here.
> 
> It's an assert hit when the monitor is already shutting down.
> OCS recovers by continuing to do what it was already going to do - start up
> new monitors.
> 
> That we expose things with no user impact as alerts in OCS is a
> supportability bug.

The impact is indeed indirect - the health is not OK and cannot be solved without support.

Comment 12 Madhavi Kasturi 2020-10-13 11:56:07 UTC
Created attachment 1721173 [details]
mon logs

Comment 14 Josh Durgin 2020-11-13 23:03:36 UTC
We'd need a coredump or logs messenger debugging to debug this. Is it reproducible?

Comment 15 Yaniv Kaul 2020-12-02 16:31:41 UTC
Raz - can we  reproduce as Josh asked in comment 14 above?

Comment 29 Neha Ojha 2021-04-30 22:13:11 UTC
*** Bug 1953345 has been marked as a duplicate of this bug. ***

Comment 34 Veera Raghava Reddy 2021-06-11 05:08:12 UTC
Pawan adding needinfo on you for tracking this BZ recreation.

Comment 72 errata-xmlrpc 2022-04-04 10:19:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Ceph Storage 5.1 Security, Enhancement, and Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:1174


Note You need to log in before you can comment on or make changes to this bug.