Bug 1835563

Summary: MON crash - src/mon/Monitor.cc: 267: FAILED ceph_assert(session_map.sessions.empty())
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Rachana Patel <racpatel>
Component: RADOSAssignee: Brad Hubbard <bhubbard>
Status: CLOSED ERRATA QA Contact: Pawan <pdhiran>
Severity: medium Docs Contact: Ranjini M N <rmandyam>
Priority: medium    
Version: 4.1CC: agunn, akupczyk, amanzane, bhubbard, bkunal, bniver, ceph-eng-bugs, jdurgin, mkasturi, mmanjuna, nojha, nravinas, pdhange, pdhiran, prpandey, rmandyam, rzarzyns, sostapov, sseshasa, tserlin, twilkins, vereddy, vumrao, ykaul
Target Milestone: ---   
Target Release: 5.1   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: ceph-16.2.7-63.el8cp Doc Type: Bug Fix
Doc Text:
.A check is added to prevent new sessions when Ceph Monitor is shutting down Previously, new sessions could be added when the Ceph Monitor was shutting down thereby there were unexpected entries in the session map causing an assert failure resulting in a crash. With this update, a check has been added to prevent a new session if the Ceph Monitor is shutting down and the assert does not fail and works as expected.
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-04-04 10:19:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1886056, 2031073    
Attachments:
Description Flags
mon logs none

Comment 4 Josh Durgin 2020-05-15 22:04:28 UTC
This is a crash during shutdown, so it has very little user impact. Additionally it is a race condition seen rarely in the thousands of runs upstream. Thus marking it low/low severity and priority.

Comment 5 Vikhyat Umrao 2020-06-01 17:23:55 UTC
*** Bug 1842536 has been marked as a duplicate of this bug. ***

Comment 6 Neha Ojha 2020-09-17 16:46:35 UTC
*** Bug 1879962 has been marked as a duplicate of this bug. ***

Comment 8 Yaniv Kaul 2020-10-12 06:32:43 UTC
(In reply to Josh Durgin from comment #4)
> This is a crash during shutdown, so it has very little user impact.
> Additionally it is a race condition seen rarely in the thousands of runs
> upstream. Thus marking it low/low severity and priority.

It was seen in OCS in a customer deployment, where 2 MONs crashed. I'm raising it to High/High for the time being, in the hope to understand when it happens and how OCS recovers from it.

Comment 9 Scott Ostapovicz 2020-10-12 13:12:31 UTC
Assigning this to the 5.0 rc so it can be attached to the OCS 4.8 release.

Comment 10 Josh Durgin 2020-10-12 23:11:47 UTC
(In reply to Yaniv Kaul from comment #8)
> (In reply to Josh Durgin from comment #4)
> > This is a crash during shutdown, so it has very little user impact.
> > Additionally it is a race condition seen rarely in the thousands of runs
> > upstream. Thus marking it low/low severity and priority.
> 
> It was seen in OCS in a customer deployment, where 2 MONs crashed. I'm
> raising it to High/High for the time being, in the hope to understand when
> it happens and how OCS recovers from it.

Obviously we should not crash, however there's no user impact here.

It's an assert hit when the monitor is already shutting down.
OCS recovers by continuing to do what it was already going to do - start up new monitors.

That we expose things with no user impact as alerts in OCS is a supportability bug.

Comment 11 Yaniv Kaul 2020-10-13 06:53:41 UTC
(In reply to Josh Durgin from comment #10)
> (In reply to Yaniv Kaul from comment #8)
> > (In reply to Josh Durgin from comment #4)
> > > This is a crash during shutdown, so it has very little user impact.
> > > Additionally it is a race condition seen rarely in the thousands of runs
> > > upstream. Thus marking it low/low severity and priority.
> > 
> > It was seen in OCS in a customer deployment, where 2 MONs crashed. I'm
> > raising it to High/High for the time being, in the hope to understand when
> > it happens and how OCS recovers from it.
> 
> Obviously we should not crash, however there's no user impact here.
> 
> It's an assert hit when the monitor is already shutting down.
> OCS recovers by continuing to do what it was already going to do - start up
> new monitors.
> 
> That we expose things with no user impact as alerts in OCS is a
> supportability bug.

The impact is indeed indirect - the health is not OK and cannot be solved without support.

Comment 12 Madhavi Kasturi 2020-10-13 11:56:07 UTC
Created attachment 1721173 [details]
mon logs

Comment 14 Josh Durgin 2020-11-13 23:03:36 UTC
We'd need a coredump or logs messenger debugging to debug this. Is it reproducible?

Comment 15 Yaniv Kaul 2020-12-02 16:31:41 UTC
Raz - can we  reproduce as Josh asked in comment 14 above?

Comment 29 Neha Ojha 2021-04-30 22:13:11 UTC
*** Bug 1953345 has been marked as a duplicate of this bug. ***

Comment 34 Veera Raghava Reddy 2021-06-11 05:08:12 UTC
Pawan adding needinfo on you for tracking this BZ recreation.

Comment 72 errata-xmlrpc 2022-04-04 10:19:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Ceph Storage 5.1 Security, Enhancement, and Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:1174