Bug 1835563
Summary: | MON crash - src/mon/Monitor.cc: 267: FAILED ceph_assert(session_map.sessions.empty()) | ||||||
---|---|---|---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Rachana Patel <racpatel> | ||||
Component: | RADOS | Assignee: | Brad Hubbard <bhubbard> | ||||
Status: | CLOSED ERRATA | QA Contact: | Pawan <pdhiran> | ||||
Severity: | medium | Docs Contact: | Ranjini M N <rmandyam> | ||||
Priority: | medium | ||||||
Version: | 4.1 | CC: | agunn, akupczyk, amanzane, bhubbard, bkunal, bniver, ceph-eng-bugs, jdurgin, mkasturi, mmanjuna, nojha, nravinas, pdhange, pdhiran, prpandey, rmandyam, rzarzyns, sostapov, sseshasa, tserlin, twilkins, vereddy, vumrao, ykaul | ||||
Target Milestone: | --- | ||||||
Target Release: | 5.1 | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | ceph-16.2.7-63.el8cp | Doc Type: | Bug Fix | ||||
Doc Text: |
.A check is added to prevent new sessions when Ceph Monitor is shutting down
Previously, new sessions could be added when the Ceph Monitor was shutting down thereby there were unexpected entries in the session map causing an assert failure resulting in a crash.
With this update, a check has been added to prevent a new session if the Ceph Monitor is shutting down and the assert does not fail and works as expected.
|
Story Points: | --- | ||||
Clone Of: | Environment: | ||||||
Last Closed: | 2022-04-04 10:19:51 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1886056, 2031073 | ||||||
Attachments: |
|
Comment 4
Josh Durgin
2020-05-15 22:04:28 UTC
*** Bug 1842536 has been marked as a duplicate of this bug. *** *** Bug 1879962 has been marked as a duplicate of this bug. *** (In reply to Josh Durgin from comment #4) > This is a crash during shutdown, so it has very little user impact. > Additionally it is a race condition seen rarely in the thousands of runs > upstream. Thus marking it low/low severity and priority. It was seen in OCS in a customer deployment, where 2 MONs crashed. I'm raising it to High/High for the time being, in the hope to understand when it happens and how OCS recovers from it. Assigning this to the 5.0 rc so it can be attached to the OCS 4.8 release. (In reply to Yaniv Kaul from comment #8) > (In reply to Josh Durgin from comment #4) > > This is a crash during shutdown, so it has very little user impact. > > Additionally it is a race condition seen rarely in the thousands of runs > > upstream. Thus marking it low/low severity and priority. > > It was seen in OCS in a customer deployment, where 2 MONs crashed. I'm > raising it to High/High for the time being, in the hope to understand when > it happens and how OCS recovers from it. Obviously we should not crash, however there's no user impact here. It's an assert hit when the monitor is already shutting down. OCS recovers by continuing to do what it was already going to do - start up new monitors. That we expose things with no user impact as alerts in OCS is a supportability bug. (In reply to Josh Durgin from comment #10) > (In reply to Yaniv Kaul from comment #8) > > (In reply to Josh Durgin from comment #4) > > > This is a crash during shutdown, so it has very little user impact. > > > Additionally it is a race condition seen rarely in the thousands of runs > > > upstream. Thus marking it low/low severity and priority. > > > > It was seen in OCS in a customer deployment, where 2 MONs crashed. I'm > > raising it to High/High for the time being, in the hope to understand when > > it happens and how OCS recovers from it. > > Obviously we should not crash, however there's no user impact here. > > It's an assert hit when the monitor is already shutting down. > OCS recovers by continuing to do what it was already going to do - start up > new monitors. > > That we expose things with no user impact as alerts in OCS is a > supportability bug. The impact is indeed indirect - the health is not OK and cannot be solved without support. Created attachment 1721173 [details]
mon logs
We'd need a coredump or logs messenger debugging to debug this. Is it reproducible? Raz - can we reproduce as Josh asked in comment 14 above? *** Bug 1953345 has been marked as a duplicate of this bug. *** Pawan adding needinfo on you for tracking this BZ recreation. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat Ceph Storage 5.1 Security, Enhancement, and Bug Fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:1174 |