Bug 1967164 - Silence crash warning in osd removal job.
Summary: Silence crash warning in osd removal job.
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: Ceph-Mgr Plugins
Version: 4.2
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: ---
: 7.0
Assignee: Neha Ojha
QA Contact: Sayalee
ceph-docs@redhat.com
URL:
Whiteboard:
Depends On: 1896810
Blocks: 1882359
TreeView+ depends on / blocked
 
Reported: 2021-06-02 14:19 UTC by Pulkit Kundra
Modified: 2023-09-12 16:35 UTC (History)
23 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1896810
Environment:
Last Closed: 2023-09-12 16:35:07 UTC
Embargoed:


Attachments (Terms of Use)

Comment 5 Blaine Gardner 2021-06-30 17:55:44 UTC
Sorry for the late reply.

Few clarifications and some questions:

> 1. the osd itself does not do the crash collection, it is done by the crash mgr module

To my understanding, there is a crash-collector utility which reports crashes to the crash mgr module, yes. The fix likely needs to happen in in one of these 2 places.

> 2. not sure I understand the title of this BZ, why doesn't this not apply to other daemons

In my opinion, this does also apply to other daemons as well. If the daemon has been removed from Ceph, then we don't need to track crashes for it. Not all daemons are tracked individually, however. Mons are, but several RGWs might share the same authentication key, and one of them could be down temporarily. I think the reason to fix this for only OSDs is because these are the only daemons that have been giving us repeated trouble.

We could extend this to ignore crashes from OSDs that are not in the CRUSH map as well ignore crashes from mons that are not in the monmap, but I think other daemons will have to keep the existing behavior.

> 3. there are several options in crash module to prune/remove/archive crashes, do any of them serve the purpose of this BZ? https://docs.ceph.com/en/latest/mgr/crash/

None of the options fix the underlying issue which is that the Ceph should not register new crashes for daemons which have been removed from the cluster.


Note You need to log in before you can comment on or make changes to this bug.