Bug 1967164

Summary: Silence crash warning in osd removal job.
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Pulkit Kundra <pkundra>
Component: Ceph-Mgr PluginsAssignee: Neha Ojha <nojha>
Ceph-Mgr Plugins sub component: crash QA Contact: Sayalee <saraut>
Status: NEW --- Docs Contact: ceph-docs <ceph-docs>
Severity: medium    
Priority: high CC: akupczyk, bhubbard, bniver, brgardne, ceph-eng-bugs, ebenahar, edonnell, ikave, madam, muagarwa, nberry, ngangadh, nojha, owasserm, rzarzyns, sdudhgao, shan, sostapov, sseshasa, tnielsen, vereddy, vumrao, yhatuka
Version: 4.2Keywords: AutomationBackLog
Target Milestone: ---   
Target Release: 7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1896810 Environment:
Last Closed: Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1896810    
Bug Blocks: 1882359    

Comment 5 Blaine Gardner 2021-06-30 17:55:44 UTC
Sorry for the late reply.

Few clarifications and some questions:

> 1. the osd itself does not do the crash collection, it is done by the crash mgr module

To my understanding, there is a crash-collector utility which reports crashes to the crash mgr module, yes. The fix likely needs to happen in in one of these 2 places.

> 2. not sure I understand the title of this BZ, why doesn't this not apply to other daemons

In my opinion, this does also apply to other daemons as well. If the daemon has been removed from Ceph, then we don't need to track crashes for it. Not all daemons are tracked individually, however. Mons are, but several RGWs might share the same authentication key, and one of them could be down temporarily. I think the reason to fix this for only OSDs is because these are the only daemons that have been giving us repeated trouble.

We could extend this to ignore crashes from OSDs that are not in the CRUSH map as well ignore crashes from mons that are not in the monmap, but I think other daemons will have to keep the existing behavior.

> 3. there are several options in crash module to prune/remove/archive crashes, do any of them serve the purpose of this BZ? https://docs.ceph.com/en/latest/mgr/crash/

None of the options fix the underlying issue which is that the Ceph should not register new crashes for daemons which have been removed from the cluster.