Bug 2175307

Summary: [RFE] Catch MDS damage to the dentry's first snapid
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Patrick Donnelly <pdonnell>
Component: CephFSAssignee: Patrick Donnelly <pdonnell>
Status: CLOSED ERRATA QA Contact: Hemanth Kumar <hyelloji>
Severity: high Docs Contact: Akash Raj <akraj>
Priority: urgent    
Version: 6.0CC: akraj, bkunal, ceph-eng-bugs, cephqe-warriors, flucifre, gfarnum, mcaldeir, tserlin, vshankar
Target Milestone: ---Keywords: FutureFeature
Target Release: 6.1   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: ceph-17.2.6-29.el9cp Doc Type: Bug Fix
Doc Text:
.A code assert is added to the Ceph Manager daemon service to detect metadata corruption Previously, a type of snapshot-related metadata corruption would be introduced by the manager daemon service for workloads running Postgres, and possibly others. With this fix, a code assert is added to the manager daemon service which is triggered if a new corruption is detected. This reduces the proliferation of the damage, and allows the collection of logs to ascertain the cause. [NOTE] ==== If daemons crash after the cluster is upgraded to {storage-product} 6.1, contact link:https://access.redhat.com/support/contact/technicalSupport/[_Red Hat support_] for analysis and corrective action. ====
Story Points: ---
Clone Of:
: 2181949 (view as bug list) Environment:
Last Closed: 2023-06-15 09:16:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2181949, 2192813    

Description Patrick Donnelly 2023-03-03 20:22:56 UTC
Description of problem:

This RFE is for a functionality in the MDS to detect specific damage to the metadata "dentries". The damage is associated with a long-standing bug (#38452).

This change will catch the damage before it's persisted. If **new** damage is detected to be written to persistent storage (i.e. RADOS), the MDS will abort to avoid persisting damage. This will hopefully have the benefit of providing logs in the same time period that the damage was created for analysis.

https://tracker.ceph.com/issues/38452
https://tracker.ceph.com/issues/58482

Documentation for support when customers encounter the abort will be forthcoming and available before 6.1 is released.

Comment 9 Manny 2023-05-12 17:11:17 UTC
Linked KCS #7010978, (https://access.redhat.com/solutions/7010978)

BR
Manny

Comment 12 errata-xmlrpc 2023-06-15 09:16:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Ceph Storage 6.1 security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:3623