Bug 2183294 - [RFE] Catch MDS damage to the dentry's first snapid
Summary: [RFE] Catch MDS damage to the dentry's first snapid
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: CephFS
Version: 5.3
Hardware: All
OS: All
urgent
urgent
Target Milestone: ---
: 5.3z3
Assignee: Patrick Donnelly
QA Contact: Hemanth Kumar
lysanche
URL:
Whiteboard:
Depends On:
Blocks: 2203283
TreeView+ depends on / blocked
 
Reported: 2023-03-30 19:20 UTC by Patrick Donnelly
Modified: 2023-05-23 00:19 UTC (History)
10 users (show)

Fixed In Version: ceph-16.2.10-164.el8cp
Doc Type: Bug Fix
Doc Text:
.A code assert is added to the Ceph Manager daemon service to detect metadata corruption Previously, a type of snapshot-related metadata corruption would be introduced by the manager daemon service for workloads running Postgres, and possibly others. With this fix, a code assert is added to the manager daemon service which is triggered if a new corruption is detected. This reduces the proliferation of the damage, and allows the collection of logs to ascertain the cause. [NOTE] ==== If daemons crash after the cluster is upgraded to {storage-product} 5.3z3, contact link:https://access.redhat.com/support/contact/technicalSupport/[_Red Hat support_] for analysis and corrective action. ====
Clone Of:
Environment:
Last Closed: 2023-05-23 00:19:10 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHCEPH-6353 0 None None None 2023-03-30 19:23:25 UTC
Red Hat Knowledge Base (Solution) 7010978 0 None None None 2023-05-22 21:05:53 UTC
Red Hat Product Errata RHBA-2023:3259 0 None None None 2023-05-23 00:19:46 UTC

Description Patrick Donnelly 2023-03-30 19:20:31 UTC
This bug was initially created as a copy of Bug #2175307

I am copying this bug because: 

5.3z2 clone of 6.1 bz

Description of problem:

This RFE is for a functionality in the MDS to detect specific damage to the metadata "dentries". The damage is associated with a long-standing bug (#38452).

This change will catch the damage before it's persisted. If **new** damage is detected to be written to persistent storage (i.e. RADOS), the MDS will abort to avoid persisting damage. This will hopefully have the benefit of providing logs in the same time period that the damage was created for analysis.

https://tracker.ceph.com/issues/38452
https://tracker.ceph.com/issues/58482

Documentation for support when customers encounter the abort will be forthcoming and available before 6.1 is released.

Comment 2 Amarnath 2023-03-31 15:48:27 UTC
Hi Patrick,

we are analyzing the issue.
Can you please help with reproduction steps for the same.

Regards,
Amarnath

Comment 3 Greg Farnum 2023-04-03 14:11:04 UTC
As initial backport tests did not pass and we are on a tight deadline, I am moving this to 5.3z3 for now. :(
We may issue an async (or even grab it back) if we figure out and resolve the test issues fast enough.

Comment 4 Patrick Donnelly 2023-04-03 18:07:30 UTC
(In reply to Amarnath from comment #2)
> Hi Patrick,
> 
> we are analyzing the issue.
> Can you please help with reproduction steps for the same.

You would run these tests:

https://gitlab.cee.redhat.com/ceph/ceph/-/merge_requests/278/diffs#7b2dc3f617cfcca3e13c38ef537cd6355175ac6b_565_567

Comment 25 errata-xmlrpc 2023-05-23 00:19:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 5.3 Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:3259


Note You need to log in before you can comment on or make changes to this bug.