2147472 – CephFS corruption in ODF 4.9.11, running 'scrub / recursive repair' results in active MDS crash

Bug 2147472 - CephFS corruption in ODF 4.9.11, running 'scrub / recursive repair' results in active MDS crash

Summary: CephFS corruption in ODF 4.9.11, running 'scrub / recursive repair' results i...

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ceph
Sub Component:
Version:	4.9
Hardware:	All
OS:	All
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Venky Shankar
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-11-24 01:29 UTC by kelwhite
Modified:	2023-08-09 16:37 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-12-02 02:38:05 UTC
Embargoed:

Attachments	(Terms of Use)

Comment 11 khover 2022-11-28 21:27:09 UTC

Hello,

I have linked a case where the customer hit > running 'scrub / recursive repair' results in active MDS crash


Current state:

cluster:
    id:     9af2f934-61de-462d-b5ab-25439dace333
    health: HEALTH_ERR
            1 filesystem is degraded
            1 filesystem is offline
            1 mds daemon damaged

  services:
    mon: 3 daemons, quorum b,f,g (age 2d)
    mgr: a(active, since 2d)
    mds: ocs-storagecluster-cephfilesystem:0/1 2 up:standby, 1 damaged

Debug logs are uploaded to supportshell 

/cases/03370989

|    11 |  0110  | ceph-mds.log.tar.gz                                                     |     8346.13 | 2022-11-27 12:40 UTC | S3       |     Yes  |
|    12 |  0120  | ceph-mds.ocs-storagecluster-cephfilesystem-a.log.bz2                    |      187.72 | 2022-11-28 19:10 UTC | S3       |     Yes  |
|    13 |  0130  | ceph-mds.ocs-storagecluster-cephfilesystem-b.log.bz2                    |      591.04 | 2022-11-28 19:10 UTC | S3       |     Yes  |
|    14 |  0140  | ceph-mds.ocs-storagecluster-cephfilesystem-b.log.bz2                    |    11714.43 | 2022-11-28 19:14 UTC | S3       |     Yes  |
|    15 |  0150  | ceph-mds.ocs-storagecluster-cephfilesystem-a.log.bz2                    |     3834.11 | 2022-11-28 19:14 UTC | S3       |     Yes  |


Please let me know if the data set is incomplete or if additional logs are needed.

Comment 12 khover 2022-11-30 13:14:54 UTC

customer update from my case 03370989

From what we have gathered our timeline was



Fr, 25th ~14:20: mds pod is oom-killed and the standby pod went into up:replay

Fr, 25th ~20:45: resource limits patched and scrub mds starts -> mds goes into down:damaged after liveness probes fail and the pod is once again killed by kublet

Sa/Su 26th/27th: marking the mds as repaired leaves them in a stanby state

Mo: We decided to recover from the journal which got things back to normal


To recover we basically followed the ceph documentation (https://docs.ceph.com/en/nautilus/cephfs/disaster-recovery-experts/)
Recover metadata from journal:


cephfs-journal-tool --rank=ocs-storagecluster-cephfilesystem:0 event recover_dentries summary
Truncated the journal


cephfs-journal-tool --rank=ocs-storagecluster-cephfilesystem:0 journal reset
reset session map


cephfs-table-tool ocs-storagecluster-cephfilesystem:0 reset session
We then restarted the mds pods and saw one go into up:replay. However, the liveness probe didn't complete in time so we temporarily replaced the probe command with a simple echo to not have the pod repeatedly killed during replay. Replay finished after some 15 minutes and the filesystem was up again.
Finally we ran the mds scrub


ceph tell mds.0 scrub start / recursive repair

Comment 16 khover 2022-12-01 21:21:04 UTC

Hi Venkey,

Unfortunately, the customer was unwilling to wait and applied the upstream solution.

I think all the issues they encountered is a trend we are seeing more and more with MDS/ceph.

Customer over utilization of cephfs sc.

This case 03370989 

$ less namespaces/openshift-storage/oc_output/volumesnapshot_-A | grep k10-csi-snap | wc -l
133

This volumesnapshot class retention policy is set to retain: k10-clone-ocs-storagecluster-cephfsplugin-snapclass.

NAME                                                  DRIVER                                  DELETIONPOLICY   AGE
k10-clone-ocs-storagecluster-cephfsplugin-snapclass   openshift-storage.cephfs.csi.ceph.com   Retain           38d
ocs-storagecluster-cephfsplugin-snapclass             openshift-storage.cephfs.csi.ceph.com   Delete           657d
ocs-storagecluster-rbdplugin-snapclass                openshift-storage.rbd.csi.ceph.com      Delete           657d


$ less namespaces/openshift-storage/oc_output/volumesnapshotcontent | grep k10-clone-ocs-storagecluster-cephfsplugin-snapclass | wc -l
1042

RAW STORAGE:
    CLASS     SIZE       AVAIL       USED        RAW USED     %RAW USED 
    hdd       12 TiB     5.4 TiB     6.4 TiB      6.6 TiB         54.88 
    TOTAL     12 TiB     5.4 TiB     6.4 TiB      6.6 TiB         54.88 
 
POOLS:
    POOL                                                      ID     STORED      OBJECTS     USED        %USED     MAX AVAIL     QUOTA OBJECTS     QUOTA BYTES     DIRTY      USED COMPR     UNDER COMPR 
    ocs-storagecluster-cephblockpool                           1      52 GiB      13.83k     155 GiB      3.65       1.3 TiB     N/A               N/A             13.83k            0 B             0 B 
    .rgw.root                                                  2     4.6 KiB          16     2.8 MiB         0       1.3 TiB     N/A               N/A                 16            0 B             0 B 
    ocs-storagecluster-cephobjectstore.rgw.control             3         0 B           8         0 B         0       1.3 TiB     N/A               N/A                  8            0 B             0 B 
    ocs-storagecluster-cephfilesystem-metadata                 4      14 GiB      19.21k      15 GiB      0.38       1.3 TiB     N/A               N/A             19.21k            0 B             0 B 
    ocs-storagecluster-cephobjectstore.rgw.meta                5     3.7 KiB          14     2.3 MiB         0       1.3 TiB     N/A               N/A                 14            0 B             0 B 
    ocs-storagecluster-cephfilesystem-data0                    6     319 GiB      37.01M     5.4 TiB     57.24       1.3 TiB     N/A               N/A             37.01M            0 B             0 B

Note You need to log in before you can comment on or make changes to this bug.