Bug 2147472

Summary:	CephFS corruption in ODF 4.9.11, running 'scrub / recursive repair' results in active MDS crash
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	kelwhite
Component:	ceph	Assignee:	Venky Shankar <vshankar>
ceph sub component:	CephFS	QA Contact:	Elad <ebenahar>
Status:	CLOSED INSUFFICIENT_DATA	Docs Contact:
Severity:	urgent
Priority:	unspecified	CC:	bhubbard, bniver, gfarnum, hnallurv, hyelloji, khover, madam, muagarwa, ocs-bugs, odf-bz-bot, vshankar, xiubli
Version:	4.9
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	All
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-12-02 02:38:05 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Comment 11 khover 2022-11-28 21:27:09 UTC

Hello,

I have linked a case where the customer hit > running 'scrub / recursive repair' results in active MDS crash


Current state:

cluster:
    id:     9af2f934-61de-462d-b5ab-25439dace333
    health: HEALTH_ERR
            1 filesystem is degraded
            1 filesystem is offline
            1 mds daemon damaged

  services:
    mon: 3 daemons, quorum b,f,g (age 2d)
    mgr: a(active, since 2d)
    mds: ocs-storagecluster-cephfilesystem:0/1 2 up:standby, 1 damaged

Debug logs are uploaded to supportshell 

/cases/03370989

|    11 |  0110  | ceph-mds.log.tar.gz                                                     |     8346.13 | 2022-11-27 12:40 UTC | S3       |     Yes  |
|    12 |  0120  | ceph-mds.ocs-storagecluster-cephfilesystem-a.log.bz2                    |      187.72 | 2022-11-28 19:10 UTC | S3       |     Yes  |
|    13 |  0130  | ceph-mds.ocs-storagecluster-cephfilesystem-b.log.bz2                    |      591.04 | 2022-11-28 19:10 UTC | S3       |     Yes  |
|    14 |  0140  | ceph-mds.ocs-storagecluster-cephfilesystem-b.log.bz2                    |    11714.43 | 2022-11-28 19:14 UTC | S3       |     Yes  |
|    15 |  0150  | ceph-mds.ocs-storagecluster-cephfilesystem-a.log.bz2                    |     3834.11 | 2022-11-28 19:14 UTC | S3       |     Yes  |


Please let me know if the data set is incomplete or if additional logs are needed.

Comment 12 khover 2022-11-30 13:14:54 UTC

customer update from my case 03370989

From what we have gathered our timeline was



Fr, 25th ~14:20: mds pod is oom-killed and the standby pod went into up:replay

Fr, 25th ~20:45: resource limits patched and scrub mds starts -> mds goes into down:damaged after liveness probes fail and the pod is once again killed by kublet

Sa/Su 26th/27th: marking the mds as repaired leaves them in a stanby state

Mo: We decided to recover from the journal which got things back to normal


To recover we basically followed the ceph documentation (https://docs.ceph.com/en/nautilus/cephfs/disaster-recovery-experts/)
Recover metadata from journal:


cephfs-journal-tool --rank=ocs-storagecluster-cephfilesystem:0 event recover_dentries summary
Truncated the journal


cephfs-journal-tool --rank=ocs-storagecluster-cephfilesystem:0 journal reset
reset session map


cephfs-table-tool ocs-storagecluster-cephfilesystem:0 reset session
We then restarted the mds pods and saw one go into up:replay. However, the liveness probe didn't complete in time so we temporarily replaced the probe command with a simple echo to not have the pod repeatedly killed during replay. Replay finished after some 15 minutes and the filesystem was up again.
Finally we ran the mds scrub


ceph tell mds.0 scrub start / recursive repair

Comment 16 khover 2022-12-01 21:21:04 UTC

Hi Venkey,

Unfortunately, the customer was unwilling to wait and applied the upstream solution.

I think all the issues they encountered is a trend we are seeing more and more with MDS/ceph.

Customer over utilization of cephfs sc.

This case 03370989 

$ less namespaces/openshift-storage/oc_output/volumesnapshot_-A | grep k10-csi-snap | wc -l
133

This volumesnapshot class retention policy is set to retain: k10-clone-ocs-storagecluster-cephfsplugin-snapclass.

NAME                                                  DRIVER                                  DELETIONPOLICY   AGE
k10-clone-ocs-storagecluster-cephfsplugin-snapclass   openshift-storage.cephfs.csi.ceph.com   Retain           38d
ocs-storagecluster-cephfsplugin-snapclass             openshift-storage.cephfs.csi.ceph.com   Delete           657d
ocs-storagecluster-rbdplugin-snapclass                openshift-storage.rbd.csi.ceph.com      Delete           657d


$ less namespaces/openshift-storage/oc_output/volumesnapshotcontent | grep k10-clone-ocs-storagecluster-cephfsplugin-snapclass | wc -l
1042

RAW STORAGE:
    CLASS     SIZE       AVAIL       USED        RAW USED     %RAW USED 
    hdd       12 TiB     5.4 TiB     6.4 TiB      6.6 TiB         54.88 
    TOTAL     12 TiB     5.4 TiB     6.4 TiB      6.6 TiB         54.88 
 
POOLS:
    POOL                                                      ID     STORED      OBJECTS     USED        %USED     MAX AVAIL     QUOTA OBJECTS     QUOTA BYTES     DIRTY      USED COMPR     UNDER COMPR 
    ocs-storagecluster-cephblockpool                           1      52 GiB      13.83k     155 GiB      3.65       1.3 TiB     N/A               N/A             13.83k            0 B             0 B 
    .rgw.root                                                  2     4.6 KiB          16     2.8 MiB         0       1.3 TiB     N/A               N/A                 16            0 B             0 B 
    ocs-storagecluster-cephobjectstore.rgw.control             3         0 B           8         0 B         0       1.3 TiB     N/A               N/A                  8            0 B             0 B 
    ocs-storagecluster-cephfilesystem-metadata                 4      14 GiB      19.21k      15 GiB      0.38       1.3 TiB     N/A               N/A             19.21k            0 B             0 B 
    ocs-storagecluster-cephobjectstore.rgw.meta                5     3.7 KiB          14     2.3 MiB         0       1.3 TiB     N/A               N/A                 14            0 B             0 B 
    ocs-storagecluster-cephfilesystem-data0                    6     319 GiB      37.01M     5.4 TiB     57.24       1.3 TiB     N/A               N/A             37.01M            0 B             0 B