2218759 – [RDR][Tracker][cephFS] Mds crash reported while running IOs

Bug 2218759 - [RDR][Tracker][cephFS] Mds crash reported while running IOs [NEEDINFO]

Summary: [RDR][Tracker][cephFS] Mds crash reported while running IOs

Keywords:
Status:	ASSIGNED
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ceph
Sub Component:
Version:	4.13
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Milind Changire
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-06-30 06:23 UTC by Aman Agrawal
Modified:	2024-09-25 04:55 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:
Flags:	sheggodu: needinfo? (mchangir) vshankar: needinfo? (mchangir)

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	62962	0	None	None	None	2023-09-25 05:39:08 UTC

Comment 2 Aman Agrawal 2023-06-30 07:17:36 UTC

RDR is Regional Disaster Recovery, where we have 3 OCP clusters tied together.
1st is hub where RHACM is installed which manages other 2 OCP clusters using submariner add-ons.
2nd and 3rd are where ODF is installed and they are connected via submariner connectivity and are used in case of any disaster to perform failover/relocate operations between them to recover and ensure business continuity with min. data loss and min. application downtime. 

At a time, one of the managed clusters where ODF is installed works as primary cluster where IOs are run, and the other works as secondary or stand by cluster where the data is being replicated all the time and can be used in case of any such event.

Comment 3 Venky Shankar 2023-06-30 07:32:32 UTC

Milind, PTAL.

Comment 6 Mudit Agarwal 2023-07-11 07:45:31 UTC

Milind, did you get a chance to take a look?

Comment 21 Mudit Agarwal 2023-07-13 05:02:23 UTC

Milind, this question should be asked from the ceph build team (Ken Dryer). We just use the tag provided by the ceph build team to build ODF.
If you want build details, this link can help you: https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=2543192

Comment 53 Aman Agrawal 2023-09-21 15:14:03 UTC

The impact of this BZ is yet to be assessed on a RDR setup as we couldn't perform failover/relocate operations on cephFS based workloads due to submariner connectivity issues (https://issues.redhat.com/browse/ACM-7600)
However, the active and standby mds crashes multiple times while running IOs as reported above and the issue is consistent, which makes it crucial.

Comment 64 Karolin Seeger 2024-01-23 16:25:33 UTC

@vshankar would it be possible to provide an ETA this fix?

Comment 74 Aman Agrawal 2024-05-22 12:38:01 UTC

This crash was reproduced as reported in https://bugzilla.redhat.com/show_bug.cgi?id=2282346#c3

ceph version 18.2.1-136.el9cp (e7edde2b655d0dd9f860dda675f9d7954f07e6e3) reef (stable)
OCP 4.16.0-0.nightly-2024-04-26-145258
ODF 4.16.0-89.stable
ACM 2.10.2
MCE 2.5.2


Steps to Reproduce:

*****Active hub co-situated with primary managed cluster*****

1. On a RDR setup, perform site-failure by bringing active hub and primary managed cluster down and move to passive hub by performing hub recovery.
2. Then failover all the workloads running on down managed cluster to the surviving managed cluster.
3. After successful failover, recover the down managed cluster.
4. Now failover one of the cephfs workloads where peer ready is marked as true but replication destination isn't created due to eviction period which is 24hrs as of now.
5. Ensure cluster is cleaned after eviction period timeout, failover is successful and data sync is resumed b/w the managed clusters.


Actual results: MDS crash is seen on the surviving cluster C2 on which workloads were failedover.

pods|grep mds
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-5bdf7cfdfzs7c   2/2     Running     1023 (9m33s ago)   7d13h   10.128.2.63    compute-2   <none>           <none>
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-7dd58665vqjsp   2/2     Running     1006 (17m ago)     7d13h   10.131.0.241   compute-1   <none>           <none>


oc -n openshift-storage rsh "$(oc get po -n openshift-storage -l app=rook-ceph-tools -o name)" ceph crash ls
ID                                                                ENTITY                                   NEW
2024-05-14T17:33:05.811016Z_b5585b5b-3a3a-4838-93d3-13ccfb04b669  mds.ocs-storagecluster-cephfilesystem-a
2024-05-14T18:52:04.860762Z_c49454e8-2180-42d0-b247-6b31619ecd12  mds.ocs-storagecluster-cephfilesystem-b
2024-05-15T11:11:58.726060Z_bf8483a9-8f22-465f-83c3-93b2f34710f3  mds.ocs-storagecluster-cephfilesystem-a   *
2024-05-15T12:26:15.553277Z_eae1ba47-82fd-4bf1-88c1-8a816f67ab65  mds.ocs-storagecluster-cephfilesystem-b   *
2024-05-16T07:07:03.233208Z_2a86dc2e-e6cf-4f07-aa8b-9d7b9eec803f  mds.ocs-storagecluster-cephfilesystem-a   *
2024-05-17T10:56:29.510669Z_e8ff0c2f-d956-4af8-a540-6cb64374e4fd  mds.ocs-storagecluster-cephfilesystem-b   *
2024-05-17T20:34:57.807764Z_dd0521e6-e92f-40e2-afe9-3c3f1768df07  mds.ocs-storagecluster-cephfilesystem-a   *
2024-05-17T21:56:39.158029Z_12c1efa9-ecfc-4c32-9024-77423ae09ecf  mds.ocs-storagecluster-cephfilesystem-b   *
2024-05-18T00:36:35.787255Z_40eb624a-a7ed-4415-b17f-4085bd9eac9b  mds.ocs-storagecluster-cephfilesystem-b   *
2024-05-18T03:32:14.163891Z_84c9b511-ac45-4bf2-9040-8014883e80a9  mds.ocs-storagecluster-cephfilesystem-b   *
2024-05-19T12:56:35.497622Z_b3c014e8-cba6-496d-a673-e2b2f026be21  mds.ocs-storagecluster-cephfilesystem-b   *
2024-05-20T00:27:27.924980Z_8085a329-6afb-4b49-a76c-b86db9d94109  mds.ocs-storagecluster-cephfilesystem-a   *
2024-05-20T08:06:13.398353Z_5ebabaee-807c-4910-ba27-14eeee4b4fba  mds.ocs-storagecluster-cephfilesystem-a   *
2024-05-20T17:27:47.267271Z_2c553b1c-d6db-43e8-9421-2994c1745e40  mds.ocs-storagecluster-cephfilesystem-b   *
2024-05-20T18:18:11.530034Z_bf609bf3-25c1-4342-ad7b-f942579daeeb  mds.ocs-storagecluster-cephfilesystem-a   *
2024-05-21T22:21:43.087757Z_06bdcccb-e7f5-4f95-a310-a053aed2fc0a  mds.ocs-storagecluster-cephfilesystem-b   *
2024-05-22T01:07:49.214137Z_bd01b408-376c-4c80-822e-ba3ae76e371f  mds.ocs-storagecluster-cephfilesystem-b   *
2024-05-22T01:13:58.832386Z_9da0f105-acb7-4026-a497-0f0da1b77f08  mds.ocs-storagecluster-cephfilesystem-b   *
2024-05-22T03:54:30.740192Z_0dc95829-aa4a-4e34-aaeb-39cd3cf4d7c7  mds.ocs-storagecluster-cephfilesystem-a   *



Out of all these crashes, the below crash is the one actually reported in this BZ.


bash-5.1$ ceph crash info 2024-05-17T10:56:29.510669Z_e8ff0c2f-d956-4af8-a540-6cb64374e4fd
{
    "backtrace": [
        "/lib64/libc.so.6(+0x54db0) [0x7ff74acafdb0]",
        "ceph-mds(+0x22d2ed) [0x557c354f42ed]",
        "ceph-mds(+0x5a7d02) [0x557c3586ed02]",
        "(EMetaBlob::fullbit::update_inode(MDSRank*, CInode*)+0x51) [0x557c35778fc1]",
        "(EMetaBlob::replay(MDSRank*, LogSegment*, int, MDPeerUpdate*)+0x79e) [0x557c3577fb6e]",
        "(EOpen::replay(MDSRank*)+0x55) [0x557c3578c6c5]",
        "(MDLog::_replay_thread()+0x75e) [0x557c356ea52e]",
        "ceph-mds(+0x16cf21) [0x557c35433f21]",
        "/lib64/libc.so.6(+0x9f802) [0x7ff74acfa802]",
        "/lib64/libc.so.6(+0x3f450) [0x7ff74ac9a450]"
    ],
    "ceph_version": "18.2.1-136.el9cp",
    "crash_id": "2024-05-17T10:56:29.510669Z_e8ff0c2f-d956-4af8-a540-6cb64374e4fd",
    "entity_name": "mds.ocs-storagecluster-cephfilesystem-b",
    "os_id": "rhel",
    "os_name": "Red Hat Enterprise Linux",
    "os_version": "9.3 (Plow)",
    "os_version_id": "9.3",
    "process_name": "ceph-mds",
    "stack_sig": "534595eadbe3cd5e36a861179a9d229df6085a48ed4bf3ee7982825650a239f5",
    "timestamp": "2024-05-17T10:56:29.510669Z",
    "utsname_hostname": "rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-7dd58665vqjsp",
    "utsname_machine": "x86_64",
    "utsname_release": "5.14.0-427.13.1.el9_4.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed Apr 10 10:29:16 EDT 2024"
}


This crash may have repeated on this setup but I am pasting just one of it's occurrence.

Must-gather logs from the cluster is kept here- http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/20may24-mds-high-log-level/

Note You need to log in before you can comment on or make changes to this bug.