Bug 2282346

Summary:	[RDR] Multiple MDS crashes seen on the surviving cluster post hub recovery
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	Aman Agrawal <amagrawa>
Component:	ceph	Assignee:	Venky Shankar <vshankar>
ceph sub component:	CephFS	QA Contact:	Elad <ebenahar>
Status:	ASSIGNED ---	Docs Contact:
Severity:	high
Priority:	unspecified	CC:	bniver, khiremat, muagarwa, sheggodu, sostapov, vshankar
Version:	4.16	Flags:	amagrawa: needinfo? (vshankar)
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:		Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Aman Agrawal 2024-05-22 08:55:44 UTC

Description of problem (please be detailed as possible and provide log
snippests):


Version of all relevant components (if applicable):

ceph version 18.2.1-136.el9cp (e7edde2b655d0dd9f860dda675f9d7954f07e6e3) reef (stable)
OCP 4.16.0-0.nightly-2024-04-26-145258
ODF 4.16.0-89.stable
ACM 2.10.2
MCE 2.5.2


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:

*****Active hub co-situated with primary managed cluster*****

1. On a RDR setup, perform site-failure by bringing active hub and primary managed cluster down and move to passive hub by performing hub recovery.
2. Then failover all the workloads running on down managed cluster to the surviving managed cluster.
3. After successful failover, recover the down managed cluster.
4. Now failover one of the cephfs workloads where peer ready is marked as true but replication destination isn't created due to eviction period which is 24hrs as of now. Successful failover should be possible after BZ2283038 is fixed.

During all these operations, keep an eye on health of MDS pods and crashes if any.


Actual results: MDS crash is seen on the surviving cluster C2 on which workloads were failedover. 

pods|grep mds
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-5bdf7cfdfzs7c   2/2     Running     1023 (9m33s ago)   7d13h   10.128.2.63    compute-2   <none>           <none>
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-7dd58665vqjsp   2/2     Running     1006 (17m ago)     7d13h   10.131.0.241   compute-1   <none>           <none>


oc -n openshift-storage rsh "$(oc get po -n openshift-storage -l app=rook-ceph-tools -o name)" ceph crash ls
ID                                                                ENTITY                                   NEW
2024-05-14T17:33:05.811016Z_b5585b5b-3a3a-4838-93d3-13ccfb04b669  mds.ocs-storagecluster-cephfilesystem-a
2024-05-14T18:52:04.860762Z_c49454e8-2180-42d0-b247-6b31619ecd12  mds.ocs-storagecluster-cephfilesystem-b
2024-05-15T11:11:58.726060Z_bf8483a9-8f22-465f-83c3-93b2f34710f3  mds.ocs-storagecluster-cephfilesystem-a   *
2024-05-15T12:26:15.553277Z_eae1ba47-82fd-4bf1-88c1-8a816f67ab65  mds.ocs-storagecluster-cephfilesystem-b   *
2024-05-16T07:07:03.233208Z_2a86dc2e-e6cf-4f07-aa8b-9d7b9eec803f  mds.ocs-storagecluster-cephfilesystem-a   *
2024-05-17T10:56:29.510669Z_e8ff0c2f-d956-4af8-a540-6cb64374e4fd  mds.ocs-storagecluster-cephfilesystem-b   *
2024-05-17T20:34:57.807764Z_dd0521e6-e92f-40e2-afe9-3c3f1768df07  mds.ocs-storagecluster-cephfilesystem-a   *
2024-05-17T21:56:39.158029Z_12c1efa9-ecfc-4c32-9024-77423ae09ecf  mds.ocs-storagecluster-cephfilesystem-b   *
2024-05-18T00:36:35.787255Z_40eb624a-a7ed-4415-b17f-4085bd9eac9b  mds.ocs-storagecluster-cephfilesystem-b   *
2024-05-18T03:32:14.163891Z_84c9b511-ac45-4bf2-9040-8014883e80a9  mds.ocs-storagecluster-cephfilesystem-b   *
2024-05-19T12:56:35.497622Z_b3c014e8-cba6-496d-a673-e2b2f026be21  mds.ocs-storagecluster-cephfilesystem-b   *
2024-05-20T00:27:27.924980Z_8085a329-6afb-4b49-a76c-b86db9d94109  mds.ocs-storagecluster-cephfilesystem-a   *
2024-05-20T08:06:13.398353Z_5ebabaee-807c-4910-ba27-14eeee4b4fba  mds.ocs-storagecluster-cephfilesystem-a   *
2024-05-20T17:27:47.267271Z_2c553b1c-d6db-43e8-9421-2994c1745e40  mds.ocs-storagecluster-cephfilesystem-b   *
2024-05-20T18:18:11.530034Z_bf609bf3-25c1-4342-ad7b-f942579daeeb  mds.ocs-storagecluster-cephfilesystem-a   *
2024-05-21T22:21:43.087757Z_06bdcccb-e7f5-4f95-a310-a053aed2fc0a  mds.ocs-storagecluster-cephfilesystem-b   *
2024-05-22T01:07:49.214137Z_bd01b408-376c-4c80-822e-ba3ae76e371f  mds.ocs-storagecluster-cephfilesystem-b   *
2024-05-22T01:13:58.832386Z_9da0f105-acb7-4026-a497-0f0da1b77f08  mds.ocs-storagecluster-cephfilesystem-b   *
2024-05-22T03:54:30.740192Z_0dc95829-aa4a-4e34-aaeb-39cd3cf4d7c7  mds.ocs-storagecluster-cephfilesystem-a   *



bash-5.1$ ceph crash info 2024-05-14T17:33:05.811016Z_b5585b5b-3a3a-4838-93d3-13ccfb04b669
{
    "archived": "2024-05-14 19:09:39.585770",
    "backtrace": [
        "/lib64/libc.so.6(+0x54db0) [0x7f1a1abf3db0]",
        "(MDSTableClient::got_journaled_ack(unsigned long)+0x123) [0x56062dc65c43]",
        "(MDLog::_replay_thread()+0x75e) [0x56062dcb852e]",
        "ceph-mds(+0x16cf21) [0x56062da01f21]",
        "/lib64/libc.so.6(+0x9f802) [0x7f1a1ac3e802]",
        "/lib64/libc.so.6(+0x3f450) [0x7f1a1abde450]"
    ],
    "ceph_version": "18.2.1-136.el9cp",
    "crash_id": "2024-05-14T17:33:05.811016Z_b5585b5b-3a3a-4838-93d3-13ccfb04b669",
    "entity_name": "mds.ocs-storagecluster-cephfilesystem-a",
    "os_id": "rhel",
    "os_name": "Red Hat Enterprise Linux",
    "os_version": "9.3 (Plow)",
    "os_version_id": "9.3",
    "process_name": "ceph-mds",
    "stack_sig": "36c718b7130271b051731a63cab7a55ab268d2ea09f56572013c03a500e81a80",
    "timestamp": "2024-05-14T17:33:05.811016Z",
    "utsname_hostname": "rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-5bdf7cfd6f5m4",
    "utsname_machine": "x86_64",
    "utsname_release": "5.14.0-427.13.1.el9_4.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed Apr 10 10:29:16 EDT 2024"
}



bash-5.1$ ceph crash info 2024-05-15T11:11:58.726060Z_bf8483a9-8f22-465f-83c3-93b2f34710f3
{
    "backtrace": [
        "/lib64/libc.so.6(+0x54db0) [0x7f60d8013db0]",
        "(MDSTableClient::got_journaled_ack(unsigned long)+0x123) [0x55daad309c43]",
        "(MDLog::_replay_thread()+0x75e) [0x55daad35c52e]",
        "ceph-mds(+0x16cf21) [0x55daad0a5f21]",
        "/lib64/libc.so.6(+0x9f802) [0x7f60d805e802]",
        "/lib64/libc.so.6(+0x3f450) [0x7f60d7ffe450]"
    ],
    "ceph_version": "18.2.1-136.el9cp",
    "crash_id": "2024-05-15T11:11:58.726060Z_bf8483a9-8f22-465f-83c3-93b2f34710f3",
    "entity_name": "mds.ocs-storagecluster-cephfilesystem-a",
    "os_id": "rhel",
    "os_name": "Red Hat Enterprise Linux",
    "os_version": "9.3 (Plow)",
    "os_version_id": "9.3",
    "process_name": "ceph-mds",
    "stack_sig": "36c718b7130271b051731a63cab7a55ab268d2ea09f56572013c03a500e81a80",
    "timestamp": "2024-05-15T11:11:58.726060Z",
    "utsname_hostname": "rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-5bdf7cfdfzs7c",
    "utsname_machine": "x86_64",
    "utsname_release": "5.14.0-427.13.1.el9_4.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed Apr 10 10:29:16 EDT 2024"
}



bash-5.1$ ceph crash info 2024-05-15T12:26:15.553277Z_eae1ba47-82fd-4bf1-88c1-8a816f67ab65
{
    "backtrace": [
        "/lib64/libc.so.6(+0x54db0) [0x7f7cbf29bdb0]",
        "(MDSTableClient::got_journaled_ack(unsigned long)+0x123) [0x55bd68c5dc43]",
        "(MDLog::_replay_thread()+0x75e) [0x55bd68cb052e]",
        "ceph-mds(+0x16cf21) [0x55bd689f9f21]",
        "/lib64/libc.so.6(+0x9f802) [0x7f7cbf2e6802]",
        "/lib64/libc.so.6(+0x3f450) [0x7f7cbf286450]"
    ],
    "ceph_version": "18.2.1-136.el9cp",
    "crash_id": "2024-05-15T12:26:15.553277Z_eae1ba47-82fd-4bf1-88c1-8a816f67ab65",
    "entity_name": "mds.ocs-storagecluster-cephfilesystem-b",
    "os_id": "rhel",
    "os_name": "Red Hat Enterprise Linux",
    "os_version": "9.3 (Plow)",
    "os_version_id": "9.3",
    "process_name": "ceph-mds",
    "stack_sig": "36c718b7130271b051731a63cab7a55ab268d2ea09f56572013c03a500e81a80",
    "timestamp": "2024-05-15T12:26:15.553277Z",
    "utsname_hostname": "rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-7dd58665vqjsp",
    "utsname_machine": "x86_64",
    "utsname_release": "5.14.0-427.13.1.el9_4.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed Apr 10 10:29:16 EDT 2024"



bash-5.1$ ceph crash info 2024-05-16T07:07:03.233208Z_2a86dc2e-e6cf-4f07-aa8b-9d7b9eec803f
{
    "backtrace": [
        "/lib64/libc.so.6(+0x54db0) [0x7f8e450e2db0]",
        "(MDSTableClient::got_journaled_ack(unsigned long)+0x123) [0x5616ea8d8c43]",
        "(MDLog::_replay_thread()+0x75e) [0x5616ea92b52e]",
        "ceph-mds(+0x16cf21) [0x5616ea674f21]",
        "/lib64/libc.so.6(+0x9f802) [0x7f8e4512d802]",
        "/lib64/libc.so.6(+0x3f450) [0x7f8e450cd450]"
    ],
    "ceph_version": "18.2.1-136.el9cp",
    "crash_id": "2024-05-16T07:07:03.233208Z_2a86dc2e-e6cf-4f07-aa8b-9d7b9eec803f",
    "entity_name": "mds.ocs-storagecluster-cephfilesystem-a",
    "os_id": "rhel",
    "os_name": "Red Hat Enterprise Linux",
    "os_version": "9.3 (Plow)",
    "os_version_id": "9.3",
    "process_name": "ceph-mds",
    "stack_sig": "36c718b7130271b051731a63cab7a55ab268d2ea09f56572013c03a500e81a80",
    "timestamp": "2024-05-16T07:07:03.233208Z",
    "utsname_hostname": "rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-5bdf7cfdfzs7c",
    "utsname_machine": "x86_64",
    "utsname_release": "5.14.0-427.13.1.el9_4.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed Apr 10 10:29:16 EDT 2024"
}


Must-gather logs from the cluster is kept here- http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/20may24-mds-high-log-level/



Expected results: MDS shouldn't unexpectedly crash when cluster isn't heavily loaded and failover is performed.


Additional info:

Comment 3 Aman Agrawal 2024-05-22 09:38:11 UTC

As per offline discussion with Venky:

Since different crash info is seen on different crashes, we don't know yet which crash is causing the failover to remain stuck. We will use this BZ to investigate further and open new BZ if needed based upon the findings.


A few crash outputs for reference:


bash-5.1$ ceph crash info 2024-05-14T18:52:04.860762Z_c49454e8-2180-42d0-b247-6b31619ecd12
{
    "archived": "2024-05-14 19:09:40.190367",
    "backtrace": [
        "/lib64/libc.so.6(+0x54db0) [0x7f6d122fcdb0]",
        "pthread_getname_np()",
        "(ceph::logging::Log::dump_recent()+0x5c5) [0x7f6d12c84bf5]",
        "(MDSDaemon::respawn()+0x15a) [0x55d7aeeb0c7a]",
        "ceph-mds(+0x143e5d) [0x55d7aee98e5d]",
        "(MDSRank::handle_write_error(int)+0x1af) [0x55d7aeece3ef]",
        "ceph-mds(+0x1a88e4) [0x55d7aeefd8e4]",
        "ceph-mds(+0x143e5d) [0x55d7aee98e5d]",
        "(Finisher::finisher_thread_entry()+0x175) [0x7f6d12a31145]",
        "/lib64/libc.so.6(+0x9f802) [0x7f6d12347802]",
        "/lib64/libc.so.6(+0x3f450) [0x7f6d122e7450]"
    ],
    "ceph_version": "18.2.1-136.el9cp",
    "crash_id": "2024-05-14T18:52:04.860762Z_c49454e8-2180-42d0-b247-6b31619ecd12",
    "entity_name": "mds.ocs-storagecluster-cephfilesystem-b",
    "os_id": "rhel",
    "os_name": "Red Hat Enterprise Linux",
    "os_version": "9.3 (Plow)",
    "os_version_id": "9.3",
    "process_name": "ceph-mds",
    "stack_sig": "238e13b05e2f9c033f80644a94b3324df6b15f78ec4772f7c900a997f8566e3f",
    "timestamp": "2024-05-14T18:52:04.860762Z",
    "utsname_hostname": "rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-7dd586658k6w8",
    "utsname_machine": "x86_64",
    "utsname_release": "5.14.0-427.13.1.el9_4.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed Apr 10 10:29:16 EDT 2024"
}





bash-5.1$ ceph crash info 2024-05-17T10:56:29.510669Z_e8ff0c2f-d956-4af8-a540-6cb64374e4fd
{
    "backtrace": [
        "/lib64/libc.so.6(+0x54db0) [0x7ff74acafdb0]",
        "ceph-mds(+0x22d2ed) [0x557c354f42ed]",
        "ceph-mds(+0x5a7d02) [0x557c3586ed02]",
        "(EMetaBlob::fullbit::update_inode(MDSRank*, CInode*)+0x51) [0x557c35778fc1]",
        "(EMetaBlob::replay(MDSRank*, LogSegment*, int, MDPeerUpdate*)+0x79e) [0x557c3577fb6e]",
        "(EOpen::replay(MDSRank*)+0x55) [0x557c3578c6c5]",
        "(MDLog::_replay_thread()+0x75e) [0x557c356ea52e]",
        "ceph-mds(+0x16cf21) [0x557c35433f21]",
        "/lib64/libc.so.6(+0x9f802) [0x7ff74acfa802]",
        "/lib64/libc.so.6(+0x3f450) [0x7ff74ac9a450]"
    ],
    "ceph_version": "18.2.1-136.el9cp",
    "crash_id": "2024-05-17T10:56:29.510669Z_e8ff0c2f-d956-4af8-a540-6cb64374e4fd",
    "entity_name": "mds.ocs-storagecluster-cephfilesystem-b",
    "os_id": "rhel",
    "os_name": "Red Hat Enterprise Linux",
    "os_version": "9.3 (Plow)",
    "os_version_id": "9.3",
    "process_name": "ceph-mds",
    "stack_sig": "534595eadbe3cd5e36a861179a9d229df6085a48ed4bf3ee7982825650a239f5",
    "timestamp": "2024-05-17T10:56:29.510669Z",
    "utsname_hostname": "rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-7dd58665vqjsp",
    "utsname_machine": "x86_64",
    "utsname_release": "5.14.0-427.13.1.el9_4.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed Apr 10 10:29:16 EDT 2024"
}




bash-5.1$ ceph crash info 2024-05-17T20:34:57.807764Z_dd0521e6-e92f-40e2-afe9-3c3f1768df07
{
    "assert_condition": "(bool)_front == (bool)_size",
    "assert_file": "/builddir/build/BUILD/ceph-18.2.1/src/include/xlist.h",
    "assert_func": "size_t xlist<T>::size() const [with T = LRUObject*; size_t = long unsigned int]",
    "assert_line": 87,
    "assert_msg": "/builddir/build/BUILD/ceph-18.2.1/src/include/xlist.h: In function 'size_t xlist<T>::size() const [with T = LRUObject*; size_t = long unsigned int]' thread 7f3c3d3fc640 time 2024-05-17T20:34:57.806097+0000\n/builddir/build/BUILD/ceph-18.2.1/src/include/xlist.h: 87: FAILED ceph_assert((bool)_front == (bool)_size)\n",
    "assert_thread_name": "md_log_replay",
    "backtrace": [
        "/lib64/libc.so.6(+0x54db0) [0x7f3c49eccdb0]",
        "/lib64/libc.so.6(+0xa154c) [0x7f3c49f1954c]",
        "raise()",
        "abort()",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x188) [0x7f3c4a549068]",
        "/usr/lib64/ceph/libceph-common.so.2(+0x1631cc) [0x7f3c4a5491cc]",
        "ceph-mds(+0x1479fe) [0x56357b3779fe]",
        "(CDir::add_null_dentry(std::basic_string_view<char, std::char_traits<char> >, snapid_t, snapid_t)+0x29a) [0x56357b5a226a]",
        "(EMetaBlob::replay(MDSRank*, LogSegment*, int, MDPeerUpdate*)+0x1853) [0x56357b6e9c23]",
        "(EOpen::replay(MDSRank*)+0x55) [0x56357b6f56c5]",
        "(MDLog::_replay_thread()+0x75e) [0x56357b65352e]",
        "ceph-mds(+0x16cf21) [0x56357b39cf21]",
        "/lib64/libc.so.6(+0x9f802) [0x7f3c49f17802]",
        "/lib64/libc.so.6(+0x3f450) [0x7f3c49eb7450]"
    ],
    "ceph_version": "18.2.1-136.el9cp",
    "crash_id": "2024-05-17T20:34:57.807764Z_dd0521e6-e92f-40e2-afe9-3c3f1768df07",
    "entity_name": "mds.ocs-storagecluster-cephfilesystem-a",
    "os_id": "rhel",
    "os_name": "Red Hat Enterprise Linux",
    "os_version": "9.3 (Plow)",
    "os_version_id": "9.3",
    "process_name": "ceph-mds",
    "stack_sig": "ea1ea719ad2b0630e8fcf810ee54c1655e5bc626e9caddabb987b8b82f28bfea",
    "timestamp": "2024-05-17T20:34:57.807764Z",
    "utsname_hostname": "rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-5bdf7cfdfzs7c",
    "utsname_machine": "x86_64",
    "utsname_release": "5.14.0-427.13.1.el9_4.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed Apr 10 10:29:16 EDT 2024"
}



bash-5.1$ ceph crash info 2024-05-20T00:27:27.924980Z_8085a329-6afb-4b49-a76c-b86db9d94109
{
    "backtrace": [
        "/lib64/libc.so.6(+0x54db0) [0x7fba580ebdb0]",
        "ceph-mds(+0x2e5548) [0x5617387fd548]",
        "(MDCache::finish_uncommitted_fragment(dirfrag_t, int)+0x8c) [0x5617387e8c5c]",
        "(EFragment::replay(MDSRank*)+0x26b) [0x5617389e374b]",
        "(MDLog::_replay_thread()+0x75e) [0x56173893b52e]",
        "ceph-mds(+0x16cf21) [0x561738684f21]",
        "/lib64/libc.so.6(+0x9f802) [0x7fba58136802]",
        "/lib64/libc.so.6(+0x3f450) [0x7fba580d6450]"
    ],
    "ceph_version": "18.2.1-136.el9cp",
    "crash_id": "2024-05-20T00:27:27.924980Z_8085a329-6afb-4b49-a76c-b86db9d94109",
    "entity_name": "mds.ocs-storagecluster-cephfilesystem-a",
    "os_id": "rhel",
    "os_name": "Red Hat Enterprise Linux",
    "os_version": "9.3 (Plow)",
    "os_version_id": "9.3",
    "process_name": "ceph-mds",
    "stack_sig": "9146a54bbc96beab126447db7eb36673a320813b826ed34354414b4148a7d86c",
    "timestamp": "2024-05-20T00:27:27.924980Z",
    "utsname_hostname": "rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-5bdf7cfdfzs7c",
    "utsname_machine": "x86_64",
    "utsname_release": "5.14.0-427.13.1.el9_4.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed Apr 10 10:29:16 EDT 2024"
}



This covers all the crashes reported in the Description of problem above. Most of the crashes have similar crash info. so only unique ones are listed here.

Comment 4 Aman Agrawal 2024-05-22 12:40:01 UTC

(In reply to Aman Agrawal from comment #3)
> As per offline discussion with Venky:
> 
> Since different crash info is seen on different crashes, we don't know yet
> which crash is causing the failover to remain stuck. We will use this BZ to
> investigate further and open new BZ if needed based upon the findings.
> 
> 
> A few crash outputs for reference:
> 
> 
> bash-5.1$ ceph crash info
> 2024-05-14T18:52:04.860762Z_c49454e8-2180-42d0-b247-6b31619ecd12
> {
>     "archived": "2024-05-14 19:09:40.190367",
>     "backtrace": [
>         "/lib64/libc.so.6(+0x54db0) [0x7f6d122fcdb0]",
>         "pthread_getname_np()",
>         "(ceph::logging::Log::dump_recent()+0x5c5) [0x7f6d12c84bf5]",
>         "(MDSDaemon::respawn()+0x15a) [0x55d7aeeb0c7a]",
>         "ceph-mds(+0x143e5d) [0x55d7aee98e5d]",
>         "(MDSRank::handle_write_error(int)+0x1af) [0x55d7aeece3ef]",
>         "ceph-mds(+0x1a88e4) [0x55d7aeefd8e4]",
>         "ceph-mds(+0x143e5d) [0x55d7aee98e5d]",
>         "(Finisher::finisher_thread_entry()+0x175) [0x7f6d12a31145]",
>         "/lib64/libc.so.6(+0x9f802) [0x7f6d12347802]",
>         "/lib64/libc.so.6(+0x3f450) [0x7f6d122e7450]"
>     ],
>     "ceph_version": "18.2.1-136.el9cp",
>     "crash_id":
> "2024-05-14T18:52:04.860762Z_c49454e8-2180-42d0-b247-6b31619ecd12",
>     "entity_name": "mds.ocs-storagecluster-cephfilesystem-b",
>     "os_id": "rhel",
>     "os_name": "Red Hat Enterprise Linux",
>     "os_version": "9.3 (Plow)",
>     "os_version_id": "9.3",
>     "process_name": "ceph-mds",
>     "stack_sig":
> "238e13b05e2f9c033f80644a94b3324df6b15f78ec4772f7c900a997f8566e3f",
>     "timestamp": "2024-05-14T18:52:04.860762Z",
>     "utsname_hostname":
> "rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-7dd586658k6w8",
>     "utsname_machine": "x86_64",
>     "utsname_release": "5.14.0-427.13.1.el9_4.x86_64",
>     "utsname_sysname": "Linux",
>     "utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed Apr 10 10:29:16 EDT 2024"
> }
> 
> 
> 
> 
> 
> bash-5.1$ ceph crash info
> 2024-05-17T10:56:29.510669Z_e8ff0c2f-d956-4af8-a540-6cb64374e4fd
> {
>     "backtrace": [
>         "/lib64/libc.so.6(+0x54db0) [0x7ff74acafdb0]",
>         "ceph-mds(+0x22d2ed) [0x557c354f42ed]",
>         "ceph-mds(+0x5a7d02) [0x557c3586ed02]",
>         "(EMetaBlob::fullbit::update_inode(MDSRank*, CInode*)+0x51)
> [0x557c35778fc1]",
>         "(EMetaBlob::replay(MDSRank*, LogSegment*, int,
> MDPeerUpdate*)+0x79e) [0x557c3577fb6e]",
>         "(EOpen::replay(MDSRank*)+0x55) [0x557c3578c6c5]",
>         "(MDLog::_replay_thread()+0x75e) [0x557c356ea52e]",
>         "ceph-mds(+0x16cf21) [0x557c35433f21]",
>         "/lib64/libc.so.6(+0x9f802) [0x7ff74acfa802]",
>         "/lib64/libc.so.6(+0x3f450) [0x7ff74ac9a450]"
>     ],
>     "ceph_version": "18.2.1-136.el9cp",
>     "crash_id":
> "2024-05-17T10:56:29.510669Z_e8ff0c2f-d956-4af8-a540-6cb64374e4fd",
>     "entity_name": "mds.ocs-storagecluster-cephfilesystem-b",
>     "os_id": "rhel",
>     "os_name": "Red Hat Enterprise Linux",
>     "os_version": "9.3 (Plow)",
>     "os_version_id": "9.3",
>     "process_name": "ceph-mds",
>     "stack_sig":
> "534595eadbe3cd5e36a861179a9d229df6085a48ed4bf3ee7982825650a239f5",
>     "timestamp": "2024-05-17T10:56:29.510669Z",
>     "utsname_hostname":
> "rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-7dd58665vqjsp",
>     "utsname_machine": "x86_64",
>     "utsname_release": "5.14.0-427.13.1.el9_4.x86_64",
>     "utsname_sysname": "Linux",
>     "utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed Apr 10 10:29:16 EDT 2024"
> }
> 
> The above crash seems to be initially reported as part of BZ2218759 and there was an ask to repro it hence I added a comment here- https://bugzilla.redhat.com/show_bug.cgi?id=2218759#c74

All other crashes still need investigation.
> 
> 
> bash-5.1$ ceph crash info
> 2024-05-17T20:34:57.807764Z_dd0521e6-e92f-40e2-afe9-3c3f1768df07
> {
>     "assert_condition": "(bool)_front == (bool)_size",
>     "assert_file": "/builddir/build/BUILD/ceph-18.2.1/src/include/xlist.h",
>     "assert_func": "size_t xlist<T>::size() const [with T = LRUObject*;
> size_t = long unsigned int]",
>     "assert_line": 87,
>     "assert_msg": "/builddir/build/BUILD/ceph-18.2.1/src/include/xlist.h: In
> function 'size_t xlist<T>::size() const [with T = LRUObject*; size_t = long
> unsigned int]' thread 7f3c3d3fc640 time
> 2024-05-17T20:34:57.806097+0000\n/builddir/build/BUILD/ceph-18.2.1/src/
> include/xlist.h: 87: FAILED ceph_assert((bool)_front == (bool)_size)\n",
>     "assert_thread_name": "md_log_replay",
>     "backtrace": [
>         "/lib64/libc.so.6(+0x54db0) [0x7f3c49eccdb0]",
>         "/lib64/libc.so.6(+0xa154c) [0x7f3c49f1954c]",
>         "raise()",
>         "abort()",
>         "(ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x188) [0x7f3c4a549068]",
>         "/usr/lib64/ceph/libceph-common.so.2(+0x1631cc) [0x7f3c4a5491cc]",
>         "ceph-mds(+0x1479fe) [0x56357b3779fe]",
>         "(CDir::add_null_dentry(std::basic_string_view<char,
> std::char_traits<char> >, snapid_t, snapid_t)+0x29a) [0x56357b5a226a]",
>         "(EMetaBlob::replay(MDSRank*, LogSegment*, int,
> MDPeerUpdate*)+0x1853) [0x56357b6e9c23]",
>         "(EOpen::replay(MDSRank*)+0x55) [0x56357b6f56c5]",
>         "(MDLog::_replay_thread()+0x75e) [0x56357b65352e]",
>         "ceph-mds(+0x16cf21) [0x56357b39cf21]",
>         "/lib64/libc.so.6(+0x9f802) [0x7f3c49f17802]",
>         "/lib64/libc.so.6(+0x3f450) [0x7f3c49eb7450]"
>     ],
>     "ceph_version": "18.2.1-136.el9cp",
>     "crash_id":
> "2024-05-17T20:34:57.807764Z_dd0521e6-e92f-40e2-afe9-3c3f1768df07",
>     "entity_name": "mds.ocs-storagecluster-cephfilesystem-a",
>     "os_id": "rhel",
>     "os_name": "Red Hat Enterprise Linux",
>     "os_version": "9.3 (Plow)",
>     "os_version_id": "9.3",
>     "process_name": "ceph-mds",
>     "stack_sig":
> "ea1ea719ad2b0630e8fcf810ee54c1655e5bc626e9caddabb987b8b82f28bfea",
>     "timestamp": "2024-05-17T20:34:57.807764Z",
>     "utsname_hostname":
> "rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-5bdf7cfdfzs7c",
>     "utsname_machine": "x86_64",
>     "utsname_release": "5.14.0-427.13.1.el9_4.x86_64",
>     "utsname_sysname": "Linux",
>     "utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed Apr 10 10:29:16 EDT 2024"
> }
> 
> 
> 
> bash-5.1$ ceph crash info
> 2024-05-20T00:27:27.924980Z_8085a329-6afb-4b49-a76c-b86db9d94109
> {
>     "backtrace": [
>         "/lib64/libc.so.6(+0x54db0) [0x7fba580ebdb0]",
>         "ceph-mds(+0x2e5548) [0x5617387fd548]",
>         "(MDCache::finish_uncommitted_fragment(dirfrag_t, int)+0x8c)
> [0x5617387e8c5c]",
>         "(EFragment::replay(MDSRank*)+0x26b) [0x5617389e374b]",
>         "(MDLog::_replay_thread()+0x75e) [0x56173893b52e]",
>         "ceph-mds(+0x16cf21) [0x561738684f21]",
>         "/lib64/libc.so.6(+0x9f802) [0x7fba58136802]",
>         "/lib64/libc.so.6(+0x3f450) [0x7fba580d6450]"
>     ],
>     "ceph_version": "18.2.1-136.el9cp",
>     "crash_id":
> "2024-05-20T00:27:27.924980Z_8085a329-6afb-4b49-a76c-b86db9d94109",
>     "entity_name": "mds.ocs-storagecluster-cephfilesystem-a",
>     "os_id": "rhel",
>     "os_name": "Red Hat Enterprise Linux",
>     "os_version": "9.3 (Plow)",
>     "os_version_id": "9.3",
>     "process_name": "ceph-mds",
>     "stack_sig":
> "9146a54bbc96beab126447db7eb36673a320813b826ed34354414b4148a7d86c",
>     "timestamp": "2024-05-20T00:27:27.924980Z",
>     "utsname_hostname":
> "rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-5bdf7cfdfzs7c",
>     "utsname_machine": "x86_64",
>     "utsname_release": "5.14.0-427.13.1.el9_4.x86_64",
>     "utsname_sysname": "Linux",
>     "utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed Apr 10 10:29:16 EDT 2024"
> }
> 
> 
> 
> This covers all the crashes reported in the Description of problem above.
> Most of the crashes have similar crash info. so only unique ones are listed
> here.

Comment 5 Venky Shankar 2024-05-23 17:39:03 UTC

Hi Aman,

(In reply to Aman Agrawal from comment #3)
> As per offline discussion with Venky:
> 
> Since different crash info is seen on different crashes, we don't know yet
> which crash is causing the failover to remain stuck. We will use this BZ to
> investigate further and open new BZ if needed based upon the findings.
> 
> 
> A few crash outputs for reference:
> 
> 
> bash-5.1$ ceph crash info
> 2024-05-14T18:52:04.860762Z_c49454e8-2180-42d0-b247-6b31619ecd12
> {
>     "archived": "2024-05-14 19:09:40.190367",
>     "backtrace": [
>         "/lib64/libc.so.6(+0x54db0) [0x7f6d122fcdb0]",
>         "pthread_getname_np()",
>         "(ceph::logging::Log::dump_recent()+0x5c5) [0x7f6d12c84bf5]",
>         "(MDSDaemon::respawn()+0x15a) [0x55d7aeeb0c7a]",
>         "ceph-mds(+0x143e5d) [0x55d7aee98e5d]",
>         "(MDSRank::handle_write_error(int)+0x1af) [0x55d7aeece3ef]",
>         "ceph-mds(+0x1a88e4) [0x55d7aeefd8e4]",
>         "ceph-mds(+0x143e5d) [0x55d7aee98e5d]",
>         "(Finisher::finisher_thread_entry()+0x175) [0x7f6d12a31145]",
>         "/lib64/libc.so.6(+0x9f802) [0x7f6d12347802]",
>         "/lib64/libc.so.6(+0x3f450) [0x7f6d122e7450]"
>     ],
>     "ceph_version": "18.2.1-136.el9cp",
>     "crash_id":
> "2024-05-14T18:52:04.860762Z_c49454e8-2180-42d0-b247-6b31619ecd12",
>     "entity_name": "mds.ocs-storagecluster-cephfilesystem-b",
>     "os_id": "rhel",
>     "os_name": "Red Hat Enterprise Linux",
>     "os_version": "9.3 (Plow)",
>     "os_version_id": "9.3",
>     "process_name": "ceph-mds",
>     "stack_sig":
> "238e13b05e2f9c033f80644a94b3324df6b15f78ec4772f7c900a997f8566e3f",
>     "timestamp": "2024-05-14T18:52:04.860762Z",
>     "utsname_hostname":
> "rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-7dd586658k6w8",
>     "utsname_machine": "x86_64",
>     "utsname_release": "5.14.0-427.13.1.el9_4.x86_64",
>     "utsname_sysname": "Linux",
>     "utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed Apr 10 10:29:16 EDT 2024"
> }
> 

We fixed a couple of bugs related to the above crash (backtrace). Do you have the MDS coredump?

> 
> 
> 
> 
> bash-5.1$ ceph crash info
> 2024-05-17T10:56:29.510669Z_e8ff0c2f-d956-4af8-a540-6cb64374e4fd
> {
>     "backtrace": [
>         "/lib64/libc.so.6(+0x54db0) [0x7ff74acafdb0]",
>         "ceph-mds(+0x22d2ed) [0x557c354f42ed]",
>         "ceph-mds(+0x5a7d02) [0x557c3586ed02]",
>         "(EMetaBlob::fullbit::update_inode(MDSRank*, CInode*)+0x51)
> [0x557c35778fc1]",
>         "(EMetaBlob::replay(MDSRank*, LogSegment*, int,
> MDPeerUpdate*)+0x79e) [0x557c3577fb6e]",
>         "(EOpen::replay(MDSRank*)+0x55) [0x557c3578c6c5]",
>         "(MDLog::_replay_thread()+0x75e) [0x557c356ea52e]",
>         "ceph-mds(+0x16cf21) [0x557c35433f21]",
>         "/lib64/libc.so.6(+0x9f802) [0x7ff74acfa802]",
>         "/lib64/libc.so.6(+0x3f450) [0x7ff74ac9a450]"
>     ],
>     "ceph_version": "18.2.1-136.el9cp",
>     "crash_id":
> "2024-05-17T10:56:29.510669Z_e8ff0c2f-d956-4af8-a540-6cb64374e4fd",
>     "entity_name": "mds.ocs-storagecluster-cephfilesystem-b",
>     "os_id": "rhel",
>     "os_name": "Red Hat Enterprise Linux",
>     "os_version": "9.3 (Plow)",
>     "os_version_id": "9.3",
>     "process_name": "ceph-mds",
>     "stack_sig":
> "534595eadbe3cd5e36a861179a9d229df6085a48ed4bf3ee7982825650a239f5",
>     "timestamp": "2024-05-17T10:56:29.510669Z",
>     "utsname_hostname":
> "rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-7dd58665vqjsp",
>     "utsname_machine": "x86_64",
>     "utsname_release": "5.14.0-427.13.1.el9_4.x86_64",
>     "utsname_sysname": "Linux",
>     "utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed Apr 10 10:29:16 EDT 2024"
> }
> 
> 
> 
> 
> bash-5.1$ ceph crash info
> 2024-05-17T20:34:57.807764Z_dd0521e6-e92f-40e2-afe9-3c3f1768df07
> {
>     "assert_condition": "(bool)_front == (bool)_size",

This is (likely) a new crash - haven't seen this backtrace yet. Again, where can I find the MDS coredump?

>     "assert_file": "/builddir/build/BUILD/ceph-18.2.1/src/include/xlist.h",
>     "assert_func": "size_t xlist<T>::size() const [with T = LRUObject*;
> size_t = long unsigned int]",
>     "assert_line": 87,
>     "assert_msg": "/builddir/build/BUILD/ceph-18.2.1/src/include/xlist.h: In
> function 'size_t xlist<T>::size() const [with T = LRUObject*; size_t = long
> unsigned int]' thread 7f3c3d3fc640 time
> 2024-05-17T20:34:57.806097+0000\n/builddir/build/BUILD/ceph-18.2.1/src/
> include/xlist.h: 87: FAILED ceph_assert((bool)_front == (bool)_size)\n",
>     "assert_thread_name": "md_log_replay",
>     "backtrace": [
>         "/lib64/libc.so.6(+0x54db0) [0x7f3c49eccdb0]",
>         "/lib64/libc.so.6(+0xa154c) [0x7f3c49f1954c]",
>         "raise()",
>         "abort()",
>         "(ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x188) [0x7f3c4a549068]",
>         "/usr/lib64/ceph/libceph-common.so.2(+0x1631cc) [0x7f3c4a5491cc]",
>         "ceph-mds(+0x1479fe) [0x56357b3779fe]",
>         "(CDir::add_null_dentry(std::basic_string_view<char,
> std::char_traits<char> >, snapid_t, snapid_t)+0x29a) [0x56357b5a226a]",
>         "(EMetaBlob::replay(MDSRank*, LogSegment*, int,
> MDPeerUpdate*)+0x1853) [0x56357b6e9c23]",
>         "(EOpen::replay(MDSRank*)+0x55) [0x56357b6f56c5]",
>         "(MDLog::_replay_thread()+0x75e) [0x56357b65352e]",
>         "ceph-mds(+0x16cf21) [0x56357b39cf21]",
>         "/lib64/libc.so.6(+0x9f802) [0x7f3c49f17802]",
>         "/lib64/libc.so.6(+0x3f450) [0x7f3c49eb7450]"
>     ],
>     "ceph_version": "18.2.1-136.el9cp",
>     "crash_id":
> "2024-05-17T20:34:57.807764Z_dd0521e6-e92f-40e2-afe9-3c3f1768df07",
>     "entity_name": "mds.ocs-storagecluster-cephfilesystem-a",
>     "os_id": "rhel",
>     "os_name": "Red Hat Enterprise Linux",
>     "os_version": "9.3 (Plow)",
>     "os_version_id": "9.3",
>     "process_name": "ceph-mds",
>     "stack_sig":
> "ea1ea719ad2b0630e8fcf810ee54c1655e5bc626e9caddabb987b8b82f28bfea",
>     "timestamp": "2024-05-17T20:34:57.807764Z",
>     "utsname_hostname":
> "rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-5bdf7cfdfzs7c",
>     "utsname_machine": "x86_64",
>     "utsname_release": "5.14.0-427.13.1.el9_4.x86_64",
>     "utsname_sysname": "Linux",
>     "utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed Apr 10 10:29:16 EDT 2024"
> }
> 
> 
> 
> bash-5.1$ ceph crash info
> 2024-05-20T00:27:27.924980Z_8085a329-6afb-4b49-a76c-b86db9d94109
> {
>     "backtrace": [
>         "/lib64/libc.so.6(+0x54db0) [0x7fba580ebdb0]",
>         "ceph-mds(+0x2e5548) [0x5617387fd548]",
>         "(MDCache::finish_uncommitted_fragment(dirfrag_t, int)+0x8c)
> [0x5617387e8c5c]",

This crash looks new.

>         "(EFragment::replay(MDSRank*)+0x26b) [0x5617389e374b]",
>         "(MDLog::_replay_thread()+0x75e) [0x56173893b52e]",
>         "ceph-mds(+0x16cf21) [0x561738684f21]",
>         "/lib64/libc.so.6(+0x9f802) [0x7fba58136802]",
>         "/lib64/libc.so.6(+0x3f450) [0x7fba580d6450]"
>     ],
>     "ceph_version": "18.2.1-136.el9cp",
>     "crash_id":
> "2024-05-20T00:27:27.924980Z_8085a329-6afb-4b49-a76c-b86db9d94109",
>     "entity_name": "mds.ocs-storagecluster-cephfilesystem-a",
>     "os_id": "rhel",
>     "os_name": "Red Hat Enterprise Linux",
>     "os_version": "9.3 (Plow)",
>     "os_version_id": "9.3",
>     "process_name": "ceph-mds",
>     "stack_sig":
> "9146a54bbc96beab126447db7eb36673a320813b826ed34354414b4148a7d86c",
>     "timestamp": "2024-05-20T00:27:27.924980Z",
>     "utsname_hostname":
> "rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-5bdf7cfdfzs7c",
>     "utsname_machine": "x86_64",
>     "utsname_release": "5.14.0-427.13.1.el9_4.x86_64",
>     "utsname_sysname": "Linux",
>     "utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed Apr 10 10:29:16 EDT 2024"
> }
> 

Wow - we are dealing with at least two new MDS crashes. The crash in the BZ description (got_journaled_ack) is a known issue which happens with standby-replay MDS and that's covered in https://tracker.ceph.com/issues/54741.

Aman, I assume the core dumps are still accessible since these crashes were seen sometime this week. Could you please find a way to share those with engineering.

Comment 6 Aman Agrawal 2024-05-23 17:57:58 UTC

The failover issue mentioned in this bug with discussed with Ramen team in today's RDR triage meeting and it was concluded that the stuck failover is due to https://bugzilla.redhat.com/show_bug.cgi?id=2283038 and should not be because of MDS crashes reported in this BZ.

However, MDS crash issue should still be prioritised and investigated. 

I am updating the bug title and details accordingly.

Venky, must-gather logs shared above should have the coredumps, please check and confirm- http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/20may24-mds-high-log-level/

Comment 7 Aman Agrawal 2024-05-23 18:06:17 UTC

We still have the setup in case logs are missing and we would need to re-visit it hence requesting an ack/nack on the logs provided earlier.

Comment 8 Aman Agrawal 2024-05-27 17:33:26 UTC

I tried collecting pstack for both mds-a and mds-b by making them active with log level 
ceph config set mds debug_mds 20
ceph config set mds debug_ms 1

but I don't think much info. was collected. Refer- http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/27may24/

Pls note, only 1 daemon was available so I had to respin the pods and thereafter collect the logs.

Time- Somewhere close to Mon May 27 17:32:39 UTC 2024


bash-5.1$ ceph -s
  cluster:
    id:     119cc23a-0ffa-4ed8-ab81-bf8f48c88b8c
    health: HEALTH_WARN
            insufficient standby MDS daemons available
            1 MDSs behind on trimming
            18 daemons have recently crashed

  services:
    mon:        3 daemons, quorum d,f,g (age 4h)
    mgr:        b(active, since 12d), standbys: a
    mds:        1/1 daemons up
    osd:        3 osds: 3 up (since 12d), 3 in (since 3w)
    rbd-mirror: 1 daemon active (1 hosts)
    rgw:        1 daemon active (1 hosts, 1 zones)

  data:
    volumes: 1/1 healthy
    pools:   12 pools, 169 pgs
    objects: 524.65k objects, 37 GiB
    usage:   112 GiB used, 1.4 TiB / 1.5 TiB avail
    pgs:     169 active+clean

  io:
    client:   1.1 MiB/s rd, 33 MiB/s wr, 194 op/s rd, 26 op/s wr


pods|grep mds
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-5bdf7cfdfzs7c   2/2     Running     1164 (4d13h ago)   12d    10.128.2.63    compute-2   <none>           <none>
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-7dd58665vqjsp   2/2     Running     1843 (10m ago)     12d    10.131.0.241   compute-1   <none>           <none>

Comment 9 Venky Shankar 2024-05-28 05:28:10 UTC

(In reply to Aman Agrawal from comment #6)
> The failover issue mentioned in this bug with discussed with Ramen team in
> today's RDR triage meeting and it was concluded that the stuck failover is
> due to https://bugzilla.redhat.com/show_bug.cgi?id=2283038 and should not be
> because of MDS crashes reported in this BZ.
> 
> However, MDS crash issue should still be prioritised and investigated. 
> 
> I am updating the bug title and details accordingly.
> 
> Venky, must-gather logs shared above should have the coredumps, please check
> and confirm-
> http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/20may24-mds-
> high-log-level/

Thank you. I will have a look.

Comment 10 Sunil Kumar Acharya 2024-06-07 15:32:47 UTC

(In reply to Venky Shankar from comment #9)
> (In reply to Aman Agrawal from comment #6)
> > The failover issue mentioned in this bug with discussed with Ramen team in
> > today's RDR triage meeting and it was concluded that the stuck failover is
> > due to https://bugzilla.redhat.com/show_bug.cgi?id=2283038 and should not be
> > because of MDS crashes reported in this BZ.
> > 
> > However, MDS crash issue should still be prioritised and investigated. 
> > 
> > I am updating the bug title and details accordingly.
> > 
> > Venky, must-gather logs shared above should have the coredumps, please check
> > and confirm-
> > http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/20may24-mds-
> > high-log-level/
> 
> Thank you. I will have a look.

Please share the latest status.