Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
This project is now read‑only. Starting Monday, February 2, please use https://ibm-ceph.atlassian.net/ for all bug tracking management.

Bug 2431960

Summary: [8.1z Backport][GSS]ceph-mds crashed - mds-rank-fin
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Vivek Maurya <vmaurya>
Component: CephFSAssignee: Vivek Maurya <vmaurya>
Status: CLOSED UPSTREAM QA Contact: sumr
Severity: high Docs Contact:
Priority: urgent    
Version: 8.1CC: bkunal, ceph-eng-bugs, cephqe-warriors, ngangadh, syeshwan
Target Milestone: ---   
Target Release: 8.1z5   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ceph-19.2.1-321.el9cp Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2431983 (view as bug list) Environment:
Last Closed: 2026-03-05 07:25:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2431983    
Bug Blocks:    

Description Vivek Maurya 2026-01-22 05:14:57 UTC
Description of problem - Provide a detailed description of the issue encountered, including logs/command-output snippets and screenshots if the issue is observed in the UI:

Partner has observed a crash of one Ceph MDS daemon

The OCP platform infrastructure and deployment type (AWS, Bare Metal, VMware, etc. Please clarify if it is platform agnostic deployment), (IPI/UPI):

Baremetal

The ODF deployment type (Internal, External, Internal-Attached (LSO), Multicluster, DR, Provider, etc):

Internal

The version of all relevant components (OCP, ODF, RHCS, ACM whichever is applicable):

OCP 4.18.22

ODF 4.18.11




Does this issue impact your ability to continue to work with the product?

There's no impact but ceph-mds core dump observed leading to ceph health warning and permanent alarm in OCP "CephClusterWarningState"

Is there any workaround available to the best of your knowledge?

No

Can this issue be reproduced? If so, please provide the hit rate

So far we cannot reproduce it

Can this issue be reproduced from the UI?

no

If this is a regression, please provide more details to justify this:

n/a

Steps to Reproduce:

1. Partner robustness test like cordon/drain and uncordon nodes, nodes restart on a MNO Cluster.

2. mds crashed and created a coredump




The exact date and time when the issue was observed, including timezone details:

coredumpctl list 
TIME                           PID UID GID SIG     COREFILE     EXE               SIZE 
Tue 2025-11-11 13:21:14 UTC 843353 167 167 SIGABRT inaccessible /usr/bin/ceph-mds    - 
[core@master1 ~]$ sudo coredumpctl dump 843353 
            PID: 843353 (ceph-mds) 
            UID: 167 (167) 
            GID: 167 (167) 
         Signal: 6 (ABRT) 
      Timestamp: Tue 2025-11-11 13:21:12 UTC (1h 18min ago) 
   Command Line: ceph-mds }fsid=caf6f764-620e-484a-be00-cc79ac74b231 --mon-host=[v2:172.22.175.211:3300],[v2:172.22.197.232:3300],[v2:172.22.156.49:3300]' --mon-initial-members=b,a,c --id=ocs-storagecluster-cephfilesystem-b --setuser=ceph --setgroup=ceph --ms-bind-ipv4=true --ms-bind-ipv6=false --foreground --public-addr=172.21.2.20 
     Executable: /usr/bin/ceph-mds 
  Control Group: /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podedc65d07_fc2b_4d10_8609_16f9bc02c64a.slice/crio-d63ad5eea8cfc5262805208871fc54fb94d802d7ee1e587c941ae24d6277b29d.scope 
           Unit: crio-d63ad5eea8cfc5262805208871fc54fb94d802d7ee1e587c941ae24d6277b29d.scope 
          Slice: kubepods-burstable-podedc65d07_fc2b_4d10_8609_16f9bc02c64a.slice 
        Boot ID: 4df85775901f4e209277566d72529439 
     Machine ID: b44a55e05e8d43ddb0445fe3b09b56e8 
       Hostname: rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-65fd95d6q6q5p 
        Storage: /var/lib/systemd/coredump/core.ceph-mds.167.4df85775901f4e209277566d72529439.843353.1762867272000000.zst (present) 
   Size on Disk: 7.1M 
        Message: Process 843353 (ceph-mds) of user 167 dumped core. 
                Stack trace of thread 5749: 
                 #0  0x00007f05ec750edc n/a (/usr/lib64/libc.so.6 + 0x8bedc) 
                 ELF object binary architecture: AMD x86-64 





Actual results:

ceph-mds crashed

Expected results:

No ceph-mds crash

Logs collected and log location:

Provided OCP must-gather and the core file, but not odf must-gather

Additional info:

Checking the core file 

podman run -v /root/work/odf:/cores:Z -it --entrypoint /bin/bash registry.redhat.io/rhceph/rhceph-8-rhel9@sha256:bcfa03b645e5a5a4dc5350afd65499ff3a6d3a052d1be96ca17870d0f25c4107
[root@619b69bdadbc /]# 
[root@619b69bdadbc /]# microdnf -y \
  --enablerepo='rhel-9-for-x86_64-baseos-debug-rpms' \
  --enablerepo='rhel-9-for-x86_64-appstream-debug-rpms' \
  --enablerepo='rhceph-8-tools-for-rhel-9-x86_64-debug-rpms' \
  install gdb ceph-mds-debuginfo-19.2.1-245.el9cp.x86_64 librados2-debuginfo-19.2.1-245.el9cp.x86_64 glibc-debuginfo-2.34-168.el9_6.23.x86_64 libstdc++-debuginfo-11.5.0-5.el9_5.x86_64 libgcc-debuginfo-11.5.0-5.el9_5.x86_64

[root@619b69bdadbc cores]# gdb /usr/bin/ceph-mds core.ceph-mds.167.4df85775901f4e209277566d72529439.843353.1762867272000000
GNU gdb (Red Hat Enterprise Linux) 16.3-2.el9
Copyright (C) 2024 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/bin/ceph-mds...
Reading symbols from /usr/lib/debug/usr/bin/ceph-mds-19.2.1-245.el9cp.x86_64.debug...
[New LWP 5749]
[New LWP 5723]
[New LWP 5725]
[New LWP 3]
[New LWP 5727]
[New LWP 5726]
[New LWP 5731]
[New LWP 5735]
[New LWP 5734]
[New LWP 5733]
[New LWP 5732]
[New LWP 5737]
[New LWP 5738]
[New LWP 5739]
[New LWP 5741]
[New LWP 5743]
[New LWP 5742]
[New LWP 5746]
[New LWP 5747]
[New LWP 5745]
[New LWP 5748]
[New LWP 5744]
[New LWP 5736]
[New LWP 5751]
[New LWP 10487]
[New LWP 10488]
[New LWP 10490]
[New LWP 10489]
[New LWP 5740]warning: could not find '.gnu_debugaltlink' file for /usr/lib/debug/usr/lib64/libstdc++.so.6.0.29-11.5.0-5.el9_5.x86_64.debugwarning: could not find '.gnu_debugaltlink' file for /usr/lib/debug/usr/lib64/libgcc_s-11-20240719.so.1-11.5.0-5.el9_5.x86_64.debug
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/proc/self/exe --fsid=caf6f764-620e-484a-be00-cc79ac74b231 --keyring=/etc/ceph/keyring-store/keyring --default-log-to-stderr=true --default-err-to-stderr=true --default-mon-cluster-log-to-stderr=true --default-log-stderr-prefix=debug\  --default-log-to-file=false --default-mon-cluster-log-to-file=false --mon-host=\[v2:172.22.175.211:3300\],\[v2:172.22.197.232:3300\],\[v2:172.22.156.49:3300\] --mon-initial-members=b,a,c --id=ocs-storagecluster-cephfilesystem-b --setuser=ceph --setgroup=ceph --ms-bind-ipv4=true --ms-bind-ipv6=false --foreground --public-addr=172.21.2.20'.
Program terminated with signal SIGABRT, Aborted.
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
44          return INTERNAL_SYSCALL_ERROR_P (ret) ? INTERNAL_SYSCALL_ERRNO (ret) : 0;
[Current thread is 1 (Thread 0x7f05e0b2a640 (LWP 5749))]
Missing rpms, try: dnf --enablerepo='*debug*' install gperftools-libs-debuginfo-2.9.1-5.el9cp.x86_64 lua-libs-debuginfo-5.4.4-4.el9.x86_64 libunwind-debuginfo-1.6.2-2.el9cp.x86_64 libblkid-debuginfo-2.37.4-21.el9.x86_64 openssl-libs-debuginfo-3.2.2-6.el9_5.1.x86_64 systemd-libs-debuginfo-252-51.el9_6.1.x86_64 libibverbs-debuginfo-54.0-1.el9.x86_64 librdmacm-debuginfo-54.0-1.el9.x86_64 zlib-debuginfo-1.2.11-40.el9.x86_64 libcurl-minimal-debuginfo-7.76.1-31.el9_6.1.x86_64 thrift-debuginfo-0.15.0-3.el9cp.x86_64 libnl3-debuginfo-3.11.0-1.el9.x86_64 libnghttp2-debuginfo-1.43.0-6.el9.x86_64 krb5-libs-debuginfo-1.21.1-8.el9_6.x86_64 libcom_err-debuginfo-1.46.5-7.el9.x86_64 keyutils-libs-debuginfo-1.6.3-1.el9.x86_64 libselinux-debuginfo-3.6-3.el9.x86_64 pcre2-debuginfo-10.40-6.el9.x86_64 lttng-ust-debuginfo-2.12.0-6.el9.x86_64 numactl-libs-debuginfo-2.0.19-1.el9.x86_64 userspace-rcu-debuginfo-0.12.1-6.el9.x86_64(gdb) bt
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
#1  0x00007f05ec750f43 in __pthread_kill_internal (signo=6, threadid=<optimized out>) at pthread_kill.c:78
#2  0x00007f05ec703b46 in __GI_raise (sig=6) at ../sysdeps/posix/raise.c:26
#3  0x0000560213db6e47 in reraise_fatal (signum=6) at /usr/src/debug/ceph-19.2.1-245.el9cp.x86_64/src/global/signal_handler.cc:88
#4  handle_oneshot_fatal_signal (signum=6) at /usr/src/debug/ceph-19.2.1-245.el9cp.x86_64/src/global/signal_handler.cc:367
#5  <signal handler called>
#6  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
#7  0x00007f05ec750f43 in __pthread_kill_internal (signo=6, threadid=<optimized out>) at pthread_kill.c:78
#8  0x00007f05ec703b46 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#9  0x00007f05ec6ed833 in __GI_abort () at abort.c:79
#10 0x00007f05ec6ed75b in __assert_fail_base (fmt=<optimized out>, assertion=assertion@entry=0x560213e86ba7 "px != 0", 
    file=file@entry=0x560213e86d50 "/builddir/build/BUILD/ceph-19.2.1/redhat-linux-build/boost/include/boost/smart_ptr/intrusive_ptr.hpp", line=line@entry=201, 
    function=function@entry=0x560213e90748 "T* boost::intrusive_ptr<T>::operator->() const [with T = MDRequestImpl]") at assert.c:123
#11 0x00007f05ec6fc886 in __assert_fail (assertion=0x560213e86ba7 "px != 0", file=0x560213e86d50 "/builddir/build/BUILD/ceph-19.2.1/redhat-linux-build/boost/include/boost/smart_ptr/intrusive_ptr.hpp", line=201, 
    function=0x560213e90748 "T* boost::intrusive_ptr<T>::operator->() const [with T = MDRequestImpl]") at assert.c:132
#12 0x0000560213a6a469 in boost::intrusive_ptr<MDRequestImpl>::operator->() const [clone .part.0] [clone .lto_priv.0] (this=<optimized out>)
    at /usr/src/debug/ceph-19.2.1-245.el9cp.x86_64/redhat-linux-build/boost/include/boost/smart_ptr/intrusive_ptr.hpp:201
#13 0x0000560213bc8dd7 in boost::intrusive_ptr<MDRequestImpl>::operator-> (this=<optimized out>, this=<optimized out>) at /usr/src/debug/ceph-19.2.1-245.el9cp.x86_64/src/mds/MDCache.cc:13045
#14 C_MDS_RetryRequest::finish (this=<optimized out>, r=<optimized out>) at /usr/src/debug/ceph-19.2.1-245.el9cp.x86_64/src/mds/MDCache.cc:13046
#15 0x0000560213d0436c in Context::complete (r=0, this=0x560219521280) at /usr/src/debug/ceph-19.2.1-245.el9cp.x86_64/src/include/Context.h:99
#16 MDSContext::complete (this=0x560219521280, r=0) at /usr/src/debug/ceph-19.2.1-245.el9cp.x86_64/src/mds/MDSContext.cc:30
#17 0x00007f05ece7907d in Finisher::finisher_thread_entry (this=0x560215347180) at /usr/src/debug/ceph-19.2.1-245.el9cp.x86_64/src/common/Finisher.cc:72
#18 0x00007f05ec74f19a in start_thread (arg=<optimized out>) at pthread_create.c:443
#19 0x00007f05ec7d4240 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81(gdb) bt full
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
        tid = <optimized out>
        ret = 0
        pd = <optimized out>
        old_mask = {__val = {16, 139659719999488, 139663221338176, 11011663608147867648, 94566923070112, 0, 139663221325824, 0, 0, 11011663608147867648, 18446744073709551615, 0, 94566962267144, 0, 4294967295, 
            139663422457464}}
        ret = <optimized out>
#1  0x00007f05ec750f43 in __pthread_kill_internal (signo=6, threadid=<optimized out>) at pthread_kill.c:78
No locals.
#2  0x00007f05ec703b46 in __GI_raise (sig=6) at ../sysdeps/posix/raise.c:26
        ret = <optimized out>
#3  0x0000560213db6e47 in reraise_fatal (signum=6) at /usr/src/debug/ceph-19.2.1-245.el9cp.x86_64/src/global/signal_handler.cc:88
        ret = <optimized out>
        buf = "/var/lib/ceph/crash/2025-11-11T13:21:11.977692Z_f6e15839-ce0a-4161-9642-833dc63c2215/log", '\000' <repeats 935 times>
#4  handle_oneshot_fatal_signal (signum=6) at /usr/src/debug/ceph-19.2.1-245.el9cp.x86_64/src/global/signal_handler.cc:367
        buf = "*** Caught signal (Aborted) **\n in thread 7f05e0b2a640 thread_name:mds-rank-fin\n", '\000' <repeats 943 times>
        pthread_name = "mds-rank-fin\000\000\000"
        r = <optimized out>
        bt = {<ceph::BackTrace> = {_vptr.BackTrace = 0x560213fea720 <vtable for ceph::ClibBackTrace+16>}, static max = 32, skip = 1, array = {0x560213db6ba9 <handle_oneshot_fatal_signal(int)+265>, 
            0x7f05ec703bf0 <__restore_rt>, 0x7f05ec750edc <__pthread_kill_implementation+284>, 0x7f05ec703b46 <__GI_raise+22>, 0x7f05ec6ed833 <__GI_abort+211>, 0x7f05ec6ed75b <_nl_load_domain.cold>, 0x7f05ec6fc886, 
            0x560213a6a469, 0x560213bc8dd7, 0x560213d0436c <MDSContext::complete(int)+92>, 0x7f05ece7907d <Finisher::finisher_thread_entry()+381>, 0x7f05ec74f19a <start_thread+794>, 0x7f05ec7d4240 <clone3+48>, 
            0x0 <repeats 19 times>}, size = 13, strings = 0x560218281800}
        oss = <error reading variable oss (could not read '.gnu_debugaltlink' section)>
        crash_base = "/var/lib/ceph/crash/2025-11-11T13:21:11.977692Z_f6e15839-ce0a-4161-9642-833dc63c2215", '\000' <repeats 4011 times>
        handler_tid = {<std::__atomic_base<int>> = {static _S_alignment = 4, _M_i = 5749}, static is_always_lock_free = true}
        NULL_TID = <optimized out>
#5  <signal handler called>
No locals.
#6  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
        tid = <optimized out>
        ret = 0
        pd = <optimized out>
        old_mask = {__val = {94566945678892, 94566945678592, 94566945678892, 0, 0, 0, 0, 0, 549755813888, 11011663608147867648, 94566962036736, 11011663608147867648, 139663420159648, 139663221355984, 
            94566923962184, 139663420159648}}
        ret = <optimized out>
#7  0x00007f05ec750f43 in __pthread_kill_internal (signo=6, threadid=<optimized out>) at pthread_kill.c:78
No locals.
#8  0x00007f05ec703b46 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
        ret = <optimized out>
#9  0x00007f05ec6ed833 in __GI_abort () at abort.c:79
        save_stage = 1
        act = {__sigaction_handler = {sa_handler = 0x560213e86d50, sa_sigaction = 0x560213e86d50}, sa_mask = {__val = {201, 94566945678592, 11543773184, 94566945678592, 320, 26587183336057, 94566945690096, 0, 
              94566965104864, 94566960089672, 139663426327871, 0, 15000000000, 26587, 139663419895028, 139663221356544}}, sa_flags = -326371680, sa_restorer = 0x7f05ec8bb5e0 <__GI__IO_file_jumps>}
        sigs = {__val = {32, 139663419888024, 139663221368384, 139663418564981, 94566923922343, 201, 139663419888024, 139663418429338, 206158430256, 139663419907256, 94566923922768, 139663418565274, 206158430232, 
            139663221356288, 139663221356096, 11011663608147867648}}
#10 0x00007f05ec6ed75b in __assert_fail_base (fmt=<optimized out>, assertion=assertion@entry=0x560213e86ba7 "px != 0", 
    file=file@entry=0x560213e86d50 "/builddir/build/BUILD/ceph-19.2.1/redhat-linux-build/boost/include/boost/smart_ptr/intrusive_ptr.hpp", line=line@entry=201, 
    function=function@entry=0x560213e90748 "T* boost::intrusive_ptr<T>::operator->() const [with T = MDRequestImpl]") at assert.c:123
        str = 0x560215346500 ""
        total = 4096
#11 0x00007f05ec6fc886 in __assert_fail (assertion=0x560213e86ba7 "px != 0", file=0x560213e86d50 "/builddir/build/BUILD/ceph-19.2.1/redhat-linux-build/boost/include/boost/smart_ptr/intrusive_ptr.hpp", line=201, 
    function=0x560213e90748 "T* boost::intrusive_ptr<T>::operator->() const [with T = MDRequestImpl]") at assert.c:132
No locals.
#12 0x0000560213a6a469 in boost::intrusive_ptr<MDRequestImpl>::operator->() const [clone .part.0] [clone .lto_priv.0] (this=<optimized out>)
    at /usr/src/debug/ceph-19.2.1-245.el9cp.x86_64/redhat-linux-build/boost/include/boost/smart_ptr/intrusive_ptr.hpp:201
        __PRETTY_FUNCTION__ = <optimized out>
#13 0x0000560213bc8dd7 in boost::intrusive_ptr<MDRequestImpl>::operator-> (this=<optimized out>, this=<optimized out>) at /usr/src/debug/ceph-19.2.1-245.el9cp.x86_64/src/mds/MDCache.cc:13045
        __PRETTY_FUNCTION__ = <optimized out>
#14 C_MDS_RetryRequest::finish (this=<optimized out>, r=<optimized out>) at /usr/src/debug/ceph-19.2.1-245.el9cp.x86_64/src/mds/MDCache.cc:13046
No locals.
#15 0x0000560213d0436c in Context::complete (r=0, this=0x560219521280) at /usr/src/debug/ceph-19.2.1-245.el9cp.x86_64/src/include/Context.h:99
No locals.
#16 MDSContext::complete (this=0x560219521280, r=0) at /usr/src/debug/ceph-19.2.1-245.el9cp.x86_64/src/mds/MDSContext.cc:30
        mds = 0x560215479608
        assert_data_ctx = {assertion = 0x560213e881ef "mds != nullptr", file = 0x560213ed88f0 "/builddir/build/BUILD/ceph-19.2.1/src/mds/MDSContext.cc", line = 26, 
          function = 0x560213ed8ab8 "virtual void MDSContext::complete(int)"}
        __PRETTY_FUNCTION__ = <optimized out>
        assert_data_ctx = <optimized out>
#17 0x00007f05ece7907d in Finisher::finisher_thread_entry (this=0x560215347180) at /usr/src/debug/ceph-19.2.1-245.el9cp.x86_64/src/common/Finisher.cc:72
        p = <optimized out>
        __for_range = <optimized out>
        __for_begin = <optimized out>
        __for_end = <optimized out>
        ul = {_M_device = 0x560215347188, _M_owns = false}
        start = {tv = {tv_sec = 1762867271, tv_nsec = 977027524}}
        count = 1
#18 0x00007f05ec74f19a in start_thread (arg=<optimized out>) at pthread_create.c:443
        ret = <optimized out>
        pd = <optimized out>
        out = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139663330459280, -8442301368976500714, 139663221368384, 0, 139663418650240, 0, 8420933910906312726, 8420943277764941846}, mask_was_saved = 0}}, priv = {pad = {
              0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
#19 0x00007f05ec7d4240 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
No locals. 

 
Similar known issues 

https://tracker.ceph.com/issues/70624 

https://tracker.ceph.com/issues/69695 

https://tracker.ceph.com/issues/70770 




We need help to understand if this is a known issue related to the issues mentioned above, or something new

Comment 1 Storage PM bot 2026-01-22 05:15:07 UTC
Please specify the severity of this bug. Severity is defined here:
https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity.

Comment 8 Red Hat Bugzilla 2026-03-05 07:25:55 UTC
This product has been discontinued or is no longer tracked in Red Hat Bugzilla.