+++ This bug was initially created as a clone of Bug #2182564 +++ The issue is reported by upstream community user. The cluster had two filesystems and the active mds of both the filesystems were stuck in 'up:replay'. This was the case for around 2 days. Later, one of the active mds (stuck in up:replay) state crashed with below stack trace. /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5/rpm/el8/BUILD/ceph-17.2.5/src/mds/journal.cc: In function 'void EMetaBlob::replay(MDSRank*, LogSegment*, MDPeerUpdate*)' thread 7fccc7153700 time 2023-01-17T10:05:15.420191+0000 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5/rpm/el8/BUILD/ceph-17.2.5/src/mds/journal.cc: 1625: FAILED ceph_assert(g_conf()->mds_wipe_sessions) ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x135) [0x7fccd759943f] 2: /usr/lib64/ceph/libceph-common.so.2(+0x269605) [0x7fccd7599605] 3: (EMetaBlob::replay(MDSRank*, LogSegment*, MDPeerUpdate*)+0x5e5c) [0x55fb2b98e89c] 4: (EUpdate::replay(MDSRank*)+0x40) [0x55fb2b98f5a0] 5: (MDLog::_replay_thread()+0x9b3) [0x55fb2b915443] 6: (MDLog::ReplayThread::entry()+0x11) [0x55fb2b5d1e31] 7: /lib64/libpthread.so.0(+0x81ca) [0x7fccd65891ca] 8: clone() The ceph tracker is https://tracker.ceph.com/issues/58489.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat Ceph Storage 5.3 Bug Fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:3259