Bug 2182564 - mds: force replay sessionmap version
Summary: mds: force replay sessionmap version
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: CephFS
Version: 6.0
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 6.1
Assignee: Xiubo Li
QA Contact: Hemanth Kumar
Akash Raj
URL:
Whiteboard:
Depends On:
Blocks: 2182566 2192813
TreeView+ depends on / blocked
 
Reported: 2023-03-29 03:18 UTC by Xiubo Li
Modified: 2023-06-15 09:16 UTC (History)
6 users (show)

Fixed In Version: ceph-17.2.6-11.el9cp
Doc Type: Bug Fix
Doc Text:
.MDS daemons no longer crash due to sessionmap version mismatch issue Previously, MDS sessionmap journal log would not correctly persist when MDS failover occurred. Due to this, when a new MDS was trying to replay the journal logs, the sessionmap journal logs would mismatch with the information in the MDCache or the information from other journal logs, causing the MDS daemons to trigger an assert to crash themselves. With this fix, trying to force replay the sessionmap version instead of crashing the MDS daemons results in no MDS daemon crashes due to sessionmap version mismatch issue.
Clone Of:
: 2182566 (view as bug list)
Environment:
Last Closed: 2023-06-15 09:16:51 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Ceph Project Bug Tracker 58489 0 None None None 2023-03-29 03:18:57 UTC
Red Hat Issue Tracker RHCEPH-6339 0 None None None 2023-03-29 03:20:11 UTC

Description Xiubo Li 2023-03-29 03:18:58 UTC
The issue is reported by upstream community user.

The cluster had two filesystems and the active mds of both the filesystems were stuck in 'up:replay'.
This was the case for around 2 days. Later, one of the active mds (stuck in up:replay) state crashed
with below stack trace.

/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5/rpm/el8/BUILD/ceph-17.2.5/src/mds/journal.cc:
In function 'void EMetaBlob::replay(MDSRank*, LogSegment*,
MDPeerUpdate*)' thread 7fccc7153700 time 2023-01-17T10:05:15.420191+0000
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5/rpm/el8/BUILD/ceph-17.2.5/src/mds/journal.cc:
1625: FAILED ceph_assert(g_conf()->mds_wipe_sessions)

  ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy
(stable)
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x135) [0x7fccd759943f]
  2: /usr/lib64/ceph/libceph-common.so.2(+0x269605) [0x7fccd7599605]
  3: (EMetaBlob::replay(MDSRank*, LogSegment*, MDPeerUpdate*)+0x5e5c)
[0x55fb2b98e89c]
  4: (EUpdate::replay(MDSRank*)+0x40) [0x55fb2b98f5a0]
  5: (MDLog::_replay_thread()+0x9b3) [0x55fb2b915443]
  6: (MDLog::ReplayThread::entry()+0x11) [0x55fb2b5d1e31]
  7: /lib64/libpthread.so.0(+0x81ca) [0x7fccd65891ca]
  8: clone()

The ceph tracker is https://tracker.ceph.com/issues/58489.

Comment 11 errata-xmlrpc 2023-06-15 09:16:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Ceph Storage 6.1 security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:3623


Note You need to log in before you can comment on or make changes to this bug.