Bug 2016380

Summary: mds: crash when journaling during replay
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Venky Shankar <vshankar>
Component: CephFSAssignee: Venky Shankar <vshankar>
Status: CLOSED ERRATA QA Contact: Amarnath <amk>
Severity: medium Docs Contact: Ranjini M N <rmandyam>
Priority: high    
Version: 5.0CC: agunn, ceph-eng-bugs, rmandyam, tserlin, vereddy
Target Milestone: ---   
Target Release: 5.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ceph-16.2.6-19.el8cp Doc Type: Bug Fix
Doc Text:
.The Ceph Metadata Server (MDS) no longer crashes after being promoted to an active rank Previously, the Ceph MDS might crash in some circumstances after being promoted to an active rank and remain unavailable, resulting in downtime for clients accessing the system due to a failover. With this update, the MDS failover results in the file system being available after transitioning to an active rank.
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-04-04 10:22:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2031073    

Description Venky Shankar 2021-10-21 12:45:52 UTC
Description of problem:
A standy MDS transitioning to active rank can crash during transition phase (state) with the following backtrace:

  -1> 2021-07-08 15:14:13.283 7f3804255700 -1 /builddir/build/BUILD/ceph-14.2.20/src/mds/MDLog.cc: In function 'void MDLog::_submit_entry(LogEvent*, MDSLogContextBase*)' thread 7f3804255700 time 2021-07-08 15:14:13.283719
/builddir/build/BUILD/ceph-14.2.20/src/mds/MDLog.cc: 288: FAILED ceph_assert(!segments.empty())

 ceph version 14.2.20 (36274af6eb7f2a5055f2d53ad448f2694e9046a0) nautilus (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14a) [0x7f380d72cfe7]
 2: (()+0x25d1af) [0x7f380d72d1af]
 3: (MDLog::_submit_entry(LogEvent*, MDSLogContextBase*)+0x599) [0x557471ec5959]
 4: (Server::journal_close_session(Session*, int, Context*)+0x9ed) [0x557471c7e02d]
 5: (Server::kill_session(Session*, Context*)+0x234) [0x557471c81914]
 6: (Server::apply_blacklist(std::set<entity_addr_t, std::less<entity_addr_t>, std::allocator<entity_addr_t> > const&)+0x14d) [0x557471c8449d]
 7: (MDSRank::reconnect_start()+0xcf) [0x557471c49c5f]
 8: (MDSRankDispatcher::handle_mds_map(boost::intrusive_ptr<MMDSMap const> const&, MDSMap const&)+0x1c29) [0x557471c57979]
 9: (MDSDaemon::handle_mds_map(boost::intrusive_ptr<MMDSMap const> const&)+0xa9b) [0x557471c3091b]
 10: (MDSDaemon::handle_core_message(boost::intrusive_ptr<Message const> const&)+0xed) [0x557471c3216d]
 11: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0xc3) [0x557471c32983]
 12: (DispatchQueue::entry()+0x1699) [0x7f380d952b79]
 13: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f380da008ed]
 14: (()+0x7ea5) [0x7f380b5eeea5]
 15: (clone()+0x6d) [0x7f380a29e96d]

(above backtrace is from a nautilus install, the bug still exists is othere releases).

Comment 10 Venky Shankar 2022-02-01 03:46:39 UTC
Clearing NI - doc text provided.

Comment 12 Venky Shankar 2022-02-21 04:55:44 UTC
minor reword - instead of

        "the Ceph MDS would crash after being promoted..."

change to:

        "the Ceph MDS might crash in some circumstances after being promoted..."

Comment 13 Venky Shankar 2022-02-21 07:09:43 UTC
Looks good!

Comment 15 errata-xmlrpc 2022-04-04 10:22:04 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Ceph Storage 5.1 Security, Enhancement, and Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:1174