Created attachment 1973454 [details] stack dump in the mds log. Description of problem: v5.3z3 cluster has seen many MDS issues, this most recent issue is a ceph_assert in the Locker.cc code. 2023-06-16T09:16:13.555+0000 7f926e8b1700 -1 /builddir/build/BUILD/ceph-16.2.10/src/mds/Locker.cc: In function 'int Locker::issue_caps(CInode*, Capability*)' thread 7f926e8b1700 time 2023-06-16T09:16:13.554458+0000 /builddir/build/BUILD/ceph-16.2.10/src/mds/Locker.cc: 2357: FAILED ceph_assert(!cap->is_new()) ceph version 16.2.10-172.el8cp (00a157ecd158911ece116ae43095de793ed9f389) pacific (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x7f92774fb7b8] 2: /usr/lib64/ceph/libceph-common.so.2(+0x2799d2) [0x7f92774fb9d2] 3: (Locker::issue_caps(CInode*, Capability*)+0x1682) [0x5592893d0ca2] <- int Locker::issue_caps(CInode *in, Capability *only_cap) 4: (Locker::simple_sync(SimpleLock*, bool*)+0x4bf) [0x5592893d98ef] 5: (Locker::_rdlock_kick(SimpleLock*, bool)+0x22f) [0x5592893ef35f] 6: (Locker::rdlock_start(SimpleLock*, boost::intrusive_ptr<MDRequestImpl>&, bool)+0xdf) [0x5592893efe5f] 7: (Locker::acquire_locks(boost::intrusive_ptr<MDRequestImpl>&, MutationImpl::LockOpVec&, CInode*, bool)+0x28d6) [0x5592893f2d26] 8: (Server::handle_client_getattr(boost::intrusive_ptr<MDRequestImpl>&, bool)+0x329) [0x559289260269] 9: (Server::dispatch_client_request(boost::intrusive_ptr<MDRequestImpl>&)+0x677) [0x5592892abe57] 10: (Server::handle_client_request(boost::intrusive_ptr<MClientRequest const> const&)+0x403) [0x5592892ace53] 11: (Server::dispatch(boost::intrusive_ptr<Message const> const&)+0x12b) [0x5592892b165b] 12: (MDSRank::handle_message(boost::intrusive_ptr<Message const> const&)+0xbb4) [0x5592892069f4] 13: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)+0x7bb) [0x5592892093ab] 14: (MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message const> const&)+0x55) [0x5592892099a5] 15: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x108) [0x5592891f9598] 16: (DispatchQueue::entry()+0x126a) [0x7f927774435a] 17: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f92777f77b1] 18: /lib64/libpthread.so.0(+0x81ca) [0x7f92764da1ca] 19: clone() 2023-06-16T09:16:13.557+0000 7f926e8b1700 -1 *** Caught signal (Aborted) ** in thread 7f926e8b1700 thread_name:ms_dispatch ceph version 16.2.10-172.el8cp (00a157ecd158911ece116ae43095de793ed9f389) pacific (stable) 1: /lib64/libpthread.so.0(+0x12cf0) [0x7f92764e4cf0] 2: gsignal() 3: abort() 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x7f92774fb809] 5: /usr/lib64/ceph/libceph-common.so.2(+0x2799d2) [0x7f92774fb9d2] 6: (Locker::issue_caps(CInode*, Capability*)+0x1682) [0x5592893d0ca2] 7: (Locker::simple_sync(SimpleLock*, bool*)+0x4bf) [0x5592893d98ef] 8: (Locker::_rdlock_kick(SimpleLock*, bool)+0x22f) [0x5592893ef35f] 9: (Locker::rdlock_start(SimpleLock*, boost::intrusive_ptr<MDRequestImpl>&, bool)+0xdf) [0x5592893efe5f] 10: (Locker::acquire_locks(boost::intrusive_ptr<MDRequestImpl>&, MutationImpl::LockOpVec&, CInode*, bool)+0x28d6) [0x5592893f2d26] 11: (Server::handle_client_getattr(boost::intrusive_ptr<MDRequestImpl>&, bool)+0x329) [0x559289260269] 12: (Server::dispatch_client_request(boost::intrusive_ptr<MDRequestImpl>&)+0x677) [0x5592892abe57] 13: (Server::handle_client_request(boost::intrusive_ptr<MClientRequest const> const&)+0x403) [0x5592892ace53] 14: (Server::dispatch(boost::intrusive_ptr<Message const> const&)+0x12b) [0x5592892b165b] 15: (MDSRank::handle_message(boost::intrusive_ptr<Message const> const&)+0xbb4) [0x5592892069f4] 16: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)+0x7bb) [0x5592892093ab] 17: (MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message const> const&)+0x55) [0x5592892099a5] 18: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x108) [0x5592891f9598] 19: (DispatchQueue::entry()+0x126a) [0x7f927774435a] 20: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f92777f77b1] 21: /lib64/libpthread.so.0(+0x81ca) [0x7f92764da1ca] 22: clone() NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. Version-Release number of selected component (if applicable): 16.2.10-172.el8cp Storage <- 5.3.z3 - 5.3.3 How reproducible: Has not been reproduced. Steps to Reproduce: 1. 2. 3. Actual results: Abort Expected results: continue Additional info: --- Locker.cc int Locker::issue_caps(CInode *in, Capability *only_cap) { // count conflicts with int nissued = 0; int all_allowed = -1, loner_allowed = -1, xlocker_allowed = -1; ceph_assert(in->is_head()); // client caps . . . if (!(pending & ~allowed)) { // skip if suppress or new, and not revocation if (cap->is_new() || cap->is_suppress() || cap->is_stale()) { dout(20) << " !revoke and new|suppressed|stale, skipping client." << it->first << dendl; continue; } } else { ceph_assert(!cap->is_new()); <- ptr from dump_stack. if (cap->is_stale()) { dout(20) << " revoke stale cap from client." << it->first << dendl; ceph_assert(!cap->is_valid()); cap->issue(allowed & pending, false); mds->queue_waiter_front(new C_Locker_RevokeStaleCap(this, in, it->first)); continue; } I will attach the ceph-mds log.
Shoot this bz got missed when it came in — Venky, please triage.
(In reply to Greg Farnum from comment #1) > Shoot this bz got missed when it came in — Venky, please triage. ACK. On it today.