Bug 2218985

Summary: [5.3z3 ceph cluster] MDS - ceph-16.2.10/src/mds/Locker.cc: 2357: FAILED ceph_assert(!cap->is_new())
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Brett Hull <bhull>
Component: CephFSAssignee: Venky Shankar <vshankar>
Status: NEW --- QA Contact: Hemanth Kumar <hyelloji>
Severity: high Docs Contact:
Priority: high    
Version: 5.3CC: bhull, ceph-eng-bugs, cephqe-warriors, gfarnum, vshankar
Target Milestone: ---   
Target Release: 6.1z2   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Brett Hull 2023-06-30 19:32:31 UTC
Created attachment 1973454 [details]
stack dump in the mds log.

Description of problem:
v5.3z3 cluster has seen many MDS issues, this most recent issue is a ceph_assert in the Locker.cc code. 

2023-06-16T09:16:13.555+0000 7f926e8b1700 -1 /builddir/build/BUILD/ceph-16.2.10/src/mds/Locker.cc: In function 'int Locker::issue_caps(CInode*, Capability*)' thread 7f926e8b1700 time 2023-06-16T09:16:13.554458+0000
/builddir/build/BUILD/ceph-16.2.10/src/mds/Locker.cc: 2357: FAILED ceph_assert(!cap->is_new())

 ceph version 16.2.10-172.el8cp (00a157ecd158911ece116ae43095de793ed9f389) pacific (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x7f92774fb7b8]
 2: /usr/lib64/ceph/libceph-common.so.2(+0x2799d2) [0x7f92774fb9d2]
 3: (Locker::issue_caps(CInode*, Capability*)+0x1682) [0x5592893d0ca2]                                   <- int Locker::issue_caps(CInode *in, Capability *only_cap)
 4: (Locker::simple_sync(SimpleLock*, bool*)+0x4bf) [0x5592893d98ef]
 5: (Locker::_rdlock_kick(SimpleLock*, bool)+0x22f) [0x5592893ef35f]
 6: (Locker::rdlock_start(SimpleLock*, boost::intrusive_ptr<MDRequestImpl>&, bool)+0xdf) [0x5592893efe5f]
 7: (Locker::acquire_locks(boost::intrusive_ptr<MDRequestImpl>&, MutationImpl::LockOpVec&, CInode*, bool)+0x28d6) [0x5592893f2d26]
 8: (Server::handle_client_getattr(boost::intrusive_ptr<MDRequestImpl>&, bool)+0x329) [0x559289260269]
 9: (Server::dispatch_client_request(boost::intrusive_ptr<MDRequestImpl>&)+0x677) [0x5592892abe57]
 10: (Server::handle_client_request(boost::intrusive_ptr<MClientRequest const> const&)+0x403) [0x5592892ace53]
 11: (Server::dispatch(boost::intrusive_ptr<Message const> const&)+0x12b) [0x5592892b165b]
 12: (MDSRank::handle_message(boost::intrusive_ptr<Message const> const&)+0xbb4) [0x5592892069f4]
 13: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)+0x7bb) [0x5592892093ab]
 14: (MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message const> const&)+0x55) [0x5592892099a5]
 15: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x108) [0x5592891f9598]
 16: (DispatchQueue::entry()+0x126a) [0x7f927774435a]
 17: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f92777f77b1]
 18: /lib64/libpthread.so.0(+0x81ca) [0x7f92764da1ca]
 19: clone()

2023-06-16T09:16:13.557+0000 7f926e8b1700 -1 *** Caught signal (Aborted) **
 in thread 7f926e8b1700 thread_name:ms_dispatch

 ceph version 16.2.10-172.el8cp (00a157ecd158911ece116ae43095de793ed9f389) pacific (stable)
 1: /lib64/libpthread.so.0(+0x12cf0) [0x7f92764e4cf0]
 2: gsignal()
 3: abort()
 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x7f92774fb809]
 5: /usr/lib64/ceph/libceph-common.so.2(+0x2799d2) [0x7f92774fb9d2]
 6: (Locker::issue_caps(CInode*, Capability*)+0x1682) [0x5592893d0ca2]
 7: (Locker::simple_sync(SimpleLock*, bool*)+0x4bf) [0x5592893d98ef]
 8: (Locker::_rdlock_kick(SimpleLock*, bool)+0x22f) [0x5592893ef35f]
 9: (Locker::rdlock_start(SimpleLock*, boost::intrusive_ptr<MDRequestImpl>&, bool)+0xdf) [0x5592893efe5f]
 10: (Locker::acquire_locks(boost::intrusive_ptr<MDRequestImpl>&, MutationImpl::LockOpVec&, CInode*, bool)+0x28d6) [0x5592893f2d26]
 11: (Server::handle_client_getattr(boost::intrusive_ptr<MDRequestImpl>&, bool)+0x329) [0x559289260269]
 12: (Server::dispatch_client_request(boost::intrusive_ptr<MDRequestImpl>&)+0x677) [0x5592892abe57]
 13: (Server::handle_client_request(boost::intrusive_ptr<MClientRequest const> const&)+0x403) [0x5592892ace53]
 14: (Server::dispatch(boost::intrusive_ptr<Message const> const&)+0x12b) [0x5592892b165b]
 15: (MDSRank::handle_message(boost::intrusive_ptr<Message const> const&)+0xbb4) [0x5592892069f4]
 16: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)+0x7bb) [0x5592892093ab]
 17: (MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message const> const&)+0x55) [0x5592892099a5]
 18: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x108) [0x5592891f9598]
 19: (DispatchQueue::entry()+0x126a) [0x7f927774435a]
 20: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f92777f77b1]
21: /lib64/libpthread.so.0(+0x81ca) [0x7f92764da1ca]
 22: clone()
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Version-Release number of selected component (if applicable):
16.2.10-172.el8cp Storage <- 5.3.z3 - 5.3.3

How reproducible:
Has not been reproduced. 

Steps to Reproduce:
1.
2.
3.

Actual results:
Abort

Expected results:
continue

Additional info:
---
Locker.cc
int Locker::issue_caps(CInode *in, Capability *only_cap)
{
  // count conflicts with
  int nissued = 0;
  int all_allowed = -1, loner_allowed = -1, xlocker_allowed = -1;

  ceph_assert(in->is_head());

  // client caps
. . .
    if (!(pending & ~allowed)) {
      // skip if suppress or new, and not revocation
      if (cap->is_new() || cap->is_suppress() || cap->is_stale()) {
	dout(20) << "  !revoke and new|suppressed|stale, skipping client." << it->first << dendl;
	continue;
      }
    } else {
      ceph_assert(!cap->is_new());                                                      <- ptr from dump_stack.
      if (cap->is_stale()) {
	dout(20) << "  revoke stale cap from client." << it->first << dendl;
	ceph_assert(!cap->is_valid());
	cap->issue(allowed & pending, false);
	mds->queue_waiter_front(new C_Locker_RevokeStaleCap(this, in, it->first));
	continue;
      }

I will attach the ceph-mds log.

Comment 1 Greg Farnum 2023-07-12 01:54:58 UTC
Shoot this bz got missed when it came in — Venky, please triage.

Comment 2 Venky Shankar 2023-07-12 04:12:31 UTC
(In reply to Greg Farnum from comment #1)
> Shoot this bz got missed when it came in — Venky, please triage.

ACK. On it today.