Bug 2230067

Summary: [GSS] ceph crash rocksdb::port::Mutex::Mutex
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: kelwhite
Component: cephAssignee: Radoslaw Zarzynski <rzarzyns>
ceph sub component: RADOS QA Contact: Elad <ebenahar>
Status: NEW --- Docs Contact:
Severity: high    
Priority: unspecified CC: bniver, mcaldeir, muagarwa, nojha, odf-bz-bot, rzarzyns, sostapov
Version: 4.11Flags: rzarzyns: needinfo? (kelwhite)
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description kelwhite 2023-08-08 15:44:12 UTC
Description of problem (please be detailed as possible and provide log
snippests):

ceph mon-b is crashing with the following assert:

    "archived": "2023-08-03 19:41:43.584662",
    "backtrace": [
        "[0x3ffc9df8fde]",
        "gsignal()",
        "abort()",
        "(rocksdb::port::Mutex::Mutex(bool)+0) [0x2aa1715ac68]",
        "ceph-mon(+0x75adba) [0x2aa1715adba]",
        "(rocksdb::InstrumentedMutex::Lock()+0xda) [0x2aa1709ddba]",
        "ceph-mon(+0x55e748) [0x2aa16f5e748]",
        "(rocksdb::Cleanable::~Cleanable()+0x2a) [0x2aa17108552]",
        "(rocksdb::DBIter::~DBIter()+0x520) [0x2aa16fd6120]",
        "(rocksdb::ArenaWrappedDBIter::~ArenaWrappedDBIter()+0x30) [0x2aa171718a0]",
        "(std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release()+0x5a) [0x2aa16c4e122]",
        "(std::_Sp_counted_ptr<MonitorDBStore::WholeStoreIteratorImpl*, (__gnu_cxx::_Lock_policy)2>::_M_dispose()+0x62) [0x2aa16cabe0a]",
        "(std::_Rb_tree<unsigned long, std::pair<unsigned long const, Monitor::SyncProvider>, std::_Select1st<std::pair<unsigned long const, Monitor::SyncProvider> >, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, Monitor::SyncProvider> > >::_M_erase(std::_Rb_tree_node<std::pair<unsigned long const, Monitor::SyncProvider> >*)+0xe6) [0x2aa16cb3646]",
        "(Monitor::~Monitor()+0x394) [0x2aa16c971a4]",
        "(Monitor::~Monitor()+0x16) [0x2aa16c979ce]",
        "main()",
        "__libc_start_main()",
        "ceph-mon(+0x247404) [0x2aa16c47404]",
        "[(nil)]"
    "ceph_version": "16.2.8-84.el8cp",
    "crash_id": "2023-07-28T20:42:49.295902Z_db4ed366-a88e-47f8-bfef-de0eb3c90660",
    "entity_name": "mon.b",
    "os_id": "rhel",
    "os_name": "Red Hat Enterprise Linux",
    "os_version": "8.6 (Ootpa)",
    "os_version_id": "8.6",
    "process_name": "ceph-mon",
    "stack_sig": "2a56dfb3ea296f126d13a277cc531950fd2f183e2c4a986b67436b8cbea6dba7",
    "timestamp": "2023-07-28T20:42:49.295902Z",
    "utsname_hostname": "rook-ceph-mon-b-bfbc4c9fd-xhtv8",
    "utsname_machine": "s390x",
    "utsname_release": "4.18.0-372.52.1.el8_6.s390x",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP Fri Mar 31 06:14:27 EDT 2023"

Is there a way to prevent this crash?
What does this crash mean? seems were asking for another mutex when we already have one?

Version of all relevant components (if applicable):
ODF 4.11
Ceph  5.2

Is there any workaround available to the best of your knowledge?
No

Additional info:
Found an upstream tracker that seems to be the same issue https://tracker.ceph.com/issues/60268