Bug 2119322 - ceph-mon pod crashes from time to time
Summary: ceph-mon pod crashes from time to time
Keywords:
Status: NEW
Alias: None
Product: Red Hat OpenShift Container Storage
Classification: Red Hat Storage
Component: ceph
Version: 4.7
Hardware: x86_64
OS: All
unspecified
medium
Target Milestone: ---
: ---
Assignee: Scott Ostapovicz
QA Contact: Neha Berry
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-08-18 10:29 UTC by Miguel Blach
Modified: 2023-08-03 08:29 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:


Attachments (Terms of Use)

Description Miguel Blach 2022-08-18 10:29:50 UTC
Description of problem (please be detailed as possible and provide log
snippests):

rook-ceph-mon pod crash from time to time with the following crash:
{
    "crash_id": "2022-08-02_01:23:25.395526Z_2338c1bd-06a0-4107-9cf3-313e0b7d1b99",
    "timestamp": "2022-08-02 01:23:25.395526Z",
    "process_name": "ceph-mon",
    "entity_name": "mon.b",
    "ceph_version": "14.2.11-199.el8cp",
    "utsname_hostname": "rook-ceph-mon-b-7b559d7fdb-ql6c5", <---[2]
    "utsname_sysname": "Linux",
    "utsname_release": "4.18.0-193.51.1.el8_2.x86_64",
    "utsname_version": "#1 SMP Thu Apr 8 13:59:36 EDT 2021",
    "utsname_machine": "x86_64",
    "os_name": "Red Hat Enterprise Linux",
    "os_id": "rhel",
    "os_version_id": "8.4",
    "os_version": "8.4 (Ootpa)",
    "backtrace": [ <---[3]
        "(()+0x12b20) [0x7fe7cf9e5b20]",
        "(gsignal()+0x10f) [0x7fe7ce64637f]",
        "(abort()+0x127) [0x7fe7ce630db5]",
        "(()+0x9009b) [0x7fe7ceffe09b]",
        "(()+0x9653c) [0x7fe7cf00453c]",
        "(()+0x96597) [0x7fe7cf004597]",
        "(()+0x973f5) [0x7fe7cf0053f5]",
        "(rocksdb::Cleanable::~Cleanable()+0x20) [0x560b223a1b20]",
        "(rocksdb::IndexBlockIter::~IndexBlockIter()+0x5d) [0x560b2244b5cd]",
        "(rocksdb::BlockBasedTableIterator<rocksdb::DataBlockIter, rocksdb::Slice>::~BlockBasedTableIterator()+0x2e) [0x560b223925ee]",
        "(()+0x5cdc9e) [0x560b22310c9e]",
        "(rocksdb::MergingIterator::~MergingIterator()+0xd1) [0x560b223a6591]",
        "(rocksdb::DBIter::~DBIter()+0x4d5) [0x560b222b0345]",
        "(rocksdb::ArenaWrappedDBIter::~ArenaWrappedDBIter()+0x27) [0x560b222ac937]",
        "(rocksdb::ArenaWrappedDBIter::~ArenaWrappedDBIter()+0x15) [0x560b222ac995]",
        "(std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release()+0x47) [0x560b21f9ca77]",
        "(std::_Sp_counted_ptr<MonitorDBStore::WholeStoreIteratorImpl*, (__gnu_cxx::_Lock_policy)2>::_M_dispose()+0x61) [0x560b21ff62c1]",
        "(std::_Rb_tree<unsigned long, std::pair<unsigned long const, Monitor::SyncProvider>, std::_Select1st<std::pair<unsigned long const, Monitor::SyncProvider> >, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, Monitor::SyncProvider> > >::_M_erase(std::_Rb_tree_node<std::pair<unsigned long const, Monitor::SyncProvider> >*)+0xc8) [0x560b21ffb7f8]",
        "(Monitor::~Monitor()+0x49c) [0x560b21fdcfcc]",
        "(Monitor::~Monitor()+0xd) [0x560b21fdd3cd]",
        "(main()+0x5378) [0x560b21f6e128]",
        "(__libc_start_main()+0xf3) [0x7fe7ce632493]",
        "(_start()+0x2e) [0x560b21f97a8e]"
    ]
}


We found an upstream Ceph bug similar to the one here in: https://tracker.ceph.com/issues/52151


Version of all relevant components (if applicable):

NAME     VERSION  AVAILABLE  PROGRESSING  SINCE  STATUS
version  4.7.51   True       False        13h    Cluster version is 4.7.51

ocs-operator.v4.7.7



Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

This is impacting the cluster because some pipelines are checking OCS status and crashing because the health is in HEALTH_WARN status.


Is there any workaround available to the best of your knowledge?

For this environment, cleaning the crashes solves the issue but it's happening more frequently and a more permanent solution is needed.


Can this issue reproducible?

N/A

Can this issue reproduce from the UI?

N/A


Actual results:

rook-ceph-mon crashing making deployed pipelines that check OCS status crash and delaying application deployments.

Expected results:

rook-ceph-mon not crashing.

Additional info:


Note You need to log in before you can comment on or make changes to this bug.