Description of problem (please be detailed as possible and provide log snippests): I have an OCP 4.13.0-rc.8 system with 3 masters and 6 nodes, with 1500 small cirros VMS, rook-ceph-mon pod crash and recover every day. Version of all relevant components (if applicable): sh-4.4$ ceph version ceph version 16.2.10-160.el8cp (6977980612de1db28e41e0a90ff779627cde7a8c) pacific (stable) Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Current user impact is that the ODF is at a degraded status due to the crashes Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? Yes Can this issue reproduce from the UI? NA If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Install system 2. Start 1500 VMS 3. Monitor rook-ceph-mon pods Actual results: rook-ceph-mon crash on a daily basis Expected results: rook-ceph-mon will not crash Additional info: Logs will be added
The mon-b log shows that scrubbing was happening when this was hit, if RADOS team could take a look. 2023-05-27T16:44:26.576222686Z debug -1> 2023-05-27T16:44:26.488+0000 7f821f368700 -1 /builddir/build/BUILD/ceph-16.2.10/src/mon/Monitor.cc: In function 'bool Monitor::_scrub(ScrubResult*, std::pair<std::__cxx11::basic_string<char>, std::__cxx11::basic_string<char> >*, int*)' thread 7f821f368700 time 2023-05-27T16:44:26.487590+0000 2023-05-27T16:44:26.576222686Z /builddir/build/BUILD/ceph-16.2.10/src/mon/Monitor.cc: 5644: FAILED ceph_assert(err == 0) 2023-05-27T16:44:26.576222686Z 2023-05-27T16:44:26.576222686Z ceph version 16.2.10-160.el8cp (6977980612de1db28e41e0a90ff779627cde7a8c) pacific (stable) 2023-05-27T16:44:26.576222686Z 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x7f822cec97b8] 2023-05-27T16:44:26.576222686Z 2: /usr/lib64/ceph/libceph-common.so.2(+0x2799d2) [0x7f822cec99d2] 2023-05-27T16:44:26.576222686Z 3: (Monitor::_scrub(ScrubResult*, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >*, int*)+0x14bf) [0x55ee90ac84df] 2023-05-27T16:44:26.576222686Z 4: (Monitor::handle_scrub(boost::intrusive_ptr<MonOpRequest>)+0x288) [0x55ee90acc708] 2023-05-27T16:44:26.576222686Z 5: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0xdc0) [0x55ee90aecd50] 2023-05-27T16:44:26.576222686Z 6: (Monitor::_ms_dispatch(Message*)+0x670) [0x55ee90aedd40] 2023-05-27T16:44:26.576222686Z 7: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x5c) [0x55ee90b1d0dc] 2023-05-27T16:44:26.576222686Z 8: (DispatchQueue::entry()+0x126a) [0x7f822d111e5a] 2023-05-27T16:44:26.576222686Z 9: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f822d1c52b1] 2023-05-27T16:44:26.576222686Z 10: /lib64/libpthread.so.0(+0x81ca) [0x7f822abf51ca] 2023-05-27T16:44:26.576222686Z 11: clone() 2023-05-27T16:44:26.576222686Z 2023-05-27T16:44:26.576222686Z debug 0> 2023-05-27T16:44:26.490+0000 7f821f368700 -1 *** Caught signal (Aborted) ** 2023-05-27T16:44:26.576231789Z in thread 7f821f368700 thread_name:ms_dispatch 2023-05-27T16:44:26.576231789Z 2023-05-27T16:44:26.576231789Z ceph version 16.2.10-160.el8cp (6977980612de1db28e41e0a90ff779627cde7a8c) pacific (stable) 2023-05-27T16:44:26.576231789Z 1: /lib64/libpthread.so.0(+0x12cf0) [0x7f822abffcf0] 2023-05-27T16:44:26.576231789Z 2: gsignal() 2023-05-27T16:44:26.576231789Z 3: abort() 2023-05-27T16:44:26.576231789Z 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x7f822cec9809] 2023-05-27T16:44:26.576231789Z 5: /usr/lib64/ceph/libceph-common.so.2(+0x2799d2) [0x7f822cec99d2] 2023-05-27T16:44:26.576231789Z 6: (Monitor::_scrub(ScrubResult*, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >*, int*)+0x14bf) [0x55ee90ac84df] 2023-05-27T16:44:26.576231789Z 7: (Monitor::handle_scrub(boost::intrusive_ptr<MonOpRequest>)+0x288) [0x55ee90acc708] 2023-05-27T16:44:26.576231789Z 8: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0xdc0) [0x55ee90aecd50] 2023-05-27T16:44:26.576231789Z 9: (Monitor::_ms_dispatch(Message*)+0x670) [0x55ee90aedd40] 2023-05-27T16:44:26.576231789Z 10: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x5c) [0x55ee90b1d0dc] 2023-05-27T16:44:26.576231789Z 11: (DispatchQueue::entry()+0x126a) [0x7f822d111e5a] 2023-05-27T16:44:26.576231789Z 12: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f822d1c52b1] 2023-05-27T16:44:26.576231789Z 13: /lib64/libpthread.so.0(+0x81ca) [0x7f822abf51ca] 2023-05-27T16:44:26.576231789Z 14: clone() 2023-05-27T16:44:26.576231789Z NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 2023-05-27T16:44:26.576231789Z 2023-05-27T16:44:26.576330604Z --- logging levels --- 2023-05-27T16:44:26.576335443Z 0/ 5 none 2023-05-27T16:44:26.576335443Z 0/ 1 lockdep 2023-05-27T16:44:26.576340240Z 0/ 1 context 2023-05-27T16:44:26.576340240Z 1/ 1 crush 2023-05-27T16:44:26.576344865Z 1/ 5 mds 2023-05-27T16:44:26.576344865Z 1/ 5 mds_balancer 2023-05-27T16:44:26.576349484Z 1/ 5 mds_locker 2023-05-27T16:44:26.576354121Z 1/ 5 mds_log 2023-05-27T16:44:26.576354121Z 1/ 5 mds_log_expire 2023-05-27T16:44:26.576358938Z 1/ 5 mds_migrator 2023-05-27T16:44:26.576358938Z 0/ 1 buffer 2023-05-27T16:44:26.576363710Z 0/ 1 timer 2023-05-27T16:44:26.576363710Z 0/ 1 filer 2023-05-27T16:44:26.576368519Z 0/ 1 striper 2023-05-27T16:44:26.576373104Z 0/ 1 objecter 2023-05-27T16:44:26.576373104Z 0/ 5 rados 2023-05-27T16:44:26.576377902Z 0/ 5 rbd 2023-05-27T16:44:26.576377902Z 0/ 5 rbd_mirror 2023-05-27T16:44:26.576382597Z 0/ 5 rbd_replay 2023-05-27T16:44:26.576382597Z 0/ 5 rbd_pwl 2023-05-27T16:44:26.576387306Z 0/ 5 journaler 2023-05-27T16:44:26.576391909Z 0/ 5 objectcacher 2023-05-27T16:44:26.576391909Z 0/ 5 immutable_obj_cache 2023-05-27T16:44:26.576396717Z 0/ 5 client 2023-05-27T16:44:26.576396717Z 1/ 5 osd 2023-05-27T16:44:26.576401553Z 0/ 5 optracker 2023-05-27T16:44:26.576401553Z 0/ 5 objclass 2023-05-27T16:44:26.576406371Z 1/ 3 filestore 2023-05-27T16:44:26.576410935Z 1/ 3 journal 2023-05-27T16:44:26.576410935Z 0/ 0 ms 2023-05-27T16:44:26.576421317Z 1/ 5 mon 2023-05-27T16:44:26.576427225Z 0/10 monc 2023-05-27T16:44:26.576427225Z 1/ 5 paxos 2023-05-27T16:44:26.576431824Z 0/ 5 tp 2023-05-27T16:44:26.576431824Z 1/ 5 auth 2023-05-27T16:44:26.576436586Z 1/ 5 crypto 2023-05-27T16:44:26.576436586Z 1/ 1 finisher 2023-05-27T16:44:26.576441499Z 1/ 1 reserver 2023-05-27T16:44:26.576446183Z 1/ 5 heartbeatmap 2023-05-27T16:44:26.576446183Z 1/ 5 perfcounter 2023-05-27T16:44:26.576450728Z 1/ 5 rgw 2023-05-27T16:44:26.576450728Z 1/ 5 rgw_sync 2023-05-27T16:44:26.576455385Z 1/10 civetweb 2023-05-27T16:44:26.576455385Z 1/ 5 rgw_access 2023-05-27T16:44:26.576460100Z 1/ 5 javaclient 2023-05-27T16:44:26.576464761Z 1/ 5 asok 2023-05-27T16:44:26.576464761Z 1/ 1 throttle 2023-05-27T16:44:26.576469194Z 0/ 0 refs 2023-05-27T16:44:26.576469194Z 1/ 5 compressor 2023-05-27T16:44:26.576474092Z 1/ 5 bluestore 2023-05-27T16:44:26.576474092Z 1/ 5 bluefs 2023-05-27T16:44:26.576478860Z 1/ 3 bdev 2023-05-27T16:44:26.576483568Z 1/ 5 kstore 2023-05-27T16:44:26.576483568Z 4/ 5 rocksdb 2023-05-27T16:44:26.576488136Z 4/ 5 leveldb 2023-05-27T16:44:26.576488136Z 4/ 5 memdb 2023-05-27T16:44:26.576495229Z 1/ 5 fuse 2023-05-27T16:44:26.576495229Z 2/ 5 mgr 2023-05-27T16:44:26.576495229Z 1/ 5 mgrc 2023-05-27T16:44:26.576500409Z 1/ 5 dpdk 2023-05-27T16:44:26.576500409Z 1/ 5 eventtrace 2023-05-27T16:44:26.576505441Z 1/ 5 prioritycache 2023-05-27T16:44:26.576505441Z 0/ 5 test 2023-05-27T16:44:26.576510325Z 0/ 5 cephfs_mirror 2023-05-27T16:44:26.576515031Z 0/ 5 cephsqlite 2023-05-27T16:44:26.576515031Z -2/-2 (syslog threshold) 2023-05-27T16:44:26.576519739Z 99/99 (stderr threshold) 2023-05-27T16:44:26.576519739Z --- pthread ID / name mapping for recent threads --- 2023-05-27T16:44:26.576562376Z 140196770608896 / rstore_compact 2023-05-27T16:44:26.576575250Z 140196787394304 / ms_dispatch 2023-05-27T16:44:26.576588150Z 140196804179712 / rocksdb:dump_st 2023-05-27T16:44:26.576588150Z 140196846143232 / ms_dispatch 2023-05-27T16:44:26.576604276Z 140196888106752 / safe_timer 2023-05-27T16:44:26.576617460Z 140196938462976 / rocksdb:high0 2023-05-27T16:44:26.576630158Z 140196946855680 / rocksdb:low0 2023-05-27T16:44:26.576641950Z 140196972541696 / admin_socket 2023-05-27T16:44:26.576641950Z max_recent 10000 2023-05-27T16:44:26.576647137Z max_new 10000 2023-05-27T16:44:26.576651869Z log_file /var/lib/ceph/crash/2023-05-27T16:44:26.490533Z_050d0644-702c-4c84-b5f1-799e7d4f7013/log 2023-05-27T16:44:26.576651869Z --- end dump of recent events ---