Bug 2210548

Summary: rook-ceph-mon pod crash from time to time
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: guy chen <guchen>
Component: cephAssignee: Radoslaw Zarzynski <rzarzyns>
ceph sub component: RADOS QA Contact: Elad <ebenahar>
Status: NEW --- Docs Contact:
Severity: unspecified    
Priority: unspecified CC: bniver, muagarwa, nojha, odf-bz-bot, rzarzyns, sostapov
Version: 4.13Keywords: Performance
Target Milestone: ---Flags: muagarwa: needinfo? (rzarzyns)
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description guy chen 2023-05-28 11:27:40 UTC
Description of problem (please be detailed as possible and provide log
snippests):

I have an OCP 4.13.0-rc.8 system with 3 masters and 6 nodes, with 1500 small cirros VMS, rook-ceph-mon pod crash and recover every day.

Version of all relevant components (if applicable):
sh-4.4$ ceph version
ceph version 16.2.10-160.el8cp (6977980612de1db28e41e0a90ff779627cde7a8c) pacific (stable)

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Current user impact is that the ODF is at a degraded status due to the crashes

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?
NA

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Install system
2. Start 1500 VMS
3. Monitor rook-ceph-mon pods


Actual results:
rook-ceph-mon crash on a daily basis

Expected results:
rook-ceph-mon will not crash

Additional info:
Logs will be added

Comment 6 Travis Nielsen 2023-05-30 19:44:59 UTC
The mon-b log shows that scrubbing was happening when this was hit, if RADOS team could take a look.


2023-05-27T16:44:26.576222686Z debug     -1> 2023-05-27T16:44:26.488+0000 7f821f368700 -1 /builddir/build/BUILD/ceph-16.2.10/src/mon/Monitor.cc: In function 'bool Monitor::_scrub(ScrubResult*, std::pair<std::__cxx11::basic_string<char>, std::__cxx11::basic_string<char> >*, int*)' thread 7f821f368700 time 2023-05-27T16:44:26.487590+0000
2023-05-27T16:44:26.576222686Z /builddir/build/BUILD/ceph-16.2.10/src/mon/Monitor.cc: 5644: FAILED ceph_assert(err == 0)
2023-05-27T16:44:26.576222686Z 
2023-05-27T16:44:26.576222686Z  ceph version 16.2.10-160.el8cp (6977980612de1db28e41e0a90ff779627cde7a8c) pacific (stable)
2023-05-27T16:44:26.576222686Z  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x7f822cec97b8]
2023-05-27T16:44:26.576222686Z  2: /usr/lib64/ceph/libceph-common.so.2(+0x2799d2) [0x7f822cec99d2]
2023-05-27T16:44:26.576222686Z  3: (Monitor::_scrub(ScrubResult*, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >*, int*)+0x14bf) [0x55ee90ac84df]
2023-05-27T16:44:26.576222686Z  4: (Monitor::handle_scrub(boost::intrusive_ptr<MonOpRequest>)+0x288) [0x55ee90acc708]
2023-05-27T16:44:26.576222686Z  5: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0xdc0) [0x55ee90aecd50]
2023-05-27T16:44:26.576222686Z  6: (Monitor::_ms_dispatch(Message*)+0x670) [0x55ee90aedd40]
2023-05-27T16:44:26.576222686Z  7: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x5c) [0x55ee90b1d0dc]
2023-05-27T16:44:26.576222686Z  8: (DispatchQueue::entry()+0x126a) [0x7f822d111e5a]
2023-05-27T16:44:26.576222686Z  9: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f822d1c52b1]
2023-05-27T16:44:26.576222686Z  10: /lib64/libpthread.so.0(+0x81ca) [0x7f822abf51ca]
2023-05-27T16:44:26.576222686Z  11: clone()
2023-05-27T16:44:26.576222686Z 
2023-05-27T16:44:26.576222686Z debug      0> 2023-05-27T16:44:26.490+0000 7f821f368700 -1 *** Caught signal (Aborted) **
2023-05-27T16:44:26.576231789Z  in thread 7f821f368700 thread_name:ms_dispatch
2023-05-27T16:44:26.576231789Z 
2023-05-27T16:44:26.576231789Z  ceph version 16.2.10-160.el8cp (6977980612de1db28e41e0a90ff779627cde7a8c) pacific (stable)
2023-05-27T16:44:26.576231789Z  1: /lib64/libpthread.so.0(+0x12cf0) [0x7f822abffcf0]
2023-05-27T16:44:26.576231789Z  2: gsignal()
2023-05-27T16:44:26.576231789Z  3: abort()
2023-05-27T16:44:26.576231789Z  4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x7f822cec9809]
2023-05-27T16:44:26.576231789Z  5: /usr/lib64/ceph/libceph-common.so.2(+0x2799d2) [0x7f822cec99d2]
2023-05-27T16:44:26.576231789Z  6: (Monitor::_scrub(ScrubResult*, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >*, int*)+0x14bf) [0x55ee90ac84df]
2023-05-27T16:44:26.576231789Z  7: (Monitor::handle_scrub(boost::intrusive_ptr<MonOpRequest>)+0x288) [0x55ee90acc708]
2023-05-27T16:44:26.576231789Z  8: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0xdc0) [0x55ee90aecd50]
2023-05-27T16:44:26.576231789Z  9: (Monitor::_ms_dispatch(Message*)+0x670) [0x55ee90aedd40]
2023-05-27T16:44:26.576231789Z  10: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x5c) [0x55ee90b1d0dc]
2023-05-27T16:44:26.576231789Z  11: (DispatchQueue::entry()+0x126a) [0x7f822d111e5a]
2023-05-27T16:44:26.576231789Z  12: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f822d1c52b1]
2023-05-27T16:44:26.576231789Z  13: /lib64/libpthread.so.0(+0x81ca) [0x7f822abf51ca]
2023-05-27T16:44:26.576231789Z  14: clone()
2023-05-27T16:44:26.576231789Z  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2023-05-27T16:44:26.576231789Z 
2023-05-27T16:44:26.576330604Z --- logging levels ---
2023-05-27T16:44:26.576335443Z    0/ 5 none
2023-05-27T16:44:26.576335443Z    0/ 1 lockdep
2023-05-27T16:44:26.576340240Z    0/ 1 context
2023-05-27T16:44:26.576340240Z    1/ 1 crush
2023-05-27T16:44:26.576344865Z    1/ 5 mds
2023-05-27T16:44:26.576344865Z    1/ 5 mds_balancer
2023-05-27T16:44:26.576349484Z    1/ 5 mds_locker
2023-05-27T16:44:26.576354121Z    1/ 5 mds_log
2023-05-27T16:44:26.576354121Z    1/ 5 mds_log_expire
2023-05-27T16:44:26.576358938Z    1/ 5 mds_migrator
2023-05-27T16:44:26.576358938Z    0/ 1 buffer
2023-05-27T16:44:26.576363710Z    0/ 1 timer
2023-05-27T16:44:26.576363710Z    0/ 1 filer
2023-05-27T16:44:26.576368519Z    0/ 1 striper
2023-05-27T16:44:26.576373104Z    0/ 1 objecter
2023-05-27T16:44:26.576373104Z    0/ 5 rados
2023-05-27T16:44:26.576377902Z    0/ 5 rbd
2023-05-27T16:44:26.576377902Z    0/ 5 rbd_mirror
2023-05-27T16:44:26.576382597Z    0/ 5 rbd_replay
2023-05-27T16:44:26.576382597Z    0/ 5 rbd_pwl
2023-05-27T16:44:26.576387306Z    0/ 5 journaler
2023-05-27T16:44:26.576391909Z    0/ 5 objectcacher
2023-05-27T16:44:26.576391909Z    0/ 5 immutable_obj_cache
2023-05-27T16:44:26.576396717Z    0/ 5 client
2023-05-27T16:44:26.576396717Z    1/ 5 osd
2023-05-27T16:44:26.576401553Z    0/ 5 optracker
2023-05-27T16:44:26.576401553Z    0/ 5 objclass
2023-05-27T16:44:26.576406371Z    1/ 3 filestore
2023-05-27T16:44:26.576410935Z    1/ 3 journal
2023-05-27T16:44:26.576410935Z    0/ 0 ms
2023-05-27T16:44:26.576421317Z    1/ 5 mon
2023-05-27T16:44:26.576427225Z    0/10 monc
2023-05-27T16:44:26.576427225Z    1/ 5 paxos
2023-05-27T16:44:26.576431824Z    0/ 5 tp
2023-05-27T16:44:26.576431824Z    1/ 5 auth
2023-05-27T16:44:26.576436586Z    1/ 5 crypto
2023-05-27T16:44:26.576436586Z    1/ 1 finisher
2023-05-27T16:44:26.576441499Z    1/ 1 reserver
2023-05-27T16:44:26.576446183Z    1/ 5 heartbeatmap
2023-05-27T16:44:26.576446183Z    1/ 5 perfcounter
2023-05-27T16:44:26.576450728Z    1/ 5 rgw
2023-05-27T16:44:26.576450728Z    1/ 5 rgw_sync
2023-05-27T16:44:26.576455385Z    1/10 civetweb
2023-05-27T16:44:26.576455385Z    1/ 5 rgw_access
2023-05-27T16:44:26.576460100Z    1/ 5 javaclient
2023-05-27T16:44:26.576464761Z    1/ 5 asok
2023-05-27T16:44:26.576464761Z    1/ 1 throttle
2023-05-27T16:44:26.576469194Z    0/ 0 refs
2023-05-27T16:44:26.576469194Z    1/ 5 compressor
2023-05-27T16:44:26.576474092Z    1/ 5 bluestore
2023-05-27T16:44:26.576474092Z    1/ 5 bluefs
2023-05-27T16:44:26.576478860Z    1/ 3 bdev
2023-05-27T16:44:26.576483568Z    1/ 5 kstore
2023-05-27T16:44:26.576483568Z    4/ 5 rocksdb
2023-05-27T16:44:26.576488136Z    4/ 5 leveldb
2023-05-27T16:44:26.576488136Z    4/ 5 memdb
2023-05-27T16:44:26.576495229Z    1/ 5 fuse
2023-05-27T16:44:26.576495229Z    2/ 5 mgr
2023-05-27T16:44:26.576495229Z    1/ 5 mgrc
2023-05-27T16:44:26.576500409Z    1/ 5 dpdk
2023-05-27T16:44:26.576500409Z    1/ 5 eventtrace
2023-05-27T16:44:26.576505441Z    1/ 5 prioritycache
2023-05-27T16:44:26.576505441Z    0/ 5 test
2023-05-27T16:44:26.576510325Z    0/ 5 cephfs_mirror
2023-05-27T16:44:26.576515031Z    0/ 5 cephsqlite
2023-05-27T16:44:26.576515031Z   -2/-2 (syslog threshold)
2023-05-27T16:44:26.576519739Z   99/99 (stderr threshold)
2023-05-27T16:44:26.576519739Z --- pthread ID / name mapping for recent threads ---
2023-05-27T16:44:26.576562376Z   140196770608896 / rstore_compact
2023-05-27T16:44:26.576575250Z   140196787394304 / ms_dispatch
2023-05-27T16:44:26.576588150Z   140196804179712 / rocksdb:dump_st
2023-05-27T16:44:26.576588150Z   140196846143232 / ms_dispatch
2023-05-27T16:44:26.576604276Z   140196888106752 / safe_timer
2023-05-27T16:44:26.576617460Z   140196938462976 / rocksdb:high0
2023-05-27T16:44:26.576630158Z   140196946855680 / rocksdb:low0
2023-05-27T16:44:26.576641950Z   140196972541696 / admin_socket
2023-05-27T16:44:26.576641950Z   max_recent     10000
2023-05-27T16:44:26.576647137Z   max_new        10000
2023-05-27T16:44:26.576651869Z   log_file /var/lib/ceph/crash/2023-05-27T16:44:26.490533Z_050d0644-702c-4c84-b5f1-799e7d4f7013/log
2023-05-27T16:44:26.576651869Z --- end dump of recent events ---