Bug 2210548 - rook-ceph-mon pod crash from time to time [NEEDINFO]
Summary: rook-ceph-mon pod crash from time to time
Keywords:
Status: NEW
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ceph
Version: 4.13
Hardware: x86_64
OS: Linux
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Radoslaw Zarzynski
QA Contact: Elad
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-05-28 11:27 UTC by guy chen
Modified: 2023-08-09 16:37 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:
muagarwa: needinfo? (rzarzyns)


Attachments (Terms of Use)

Description guy chen 2023-05-28 11:27:40 UTC
Description of problem (please be detailed as possible and provide log
snippests):

I have an OCP 4.13.0-rc.8 system with 3 masters and 6 nodes, with 1500 small cirros VMS, rook-ceph-mon pod crash and recover every day.

Version of all relevant components (if applicable):
sh-4.4$ ceph version
ceph version 16.2.10-160.el8cp (6977980612de1db28e41e0a90ff779627cde7a8c) pacific (stable)

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Current user impact is that the ODF is at a degraded status due to the crashes

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?
NA

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Install system
2. Start 1500 VMS
3. Monitor rook-ceph-mon pods


Actual results:
rook-ceph-mon crash on a daily basis

Expected results:
rook-ceph-mon will not crash

Additional info:
Logs will be added

Comment 6 Travis Nielsen 2023-05-30 19:44:59 UTC
The mon-b log shows that scrubbing was happening when this was hit, if RADOS team could take a look.


2023-05-27T16:44:26.576222686Z debug     -1> 2023-05-27T16:44:26.488+0000 7f821f368700 -1 /builddir/build/BUILD/ceph-16.2.10/src/mon/Monitor.cc: In function 'bool Monitor::_scrub(ScrubResult*, std::pair<std::__cxx11::basic_string<char>, std::__cxx11::basic_string<char> >*, int*)' thread 7f821f368700 time 2023-05-27T16:44:26.487590+0000
2023-05-27T16:44:26.576222686Z /builddir/build/BUILD/ceph-16.2.10/src/mon/Monitor.cc: 5644: FAILED ceph_assert(err == 0)
2023-05-27T16:44:26.576222686Z 
2023-05-27T16:44:26.576222686Z  ceph version 16.2.10-160.el8cp (6977980612de1db28e41e0a90ff779627cde7a8c) pacific (stable)
2023-05-27T16:44:26.576222686Z  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x7f822cec97b8]
2023-05-27T16:44:26.576222686Z  2: /usr/lib64/ceph/libceph-common.so.2(+0x2799d2) [0x7f822cec99d2]
2023-05-27T16:44:26.576222686Z  3: (Monitor::_scrub(ScrubResult*, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >*, int*)+0x14bf) [0x55ee90ac84df]
2023-05-27T16:44:26.576222686Z  4: (Monitor::handle_scrub(boost::intrusive_ptr<MonOpRequest>)+0x288) [0x55ee90acc708]
2023-05-27T16:44:26.576222686Z  5: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0xdc0) [0x55ee90aecd50]
2023-05-27T16:44:26.576222686Z  6: (Monitor::_ms_dispatch(Message*)+0x670) [0x55ee90aedd40]
2023-05-27T16:44:26.576222686Z  7: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x5c) [0x55ee90b1d0dc]
2023-05-27T16:44:26.576222686Z  8: (DispatchQueue::entry()+0x126a) [0x7f822d111e5a]
2023-05-27T16:44:26.576222686Z  9: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f822d1c52b1]
2023-05-27T16:44:26.576222686Z  10: /lib64/libpthread.so.0(+0x81ca) [0x7f822abf51ca]
2023-05-27T16:44:26.576222686Z  11: clone()
2023-05-27T16:44:26.576222686Z 
2023-05-27T16:44:26.576222686Z debug      0> 2023-05-27T16:44:26.490+0000 7f821f368700 -1 *** Caught signal (Aborted) **
2023-05-27T16:44:26.576231789Z  in thread 7f821f368700 thread_name:ms_dispatch
2023-05-27T16:44:26.576231789Z 
2023-05-27T16:44:26.576231789Z  ceph version 16.2.10-160.el8cp (6977980612de1db28e41e0a90ff779627cde7a8c) pacific (stable)
2023-05-27T16:44:26.576231789Z  1: /lib64/libpthread.so.0(+0x12cf0) [0x7f822abffcf0]
2023-05-27T16:44:26.576231789Z  2: gsignal()
2023-05-27T16:44:26.576231789Z  3: abort()
2023-05-27T16:44:26.576231789Z  4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x7f822cec9809]
2023-05-27T16:44:26.576231789Z  5: /usr/lib64/ceph/libceph-common.so.2(+0x2799d2) [0x7f822cec99d2]
2023-05-27T16:44:26.576231789Z  6: (Monitor::_scrub(ScrubResult*, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >*, int*)+0x14bf) [0x55ee90ac84df]
2023-05-27T16:44:26.576231789Z  7: (Monitor::handle_scrub(boost::intrusive_ptr<MonOpRequest>)+0x288) [0x55ee90acc708]
2023-05-27T16:44:26.576231789Z  8: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0xdc0) [0x55ee90aecd50]
2023-05-27T16:44:26.576231789Z  9: (Monitor::_ms_dispatch(Message*)+0x670) [0x55ee90aedd40]
2023-05-27T16:44:26.576231789Z  10: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x5c) [0x55ee90b1d0dc]
2023-05-27T16:44:26.576231789Z  11: (DispatchQueue::entry()+0x126a) [0x7f822d111e5a]
2023-05-27T16:44:26.576231789Z  12: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f822d1c52b1]
2023-05-27T16:44:26.576231789Z  13: /lib64/libpthread.so.0(+0x81ca) [0x7f822abf51ca]
2023-05-27T16:44:26.576231789Z  14: clone()
2023-05-27T16:44:26.576231789Z  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2023-05-27T16:44:26.576231789Z 
2023-05-27T16:44:26.576330604Z --- logging levels ---
2023-05-27T16:44:26.576335443Z    0/ 5 none
2023-05-27T16:44:26.576335443Z    0/ 1 lockdep
2023-05-27T16:44:26.576340240Z    0/ 1 context
2023-05-27T16:44:26.576340240Z    1/ 1 crush
2023-05-27T16:44:26.576344865Z    1/ 5 mds
2023-05-27T16:44:26.576344865Z    1/ 5 mds_balancer
2023-05-27T16:44:26.576349484Z    1/ 5 mds_locker
2023-05-27T16:44:26.576354121Z    1/ 5 mds_log
2023-05-27T16:44:26.576354121Z    1/ 5 mds_log_expire
2023-05-27T16:44:26.576358938Z    1/ 5 mds_migrator
2023-05-27T16:44:26.576358938Z    0/ 1 buffer
2023-05-27T16:44:26.576363710Z    0/ 1 timer
2023-05-27T16:44:26.576363710Z    0/ 1 filer
2023-05-27T16:44:26.576368519Z    0/ 1 striper
2023-05-27T16:44:26.576373104Z    0/ 1 objecter
2023-05-27T16:44:26.576373104Z    0/ 5 rados
2023-05-27T16:44:26.576377902Z    0/ 5 rbd
2023-05-27T16:44:26.576377902Z    0/ 5 rbd_mirror
2023-05-27T16:44:26.576382597Z    0/ 5 rbd_replay
2023-05-27T16:44:26.576382597Z    0/ 5 rbd_pwl
2023-05-27T16:44:26.576387306Z    0/ 5 journaler
2023-05-27T16:44:26.576391909Z    0/ 5 objectcacher
2023-05-27T16:44:26.576391909Z    0/ 5 immutable_obj_cache
2023-05-27T16:44:26.576396717Z    0/ 5 client
2023-05-27T16:44:26.576396717Z    1/ 5 osd
2023-05-27T16:44:26.576401553Z    0/ 5 optracker
2023-05-27T16:44:26.576401553Z    0/ 5 objclass
2023-05-27T16:44:26.576406371Z    1/ 3 filestore
2023-05-27T16:44:26.576410935Z    1/ 3 journal
2023-05-27T16:44:26.576410935Z    0/ 0 ms
2023-05-27T16:44:26.576421317Z    1/ 5 mon
2023-05-27T16:44:26.576427225Z    0/10 monc
2023-05-27T16:44:26.576427225Z    1/ 5 paxos
2023-05-27T16:44:26.576431824Z    0/ 5 tp
2023-05-27T16:44:26.576431824Z    1/ 5 auth
2023-05-27T16:44:26.576436586Z    1/ 5 crypto
2023-05-27T16:44:26.576436586Z    1/ 1 finisher
2023-05-27T16:44:26.576441499Z    1/ 1 reserver
2023-05-27T16:44:26.576446183Z    1/ 5 heartbeatmap
2023-05-27T16:44:26.576446183Z    1/ 5 perfcounter
2023-05-27T16:44:26.576450728Z    1/ 5 rgw
2023-05-27T16:44:26.576450728Z    1/ 5 rgw_sync
2023-05-27T16:44:26.576455385Z    1/10 civetweb
2023-05-27T16:44:26.576455385Z    1/ 5 rgw_access
2023-05-27T16:44:26.576460100Z    1/ 5 javaclient
2023-05-27T16:44:26.576464761Z    1/ 5 asok
2023-05-27T16:44:26.576464761Z    1/ 1 throttle
2023-05-27T16:44:26.576469194Z    0/ 0 refs
2023-05-27T16:44:26.576469194Z    1/ 5 compressor
2023-05-27T16:44:26.576474092Z    1/ 5 bluestore
2023-05-27T16:44:26.576474092Z    1/ 5 bluefs
2023-05-27T16:44:26.576478860Z    1/ 3 bdev
2023-05-27T16:44:26.576483568Z    1/ 5 kstore
2023-05-27T16:44:26.576483568Z    4/ 5 rocksdb
2023-05-27T16:44:26.576488136Z    4/ 5 leveldb
2023-05-27T16:44:26.576488136Z    4/ 5 memdb
2023-05-27T16:44:26.576495229Z    1/ 5 fuse
2023-05-27T16:44:26.576495229Z    2/ 5 mgr
2023-05-27T16:44:26.576495229Z    1/ 5 mgrc
2023-05-27T16:44:26.576500409Z    1/ 5 dpdk
2023-05-27T16:44:26.576500409Z    1/ 5 eventtrace
2023-05-27T16:44:26.576505441Z    1/ 5 prioritycache
2023-05-27T16:44:26.576505441Z    0/ 5 test
2023-05-27T16:44:26.576510325Z    0/ 5 cephfs_mirror
2023-05-27T16:44:26.576515031Z    0/ 5 cephsqlite
2023-05-27T16:44:26.576515031Z   -2/-2 (syslog threshold)
2023-05-27T16:44:26.576519739Z   99/99 (stderr threshold)
2023-05-27T16:44:26.576519739Z --- pthread ID / name mapping for recent threads ---
2023-05-27T16:44:26.576562376Z   140196770608896 / rstore_compact
2023-05-27T16:44:26.576575250Z   140196787394304 / ms_dispatch
2023-05-27T16:44:26.576588150Z   140196804179712 / rocksdb:dump_st
2023-05-27T16:44:26.576588150Z   140196846143232 / ms_dispatch
2023-05-27T16:44:26.576604276Z   140196888106752 / safe_timer
2023-05-27T16:44:26.576617460Z   140196938462976 / rocksdb:high0
2023-05-27T16:44:26.576630158Z   140196946855680 / rocksdb:low0
2023-05-27T16:44:26.576641950Z   140196972541696 / admin_socket
2023-05-27T16:44:26.576641950Z   max_recent     10000
2023-05-27T16:44:26.576647137Z   max_new        10000
2023-05-27T16:44:26.576651869Z   log_file /var/lib/ceph/crash/2023-05-27T16:44:26.490533Z_050d0644-702c-4c84-b5f1-799e7d4f7013/log
2023-05-27T16:44:26.576651869Z --- end dump of recent events ---


Note You need to log in before you can comment on or make changes to this bug.