2297267 – Ceph mon aborted in thread_name:msgr-worker-1

Bug 2297267 - Ceph mon aborted in thread_name:msgr-worker-1 [NEEDINFO]

Summary: Ceph mon aborted in thread_name:msgr-worker-1

Keywords:
Status:	NEW
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ceph
Sub Component:
Version:	4.14
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Radoslaw Zarzynski
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2024-07-11 07:17 UTC by Prasad Desala
Modified:	2024-09-12 13:53 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:
Flags:	muagarwa: needinfo? (tdesala)

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	OCSBZM-8672	0	None	None	None	2024-09-12 13:53:49 UTC

Description Prasad Desala 2024-07-11 07:17:18 UTC

Description of problem (please be detailed as possible and provide log
snippests):
==================================================================================
Ceph mon (mon.b) aborted in thread_name:msgr-worker-1 while performing repeated OCP worker machine config pool reboots.

sh-5.1$ ceph crash ls
h cID                                                                ENTITY                NEW  
2024-07-11T05:01:39.337305Z_bc04869e-9361-4a42-886b-0206f18e1d23  mon.b                  *   
Log snip:
==========

    -8> 2024-07-11T05:01:39.288+0000 7f77c3992900  5 AuthRegistry(0x55b46d9cee20) adding con mode: secure
    -7> 2024-07-11T05:01:39.288+0000 7f77c3992900  5 AuthRegistry(0x55b46d9cee20) adding con mode: secure
    -6> 2024-07-11T05:01:39.288+0000 7f77c3992900  2 auth: KeyRing::load: loaded key file /etc/ceph/keyring-store/keyring
    -5> 2024-07-11T05:01:39.288+0000 7f77c3992900  2 mon.b@-1(???) e3 init
    -4> 2024-07-11T05:01:39.290+0000 7f77c3992900  4 mgrc handle_mgr_map Got map version 928
    -3> 2024-07-11T05:01:39.291+0000 7f77c3992900  4 mgrc handle_mgr_map Active mgr is now [v2:10.131.0.21:6800/3571147502,v1:10.131.0.21:6801/3571147502]
    -2> 2024-07-11T05:01:39.291+0000 7f77c3992900  4 mgrc reconnect Starting new session with [v2:10.131.0.21:6800/3571147502,v1:10.131.0.21:6801/3571147502]
    -1> 2024-07-11T05:01:39.315+0000 7f77c3992900  0 mon.b@-1(probing) e3  my rank is now 1 (was -1)
     0> 2024-07-11T05:01:39.339+0000 7f77baf63640 -1 *** Caught signal (Aborted) **
 in thread 7f77baf63640 thread_name:msgr-worker-1

 ceph version 17.2.6-216.0.hotfix.bz2266538.el9cp (e3968f91dc6b6b52eea5a64d169887c551d0d99c) quincy (stable)
 1: /lib64/libc.so.6(+0x3e6f0) [0x7f77c3f336f0]
 2: /lib64/libc.so.6(+0x8b94c) [0x7f77c3f8094c]
 3: raise()
 4: abort()
 5: /lib64/libstdc++.so.6(+0xa1b21) [0x7f77c4295b21]
 6: /lib64/libstdc++.so.6(+0xad52c) [0x7f77c42a152c]
 7: /lib64/libstdc++.so.6(+0xad597) [0x7f77c42a1597]
 8: /lib64/libstdc++.so.6(+0xad7f9) [0x7f77c42a17f9]
 9: /usr/lib64/ceph/libceph-common.so.2(+0x137e4b) [0x7f77c4813e4b]
 10: (ProtocolV2::handle_auth_done(ceph::buffer::v15_2_0::list&)+0x613) [0x7f77c4acdb83]
 11: (ProtocolV2::run_continuation(Ct<ProtocolV2>&)+0x39) [0x7f77c4ab97a9]
 12: (AsyncConnection::process()+0x42b) [0x7f77c4a99e7b]
 13: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0x1c1) [0x7f77c4ae1401]
 14: /usr/lib64/ceph/libceph-common.so.2(+0x405eb6) [0x7f77c4ae1eb6]
 15: /lib64/libstdc++.so.6(+0xdbad4) [0x7f77c42cfad4]
 16: /lib64/libc.so.6(+0x89c02) [0x7f77c3f7ec02]
 17: /lib64/libc.so.6(+0x10ec40) [0x7f77c4003c40]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 rbd_pwl
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 immutable_obj_cache
   0/ 5 client
   1/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 0 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 1 reserver
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 rgw_sync
   1/ 5 rgw_datacache
   1/10 civetweb
   1/ 5 rgw_access
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 compressor
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   4/ 5 memdb
   1/ 5 fuse
   2/ 5 mgr
   1/ 5 mgrc
   1/ 5 dpdk
   1/ 5 eventtrace
   1/ 5 prioritycache
   0/ 5 test
   0/ 5 cephfs_mirror
   0/ 5 cephsqlite
   0/ 5 seastore
   0/ 5 seastore_onode
   0/ 5 seastore_odata
   0/ 5 seastore_omap
   0/ 5 seastore_tm
   0/ 5 seastore_cleaner
   0/ 5 seastore_lba
   0/ 5 seastore_cache
   0/ 5 seastore_journal
   0/ 5 seastore_device
   0/ 5 alienstore
   1/ 5 mclock
   1/ 5 ceph_exporter
  -2/-2 (syslog threshold)
  99/99 (stderr threshold)
--- pthread ID / name mapping for recent threads ---
  7f77b9760640 / ceph-mon
  7f77baf63640 / msgr-worker-1
  7f77c2930640 / admin_socket
  7f77c3992900 / ceph-mon
  max_recent     10000
  max_new        10000

Version of all relevant components (if applicable):
OCP: 4.14.31
ODF: 4.14.9
Ceph: 17.2.6_216.0.hotfix.bz2266538


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
The ceph health went to warning state due to this mon crash which can be made healthy by archiving the crash.


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
3


Can this issue reproducible?
Reporting upon the first occurrence. 


Can this issue reproduce from the UI?
N/A


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
===================
1) Deploy a OCP + ODF cluster
2) Reboot OCP worker machine config pool
3) Wait for the machine config pool worker to start updating
4) wait for the machine config pool worker to stop updating

Repeat the above steps many times.

Actual results:
===============
mon.b aborted during the 71st iteration.

Expected results:
=================
No crashes should be observed.

Note You need to log in before you can comment on or make changes to this bug.