Description of problem (please be detailed as possible and provide log snippests): ================================================================================== Ceph mon (mon.b) aborted in thread_name:msgr-worker-1 while performing repeated OCP worker machine config pool reboots. sh-5.1$ ceph crash ls h cID ENTITY NEW 2024-07-11T05:01:39.337305Z_bc04869e-9361-4a42-886b-0206f18e1d23 mon.b * Log snip: ========== -8> 2024-07-11T05:01:39.288+0000 7f77c3992900 5 AuthRegistry(0x55b46d9cee20) adding con mode: secure -7> 2024-07-11T05:01:39.288+0000 7f77c3992900 5 AuthRegistry(0x55b46d9cee20) adding con mode: secure -6> 2024-07-11T05:01:39.288+0000 7f77c3992900 2 auth: KeyRing::load: loaded key file /etc/ceph/keyring-store/keyring -5> 2024-07-11T05:01:39.288+0000 7f77c3992900 2 mon.b@-1(???) e3 init -4> 2024-07-11T05:01:39.290+0000 7f77c3992900 4 mgrc handle_mgr_map Got map version 928 -3> 2024-07-11T05:01:39.291+0000 7f77c3992900 4 mgrc handle_mgr_map Active mgr is now [v2:10.131.0.21:6800/3571147502,v1:10.131.0.21:6801/3571147502] -2> 2024-07-11T05:01:39.291+0000 7f77c3992900 4 mgrc reconnect Starting new session with [v2:10.131.0.21:6800/3571147502,v1:10.131.0.21:6801/3571147502] -1> 2024-07-11T05:01:39.315+0000 7f77c3992900 0 mon.b@-1(probing) e3 my rank is now 1 (was -1) 0> 2024-07-11T05:01:39.339+0000 7f77baf63640 -1 *** Caught signal (Aborted) ** in thread 7f77baf63640 thread_name:msgr-worker-1 ceph version 17.2.6-216.0.hotfix.bz2266538.el9cp (e3968f91dc6b6b52eea5a64d169887c551d0d99c) quincy (stable) 1: /lib64/libc.so.6(+0x3e6f0) [0x7f77c3f336f0] 2: /lib64/libc.so.6(+0x8b94c) [0x7f77c3f8094c] 3: raise() 4: abort() 5: /lib64/libstdc++.so.6(+0xa1b21) [0x7f77c4295b21] 6: /lib64/libstdc++.so.6(+0xad52c) [0x7f77c42a152c] 7: /lib64/libstdc++.so.6(+0xad597) [0x7f77c42a1597] 8: /lib64/libstdc++.so.6(+0xad7f9) [0x7f77c42a17f9] 9: /usr/lib64/ceph/libceph-common.so.2(+0x137e4b) [0x7f77c4813e4b] 10: (ProtocolV2::handle_auth_done(ceph::buffer::v15_2_0::list&)+0x613) [0x7f77c4acdb83] 11: (ProtocolV2::run_continuation(Ct<ProtocolV2>&)+0x39) [0x7f77c4ab97a9] 12: (AsyncConnection::process()+0x42b) [0x7f77c4a99e7b] 13: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0x1c1) [0x7f77c4ae1401] 14: /usr/lib64/ceph/libceph-common.so.2(+0x405eb6) [0x7f77c4ae1eb6] 15: /lib64/libstdc++.so.6(+0xdbad4) [0x7f77c42cfad4] 16: /lib64/libc.so.6(+0x89c02) [0x7f77c3f7ec02] 17: /lib64/libc.so.6(+0x10ec40) [0x7f77c4003c40] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 rbd_mirror 0/ 5 rbd_replay 0/ 5 rbd_pwl 0/ 5 journaler 0/ 5 objectcacher 0/ 5 immutable_obj_cache 0/ 5 client 1/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 journal 0/ 0 ms 1/ 5 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 1 reserver 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/ 5 rgw_sync 1/ 5 rgw_datacache 1/10 civetweb 1/ 5 rgw_access 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle 0/ 0 refs 1/ 5 compressor 1/ 5 bluestore 1/ 5 bluefs 1/ 3 bdev 1/ 5 kstore 4/ 5 rocksdb 4/ 5 leveldb 4/ 5 memdb 1/ 5 fuse 2/ 5 mgr 1/ 5 mgrc 1/ 5 dpdk 1/ 5 eventtrace 1/ 5 prioritycache 0/ 5 test 0/ 5 cephfs_mirror 0/ 5 cephsqlite 0/ 5 seastore 0/ 5 seastore_onode 0/ 5 seastore_odata 0/ 5 seastore_omap 0/ 5 seastore_tm 0/ 5 seastore_cleaner 0/ 5 seastore_lba 0/ 5 seastore_cache 0/ 5 seastore_journal 0/ 5 seastore_device 0/ 5 alienstore 1/ 5 mclock 1/ 5 ceph_exporter -2/-2 (syslog threshold) 99/99 (stderr threshold) --- pthread ID / name mapping for recent threads --- 7f77b9760640 / ceph-mon 7f77baf63640 / msgr-worker-1 7f77c2930640 / admin_socket 7f77c3992900 / ceph-mon max_recent 10000 max_new 10000 Version of all relevant components (if applicable): OCP: 4.14.31 ODF: 4.14.9 Ceph: 17.2.6_216.0.hotfix.bz2266538 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? The ceph health went to warning state due to this mon crash which can be made healthy by archiving the crash. Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 3 Can this issue reproducible? Reporting upon the first occurrence. Can this issue reproduce from the UI? N/A If this is a regression, please provide more details to justify this: Steps to Reproduce: =================== 1) Deploy a OCP + ODF cluster 2) Reboot OCP worker machine config pool 3) Wait for the machine config pool worker to start updating 4) wait for the machine config pool worker to stop updating Repeat the above steps many times. Actual results: =============== mon.b aborted during the 71st iteration. Expected results: ================= No crashes should be observed.