Description of problem (please be detailed as possible and provide log snippets): - The several OSD pods are going in the CLBO state. - The issue started when we were trying to recover incomplete PGs using the article: https://access.redhat.com/solutions/3740631 - We were recovering one of the incomplete PGs when the OSD pods went to CLBO state. rook-ceph-osd-3-f594b77db-tcnhh 0/1 CrashLoopBackOff rook-ceph-osd-5-6df8cb58cf-jx52g 0/1 CrashLoopBackOff rook-ceph-osd-6-69666f6975-wfffn 0/1 CrashLoopBackOff rook-ceph-osd-7-69df499d47-9mpx6 0/1 CrashLoopBackOff - The OSDs are failing with the below assert messages: -------------------------------------------------------------------------------- debug -1> 2022-07-08 08:53:29.062 7fa9a2e8a700 -1 /builddir/build/BUILD/ceph-14.2.11/src/osd/PGLog.cc: In function 'void PGLog::merge_log(pg_info_t&, pg_log_t&, pg _shard_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)' thread 7fa9a2e8a700 time 2022-07-08 08:53:29.060517 /builddir/build/BUILD/ceph-14.2.11/src/osd/PGLog.cc: 369: FAILED ceph_assert(log.head >= olog.tail && olog.head >= log.tail) ceph version 14.2.11-208.el8cp (6738ba96f296a41c24357c12e8d594fbde457abc) nautilus (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x156) [0x55648bba8790] 2: (()+0x50a9aa) [0x55648bba89aa] 3: (PGLog::merge_log(pg_info_t&, pg_log_t&, pg_shard_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)+0x1cc1) [0x55648bdd8a61] 4: (PG::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, pg_shard_t)+0x68) [0x55648bd292a8] 5: (PG::RecoveryState::Stray::react(MLogRec const&)+0x242) [0x55648bd6c0f2] 6: (boost::statechart::simple_state<PG::RecoveryState::Stray, PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::histo ry_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xa5) [0x55648bdca945] 7: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<boost::statechart::none>, boost::statechart::null_ exception_translator>::process_event(boost::statechart::event_base const&)+0x5a) [0x55648bd967aa] 8: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PG::RecoveryCtx*)+0x2c2) [0x55648bd872d2] 9: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x2bc) [0x55648bcc3ccc] 10: (PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x55) [0x55648bf4d365] 11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x1366) [0x55648bcc05a6] 12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x55648c2bd554] 13: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55648c2c0124] 14: (()+0x817a) [0x7fa9c98d517a] 15: (clone()+0x43) [0x7fa9c8603dc3] -------------------------------------------------------------------------------- Version of all relevant components (if applicable): v4.6.13 Ceph: 14.2.11-208.el8cp Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? - Cluster is highly impacted as they're not able to access any data. Is there any workaround available to the best of your knowledge? N/A Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? N/A Can this issue reproducible? N/A Can this issue reproduce from the UI? N/A If this is a regression, please provide more details to justify this: N/A Steps to Reproduce: N/A Actual results: - OSD pod is hitting an assert and going to CLBO state. Expected results: - The OSD pods should be running fine. Additional info: In the next comments.