Description of problem: Can't get all PGs to come Active+Clean. OSD now in CLBO This is where we are now, with 2 OSDs in CLBO due to this: ~~~ debug 2023-07-22T15:57:23.340+0000 7f9ac3d3b700 0 log_channel(cluster) log [INF] : 5.9 continuing backfill to osd.5 from (4206'22920,8567'23631] MIN to 8567'23631 debug 2023-07-22T15:57:23.340+0000 7f9ac453c700 0 log_channel(cluster) log [INF] : 2.44 continuing backfill to osd.5 from (7536'11835902,8311'11836991] MIN to 8311'11836991 debug 2023-07-22T15:57:23.340+0000 7f9ac553e700 0 log_channel(cluster) log [INF] : 5.1f continuing backfill to osd.5 from (4272'28565,8269'29255] MIN to 8269'29255 debug 2023-07-22T15:57:23.340+0000 7f9ac3d3b700 0 log_channel(cluster) log [INF] : 2.2c continuing backfill to osd.5 from (7570'14772680,8567'14773443] MIN to 8567'14773443 debug 2023-07-22T15:57:23.340+0000 7f9ac553e700 0 log_channel(cluster) log [INF] : 2.51 continuing backfill to osd.5 from (7768'10899760,8269'10900514] MIN to 8269'10900514 debug 2023-07-22T15:57:23.341+0000 7f9ac4d3d700 0 log_channel(cluster) log [INF] : 2.c continuing backfill to osd.5 from (8091'14779974,8267'14780696] MIN to 8267'14780696 debug 2023-07-22T15:57:23.341+0000 7f9ac553e700 0 log_channel(cluster) log [INF] : 2.7e continuing backfill to osd.5 from (7537'13449236,8557'13449959] MIN to 8557'13449959 debug 2023-07-22T15:57:23.341+0000 7f9ac3d3b700 0 log_channel(cluster) log [INF] : 2.7c continuing backfill to osd.5 from (8189'14309291,8567'14310090] MIN to 8567'14310090 /builddir/build/BUILD/ceph-16.2.10/src/osd/PGLog.cc: In function 'void PGLog::merge_log(pg_info_t&, pg_log_t&&, pg_shard_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)' thread 7f9ac453c700 time 2023-07-22T15:57:23.342163+0000 /builddir/build/BUILD/ceph-16.2.10/src/osd/PGLog.cc: 384: FAILED ceph_assert(log.head >= olog.tail && olog.head >= log.tail) debug 2023-07-22T15:57:23.341+0000 7f9ac4d3d700 0 log_channel(cluster) log [INF] : 5.7 continuing backfill to osd.5 from (4075'16982,8269'17704] MIN to 8269'17704 debug 2023-07-22T15:57:23.341+0000 7f9ac4d3d700 0 log_channel(cluster) log [INF] : 5.c continuing backfill to osd.5 from (4298'34117,7649'34841] MIN to 7649'34841 debug 2023-07-22T15:57:23.342+0000 7f9ac4d3d700 0 log_channel(cluster) log [INF] : 2.66 continuing backfill to osd.5 from (7904'12779138,8566'12779918] MIN to 8566'12779918 debug 2023-07-22T15:57:23.342+0000 7f9ac4d3d700 0 log_channel(cluster) log [INF] : 2.2a continuing backfill to osd.4 from (7712'12023012,8292'12024120] MIN to 8292'12024120 debug 2023-07-22T15:57:23.342+0000 7f9ac4d3d700 0 log_channel(cluster) log [INF] : 2.2a continuing backfill to osd.5 from (7712'12023012,8292'12024120] MIN to 8292'12024120 debug 2023-07-22T15:57:23.342+0000 7f9ac4d3d700 0 log_channel(cluster) log [INF] : 2.7a continuing backfill to osd.5 from (7593'12966654,8566'12967372] MIN to 8566'12967372 debug 2023-07-22T15:57:23.342+0000 7f9ac4d3d700 0 log_channel(cluster) log [INF] : 2.7f continuing backfill to osd.5 from (7540'12367622,8566'12368417] MIN to 8566'12368417 /builddir/build/BUILD/ceph-16.2.10/src/osd/PGLog.cc: In function 'void PGLog::merge_log(pg_info_t&, pg_log_t&&, pg_shard_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)' thread 7f9ac4d3d700 time 2023-07-22T15:57:23.344780+0000 /builddir/build/BUILD/ceph-16.2.10/src/osd/PGLog.cc: 384: FAILED ceph_assert(log.head >= olog.tail && olog.head >= log.tail) ceph version 16.2.10-172.el8cp (00a157ecd158911ece116ae43095de793ed9f389) pacific (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x560c6ab82a58] 2: ceph-osd(+0x582c72) [0x560c6ab82c72] 3: (PGLog::merge_log(pg_info_t&, pg_log_t&&, pg_shard_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)+0x1cb1) [0x560c6ad6b7e1] 4: (PeeringState::merge_log(ceph::os::Transaction&, pg_info_t&, pg_log_t&&, pg_shard_t)+0x75) [0x560c6aefe435] 5: (PeeringState::Stray::react(MLogRec const&)+0xcc) [0x560c6af3d50c] 6: (boost::statechart::simple_state<PeeringState::Stray, PeeringState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xd5) [0x560c6af695d5] 7: (boost::statechart::state_machine<PeeringState::PeeringMachine, PeeringState::Initial, std::allocator<boost::statechart::none>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x5b) [0x560c6ad4dc2b] 8: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PeeringCtx&)+0x2d1) [0x560c6ad42761] 9: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x29c) [0x560c6acb87ac] 10: (ceph::osd::scheduler::PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x56) [0x560c6aeef446] 11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xc28) [0x560c6acaa558] 12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x560c6b32d2d4] 13: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x560c6b3301b4] 14: /lib64/libpthread.so.0(+0x81ca) [0x7f9ae68471ca] 15: clone() ~~~ We had 4 OSD up yesterday and everything was rebuilding just fine, but slow. But then OCP ran some `udev` script and broke our access to osd.1 and osd.2 and we had 43 PG inactive We set the min_size to 1 on all the pool, that did not help We marked osd 1 and 2 as lost and they switched to incomplete We then set "injectargs '--osd_find_best_info_ignore_history_les=false" But then we fell into the assert listed above. We spoke with Michael and asked that we do whatever in Shift to get osd.1 and osd.2 up in some way so we can rescue data off of them. There is something in OCP which is disruptive to the OSDs and an OCP upgrade will resolve that issue. I will attach the requested artifacts to this BZ as soon as it settles. (Mid-air collisions, so 1990's) Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: