Bug 2224775

Summary: Can't get all PGs to come Active+Clean. OSD now in CLBO
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Manny <mcaldeir>
Component: RADOSAssignee: Michael J. Kidd <linuxkidd>
Status: CLOSED NOTABUG QA Contact: Pawan <pdhiran>
Severity: high Docs Contact:
Priority: unspecified    
Version: 5.1CC: akupczyk, bhubbard, bhull, ceph-eng-bugs, cephqe-warriors, khover, linuxkidd, nojha, rzarzyns, vumrao
Target Milestone: ---Flags: mcaldeir: needinfo-
mcaldeir: needinfo-
Target Release: 7.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-07-22 23:35:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Manny 2023-07-22 17:52:11 UTC
Description of problem:  Can't get all PGs to come Active+Clean. OSD now in CLBO

This is where we are now, with 2 OSDs in CLBO due to this:
~~~
debug 2023-07-22T15:57:23.340+0000 7f9ac3d3b700  0 log_channel(cluster) log [INF] : 5.9 continuing backfill to osd.5 from (4206'22920,8567'23631] MIN to 8567'23631
debug 2023-07-22T15:57:23.340+0000 7f9ac453c700  0 log_channel(cluster) log [INF] : 2.44 continuing backfill to osd.5 from (7536'11835902,8311'11836991] MIN to 8311'11836991
debug 2023-07-22T15:57:23.340+0000 7f9ac553e700  0 log_channel(cluster) log [INF] : 5.1f continuing backfill to osd.5 from (4272'28565,8269'29255] MIN to 8269'29255
debug 2023-07-22T15:57:23.340+0000 7f9ac3d3b700  0 log_channel(cluster) log [INF] : 2.2c continuing backfill to osd.5 from (7570'14772680,8567'14773443] MIN to 8567'14773443
debug 2023-07-22T15:57:23.340+0000 7f9ac553e700  0 log_channel(cluster) log [INF] : 2.51 continuing backfill to osd.5 from (7768'10899760,8269'10900514] MIN to 8269'10900514
debug 2023-07-22T15:57:23.341+0000 7f9ac4d3d700  0 log_channel(cluster) log [INF] : 2.c continuing backfill to osd.5 from (8091'14779974,8267'14780696] MIN to 8267'14780696
debug 2023-07-22T15:57:23.341+0000 7f9ac553e700  0 log_channel(cluster) log [INF] : 2.7e continuing backfill to osd.5 from (7537'13449236,8557'13449959] MIN to 8557'13449959
debug 2023-07-22T15:57:23.341+0000 7f9ac3d3b700  0 log_channel(cluster) log [INF] : 2.7c continuing backfill to osd.5 from (8189'14309291,8567'14310090] MIN to 8567'14310090
/builddir/build/BUILD/ceph-16.2.10/src/osd/PGLog.cc: In function 'void PGLog::merge_log(pg_info_t&, pg_log_t&&, pg_shard_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)' thread 7f9ac453c700 time 2023-07-22T15:57:23.342163+0000
/builddir/build/BUILD/ceph-16.2.10/src/osd/PGLog.cc: 384: FAILED ceph_assert(log.head >= olog.tail && olog.head >= log.tail)
debug 2023-07-22T15:57:23.341+0000 7f9ac4d3d700  0 log_channel(cluster) log [INF] : 5.7 continuing backfill to osd.5 from (4075'16982,8269'17704] MIN to 8269'17704
debug 2023-07-22T15:57:23.341+0000 7f9ac4d3d700  0 log_channel(cluster) log [INF] : 5.c continuing backfill to osd.5 from (4298'34117,7649'34841] MIN to 7649'34841
debug 2023-07-22T15:57:23.342+0000 7f9ac4d3d700  0 log_channel(cluster) log [INF] : 2.66 continuing backfill to osd.5 from (7904'12779138,8566'12779918] MIN to 8566'12779918
debug 2023-07-22T15:57:23.342+0000 7f9ac4d3d700  0 log_channel(cluster) log [INF] : 2.2a continuing backfill to osd.4 from (7712'12023012,8292'12024120] MIN to 8292'12024120
debug 2023-07-22T15:57:23.342+0000 7f9ac4d3d700  0 log_channel(cluster) log [INF] : 2.2a continuing backfill to osd.5 from (7712'12023012,8292'12024120] MIN to 8292'12024120
debug 2023-07-22T15:57:23.342+0000 7f9ac4d3d700  0 log_channel(cluster) log [INF] : 2.7a continuing backfill to osd.5 from (7593'12966654,8566'12967372] MIN to 8566'12967372
debug 2023-07-22T15:57:23.342+0000 7f9ac4d3d700  0 log_channel(cluster) log [INF] : 2.7f continuing backfill to osd.5 from (7540'12367622,8566'12368417] MIN to 8566'12368417
/builddir/build/BUILD/ceph-16.2.10/src/osd/PGLog.cc: In function 'void PGLog::merge_log(pg_info_t&, pg_log_t&&, pg_shard_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)' thread 7f9ac4d3d700 time 2023-07-22T15:57:23.344780+0000
/builddir/build/BUILD/ceph-16.2.10/src/osd/PGLog.cc: 384: FAILED ceph_assert(log.head >= olog.tail && olog.head >= log.tail)
 ceph version 16.2.10-172.el8cp (00a157ecd158911ece116ae43095de793ed9f389) pacific (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x560c6ab82a58]
 2: ceph-osd(+0x582c72) [0x560c6ab82c72]
 3: (PGLog::merge_log(pg_info_t&, pg_log_t&&, pg_shard_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)+0x1cb1) [0x560c6ad6b7e1]
 4: (PeeringState::merge_log(ceph::os::Transaction&, pg_info_t&, pg_log_t&&, pg_shard_t)+0x75) [0x560c6aefe435]
 5: (PeeringState::Stray::react(MLogRec const&)+0xcc) [0x560c6af3d50c]
 6: (boost::statechart::simple_state<PeeringState::Stray, PeeringState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xd5) [0x560c6af695d5]
 7: (boost::statechart::state_machine<PeeringState::PeeringMachine, PeeringState::Initial, std::allocator<boost::statechart::none>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x5b) [0x560c6ad4dc2b]
 8: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PeeringCtx&)+0x2d1) [0x560c6ad42761]
 9: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x29c) [0x560c6acb87ac]
 10: (ceph::osd::scheduler::PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x56) [0x560c6aeef446]
 11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xc28) [0x560c6acaa558]
 12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x560c6b32d2d4]
 13: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x560c6b3301b4]
 14: /lib64/libpthread.so.0(+0x81ca) [0x7f9ae68471ca]
 15: clone()
~~~

We had 4 OSD up yesterday and everything was rebuilding just fine, but slow.
But then OCP ran some `udev` script and broke our access to osd.1 and osd.2 and we had 43 PG inactive
We set the min_size to 1 on all the pool, that did not help
We marked osd 1 and 2 as lost and they switched to incomplete
We then set "injectargs '--osd_find_best_info_ignore_history_les=false"
But then we fell into the assert listed above.

We spoke with Michael and asked that we do whatever in Shift to get osd.1 and osd.2 up in some way so we can rescue data off of them.
There is something in OCP which is disruptive to the OSDs and an OCP upgrade will resolve that issue.

I will attach the requested artifacts to this BZ as soon as it settles.
(Mid-air collisions, so 1990's) 


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info: