Bug 2224775 - Can't get all PGs to come Active+Clean. OSD now in CLBO
Summary: Can't get all PGs to come Active+Clean. OSD now in CLBO
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RADOS
Version: 5.1
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 7.1
Assignee: Michael J. Kidd
QA Contact: Pawan
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-07-22 17:52 UTC by Manny
Modified: 2023-07-26 21:29 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-07-22 23:35:34 UTC
Embargoed:
mcaldeir: needinfo-
mcaldeir: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OCPBUGS-16670 0 None None None 2023-07-22 18:44:50 UTC
Red Hat Issue Tracker RHCEPH-7062 0 None None None 2023-07-22 17:53:25 UTC

Description Manny 2023-07-22 17:52:11 UTC
Description of problem:  Can't get all PGs to come Active+Clean. OSD now in CLBO

This is where we are now, with 2 OSDs in CLBO due to this:
~~~
debug 2023-07-22T15:57:23.340+0000 7f9ac3d3b700  0 log_channel(cluster) log [INF] : 5.9 continuing backfill to osd.5 from (4206'22920,8567'23631] MIN to 8567'23631
debug 2023-07-22T15:57:23.340+0000 7f9ac453c700  0 log_channel(cluster) log [INF] : 2.44 continuing backfill to osd.5 from (7536'11835902,8311'11836991] MIN to 8311'11836991
debug 2023-07-22T15:57:23.340+0000 7f9ac553e700  0 log_channel(cluster) log [INF] : 5.1f continuing backfill to osd.5 from (4272'28565,8269'29255] MIN to 8269'29255
debug 2023-07-22T15:57:23.340+0000 7f9ac3d3b700  0 log_channel(cluster) log [INF] : 2.2c continuing backfill to osd.5 from (7570'14772680,8567'14773443] MIN to 8567'14773443
debug 2023-07-22T15:57:23.340+0000 7f9ac553e700  0 log_channel(cluster) log [INF] : 2.51 continuing backfill to osd.5 from (7768'10899760,8269'10900514] MIN to 8269'10900514
debug 2023-07-22T15:57:23.341+0000 7f9ac4d3d700  0 log_channel(cluster) log [INF] : 2.c continuing backfill to osd.5 from (8091'14779974,8267'14780696] MIN to 8267'14780696
debug 2023-07-22T15:57:23.341+0000 7f9ac553e700  0 log_channel(cluster) log [INF] : 2.7e continuing backfill to osd.5 from (7537'13449236,8557'13449959] MIN to 8557'13449959
debug 2023-07-22T15:57:23.341+0000 7f9ac3d3b700  0 log_channel(cluster) log [INF] : 2.7c continuing backfill to osd.5 from (8189'14309291,8567'14310090] MIN to 8567'14310090
/builddir/build/BUILD/ceph-16.2.10/src/osd/PGLog.cc: In function 'void PGLog::merge_log(pg_info_t&, pg_log_t&&, pg_shard_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)' thread 7f9ac453c700 time 2023-07-22T15:57:23.342163+0000
/builddir/build/BUILD/ceph-16.2.10/src/osd/PGLog.cc: 384: FAILED ceph_assert(log.head >= olog.tail && olog.head >= log.tail)
debug 2023-07-22T15:57:23.341+0000 7f9ac4d3d700  0 log_channel(cluster) log [INF] : 5.7 continuing backfill to osd.5 from (4075'16982,8269'17704] MIN to 8269'17704
debug 2023-07-22T15:57:23.341+0000 7f9ac4d3d700  0 log_channel(cluster) log [INF] : 5.c continuing backfill to osd.5 from (4298'34117,7649'34841] MIN to 7649'34841
debug 2023-07-22T15:57:23.342+0000 7f9ac4d3d700  0 log_channel(cluster) log [INF] : 2.66 continuing backfill to osd.5 from (7904'12779138,8566'12779918] MIN to 8566'12779918
debug 2023-07-22T15:57:23.342+0000 7f9ac4d3d700  0 log_channel(cluster) log [INF] : 2.2a continuing backfill to osd.4 from (7712'12023012,8292'12024120] MIN to 8292'12024120
debug 2023-07-22T15:57:23.342+0000 7f9ac4d3d700  0 log_channel(cluster) log [INF] : 2.2a continuing backfill to osd.5 from (7712'12023012,8292'12024120] MIN to 8292'12024120
debug 2023-07-22T15:57:23.342+0000 7f9ac4d3d700  0 log_channel(cluster) log [INF] : 2.7a continuing backfill to osd.5 from (7593'12966654,8566'12967372] MIN to 8566'12967372
debug 2023-07-22T15:57:23.342+0000 7f9ac4d3d700  0 log_channel(cluster) log [INF] : 2.7f continuing backfill to osd.5 from (7540'12367622,8566'12368417] MIN to 8566'12368417
/builddir/build/BUILD/ceph-16.2.10/src/osd/PGLog.cc: In function 'void PGLog::merge_log(pg_info_t&, pg_log_t&&, pg_shard_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)' thread 7f9ac4d3d700 time 2023-07-22T15:57:23.344780+0000
/builddir/build/BUILD/ceph-16.2.10/src/osd/PGLog.cc: 384: FAILED ceph_assert(log.head >= olog.tail && olog.head >= log.tail)
 ceph version 16.2.10-172.el8cp (00a157ecd158911ece116ae43095de793ed9f389) pacific (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x560c6ab82a58]
 2: ceph-osd(+0x582c72) [0x560c6ab82c72]
 3: (PGLog::merge_log(pg_info_t&, pg_log_t&&, pg_shard_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)+0x1cb1) [0x560c6ad6b7e1]
 4: (PeeringState::merge_log(ceph::os::Transaction&, pg_info_t&, pg_log_t&&, pg_shard_t)+0x75) [0x560c6aefe435]
 5: (PeeringState::Stray::react(MLogRec const&)+0xcc) [0x560c6af3d50c]
 6: (boost::statechart::simple_state<PeeringState::Stray, PeeringState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xd5) [0x560c6af695d5]
 7: (boost::statechart::state_machine<PeeringState::PeeringMachine, PeeringState::Initial, std::allocator<boost::statechart::none>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x5b) [0x560c6ad4dc2b]
 8: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PeeringCtx&)+0x2d1) [0x560c6ad42761]
 9: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x29c) [0x560c6acb87ac]
 10: (ceph::osd::scheduler::PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x56) [0x560c6aeef446]
 11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xc28) [0x560c6acaa558]
 12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x560c6b32d2d4]
 13: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x560c6b3301b4]
 14: /lib64/libpthread.so.0(+0x81ca) [0x7f9ae68471ca]
 15: clone()
~~~

We had 4 OSD up yesterday and everything was rebuilding just fine, but slow.
But then OCP ran some `udev` script and broke our access to osd.1 and osd.2 and we had 43 PG inactive
We set the min_size to 1 on all the pool, that did not help
We marked osd 1 and 2 as lost and they switched to incomplete
We then set "injectargs '--osd_find_best_info_ignore_history_les=false"
But then we fell into the assert listed above.

We spoke with Michael and asked that we do whatever in Shift to get osd.1 and osd.2 up in some way so we can rescue data off of them.
There is something in OCP which is disruptive to the OSDs and an OCP upgrade will resolve that issue.

I will attach the requested artifacts to this BZ as soon as it settles.
(Mid-air collisions, so 1990's) 


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:


Note You need to log in before you can comment on or make changes to this bug.