Description of problem (please be detailed as possible and provide log
snippets):
- The several OSD pods are going in the CLBO state.
- The issue started when we were trying to recover incomplete PGs using the article: https://access.redhat.com/solutions/3740631
- We were recovering one of the incomplete PGs when the OSD pods went to CLBO state.
rook-ceph-osd-3-f594b77db-tcnhh 0/1 CrashLoopBackOff
rook-ceph-osd-5-6df8cb58cf-jx52g 0/1 CrashLoopBackOff
rook-ceph-osd-6-69666f6975-wfffn 0/1 CrashLoopBackOff
rook-ceph-osd-7-69df499d47-9mpx6 0/1 CrashLoopBackOff
- The OSDs are failing with the below assert messages:
--------------------------------------------------------------------------------
debug -1> 2022-07-08 08:53:29.062 7fa9a2e8a700 -1 /builddir/build/BUILD/ceph-14.2.11/src/osd/PGLog.cc: In function 'void PGLog::merge_log(pg_info_t&, pg_log_t&, pg
_shard_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)' thread 7fa9a2e8a700 time 2022-07-08 08:53:29.060517
/builddir/build/BUILD/ceph-14.2.11/src/osd/PGLog.cc: 369: FAILED ceph_assert(log.head >= olog.tail && olog.head >= log.tail)
ceph version 14.2.11-208.el8cp (6738ba96f296a41c24357c12e8d594fbde457abc) nautilus (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x156) [0x55648bba8790]
2: (()+0x50a9aa) [0x55648bba89aa]
3: (PGLog::merge_log(pg_info_t&, pg_log_t&, pg_shard_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)+0x1cc1) [0x55648bdd8a61]
4: (PG::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, pg_shard_t)+0x68) [0x55648bd292a8]
5: (PG::RecoveryState::Stray::react(MLogRec const&)+0x242) [0x55648bd6c0f2]
6: (boost::statechart::simple_state<PG::RecoveryState::Stray, PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::histo
ry_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xa5) [0x55648bdca945]
7: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<boost::statechart::none>, boost::statechart::null_
exception_translator>::process_event(boost::statechart::event_base const&)+0x5a) [0x55648bd967aa]
8: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PG::RecoveryCtx*)+0x2c2) [0x55648bd872d2]
9: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x2bc) [0x55648bcc3ccc]
10: (PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x55) [0x55648bf4d365]
11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x1366) [0x55648bcc05a6]
12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x55648c2bd554]
13: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55648c2c0124]
14: (()+0x817a) [0x7fa9c98d517a]
15: (clone()+0x43) [0x7fa9c8603dc3]
--------------------------------------------------------------------------------
Version of all relevant components (if applicable):
v4.6.13
Ceph: 14.2.11-208.el8cp
Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
- Cluster is highly impacted as they're not able to access any data.
Is there any workaround available to the best of your knowledge?
N/A
Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
N/A
Can this issue reproducible?
N/A
Can this issue reproduce from the UI?
N/A
If this is a regression, please provide more details to justify this:
N/A
Steps to Reproduce:
N/A
Actual results:
- OSD pod is hitting an assert and going to CLBO state.
Expected results:
- The OSD pods should be running fine.
Additional info:
In the next comments.