Bug 2105288 - [GSS] OSD pods crashed with failed assertion in PGLog::merge_log as logs do not overlap"
Summary: [GSS] OSD pods crashed with failed assertion in PGLog::merge_log as logs do n...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ceph
Version: 4.6
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: ---
Assignee: Neha Ojha
QA Contact: Elad
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-07-08 12:34 UTC by Priya Pandey
Modified: 2023-08-09 16:37 UTC (History)
16 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-07-26 03:02:35 UTC
Embargoed:


Attachments (Terms of Use)

Description Priya Pandey 2022-07-08 12:34:07 UTC
Description of problem (please be detailed as possible and provide log
snippets):

- The several OSD pods are going in the CLBO state.

- The issue started when we were trying to recover incomplete PGs using the article: https://access.redhat.com/solutions/3740631

- We were recovering one of the incomplete PGs when the OSD pods went to CLBO state.

rook-ceph-osd-3-f594b77db-tcnhh                                   0/1     CrashLoopBackOff
rook-ceph-osd-5-6df8cb58cf-jx52g                                  0/1     CrashLoopBackOff
rook-ceph-osd-6-69666f6975-wfffn                                  0/1     CrashLoopBackOff 
rook-ceph-osd-7-69df499d47-9mpx6                                  0/1     CrashLoopBackOff

- The OSDs are failing with the below assert messages:

--------------------------------------------------------------------------------

debug     -1> 2022-07-08 08:53:29.062 7fa9a2e8a700 -1 /builddir/build/BUILD/ceph-14.2.11/src/osd/PGLog.cc: In function 'void PGLog::merge_log(pg_info_t&, pg_log_t&, pg
_shard_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)' thread 7fa9a2e8a700 time 2022-07-08 08:53:29.060517
/builddir/build/BUILD/ceph-14.2.11/src/osd/PGLog.cc: 369: FAILED ceph_assert(log.head >= olog.tail && olog.head >= log.tail)

 ceph version 14.2.11-208.el8cp (6738ba96f296a41c24357c12e8d594fbde457abc) nautilus (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x156) [0x55648bba8790]
 2: (()+0x50a9aa) [0x55648bba89aa]
 3: (PGLog::merge_log(pg_info_t&, pg_log_t&, pg_shard_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)+0x1cc1) [0x55648bdd8a61]
 4: (PG::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, pg_shard_t)+0x68) [0x55648bd292a8]
 5: (PG::RecoveryState::Stray::react(MLogRec const&)+0x242) [0x55648bd6c0f2]
 6: (boost::statechart::simple_state<PG::RecoveryState::Stray, PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
 mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::histo
ry_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xa5) [0x55648bdca945]
 7: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<boost::statechart::none>, boost::statechart::null_
exception_translator>::process_event(boost::statechart::event_base const&)+0x5a) [0x55648bd967aa]
 8: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PG::RecoveryCtx*)+0x2c2) [0x55648bd872d2]
 9: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x2bc) [0x55648bcc3ccc]
 10: (PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x55) [0x55648bf4d365]
 11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x1366) [0x55648bcc05a6]
 12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x55648c2bd554]
 13: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55648c2c0124]
 14: (()+0x817a) [0x7fa9c98d517a]
 15: (clone()+0x43) [0x7fa9c8603dc3]
--------------------------------------------------------------------------------


Version of all relevant components (if applicable):

v4.6.13

Ceph: 14.2.11-208.el8cp

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

- Cluster is highly impacted as they're not able to access any data.


Is there any workaround available to the best of your knowledge?

N/A

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

N/A

Can this issue reproducible?
N/A


Can this issue reproduce from the UI?
N/A


If this is a regression, please provide more details to justify this:

N/A

Steps to Reproduce:
N/A


Actual results:

- OSD pod is hitting an assert and going to CLBO state.

Expected results:

- The OSD pods should be running fine.


Additional info:

In the next comments.


Note You need to log in before you can comment on or make changes to this bug.