Bug 2215698 - [ceph-osd] OSD failing to start and in CLBO due to high pg dup log issue
Summary: [ceph-osd] OSD failing to start and in CLBO due to high pg dup log issue
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ceph
Version: 4.9
Hardware: All
OS: Linux
unspecified
urgent
Target Milestone: ---
: ---
Assignee: Radoslaw Zarzynski
QA Contact: Elad
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-06-17 14:59 UTC by Steve Baldwin
Modified: 2023-08-09 16:37 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-07-12 23:08:10 UTC
Embargoed:


Attachments (Terms of Use)

Description Steve Baldwin 2023-06-17 14:59:15 UTC
Description of problem (please be detailed as possible and provide log
snippests):
OSDs are failing to start and transitioning into CLBO state.  There are memory errors reporting and upon further investigation it was identified that pg log duplicates are very high for many of the pgs.

Version of all relevant components (if applicable):
odf  4.9.14
rhcs 16.2.0-152

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
- Yes, all of the OSDs are down.  This is a 3 node/OSD cluster.

Is there any workaround available to the best of your knowledge?
- Yes, there is a workaround [1] using a patched ceph-objectstore-tool binary. The trim-pg-log-dups operation is included the ceph-objectstore-tool of RHCS release 5.1z2 (ceph version 16.2.7-126.el8cp ) and newer. No hotfix image is required using at RHCS 5.1z2 and higher. This cluster is using an older release (rhcs 16.2.0-152) so the pg-log-dups option is not available in the ceph-objectstore-tool.

[1]: https://access.redhat.com/solutions/6987599

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
3

Can this issue reproducible?
Only on customer site 

Can this issue reproduce from the UI?
N/A

If this is a regression, please provide more details to justify this:
N/A

Steps to Reproduce:
1. Restart the OSD, 
2. OOM messages are reported in journalctl for the OSD in question.
3. Observe the OSD transition into a CLBO in 'oc get pods -n openshift-storage'
4. Using KCS solution 6987599, the diagnostic steps you can see the high number of duplicate entries for many of the PGs and matches symptoms in article.


Additional info:
-- This is 3 OSD / 3 storage node cluster, all OSDs are encountering the same issue so all PGs are unavailable.


Note You need to log in before you can comment on or make changes to this bug.