Bug 2215698

Summary:	[ceph-osd] OSD failing to start and in CLBO due to high pg dup log issue
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	Steve Baldwin <sbaldwin>
Component:	ceph	Assignee:	Radoslaw Zarzynski <rzarzyns>
ceph sub component:	RADOS	QA Contact:	Elad <ebenahar>
Status:	CLOSED NOTABUG	Docs Contact:
Severity:	urgent
Priority:	unspecified	CC:	bniver, linuxkidd, muagarwa, nojha, ocs-bugs, odf-bz-bot, pdhange, sostapov
Version:	4.9
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-07-12 23:08:10 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Steve Baldwin 2023-06-17 14:59:15 UTC

Description of problem (please be detailed as possible and provide log
snippests):
OSDs are failing to start and transitioning into CLBO state. There are memory errors reporting and upon further investigation it was identified that pg log duplicates are very high for many of the pgs.

Version of all relevant components (if applicable):
odf 4.9.14
rhcs 16.2.0-152

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
- Yes, all of the OSDs are down. This is a 3 node/OSD cluster.

Is there any workaround available to the best of your knowledge?
- Yes, there is a workaround [1] using a patched ceph-objectstore-tool binary. The trim-pg-log-dups operation is included the ceph-objectstore-tool of RHCS release 5.1z2 (ceph version 16.2.7-126.el8cp ) and newer. No hotfix image is required using at RHCS 5.1z2 and higher. This cluster is using an older release (rhcs 16.2.0-152) so the pg-log-dups option is not available in the ceph-objectstore-tool.

[1]: https://access.redhat.com/solutions/6987599

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
3

Can this issue reproducible?
Only on customer site

Can this issue reproduce from the UI?
N/A

If this is a regression, please provide more details to justify this:
N/A

Steps to Reproduce:
1. Restart the OSD,
2. OOM messages are reported in journalctl for the OSD in question.
3. Observe the OSD transition into a CLBO in 'oc get pods -n openshift-storage'
4. Using KCS solution 6987599, the diagnostic steps you can see the high number of duplicate entries for many of the PGs and matches symptoms in article.

Additional info:
-- This is 3 OSD / 3 storage node cluster, all OSDs are encountering the same issue so all PGs are unavailable.