Bug 2021079

Summary:	ceph HEALTH_WARN snap trim queue for 10 pg(s)
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	Elvir Kuric <ekuric>
Component:	ceph	Assignee:	Josh Durgin <jdurgin>
Status:	CLOSED DUPLICATE	QA Contact:	Elad <ebenahar>
Severity:	low	Docs Contact:
Priority:	unspecified
Version:	4.9	CC:	bhubbard, bniver, dupadhya, jespy, madam, mmuench, muagarwa, ocs-bugs, odf-bz-bot, vumrao
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-02-15 17:52:35 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Comment 5 Scott Ostapovicz 2021-11-09 15:03:01 UTC

So this is a cleanup issue that eventually fixes itself as the PG count eventually reaches zero, and that does not degrade the system?  In any case have you tracked how long it takes for this state to clear up?

Comment 6 Elvir Kuric 2021-11-09 15:53:20 UTC

(In reply to Scott Ostapovicz from comment #5)
> So this is a cleanup issue that eventually fixes itself as the PG count
> eventually reaches zero, and that does not degrade the system?  In any case
> have you tracked how long it takes for this state to clear up?

Once the cluster ends in this state and if I actively use it ( create Pods/PVC , do writes/reads ) it will take very long to move to healthy state ( hours / days ). Leaving cluster in "idle" state without issuing read / write against it will eventually move it to HEALTHY state. Other than this, cluster is operational, possible to create images and issue r/w operations against it. I am doing lot of tests and using same tests scenario ( create pod(s), attach PVC(s) execute load against pods, once test is done , delete pods / PVCs ... and I see this problem only in case when rbd replication / mirroring is involved )

Comment 18 Mudit Agarwal 2021-11-15 08:13:00 UTC

Not a 4.9 blocker, moving it out while we continue the discussion.

Comment 20 Josh Durgin 2022-02-15 17:52:35 UTC

This is a symptom of an overloaded cluster - not a bug. We need to test to determine what configuration / workload we can support on given hardware, as described here: https://docs.google.com/document/d/1lLSf2GzdBIt9EATcqMx9jYcX5ylgIu6rzPISVmOYtIA/edit?usp=sharing

Comment 21 Josh Durgin 2022-03-24 17:00:42 UTC

This turned out to be bug in scrub/snap trim interaction - marking as a duplicate instead

*** This bug has been marked as a duplicate of bug 2067056 ***