2021079 – ceph HEALTH_WARN snap trim queue for 10 pg(s)

Bug 2021079 - ceph HEALTH_WARN snap trim queue for 10 pg(s)

Summary: ceph HEALTH_WARN snap trim queue for 10 pg(s)

Keywords:
Status:	CLOSED DUPLICATE of bug 2067056
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ceph
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	low
Target Milestone:	---
Target Release:	---
Assignee:	Josh Durgin
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-11-08 10:15 UTC by Elvir Kuric
Modified:	2023-08-09 16:37 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-02-15 17:52:35 UTC
Embargoed:

Attachments	(Terms of Use)

Comment 5 Scott Ostapovicz 2021-11-09 15:03:01 UTC

So this is a cleanup issue that eventually fixes itself as the PG count eventually reaches zero, and that does not degrade the system?  In any case have you tracked how long it takes for this state to clear up?

Comment 6 Elvir Kuric 2021-11-09 15:53:20 UTC

(In reply to Scott Ostapovicz from comment #5)
> So this is a cleanup issue that eventually fixes itself as the PG count
> eventually reaches zero, and that does not degrade the system?  In any case
> have you tracked how long it takes for this state to clear up?

Once the cluster ends in this state and if I actively use it ( create Pods/PVC , do writes/reads ) it will take very long to move to healthy state ( hours / days ). Leaving cluster in "idle" state without issuing read / write against it will eventually move it to HEALTHY state. Other than this, cluster is operational, possible to create images and issue r/w operations against it. I am doing lot of tests and using same tests scenario ( create pod(s), attach PVC(s) execute load against pods, once test is done , delete pods / PVCs ... and I see this problem only in case when rbd replication / mirroring is involved )

Comment 18 Mudit Agarwal 2021-11-15 08:13:00 UTC

Not a 4.9 blocker, moving it out while we continue the discussion.

Comment 20 Josh Durgin 2022-02-15 17:52:35 UTC

This is a symptom of an overloaded cluster - not a bug. We need to test to determine what configuration / workload we can support on given hardware, as described here: https://docs.google.com/document/d/1lLSf2GzdBIt9EATcqMx9jYcX5ylgIu6rzPISVmOYtIA/edit?usp=sharing

Comment 21 Josh Durgin 2022-03-24 17:00:42 UTC

This turned out to be bug in scrub/snap trim interaction - marking as a duplicate instead

*** This bug has been marked as a duplicate of bug 2067056 ***

Note You need to log in before you can comment on or make changes to this bug.