2062339 – RFE: Provide an alert when rbd snapshots are being flattened

Bug 2062339 - RFE: Provide an alert when rbd snapshots are being flattened

Summary: RFE: Provide an alert when rbd snapshots are being flattened

Keywords:
Status:	CLOSED DEFERRED
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ceph-monitoring
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Pranshu Srivastava
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1972264
TreeView+	depends on / blocked

Reported:	2022-03-09 15:12 UTC by Adam Litke
Modified:	2023-12-08 04:27 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-06-20 14:11:41 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1976936	1	unspecified	CLOSED	Failed to create snapshot	2023-08-09 16:37:41 UTC
Red Hat Issue Tracker	OCSBZM-3408	0	None	None	None	2022-03-16 14:17:54 UTC
Red Hat Issue Tracker	RHSTOR-3276	0	None	None	None	2022-06-20 14:11:40 UTC

Description Adam Litke 2022-03-09 15:12:51 UTC

Description of problem (please be detailed as possible and provide log
snippests):

As documented extensively in Bug 1976936, when the limit of rbd snapshots is reached for a volume a flattening task runs in the background to mitigate the issue. During this time no more snapshots can be created. This will cause a lengthy delay when provisioning new PVCs that clone an impacted volume. This reason for this delay is not apparent to the user or cluster admin and will cause frustration and support tickets. I would like to request that an alert should fire while a volume is being flattened so we can let the user know what to expect in this situation.

Version of all relevant components (if applicable): 4.8+

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
It creates a bad customer experience in high-scale workloads.

Is there any workaround available to the best of your knowledge?
The situation resolves itself eventually (could be hours later) but this will cause confusion since there is no way to understand what is happening unless you analyze logs and have a deep understanding of ceph internals.

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
3

Can this issue reproducible?
Yes.

Can this issue reproduce from the UI?
In theory yes, but it's more likely to happen when creating many volume clones from the CLI.

If this is a regression, please provide more details to justify this:
Not a regression

Steps to Reproduce:
See Bug 1976936 for full details.

Actual results:
Flattening happens in the background with no user-visible alerts or explanation why further snapshots and clones cannot continue.

Expected results:
An alert is firing to indicate this condition and appropriate documentation (runbook) explains the situation to the user.

Additional info:

Comment 3 Travis Nielsen 2022-03-09 17:44:18 UTC

CSI team should review this, but ultimately it goes to the monitoring component

Comment 4 Niels de Vos 2022-03-16 14:17:55 UTC

(In reply to Travis Nielsen from comment #3)
> CSI team should review this, but ultimately it goes to the monitoring
> component

Indeed, Ceph-CSI can not create alerts in OCP (or Kubernetes). Because CSI are independent of the container platform, they should not use container platform APIs directly.

Monitoring components can check the events for a PV(C), and should be able to create an alert when flattening is in progress (and remove the alert when the PV(C) is created).

Comment 5 Mudit Agarwal 2022-05-27 13:16:21 UTC

Created a Jira for this RFE

Comment 6 Adam Litke 2022-06-03 17:42:46 UTC

Reopening because the follow up Jira is not linked anywhere.  Where can I check to find this Jira?

Comment 7 Nishanth Thomas 2022-06-06 22:12:57 UTC

(In reply to Adam Litke from comment #6)
> Reopening because the follow up Jira is not linked anywhere.  Where can I
> check to find this Jira?

@alitke , its linked to 'Links' section of this BZ(https://issues.redhat.com/browse/OCSBZM-3408). Is that what you are looking for?

Comment 8 Mudit Agarwal 2022-06-20 14:11:41 UTC

My bad, I should have added the epic link.
Here you go https://issues.redhat.com/browse/RHSTOR-3276

Comment 10 Red Hat Bugzilla 2023-12-08 04:27:57 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.