2257949 – [ODF Hackathon]: Quota Alerts overlapping (quotaobjects and quotaobjectsexhausted) and flapping (RGW)

Bug 2257949 - [ODF Hackathon]: Quota Alerts overlapping (quotaobjects and quotaobjectsexhausted) and flapping (RGW)

Summary: [ODF Hackathon]: Quota Alerts overlapping (quotaobjects and quotaobjectsexhau...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ceph-monitoring
Sub Component:
Version:	4.14
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	low
Target Milestone:	---
Target Release:	ODF 4.16.0
Assignee:	arun kumar mohan
QA Contact:	Mahesh Shetty
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2260844
TreeView+	depends on / blocked

Reported:	2024-01-11 18:02 UTC by Ramon Gordillo
Modified:	2024-11-15 04:25 UTC (History)
CC List:	11 users (show)
Fixed In Version:	4.16.0-86
Doc Type:	Bug Fix
Doc Text:	.Quota alerts overlapping Previously, redundant alerts were fired when object bucket claim (OBC) quota limit was reached. This is because when OBC quota reached 100%, both `ObcQuotaObjectsAlert` (when OBC object quota crosses 80% of its limit) and `ObcQuotaObjectsExhausedAlert` (when quota reaches 100%) alerts were fired. With this fix, the queries of the alerts were changed to make sure that only one alert is triggered at a time indicating the issue. As a result, when the quota crosses 80%, `ObcQuotaObjectsAlert` is triggered and when quota is at 100%, `ObcQuotaObjectsExhausedAlert` is triggered.
Clone Of:
Environment:
Last Closed:	2024-07-17 13:12:01 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	red-hat-storage ocs-operator pull 2472	None	Merged	Prevent ObcQuotaObjectsAlert overlapping with ObcQuotaObjectsExhausedAlert	2024-04-23 12:29:36 UTC
Red Hat Bugzilla	2256771	unspecified	NEW	[ODF Hackathon] 4.14 parent case	2024-09-03 12:53:00 UTC
Red Hat Bugzilla	2258479	unspecified	CLOSED	[ODF Hackathon]: Ceph metrics timeout when looking for RBD mirroring when it is not configured (internal)	2024-05-02 11:58:44 UTC
Red Hat Product Errata	RHSA-2024:4591	None	None	None	2024-07-17 13:12:09 UTC

Description Ramon Gordillo 2024-01-11 18:02:19 UTC

Description of problem (please be detailed as possible and provide log
snippests):

When reaching the 100% of the quota in objects, ObcQuotaObjectsAlert (>80%) and ObcQuotaObjectsExhausedAlert (>100%) are raised at the same time. I would suggest not to overlap both, changing the first one to be 80-100% instead.

Additionally, the metrics for the alerts are not continuously scrapped (screenshot attached). It makes a flapping in the alert (see additional screenshot on the slack channel that receives the alert).


Version of all relevant components (if applicable):

OCP 4.14.7, ODF 4.14.3

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

It creates a bad experience and can lead to silence the alerts.

Is there any workaround available to the best of your knowledge?

No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

1

Can this issue reproducible?

Yes

Can this issue reproduce from the UI?

Yes

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Deploy an OBC with quota
2. Upload objects to reach 100% of quota in objects a/o size
3. See the metrics and the alerts


Actual results:

Flapping and overlapping alerts

Expected results:

Non flapping alerts
Additional info:

Comment 6 Ramon Gordillo 2024-01-15 18:51:11 UTC

If you see other Ceph alerting rules, is like you say (CephOSDNearFull, and CephOSDCriticallyFull).

However, other standard rules from kubernetes are not overlapping. See KubeQuotaAlmostFull and KubeQuotaFullyUsed.

Additionally, those later alerts are "info", as it is not an issue in the cluster but something that should be aware in case an user needs more space.

Comment 7 Ramon Gordillo 2024-01-16 13:01:50 UTC

The flapping issue can be due to https://bugzilla.redhat.com/show_bug.cgi?id=2258479?

Comment 11 arun kumar mohan 2024-02-21 07:22:11 UTC

Added PR: https://github.com/red-hat-storage/ocs-operator/pull/2472 to stop the overlapping.
About flapping issues, we may require a cluster setup to debug the issue.
Will check with Divyansh (as per comment#7, regarding BZ#2258479) to confirm the theory.

Comment 12 Ramon Gordillo 2024-02-21 08:10:45 UTC

Hi, @amohan. The PR only contains the quota for objects, but there is another for bytes.

BTW, for readability I recommend using the notation > 0.8 < 1 instead of < 1 > 0.8.

Comment 13 arun kumar mohan 2024-03-08 07:25:07 UTC

Hi Ramon, made both the changes (changed the condition better for readability and added the same for ObcQuotaBytesAlert as well)

Comment 19 arun kumar mohan 2024-05-29 14:22:36 UTC

Adding the RDT details, please take a look.

Comment 21 errata-xmlrpc 2024-07-17 13:12:01 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.16.0 security, enhancement & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:4591

Comment 22 Red Hat Bugzilla 2024-11-15 04:25:14 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.