Bug 2257949 - [ODF Hackathon]: Quota Alerts overlapping (quotaobjects and quotaobjectsexhausted) and flapping (RGW)
Summary: [ODF Hackathon]: Quota Alerts overlapping (quotaobjects and quotaobjectsexhau...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ceph-monitoring
Version: 4.14
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: ODF 4.16.0
Assignee: arun kumar mohan
QA Contact: Mahesh Shetty
URL:
Whiteboard:
Depends On:
Blocks: 2260844
TreeView+ depends on / blocked
 
Reported: 2024-01-11 18:02 UTC by Ramon Gordillo
Modified: 2024-11-15 04:25 UTC (History)
11 users (show)

Fixed In Version: 4.16.0-86
Doc Type: Bug Fix
Doc Text:
.Quota alerts overlapping Previously, redundant alerts were fired when object bucket claim (OBC) quota limit was reached. This is because when OBC quota reached 100%, both `ObcQuotaObjectsAlert` (when OBC object quota crosses 80% of its limit) and `ObcQuotaObjectsExhausedAlert` (when quota reaches 100%) alerts were fired. With this fix, the queries of the alerts were changed to make sure that only one alert is triggered at a time indicating the issue. As a result, when the quota crosses 80%, `ObcQuotaObjectsAlert` is triggered and when quota is at 100%, `ObcQuotaObjectsExhausedAlert` is triggered.
Clone Of:
Environment:
Last Closed: 2024-07-17 13:12:01 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github red-hat-storage ocs-operator pull 2472 0 None Merged Prevent ObcQuotaObjectsAlert overlapping with ObcQuotaObjectsExhausedAlert 2024-04-23 12:29:36 UTC
Red Hat Bugzilla 2256771 0 unspecified NEW [ODF Hackathon] 4.14 parent case 2024-09-03 12:53:00 UTC
Red Hat Bugzilla 2258479 0 unspecified CLOSED [ODF Hackathon]: Ceph metrics timeout when looking for RBD mirroring when it is not configured (internal) 2024-05-02 11:58:44 UTC
Red Hat Product Errata RHSA-2024:4591 0 None None None 2024-07-17 13:12:09 UTC

Description Ramon Gordillo 2024-01-11 18:02:19 UTC
Description of problem (please be detailed as possible and provide log
snippests):

When reaching the 100% of the quota in objects, ObcQuotaObjectsAlert (>80%) and ObcQuotaObjectsExhausedAlert (>100%) are raised at the same time. I would suggest not to overlap both, changing the first one to be 80-100% instead.

Additionally, the metrics for the alerts are not continuously scrapped (screenshot attached). It makes a flapping in the alert (see additional screenshot on the slack channel that receives the alert).


Version of all relevant components (if applicable):

OCP 4.14.7, ODF 4.14.3

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

It creates a bad experience and can lead to silence the alerts.

Is there any workaround available to the best of your knowledge?

No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

1

Can this issue reproducible?

Yes

Can this issue reproduce from the UI?

Yes

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Deploy an OBC with quota
2. Upload objects to reach 100% of quota in objects a/o size
3. See the metrics and the alerts


Actual results:

Flapping and overlapping alerts

Expected results:

Non flapping alerts
Additional info:

Comment 6 Ramon Gordillo 2024-01-15 18:51:11 UTC
If you see other Ceph alerting rules, is like you say (CephOSDNearFull, and CephOSDCriticallyFull).

However, other standard rules from kubernetes are not overlapping. See KubeQuotaAlmostFull and KubeQuotaFullyUsed.

Additionally, those later alerts are "info", as it is not an issue in the cluster but something that should be aware in case an user needs more space.

Comment 7 Ramon Gordillo 2024-01-16 13:01:50 UTC
The flapping issue can be due to https://bugzilla.redhat.com/show_bug.cgi?id=2258479?

Comment 11 arun kumar mohan 2024-02-21 07:22:11 UTC
Added PR: https://github.com/red-hat-storage/ocs-operator/pull/2472 to stop the overlapping.
About flapping issues, we may require a cluster setup to debug the issue.
Will check with Divyansh (as per comment#7, regarding BZ#2258479) to confirm the theory.

Comment 12 Ramon Gordillo 2024-02-21 08:10:45 UTC
Hi, @amohan. The PR only contains the quota for objects, but there is another for bytes.

BTW, for readability I recommend using the notation > 0.8 < 1 instead of < 1 > 0.8.

Comment 13 arun kumar mohan 2024-03-08 07:25:07 UTC
Hi Ramon, made both the changes (changed the condition better for readability and added the same for ObcQuotaBytesAlert as well)

Comment 19 arun kumar mohan 2024-05-29 14:22:36 UTC
Adding the RDT details, please take a look.

Comment 21 errata-xmlrpc 2024-07-17 13:12:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.16.0 security, enhancement & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:4591

Comment 22 Red Hat Bugzilla 2024-11-15 04:25:14 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.