2084014 – Capacity utilization alerts on provider are not raised for large clusters

Bug 2084014 - Capacity utilization alerts on provider are not raised for large clusters

Summary: Capacity utilization alerts on provider are not raised for large clusters

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	odf-managed-service
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Pranshu Srivastava
QA Contact:	Neha Berry
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2084534 2084541 2236143
TreeView+	depends on / blocked

Reported:	2022-05-11 08:20 UTC by Filip Balák
Modified:	2023-12-29 04:25 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-10-12 10:11:35 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1809248	1	None	None	None	2024-09-18 00:56:04 UTC

Description Filip Balák 2022-05-11 08:20:34 UTC

Description of problem:
Due to issue described in https://bugzilla.redhat.com/show_bug.cgi?id=1809248 (Alerts CephClusterNearFull and CephClusterCriticallyFull use RAW storage instead of user storage) that never got fully resolved, users are unable to get alerts CephClusterNearFull and CephClusterCriticallyFull due to filling available cluster storage before those alerts are raised.

Version-Release number of selected component (if applicable):
ocs-operator.v4.10.0
OCP 4.10.8

How reproducible:
1/1

Steps to Reproduce:
1. Deploy provider with 3 availability zones and size 20 TiB.
2. Deploy consumer with 3 availability zones and size 20 TiB that uses previously created provider.
3. Create a large PVC on consumer that uses all available space.
4. Fill the PVC with data

Actual results:
Cluster gets full without any alert:

$ oc rsh -n openshift-storage $(oc get pods -o wide -n openshift-storage|grep tool|awk '{print$1}') ceph df
--- RAW STORAGE ---
CLASS    SIZE   AVAIL    USED  RAW USED  %RAW USED
ssd    60 TiB  22 TiB  38 TiB    38 TiB      63.98
TOTAL  60 TiB  22 TiB  38 TiB    38 TiB      63.98
 
--- POOLS ---
POOL                                                                ID  PGS  STORED  OBJECTS     USED   %USED  MAX AVAIL
device_health_metrics                                                1    1     0 B       18      0 B       0        0 B
ocs-storagecluster-cephblockpool                                     2  128    19 B        1   12 KiB  100.00        0 B
ocs-storagecluster-cephfilesystem-metadata                           3   32  35 KiB       22  191 KiB  100.00        0 B
ocs-storagecluster-cephfilesystem-data0                              4  128     0 B        0      0 B       0        0 B
cephblockpool-storageconsumer-d3bfd22c-4e28-4b5e-b9e7-18fc13790a3a   5  128  13 TiB    3.36M   38 TiB  100.00        0 B


RAW capacity is filled by 63.98% but all available space is full. The first capacity alert is triggered at 75% RAW capacity.


Expected results:
Users of Provider cluster should get notification that the cluster is getting full.

Additional info:

Comment 1 Filip Balák 2022-05-12 07:43:58 UTC

I tested the scenario also with a cluster with 4 TiB size and 3 availability zones and I am still unable to get any Cluster level SendGrid notification.

Comment 2 Filip Balák 2022-05-12 11:55:04 UTC

After utilizing fully a cluster with 4 TiB size and 3 availability zones I got SendGrid notification:

Your storage cluster utilization has crossed 80% and will become read-only at 85% utilized! Please free up some space or if possible expand the storage cluster immediately to prevent any service access issues. It is common to also be alerted to OSD devices entering near-full or full states prior to this alert. 

After full utilization of the cluster, it's capacity looks like:

$ oc rsh -n openshift-storage $(oc get pods -n openshift-storage|grep tool|awk '{print$1}') ceph df
--- RAW STORAGE ---
CLASS    SIZE    AVAIL    USED  RAW USED  %RAW USED
ssd    12 TiB  1.8 TiB  10 TiB    10 TiB      85.01
TOTAL  12 TiB  1.8 TiB  10 TiB    10 TiB      85.01
 
--- POOLS ---
POOL                                                                ID  PGS   STORED  OBJECTS     USED   %USED  MAX AVAIL
device_health_metrics                                                1    1   15 KiB        6   46 KiB  100.00        0 B
ocs-storagecluster-cephblockpool                                     2   32     19 B        1   12 KiB  100.00        0 B
ocs-storagecluster-cephfilesystem-metadata                           3   32   18 KiB       22  138 KiB  100.00        0 B
ocs-storagecluster-cephfilesystem-data0                              4   32      0 B        0      0 B       0        0 B
cephblockpool-storageconsumer-1318e613-2b6e-45e1-81e2-b25f67221e47   5   32  3.4 TiB  892.69k   10 TiB  100.00        0 B

This is not achievable with larger clusters as mentioned in description of the bug.

Comment 3 Pranshu Srivastava 2022-05-13 04:37:05 UTC

@fbalak Could you confirm if the ask here is to include existing pool alerts, namely CephPoolQuotaBytesNearExhaustion and CephPoolQuotaBytesCriticallyExhausted, as defined here [1], which would let the user know when the pools exceed the threshold limit?

- [1]: https://github.com/ceph/ceph-mixins/blob/master/alerts/pool-quota.libsonnet#L7-L38

Comment 4 Filip Balák 2022-05-13 06:36:32 UTC

I don't think that would solve the issue. AFAIK there is no pool quota set for pools used in default ceph storageclasses.

New pool capacity alerts (not quota alerts) that would be clearly communicated to users could help here but RFE that was created for it was closed as not needed and confusing: https://bugzilla.redhat.com/show_bug.cgi?id=1870083

Comment 12 Red Hat Bugzilla 2023-12-29 04:25:03 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.