Bug 2084014

Summary: Capacity utilization alerts on provider are not raised for large clusters
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Filip Balák <fbalak>
Component: odf-managed-serviceAssignee: Pranshu Srivastava <prasriva>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Neha Berry <nberry>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.10CC: aeyal, mbukatov, nthomas, ocs-bugs, odf-bz-bot, pcuzner, prasriva, sgatfane, tnielsen
Target Milestone: ---Keywords: AutomationBlocker
Target Release: ---Flags: prasriva: needinfo? (fbalak)
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-10-12 10:11:35 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2084534, 2084541, 2236143    

Description Filip Balák 2022-05-11 08:20:34 UTC
Description of problem:
Due to issue described in https://bugzilla.redhat.com/show_bug.cgi?id=1809248 (Alerts CephClusterNearFull and CephClusterCriticallyFull use RAW storage instead of user storage) that never got fully resolved, users are unable to get alerts CephClusterNearFull and CephClusterCriticallyFull due to filling available cluster storage before those alerts are raised.

Version-Release number of selected component (if applicable):
ocs-operator.v4.10.0
OCP 4.10.8

How reproducible:
1/1

Steps to Reproduce:
1. Deploy provider with 3 availability zones and size 20 TiB.
2. Deploy consumer with 3 availability zones and size 20 TiB that uses previously created provider.
3. Create a large PVC on consumer that uses all available space.
4. Fill the PVC with data

Actual results:
Cluster gets full without any alert:

$ oc rsh -n openshift-storage $(oc get pods -o wide -n openshift-storage|grep tool|awk '{print$1}') ceph df
--- RAW STORAGE ---
CLASS    SIZE   AVAIL    USED  RAW USED  %RAW USED
ssd    60 TiB  22 TiB  38 TiB    38 TiB      63.98
TOTAL  60 TiB  22 TiB  38 TiB    38 TiB      63.98
 
--- POOLS ---
POOL                                                                ID  PGS  STORED  OBJECTS     USED   %USED  MAX AVAIL
device_health_metrics                                                1    1     0 B       18      0 B       0        0 B
ocs-storagecluster-cephblockpool                                     2  128    19 B        1   12 KiB  100.00        0 B
ocs-storagecluster-cephfilesystem-metadata                           3   32  35 KiB       22  191 KiB  100.00        0 B
ocs-storagecluster-cephfilesystem-data0                              4  128     0 B        0      0 B       0        0 B
cephblockpool-storageconsumer-d3bfd22c-4e28-4b5e-b9e7-18fc13790a3a   5  128  13 TiB    3.36M   38 TiB  100.00        0 B


RAW capacity is filled by 63.98% but all available space is full. The first capacity alert is triggered at 75% RAW capacity.


Expected results:
Users of Provider cluster should get notification that the cluster is getting full.

Additional info:

Comment 1 Filip Balák 2022-05-12 07:43:58 UTC
I tested the scenario also with a cluster with 4 TiB size and 3 availability zones and I am still unable to get any Cluster level SendGrid notification.

Comment 2 Filip Balák 2022-05-12 11:55:04 UTC
After utilizing fully a cluster with 4 TiB size and 3 availability zones I got SendGrid notification:

Your storage cluster utilization has crossed 80% and will become read-only at 85% utilized! Please free up some space or if possible expand the storage cluster immediately to prevent any service access issues. It is common to also be alerted to OSD devices entering near-full or full states prior to this alert. 

After full utilization of the cluster, it's capacity looks like:

$ oc rsh -n openshift-storage $(oc get pods -n openshift-storage|grep tool|awk '{print$1}') ceph df
--- RAW STORAGE ---
CLASS    SIZE    AVAIL    USED  RAW USED  %RAW USED
ssd    12 TiB  1.8 TiB  10 TiB    10 TiB      85.01
TOTAL  12 TiB  1.8 TiB  10 TiB    10 TiB      85.01
 
--- POOLS ---
POOL                                                                ID  PGS   STORED  OBJECTS     USED   %USED  MAX AVAIL
device_health_metrics                                                1    1   15 KiB        6   46 KiB  100.00        0 B
ocs-storagecluster-cephblockpool                                     2   32     19 B        1   12 KiB  100.00        0 B
ocs-storagecluster-cephfilesystem-metadata                           3   32   18 KiB       22  138 KiB  100.00        0 B
ocs-storagecluster-cephfilesystem-data0                              4   32      0 B        0      0 B       0        0 B
cephblockpool-storageconsumer-1318e613-2b6e-45e1-81e2-b25f67221e47   5   32  3.4 TiB  892.69k   10 TiB  100.00        0 B

This is not achievable with larger clusters as mentioned in description of the bug.

Comment 3 Pranshu Srivastava 2022-05-13 04:37:05 UTC
@fbalak Could you confirm if the ask here is to include existing pool alerts, namely CephPoolQuotaBytesNearExhaustion and CephPoolQuotaBytesCriticallyExhausted, as defined here [1], which would let the user know when the pools exceed the threshold limit?

- [1]: https://github.com/ceph/ceph-mixins/blob/master/alerts/pool-quota.libsonnet#L7-L38

Comment 4 Filip Balák 2022-05-13 06:36:32 UTC
I don't think that would solve the issue. AFAIK there is no pool quota set for pools used in default ceph storageclasses.

New pool capacity alerts (not quota alerts) that would be clearly communicated to users could help here but RFE that was created for it was closed as not needed and confusing: https://bugzilla.redhat.com/show_bug.cgi?id=1870083