Description of problem: Due to issue described in https://bugzilla.redhat.com/show_bug.cgi?id=1809248 (Alerts CephClusterNearFull and CephClusterCriticallyFull use RAW storage instead of user storage) that never got fully resolved, users are unable to get alerts CephClusterNearFull and CephClusterCriticallyFull due to filling available cluster storage before those alerts are raised. Version-Release number of selected component (if applicable): ocs-operator.v4.10.0 OCP 4.10.8 How reproducible: 1/1 Steps to Reproduce: 1. Deploy provider with 3 availability zones and size 20 TiB. 2. Deploy consumer with 3 availability zones and size 20 TiB that uses previously created provider. 3. Create a large PVC on consumer that uses all available space. 4. Fill the PVC with data Actual results: Cluster gets full without any alert: $ oc rsh -n openshift-storage $(oc get pods -o wide -n openshift-storage|grep tool|awk '{print$1}') ceph df --- RAW STORAGE --- CLASS SIZE AVAIL USED RAW USED %RAW USED ssd 60 TiB 22 TiB 38 TiB 38 TiB 63.98 TOTAL 60 TiB 22 TiB 38 TiB 38 TiB 63.98 --- POOLS --- POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL device_health_metrics 1 1 0 B 18 0 B 0 0 B ocs-storagecluster-cephblockpool 2 128 19 B 1 12 KiB 100.00 0 B ocs-storagecluster-cephfilesystem-metadata 3 32 35 KiB 22 191 KiB 100.00 0 B ocs-storagecluster-cephfilesystem-data0 4 128 0 B 0 0 B 0 0 B cephblockpool-storageconsumer-d3bfd22c-4e28-4b5e-b9e7-18fc13790a3a 5 128 13 TiB 3.36M 38 TiB 100.00 0 B RAW capacity is filled by 63.98% but all available space is full. The first capacity alert is triggered at 75% RAW capacity. Expected results: Users of Provider cluster should get notification that the cluster is getting full. Additional info:
I tested the scenario also with a cluster with 4 TiB size and 3 availability zones and I am still unable to get any Cluster level SendGrid notification.
After utilizing fully a cluster with 4 TiB size and 3 availability zones I got SendGrid notification: Your storage cluster utilization has crossed 80% and will become read-only at 85% utilized! Please free up some space or if possible expand the storage cluster immediately to prevent any service access issues. It is common to also be alerted to OSD devices entering near-full or full states prior to this alert. After full utilization of the cluster, it's capacity looks like: $ oc rsh -n openshift-storage $(oc get pods -n openshift-storage|grep tool|awk '{print$1}') ceph df --- RAW STORAGE --- CLASS SIZE AVAIL USED RAW USED %RAW USED ssd 12 TiB 1.8 TiB 10 TiB 10 TiB 85.01 TOTAL 12 TiB 1.8 TiB 10 TiB 10 TiB 85.01 --- POOLS --- POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL device_health_metrics 1 1 15 KiB 6 46 KiB 100.00 0 B ocs-storagecluster-cephblockpool 2 32 19 B 1 12 KiB 100.00 0 B ocs-storagecluster-cephfilesystem-metadata 3 32 18 KiB 22 138 KiB 100.00 0 B ocs-storagecluster-cephfilesystem-data0 4 32 0 B 0 0 B 0 0 B cephblockpool-storageconsumer-1318e613-2b6e-45e1-81e2-b25f67221e47 5 32 3.4 TiB 892.69k 10 TiB 100.00 0 B This is not achievable with larger clusters as mentioned in description of the bug.
@fbalak Could you confirm if the ask here is to include existing pool alerts, namely CephPoolQuotaBytesNearExhaustion and CephPoolQuotaBytesCriticallyExhausted, as defined here [1], which would let the user know when the pools exceed the threshold limit? - [1]: https://github.com/ceph/ceph-mixins/blob/master/alerts/pool-quota.libsonnet#L7-L38
I don't think that would solve the issue. AFAIK there is no pool quota set for pools used in default ceph storageclasses. New pool capacity alerts (not quota alerts) that would be clearly communicated to users could help here but RFE that was created for it was closed as not needed and confusing: https://bugzilla.redhat.com/show_bug.cgi?id=1870083