Description of problem: There are no emails sent to a user when cluster is fully utilized. Version-Release number of selected component (if applicable): ocs-osd-deployer.v2.0.8 Dev addons with a new topology change How reproducible: 1/1 Steps to Reproduce: 1. Deploy provider with 3 availability zones and size 4 TiB. Set notification emails during installation. 2. Deploy consumer for previously created provider. Set notification emails during installation. 3. Create a large PVC on consumer that uses all available space. 4. Fill the PVC with data Actual results: Cluster gets full without any alert: $ oc rsh -n openshift-storage $(oc get pods -n openshift-storage|grep tool|awk '{print$1}') ceph df --- RAW STORAGE --- CLASS SIZE AVAIL USED RAW USED %RAW USED ssd 12 TiB 1.8 TiB 10 TiB 10 TiB 85.01 TOTAL 12 TiB 1.8 TiB 10 TiB 10 TiB 85.01 --- POOLS --- POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL device_health_metrics 1 1 0 B 0 0 B 0 0 B ocs-storagecluster-cephfilesystem-metadata 2 32 30 KiB 22 171 KiB 100.00 0 B ocs-storagecluster-cephfilesystem-data0 3 256 0 B 0 0 B 0 0 B cephblockpool-storageconsumer-9e2ffdfd-72f8-4678-98a4-a9299908a383 4 64 3.4 TiB 893.47k 10 TiB 100.00 0 B cephblockpool-storageconsumer-6e28afeb-7665-4c91-9900-edaf6b35842b 5 64 19 B 1 12 KiB 100.00 0 B Expected results: There should be capacity utilization alerts sent to emails that were set during addon installation. Additional info:
@fbalak - Would ideally need a must-gather for this - Are we sure that it's working without pricing changes? - Below is my reasoning: 1. When the pool is full, Ceph alert should be surfaced to our monitoring stack 2. Then the alertmanager will act on this Based on description the unknown here is whether Ceph triggered an alert or our monitoring stack didn't pick up a raised alert. @dbindra could you also pls take a look?
This needs to be retested because it could be affected by https://bugzilla.redhat.com/show_bug.cgi?id=2136854.
Ack, @fbalak can you please re-test and let us know?
Moving to 4.12.z as the verification would be done against the ODF MS rollout that would be based on ODF 4.12
Moving to VERIFIED based on regression testing. We will clone this bug for the sake of verifying the scenario as part of ODF MS testing over ODF 4.12 or with the provider-consumer layout
Notifications for cluster utilization are not working: $ oc rsh -n openshift-storage $(oc get pods -n openshift-storage|grep tool|awk '{print$1}') ceph df --- RAW STORAGE --- CLASS SIZE AVAIL USED RAW USED %RAW USED ssd 12 TiB 1.8 TiB 10 TiB 10 TiB 85.02 TOTAL 12 TiB 1.8 TiB 10 TiB 10 TiB 85.02 --- POOLS --- POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL device_health_metrics 1 1 0 B 0 0 B 0 0 B ocs-storagecluster-cephfilesystem-metadata 2 32 16 KiB 22 131 KiB 100.00 0 B ocs-storagecluster-cephfilesystem-data0 3 256 0 B 0 0 B 0 0 B cephblockpool-storageconsumer-fddd8f1a-09e4-42fc-be0d-7d70e5f02f79 4 64 3.4 TiB 893.22k 10 TiB 100.00 0 B $ oc rsh -n openshift-storage $(oc get pods -n openshift-storage|grep tool|awk '{print$1}') ceph -s cluster: id: 9e2ee3a5-53ef-45f3-bbd7-2dc83b07993f health: HEALTH_ERR 3 full osd(s) 4 pool(s) full services: mon: 3 daemons, quorum a,b,c (age 5h) mgr: a(active, since 5h) mds: 1/1 daemons up, 1 hot standby osd: 3 osds: 3 up (since 5h), 3 in (since 5h) data: volumes: 1/1 healthy pools: 4 pools, 353 pgs objects: 893.24k objects, 3.4 TiB usage: 10 TiB used, 1.8 TiB / 12 TiB avail pgs: 353 active+clean io: client: 1.2 KiB/s rd, 2 op/s rd, 0 op/s wr $ rosa describe addon-installation --cluster fbalak03-1-pr --addon ocs-provider-qe Id: ocs-provider-qe Href: /api/clusters_mgmt/v1/clusters/226dcb9q8ric7euo2o73oo9k3jg73rjq/addons/ocs-provider-qe Addon state: ready Parameters: "size" : "4" "onboarding-validation-key" : (...) "notification-email-1" : "fbalak" "notification-email-2" : "odf-ms-qe" Tested with: ocs-osd-deployer.v2.0.11 must-gather: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/fbalak03-1-pr/fbalak03-1-pr_20230301T100351/logs/testcases_1677687913/
@lgangava Could you provide the latest update on this bug?
@fbalak - would you be able to provide a live cluster to debug this? - the reason being I don't have any tools setup which can saturate the storage thanks.
Providing gist from https://chat.google.com/room/AAAASHA9vWs/G3gxiWeBWV4 - Send Grid alerts were being sent based on Raw Storage used from `ceph df` command - This raw storage %age isn't coinciding with the %age used for a specific pool - In this scenario, even if pool is at 100% (assuming this pool solely uses all storage), raw storage isn't matching with it - Investigation should be done from ODF/Ceph side thanks.