+++ This bug was initially created as a clone of Bug #2136039 +++ Description of problem: There are no emails sent to a user when cluster is fully utilized. Version-Release number of selected component (if applicable): ocs-osd-deployer.v2.0.8 Dev addons with a new topology change How reproducible: 1/1 Steps to Reproduce: 1. Deploy provider with 3 availability zones and size 4 TiB. Set notification emails during installation. 2. Deploy consumer for previously created provider. Set notification emails during installation. 3. Create a large PVC on consumer that uses all available space. 4. Fill the PVC with data Actual results: Cluster gets full without any alert: $ oc rsh -n openshift-storage $(oc get pods -n openshift-storage|grep tool|awk '{print$1}') ceph df --- RAW STORAGE --- CLASS SIZE AVAIL USED RAW USED %RAW USED ssd 12 TiB 1.8 TiB 10 TiB 10 TiB 85.01 TOTAL 12 TiB 1.8 TiB 10 TiB 10 TiB 85.01 --- POOLS --- POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL device_health_metrics 1 1 0 B 0 0 B 0 0 B ocs-storagecluster-cephfilesystem-metadata 2 32 30 KiB 22 171 KiB 100.00 0 B ocs-storagecluster-cephfilesystem-data0 3 256 0 B 0 0 B 0 0 B cephblockpool-storageconsumer-9e2ffdfd-72f8-4678-98a4-a9299908a383 4 64 3.4 TiB 893.47k 10 TiB 100.00 0 B cephblockpool-storageconsumer-6e28afeb-7665-4c91-9900-edaf6b35842b 5 64 19 B 1 12 KiB 100.00 0 B Expected results: There should be capacity utilization alerts sent to emails that were set during addon installation. Additional info: --- Additional comment from Leela Venkaiah Gangavarapu on 2022-10-27 13:33:38 UTC --- @fbalak - Would ideally need a must-gather for this - Are we sure that it's working without pricing changes? - Below is my reasoning: 1. When the pool is full, Ceph alert should be surfaced to our monitoring stack 2. Then the alertmanager will act on this Based on description the unknown here is whether Ceph triggered an alert or our monitoring stack didn't pick up a raised alert. @dbindra could you also pls take a look? --- Additional comment from Filip Balák on 2022-10-27 13:45:52 UTC --- This needs to be retested because it could be affected by https://bugzilla.redhat.com/show_bug.cgi?id=2136854. --- Additional comment from Dhruv Bindra on 2022-10-27 14:07:34 UTC --- Ack, @fbalak can you please re-test and let us know? --- Additional comment from Red Hat Bugzilla on 2022-12-31 19:29:35 UTC --- remove performed by PnT Account Manager <pnt-expunge> --- Additional comment from Red Hat Bugzilla on 2022-12-31 22:31:45 UTC --- remove performed by PnT Account Manager <pnt-expunge> --- Additional comment from Red Hat Bugzilla on 2022-12-31 22:37:25 UTC --- remove performed by PnT Account Manager <pnt-expunge> --- Additional comment from Red Hat Bugzilla on 2022-12-31 23:27:25 UTC --- remove performed by PnT Account Manager <pnt-expunge> --- Additional comment from Red Hat Bugzilla on 2023-01-01 08:33:12 UTC --- Account disabled by LDAP Audit --- Additional comment from Elad on 2023-01-17 12:41:35 UTC --- Moving to 4.12.z as the verification would be done against the ODF MS rollout that would be based on ODF 4.12 --- Additional comment from Elad on 2023-01-17 13:18:01 UTC --- Moving to VERIFIED based on regression testing. We will clone this bug for the sake of verifying the scenario as part of ODF MS testing over ODF 4.12 or with the provider-consumer layout --- Additional comment from Neha Berry on 2023-01-17 13:26:45 UTC --- (In reply to Elad from comment #10) > Moving to VERIFIED based on regression testing. > We will clone this bug for the sake of verifying the scenario as part of ODF > MS testing over ODF 4.12 or with the provider-consumer layout After discussing with Filip and Elad, instead moving it out of 4.12 as the fix is not in the product, rather in the deployer build and is not tied to any ODF version --- Additional comment from Filip Balák on 2023-03-01 16:44:09 UTC --- Notifications for cluster utilization are not working: $ oc rsh -n openshift-storage $(oc get pods -n openshift-storage|grep tool|awk '{print$1}') ceph df --- RAW STORAGE --- CLASS SIZE AVAIL USED RAW USED %RAW USED ssd 12 TiB 1.8 TiB 10 TiB 10 TiB 85.02 TOTAL 12 TiB 1.8 TiB 10 TiB 10 TiB 85.02 --- POOLS --- POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL device_health_metrics 1 1 0 B 0 0 B 0 0 B ocs-storagecluster-cephfilesystem-metadata 2 32 16 KiB 22 131 KiB 100.00 0 B ocs-storagecluster-cephfilesystem-data0 3 256 0 B 0 0 B 0 0 B cephblockpool-storageconsumer-fddd8f1a-09e4-42fc-be0d-7d70e5f02f79 4 64 3.4 TiB 893.22k 10 TiB 100.00 0 B $ oc rsh -n openshift-storage $(oc get pods -n openshift-storage|grep tool|awk '{print$1}') ceph -s cluster: id: 9e2ee3a5-53ef-45f3-bbd7-2dc83b07993f health: HEALTH_ERR 3 full osd(s) 4 pool(s) full services: mon: 3 daemons, quorum a,b,c (age 5h) mgr: a(active, since 5h) mds: 1/1 daemons up, 1 hot standby osd: 3 osds: 3 up (since 5h), 3 in (since 5h) data: volumes: 1/1 healthy pools: 4 pools, 353 pgs objects: 893.24k objects, 3.4 TiB usage: 10 TiB used, 1.8 TiB / 12 TiB avail pgs: 353 active+clean io: client: 1.2 KiB/s rd, 2 op/s rd, 0 op/s wr $ rosa describe addon-installation --cluster fbalak03-1-pr --addon ocs-provider-qe Id: ocs-provider-qe Href: /api/clusters_mgmt/v1/clusters/226dcb9q8ric7euo2o73oo9k3jg73rjq/addons/ocs-provider-qe Addon state: ready Parameters: "size" : "4" "onboarding-validation-key" : (...) "notification-email-1" : "fbalak" "notification-email-2" : "odf-ms-qe" Tested with: ocs-osd-deployer.v2.0.11 must-gather: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/fbalak03-1-pr/fbalak03-1-pr_20230301T100351/logs/testcases_1677687913/ --- Additional comment from Rewant on 2023-07-03 11:59:36 UTC --- @lgangava Could you provide the latest update on this bug? --- Additional comment from Leela Venkaiah Gangavarapu on 2023-07-04 03:10:32 UTC --- @fbalak - would you be able to provide a live cluster to debug this? - the reason being I don't have any tools setup which can saturate the storage thanks. --- Additional comment from Leela Venkaiah Gangavarapu on 2023-07-06 12:14:08 UTC --- Providing gist from https://chat.google.com/room/AAAASHA9vWs/G3gxiWeBWV4 - Send Grid alerts were being sent based on Raw Storage used from `ceph df` command - This raw storage %age isn't coinciding with the %age used for a specific pool - In this scenario, even if pool is at 100% (assuming this pool solely uses all storage), raw storage isn't matching with it - Investigation should be done from ODF/Ceph side thanks. --- Additional comment from Red Hat Bugzilla on 2023-08-03 08:28:08 UTC --- Account disabled by LDAP Audit
*** This bug has been marked as a duplicate of bug 2136039 ***