2236148 – No sendgrid emails are sent when cluster is fully utilized

Bug 2236148 - No sendgrid emails are sent when cluster is fully utilized

Summary: No sendgrid emails are sent when cluster is fully utilized

Keywords:
Status:	CLOSED DUPLICATE of bug 2136039
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ceph-monitoring
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	low
Target Milestone:	---
Target Release:	---
Assignee:	Juan Miguel Olmo
QA Contact:	Harish NV Rao
Docs Contact:
URL:
Whiteboard:
Depends On:	2136039
Blocks:
TreeView+	depends on / blocked

Reported:	2023-08-30 12:45 UTC by Filip Balák
Modified:	2024-02-28 11:46 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	2136039
Environment:
Last Closed:	2024-02-28 11:46:36 UTC
Embargoed:

Attachments	(Terms of Use)

Description Filip Balák 2023-08-30 12:45:06 UTC

+++ This bug was initially created as a clone of Bug #2136039 +++

Description of problem:
There are no emails sent to a user when cluster is fully utilized.

Version-Release number of selected component (if applicable):
ocs-osd-deployer.v2.0.8
Dev addons with a new topology change

How reproducible:
1/1

Steps to Reproduce:
1. Deploy provider with 3 availability zones and size 4 TiB. Set notification emails during installation.
2. Deploy consumer for previously created provider. Set notification emails during installation.
3. Create a large PVC on consumer that uses all available space.
4. Fill the PVC with data

Actual results:
Cluster gets full without any alert:
$ oc rsh -n openshift-storage $(oc get pods -n openshift-storage|grep tool|awk '{print$1}') ceph df
--- RAW STORAGE ---
CLASS    SIZE    AVAIL    USED  RAW USED  %RAW USED
ssd    12 TiB  1.8 TiB  10 TiB    10 TiB      85.01
TOTAL  12 TiB  1.8 TiB  10 TiB    10 TiB      85.01
 
--- POOLS ---
POOL                                                                ID  PGS   STORED  OBJECTS     USED   %USED  MAX AVAIL
device_health_metrics                                                1    1      0 B        0      0 B       0        0 B
ocs-storagecluster-cephfilesystem-metadata                           2   32   30 KiB       22  171 KiB  100.00        0 B
ocs-storagecluster-cephfilesystem-data0                              3  256      0 B        0      0 B       0        0 B
cephblockpool-storageconsumer-9e2ffdfd-72f8-4678-98a4-a9299908a383   4   64  3.4 TiB  893.47k   10 TiB  100.00        0 B
cephblockpool-storageconsumer-6e28afeb-7665-4c91-9900-edaf6b35842b   5   64     19 B        1   12 KiB  100.00        0 B

Expected results:
There should be capacity utilization alerts sent to emails that were set during addon installation.

Additional info:

--- Additional comment from Leela Venkaiah Gangavarapu on 2022-10-27 13:33:38 UTC ---

@fbalak

- Would ideally need a must-gather for this
- Are we sure that it's working without pricing changes?
- Below is my reasoning:
1. When the pool is full, Ceph alert should be surfaced to our monitoring stack
2. Then the alertmanager will act on this

Based on description the unknown here is whether Ceph triggered an alert or our monitoring stack didn't pick up a raised alert.

@dbindra could you also pls take a look?

--- Additional comment from Filip Balák on 2022-10-27 13:45:52 UTC ---

This needs to be retested because it could be affected by https://bugzilla.redhat.com/show_bug.cgi?id=2136854.

--- Additional comment from Dhruv Bindra on 2022-10-27 14:07:34 UTC ---

Ack, @fbalak can you please re-test and let us know?

--- Additional comment from Red Hat Bugzilla on 2022-12-31 19:29:35 UTC ---

remove performed by PnT Account Manager <pnt-expunge>

--- Additional comment from Red Hat Bugzilla on 2022-12-31 22:31:45 UTC ---

remove performed by PnT Account Manager <pnt-expunge>

--- Additional comment from Red Hat Bugzilla on 2022-12-31 22:37:25 UTC ---

remove performed by PnT Account Manager <pnt-expunge>

--- Additional comment from Red Hat Bugzilla on 2022-12-31 23:27:25 UTC ---

remove performed by PnT Account Manager <pnt-expunge>

--- Additional comment from Red Hat Bugzilla on 2023-01-01 08:33:12 UTC ---

Account disabled by LDAP Audit

--- Additional comment from Elad on 2023-01-17 12:41:35 UTC ---

Moving to 4.12.z as the verification would be done against the ODF MS rollout that would be based on ODF 4.12

--- Additional comment from Elad on 2023-01-17 13:18:01 UTC ---

Moving to VERIFIED based on regression testing.
We will clone this bug for the sake of verifying the scenario as part of ODF MS testing over ODF 4.12 or with the provider-consumer layout

--- Additional comment from Neha Berry on 2023-01-17 13:26:45 UTC ---

(In reply to Elad from comment #10)
> Moving to VERIFIED based on regression testing.
> We will clone this bug for the sake of verifying the scenario as part of ODF
> MS testing over ODF 4.12 or with the provider-consumer layout

After discussing with Filip and Elad, instead moving it out of 4.12 as the fix is not in the product, rather in the deployer build and is not tied to any ODF version

--- Additional comment from Filip Balák on 2023-03-01 16:44:09 UTC ---

Notifications for cluster utilization are not working:

$ oc rsh -n openshift-storage $(oc get pods -n openshift-storage|grep tool|awk '{print$1}') ceph df
--- RAW STORAGE ---
CLASS    SIZE    AVAIL    USED  RAW USED  %RAW USED
ssd    12 TiB  1.8 TiB  10 TiB    10 TiB      85.02
TOTAL  12 TiB  1.8 TiB  10 TiB    10 TiB      85.02
 
--- POOLS ---
POOL                                                                ID  PGS   STORED  OBJECTS     USED   %USED  MAX AVAIL
device_health_metrics                                                1    1      0 B        0      0 B       0        0 B
ocs-storagecluster-cephfilesystem-metadata                           2   32   16 KiB       22  131 KiB  100.00        0 B
ocs-storagecluster-cephfilesystem-data0                              3  256      0 B        0      0 B       0        0 B
cephblockpool-storageconsumer-fddd8f1a-09e4-42fc-be0d-7d70e5f02f79   4   64  3.4 TiB  893.22k   10 TiB  100.00        0 B

$ oc rsh -n openshift-storage $(oc get pods -n openshift-storage|grep tool|awk '{print$1}') ceph -s
  cluster:
    id:     9e2ee3a5-53ef-45f3-bbd7-2dc83b07993f
    health: HEALTH_ERR
            3 full osd(s)
            4 pool(s) full
 
  services:
    mon: 3 daemons, quorum a,b,c (age 5h)
    mgr: a(active, since 5h)
    mds: 1/1 daemons up, 1 hot standby
    osd: 3 osds: 3 up (since 5h), 3 in (since 5h)
 
  data:
    volumes: 1/1 healthy
    pools:   4 pools, 353 pgs
    objects: 893.24k objects, 3.4 TiB
    usage:   10 TiB used, 1.8 TiB / 12 TiB avail
    pgs:     353 active+clean
 
  io:
    client:   1.2 KiB/s rd, 2 op/s rd, 0 op/s wr

$ rosa describe addon-installation --cluster fbalak03-1-pr --addon ocs-provider-qe
Id:                          ocs-provider-qe
Href:                        /api/clusters_mgmt/v1/clusters/226dcb9q8ric7euo2o73oo9k3jg73rjq/addons/ocs-provider-qe
Addon state:                 ready
Parameters:
	"size"                      : "4"
	"onboarding-validation-key" : (...)
	"notification-email-1"      : "fbalak"
	"notification-email-2"      : "odf-ms-qe"


Tested with:
ocs-osd-deployer.v2.0.11

must-gather: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/fbalak03-1-pr/fbalak03-1-pr_20230301T100351/logs/testcases_1677687913/

--- Additional comment from Rewant on 2023-07-03 11:59:36 UTC ---

@lgangava Could you provide the latest update on this bug?

--- Additional comment from Leela Venkaiah Gangavarapu on 2023-07-04 03:10:32 UTC ---

@fbalak 
- would you be able to provide a live cluster to debug this?
- the reason being I don't have any tools setup which can saturate the storage

thanks.

--- Additional comment from Leela Venkaiah Gangavarapu on 2023-07-06 12:14:08 UTC ---

Providing gist from https://chat.google.com/room/AAAASHA9vWs/G3gxiWeBWV4
- Send Grid alerts were being sent based on Raw Storage used from `ceph df` command
- This raw storage %age isn't coinciding with the %age used for a specific pool
- In this scenario, even if pool is at 100% (assuming this pool solely uses all storage), raw storage isn't matching with it
- Investigation should be done from ODF/Ceph side

thanks.

--- Additional comment from Red Hat Bugzilla on 2023-08-03 08:28:08 UTC ---

Account disabled by LDAP Audit

Comment 3 Juan Miguel Olmo 2024-02-28 11:46:36 UTC


*** This bug has been marked as a duplicate of bug 2136039 ***

Note You need to log in before you can comment on or make changes to this bug.