Bug 2136039 - No sendgrid emails are sent when cluster is fully utilized [NEEDINFO]
Summary: No sendgrid emails are sent when cluster is fully utilized
Keywords:
Status: ASSIGNED
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: odf-managed-service
Version: 4.10
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: ---
Assignee: Leela Venkaiah Gangavarapu
QA Contact: Filip Balák
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-10-19 07:34 UTC by Filip Balák
Modified: 2023-08-09 17:00 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:
lgangava: needinfo? (fbalak)
lgangava: needinfo? (fbalak)


Attachments (Terms of Use)

Description Filip Balák 2022-10-19 07:34:11 UTC
Description of problem:
There are no emails sent to a user when cluster is fully utilized.

Version-Release number of selected component (if applicable):
ocs-osd-deployer.v2.0.8
Dev addons with a new topology change

How reproducible:
1/1

Steps to Reproduce:
1. Deploy provider with 3 availability zones and size 4 TiB. Set notification emails during installation.
2. Deploy consumer for previously created provider. Set notification emails during installation.
3. Create a large PVC on consumer that uses all available space.
4. Fill the PVC with data

Actual results:
Cluster gets full without any alert:
$ oc rsh -n openshift-storage $(oc get pods -n openshift-storage|grep tool|awk '{print$1}') ceph df
--- RAW STORAGE ---
CLASS    SIZE    AVAIL    USED  RAW USED  %RAW USED
ssd    12 TiB  1.8 TiB  10 TiB    10 TiB      85.01
TOTAL  12 TiB  1.8 TiB  10 TiB    10 TiB      85.01
 
--- POOLS ---
POOL                                                                ID  PGS   STORED  OBJECTS     USED   %USED  MAX AVAIL
device_health_metrics                                                1    1      0 B        0      0 B       0        0 B
ocs-storagecluster-cephfilesystem-metadata                           2   32   30 KiB       22  171 KiB  100.00        0 B
ocs-storagecluster-cephfilesystem-data0                              3  256      0 B        0      0 B       0        0 B
cephblockpool-storageconsumer-9e2ffdfd-72f8-4678-98a4-a9299908a383   4   64  3.4 TiB  893.47k   10 TiB  100.00        0 B
cephblockpool-storageconsumer-6e28afeb-7665-4c91-9900-edaf6b35842b   5   64     19 B        1   12 KiB  100.00        0 B

Expected results:
There should be capacity utilization alerts sent to emails that were set during addon installation.

Additional info:

Comment 1 Leela Venkaiah Gangavarapu 2022-10-27 13:33:38 UTC
@fbalak

- Would ideally need a must-gather for this
- Are we sure that it's working without pricing changes?
- Below is my reasoning:
1. When the pool is full, Ceph alert should be surfaced to our monitoring stack
2. Then the alertmanager will act on this

Based on description the unknown here is whether Ceph triggered an alert or our monitoring stack didn't pick up a raised alert.

@dbindra could you also pls take a look?

Comment 2 Filip Balák 2022-10-27 13:45:52 UTC
This needs to be retested because it could be affected by https://bugzilla.redhat.com/show_bug.cgi?id=2136854.

Comment 3 Dhruv Bindra 2022-10-27 14:07:34 UTC
Ack, @fbalak can you please re-test and let us know?

Comment 9 Elad 2023-01-17 12:41:35 UTC
Moving to 4.12.z as the verification would be done against the ODF MS rollout that would be based on ODF 4.12

Comment 10 Elad 2023-01-17 13:18:01 UTC
Moving to VERIFIED based on regression testing.
We will clone this bug for the sake of verifying the scenario as part of ODF MS testing over ODF 4.12 or with the provider-consumer layout

Comment 12 Filip Balák 2023-03-01 16:44:09 UTC
Notifications for cluster utilization are not working:

$ oc rsh -n openshift-storage $(oc get pods -n openshift-storage|grep tool|awk '{print$1}') ceph df
--- RAW STORAGE ---
CLASS    SIZE    AVAIL    USED  RAW USED  %RAW USED
ssd    12 TiB  1.8 TiB  10 TiB    10 TiB      85.02
TOTAL  12 TiB  1.8 TiB  10 TiB    10 TiB      85.02
 
--- POOLS ---
POOL                                                                ID  PGS   STORED  OBJECTS     USED   %USED  MAX AVAIL
device_health_metrics                                                1    1      0 B        0      0 B       0        0 B
ocs-storagecluster-cephfilesystem-metadata                           2   32   16 KiB       22  131 KiB  100.00        0 B
ocs-storagecluster-cephfilesystem-data0                              3  256      0 B        0      0 B       0        0 B
cephblockpool-storageconsumer-fddd8f1a-09e4-42fc-be0d-7d70e5f02f79   4   64  3.4 TiB  893.22k   10 TiB  100.00        0 B

$ oc rsh -n openshift-storage $(oc get pods -n openshift-storage|grep tool|awk '{print$1}') ceph -s
  cluster:
    id:     9e2ee3a5-53ef-45f3-bbd7-2dc83b07993f
    health: HEALTH_ERR
            3 full osd(s)
            4 pool(s) full
 
  services:
    mon: 3 daemons, quorum a,b,c (age 5h)
    mgr: a(active, since 5h)
    mds: 1/1 daemons up, 1 hot standby
    osd: 3 osds: 3 up (since 5h), 3 in (since 5h)
 
  data:
    volumes: 1/1 healthy
    pools:   4 pools, 353 pgs
    objects: 893.24k objects, 3.4 TiB
    usage:   10 TiB used, 1.8 TiB / 12 TiB avail
    pgs:     353 active+clean
 
  io:
    client:   1.2 KiB/s rd, 2 op/s rd, 0 op/s wr

$ rosa describe addon-installation --cluster fbalak03-1-pr --addon ocs-provider-qe
Id:                          ocs-provider-qe
Href:                        /api/clusters_mgmt/v1/clusters/226dcb9q8ric7euo2o73oo9k3jg73rjq/addons/ocs-provider-qe
Addon state:                 ready
Parameters:
	"size"                      : "4"
	"onboarding-validation-key" : (...)
	"notification-email-1"      : "fbalak"
	"notification-email-2"      : "odf-ms-qe"


Tested with:
ocs-osd-deployer.v2.0.11

must-gather: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/fbalak03-1-pr/fbalak03-1-pr_20230301T100351/logs/testcases_1677687913/

Comment 13 Rewant 2023-07-03 11:59:36 UTC
@lgangava Could you provide the latest update on this bug?

Comment 14 Leela Venkaiah Gangavarapu 2023-07-04 03:10:32 UTC
@fbalak 
- would you be able to provide a live cluster to debug this?
- the reason being I don't have any tools setup which can saturate the storage

thanks.

Comment 15 Leela Venkaiah Gangavarapu 2023-07-06 12:14:08 UTC
Providing gist from https://chat.google.com/room/AAAASHA9vWs/G3gxiWeBWV4
- Send Grid alerts were being sent based on Raw Storage used from `ceph df` command
- This raw storage %age isn't coinciding with the %age used for a specific pool
- In this scenario, even if pool is at 100% (assuming this pool solely uses all storage), raw storage isn't matching with it
- Investigation should be done from ODF/Ceph side

thanks.


Note You need to log in before you can comment on or make changes to this bug.