2236143 – CephClusterReadOnly alert is not getting triggered

Bug 2236143 - CephClusterReadOnly alert is not getting triggered

Summary: CephClusterReadOnly alert is not getting triggered

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ceph-monitoring
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Divyansh Kamboj
QA Contact:	Harish NV Rao
Docs Contact:
URL:
Whiteboard:
Depends On:	2084014 2084541
Blocks:
TreeView+	depends on / blocked

Reported:	2023-08-30 12:44 UTC by Filip Balák
Modified:	2024-02-05 02:22 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	2084541
Environment:
Last Closed:	2024-01-11 10:38:59 UTC
Embargoed:

Attachments	(Terms of Use)

Description Filip Balák 2023-08-30 12:44:02 UTC

+++ This bug was initially created as a clone of Bug #2084541 +++

Description of problem:
There is no CephClusterReadOnly alert triggered when cluster reaches 85% of its utilization

Version-Release number of selected component (if applicable):
ocs-operator.v4.10.0
OCP 4.10.8

How reproducible:
1/1

Steps to Reproduce:
1. Deploy provider and consumer with 4 TiB cluster on ROSA (this is not reproducible on larger clusters: https://bugzilla.redhat.com/show_bug.cgi?id=2084014)
2. Set notification emails during deployment.
3. Fully utilize cluster capacity.
4. Check email.

Actual results:
There is an email:

Your storage cluster utilization has crossed 80% and will become read-only at 85% utilized! Please free up some space or if possible expand the storage cluster immediately to prevent any service access issues. It is common to also be alerted to OSD devices entering near-full or full states prior to this alert.

That was received on reaching 80% of utilized capacity. There is no notification that cluster is read only after reaching 85%.

Expected results:
User should be notified that the cluster is read only.

Additional info:
Output of ceph df command on fully utilized cluster:

$ oc rsh -n openshift-storage $(oc get pods -n openshift-storage|grep tool|awk '{print$1}') ceph df
--- RAW STORAGE ---
CLASS    SIZE    AVAIL    USED  RAW USED  %RAW USED
ssd    12 TiB  1.8 TiB  10 TiB    10 TiB      85.01
TOTAL  12 TiB  1.8 TiB  10 TiB    10 TiB      85.01
 
--- POOLS ---
POOL                                                                ID  PGS   STORED  OBJECTS     USED   %USED  MAX AVAIL
device_health_metrics                                                1    1   15 KiB        6   46 KiB  100.00        0 B
ocs-storagecluster-cephblockpool                                     2   32     19 B        1   12 KiB  100.00        0 B
ocs-storagecluster-cephfilesystem-metadata                           3   32   18 KiB       22  138 KiB  100.00        0 B
ocs-storagecluster-cephfilesystem-data0                              4   32      0 B        0      0 B       0        0 B
cephblockpool-storageconsumer-1318e613-2b6e-45e1-81e2-b25f67221e47   5   32  3.4 TiB  892.69k   10 TiB  100.00        0 B

--- Additional comment from Red Hat Bugzilla on 2022-08-05 19:09:06 UTC ---

remove performed by PnT Account Manager <pnt-expunge>

--- Additional comment from Red Hat Bugzilla on 2022-08-05 19:09:34 UTC ---

remove performed by PnT Account Manager <pnt-expunge>

--- Additional comment from Red Hat Bugzilla on 2022-12-31 19:29:35 UTC ---

remove performed by PnT Account Manager <pnt-expunge>

--- Additional comment from Red Hat Bugzilla on 2022-12-31 19:49:58 UTC ---

remove performed by PnT Account Manager <pnt-expunge>

--- Additional comment from Red Hat Bugzilla on 2022-12-31 19:50:01 UTC ---

remove performed by PnT Account Manager <pnt-expunge>

--- Additional comment from Red Hat Bugzilla on 2022-12-31 22:31:45 UTC ---

remove performed by PnT Account Manager <pnt-expunge>

--- Additional comment from Red Hat Bugzilla on 2022-12-31 23:27:25 UTC ---

remove performed by PnT Account Manager <pnt-expunge>

--- Additional comment from Dhruv Bindra on 2023-01-20 09:49:53 UTC ---

Try it on the latest build

--- Additional comment from Filip Balák on 2023-03-02 14:16:15 UTC ---

Notifications for cluster utilization including CephClusterReadOnly are not working:

$ oc rsh -n openshift-storage $(oc get pods -n openshift-storage|grep tool|awk '{print$1}') ceph df
--- RAW STORAGE ---
CLASS    SIZE    AVAIL    USED  RAW USED  %RAW USED
ssd    12 TiB  1.8 TiB  10 TiB    10 TiB      85.02
TOTAL  12 TiB  1.8 TiB  10 TiB    10 TiB      85.02
 
--- POOLS ---
POOL                                                                ID  PGS   STORED  OBJECTS     USED   %USED  MAX AVAIL
device_health_metrics                                                1    1      0 B        0      0 B       0        0 B
ocs-storagecluster-cephfilesystem-metadata                           2   32   16 KiB       22  131 KiB  100.00        0 B
ocs-storagecluster-cephfilesystem-data0                              3  256      0 B        0      0 B       0        0 B
cephblockpool-storageconsumer-fddd8f1a-09e4-42fc-be0d-7d70e5f02f79   4   64  3.4 TiB  893.22k   10 TiB  100.00        0 B

$ oc rsh -n openshift-storage $(oc get pods -n openshift-storage|grep tool|awk '{print$1}') ceph -s
  cluster:
    id:     9e2ee3a5-53ef-45f3-bbd7-2dc83b07993f
    health: HEALTH_ERR
            3 full osd(s)
            4 pool(s) full
 
  services:
    mon: 3 daemons, quorum a,b,c (age 5h)
    mgr: a(active, since 5h)
    mds: 1/1 daemons up, 1 hot standby
    osd: 3 osds: 3 up (since 5h), 3 in (since 5h)
 
  data:
    volumes: 1/1 healthy
    pools:   4 pools, 353 pgs
    objects: 893.24k objects, 3.4 TiB
    usage:   10 TiB used, 1.8 TiB / 12 TiB avail
    pgs:     353 active+clean
 
  io:
    client:   1.2 KiB/s rd, 2 op/s rd, 0 op/s wr

$ rosa describe addon-installation --cluster fbalak03-1-pr --addon ocs-provider-qe
Id:                          ocs-provider-qe
Href:                        /api/clusters_mgmt/v1/clusters/226dcb9q8ric7euo2o73oo9k3jg73rjq/addons/ocs-provider-qe
Addon state:                 ready
Parameters:
	"size"                      : "4"
	"onboarding-validation-key" : (...)
	"notification-email-1"      : "fbalak"
	"notification-email-2"      : "odf-ms-qe"


Tested with:
ocs-osd-deployer.v2.0.11

must-gather: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/fbalak03-1-pr/fbalak03-1-pr_20230301T100351/logs/testcases_1677687913/

--- Additional comment from Rewant on 2023-07-03 11:04:29 UTC ---

@kmajumde can you please provide the latest update?

--- Additional comment from Red Hat Bugzilla on 2023-08-03 08:28:29 UTC ---

Account disabled by LDAP Audit

Comment 4 Divyansh Kamboj 2023-09-19 07:29:17 UTC

@fbalak I tried this out on a 4.10.14 ODF, filled up the cluster (1.5TB) up to 85% and I'm getting all the alerts required, `CephClusterCriticallyFull`, `CephClusterNearFull`, and `CephClusterReadOnly`

Can you provide a cluster where this issue is reproducible?

Comment 5 Divyansh Kamboj 2023-11-13 08:23:17 UTC

as it's not reproducible, moving it down to medium

Comment 6 Divyansh Kamboj 2024-01-11 10:38:59 UTC

@fbalak any updates? closing for now as it's not reproducible on clusters I have tested

Note You need to log in before you can comment on or make changes to this bug.