Bug 2215239 - The CephOSDCriticallyFull and CephOSDNearFull alerts are not firing when reaching the ceph OSD full ratios
Summary: The CephOSDCriticallyFull and CephOSDNearFull alerts are not firing when reac...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ceph-monitoring
Version: 4.13
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ODF 4.14.0
Assignee: arun kumar mohan
QA Contact: Vishakha Kathole
URL:
Whiteboard:
Depends On: 2217817
Blocks: 2154341 2244409
TreeView+ depends on / blocked
 
Reported: 2023-06-15 06:55 UTC by Prasad Desala
Modified: 2024-03-08 04:25 UTC (History)
8 users (show)

Fixed In Version: 4.14.0-126
Doc Type: Known Issue
Doc Text:
The alerts `CephOSDCriticallyFull` and `CephOSDNearFull` do not fire as expected because `ceph_daemon` value format has changed in ceph provided metrics and these alerts rely on the old value format.
Clone Of:
Environment:
Last Closed: 2023-11-08 18:51:26 UTC
Embargoed:
kbg: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github ceph ceph pull 52084 0 None open exporter: ceph-exporter scrapes failing on multi-homed server 2023-06-27 06:18:35 UTC
Github red-hat-storage ocs-operator pull 2081 0 None open Fix alerts not firing issue 2023-06-20 18:51:48 UTC
Github red-hat-storage ocs-operator pull 2172 0 None open Bug 2215239: [release-4.14] Fix alerts not firing issue 2023-09-04 02:35:24 UTC

Description Prasad Desala 2023-06-15 06:55:30 UTC
Description of problem (please be detailed as possible and provide log
snippests):
=======================================================================
On an ODF 4.13 cluster with the following cluster-level parameters enabled:
FIPS
Hugepages
KMS - vault
Cluster-wide encryption
Encryption in transit

The CephOSDCriticallyFull and CephOSDNearFull alerts are not triggered even though the OSDs have reached 85.05%. 
Please note that the CephClusterNearFull and CephClusterCriticallyFull alerts are firing.

11:08:06 - MainThread - tests.e2e.system_test.test_cluster_full_and_recovery - INFO  - osd utilization: {'osd.1': 85.05687686796765, 'osd.2': 85.05248164996483, 'osd.0': 85.05290896282622}

prasad:alerts$ oc rsh -n openshift-storage rook-ceph-tools-75bc769bdd-677cv
sh-5.1$ ceph -s
os  cluster:
    id:     a8434717-c354-401c-bc64-7dc6c9b15e28
    health: HEALTH_ERR
            3 full osd(s)
            12 pool(s) full
 
  services:
    mon: 3 daemons, quorum a,b,c (age 37h)
    mgr: a(active, since 37h)
    mds: 1/1 daemons up, 1 hot standby
    osd: 3 osds: 3 up (since 37h), 3 in (since 37h)
    rgw: 1 daemon active (1 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   12 pools, 169 pgs
    objects: 23.11k objects, 87 GiB
    usage:   255 GiB used, 45 GiB / 300 GiB avail
    pgs:     169 active+clean
 
  io:
    client:   852 B/s rd, 1 op/s rd, 0 op/s wr
 
sh-5.1$ ceph osd status
ID  HOST        USED  AVAIL  WR OPS  WR DATA  RD OPS  RD DATA  STATE           
 0  compute-1  85.0G  14.9G      0        0       2        0   exists,full,up  
 1  compute-2  85.0G  14.9G      0        0       2        0   exists,full,up  
 2  compute-0  85.0G  14.9G      0        0       4      106   exists,full,up  
sh-5.1$ ceph osd df    
ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP     META     AVAIL   %USE   VAR   PGS  STATUS
 1    hdd  0.09760   1.00000  100 GiB   85 GiB   85 GiB   84 KiB  528 MiB  15 GiB  85.06  1.00  169      up
 2    hdd  0.09760   1.00000  100 GiB   85 GiB   85 GiB   84 KiB  524 MiB  15 GiB  85.05  1.00  169      up
 0    hdd  0.09760   1.00000  100 GiB   85 GiB   85 GiB   84 KiB  524 MiB  15 GiB  85.05  1.00  169      up
                       TOTAL  300 GiB  255 GiB  254 GiB  255 KiB  1.5 GiB  45 GiB  85.05                   
MIN/MAX VAR: 1.00/1.00  STDDEV: 0.00

AL
CephClusterCriticallyFull
Storage cluster utilization has crossed 80% and will become read-only at 85%. Free up some space or expand the storage cluster immediately.
 Critical	 Firing
Since 
Jun 15, 2023, 11:04 AM
Platform	
AL
CephClusterNearFull
Storage cluster utilization has crossed 75% and will become read-only at 85%. Free up some space or expand the storage cluster.
 Warning	 Firing
Since 
Jun 15, 2023, 11:01 AM
Platform


Version of all relevant components (if applicable):
OCP version - 4.13.0-0.nightly-2023-06-13-070743
ODF version - 4.13.0-218

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
always

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:
On 4.12 the CephOSDCriticallyFull and CephOSDNearFull alerts are firing when ceph osd reaches the full ratios

Steps to Reproduce:
===================
Manual steps
1) Create an ODF cluster
2) Fill the osd's till 85% and check for CephOSDCriticallyFull and CephOSDNearFull alerts

automation:
Run system test - https://github.com/red-hat-storage/ocs-ci/blob/master/tests/e2e/system_test/test_cluster_full_and_recovery.py


Actual results:
===============
The CephOSDCriticallyFull and CephOSDNearFull alerts are not firing when reaching the ceph OSD full ratios.


Expected results:
=================
The CephOSDCriticallyFull and CephOSDNearFull alerts should fire when reaching the ceph OSD full ratios

Comment 5 arun kumar mohan 2023-06-16 09:32:26 UTC
Had an initial investigation with the QE cluster (thanks to Prasad),

Both alerts (CephOSDCriticallyFull and CephOSDNearFull) use the following metrics

ceph_osd_metadata
ceph_osd_stat_bytes_used
ceph_osd_stat_bytes

We are getting 'null' (with value: 'None') values while individually running each (above) metric commands. Thus not getting a definite value for the alert query.
Yet to figure out why we are having the null values for the metrics. Triaging...

Comment 6 Harish NV Rao 2023-06-19 06:40:38 UTC
(In reply to arun kumar mohan from comment #5)
> Had an initial investigation with the QE cluster (thanks to Prasad),
> 
> Both alerts (CephOSDCriticallyFull and CephOSDNearFull) use the following
> metrics
> 
> ceph_osd_metadata
> ceph_osd_stat_bytes_used
> ceph_osd_stat_bytes
> 
> We are getting 'null' (with value: 'None') values while individually running
> each (above) metric commands. Thus not getting a definite value for the
> alert query.
> Yet to figure out why we are having the null values for the metrics.
> Triaging...

Hi Arun,

Any update on the RCA?

Comment 7 arun kumar mohan 2023-06-20 12:59:47 UTC
Hi Harish, we are trying to put up a changed query PR, which provide only the non-null value results (which should fix the issue).
Currently hitting the issue where the new/changed query also drags this 'None' values

Comment 9 arun kumar mohan 2023-06-20 16:30:32 UTC
CC-ing Avan (who worked in the ceph-exporter area) for any insight (or any further other changes which we might have missed from the given limited samples)

Comment 10 arun kumar mohan 2023-06-20 18:51:49 UTC
Added a PR: https://github.com/red-hat-storage/ocs-operator/pull/2081

Comment 11 Harish NV Rao 2023-06-21 10:13:52 UTC
@Kusuma, requesting you to add this as a known issue in the 4.13.0 Release notes.
@Arun, could you please provide the doc text?

Comment 12 arun kumar mohan 2023-06-21 10:40:35 UTC
Since this is a regression (and not a new issue), how we will categorize this.
Mudit can you take a look (on how to proceed)?
Provided the doc-text as requested

PS: After a quick chat with Avan, moved the above mentioned PR#2081 to a draft one as Avan is working on PR: https://github.com/ceph/ceph/pull/52084 which will have a 'ceph_daemon' format issue fix.

Comment 13 Mudit Agarwal 2023-06-27 06:18:02 UTC
Already tagged as a known issue.

Avan, you should create a ceph bug (clone of this bug) so that the downstream backport can be tracked there.

Comment 18 arun kumar mohan 2023-07-13 12:25:37 UTC
Will take it once the dependent BZ get completed...

Comment 19 arun kumar mohan 2023-08-02 16:42:43 UTC
As per comment#8, this is a two part issue of which the first part is resolved with Avan's fix.
A minor second part is fixed through this PR: https://github.com/red-hat-storage/ocs-operator/pull/2081

Comment 25 errata-xmlrpc 2023-11-08 18:51:26 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.14.0 security, enhancement & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:6832

Comment 26 Red Hat Bugzilla 2024-03-08 04:25:50 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.