Description of problem (please be detailed as possible and provide log snippests): ======================================================================= On an ODF 4.13 cluster with the following cluster-level parameters enabled: FIPS Hugepages KMS - vault Cluster-wide encryption Encryption in transit The CephOSDCriticallyFull and CephOSDNearFull alerts are not triggered even though the OSDs have reached 85.05%. Please note that the CephClusterNearFull and CephClusterCriticallyFull alerts are firing. 11:08:06 - MainThread - tests.e2e.system_test.test_cluster_full_and_recovery - INFO - osd utilization: {'osd.1': 85.05687686796765, 'osd.2': 85.05248164996483, 'osd.0': 85.05290896282622} prasad:alerts$ oc rsh -n openshift-storage rook-ceph-tools-75bc769bdd-677cv sh-5.1$ ceph -s os cluster: id: a8434717-c354-401c-bc64-7dc6c9b15e28 health: HEALTH_ERR 3 full osd(s) 12 pool(s) full services: mon: 3 daemons, quorum a,b,c (age 37h) mgr: a(active, since 37h) mds: 1/1 daemons up, 1 hot standby osd: 3 osds: 3 up (since 37h), 3 in (since 37h) rgw: 1 daemon active (1 hosts, 1 zones) data: volumes: 1/1 healthy pools: 12 pools, 169 pgs objects: 23.11k objects, 87 GiB usage: 255 GiB used, 45 GiB / 300 GiB avail pgs: 169 active+clean io: client: 852 B/s rd, 1 op/s rd, 0 op/s wr sh-5.1$ ceph osd status ID HOST USED AVAIL WR OPS WR DATA RD OPS RD DATA STATE 0 compute-1 85.0G 14.9G 0 0 2 0 exists,full,up 1 compute-2 85.0G 14.9G 0 0 2 0 exists,full,up 2 compute-0 85.0G 14.9G 0 0 4 106 exists,full,up sh-5.1$ ceph osd df ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS 1 hdd 0.09760 1.00000 100 GiB 85 GiB 85 GiB 84 KiB 528 MiB 15 GiB 85.06 1.00 169 up 2 hdd 0.09760 1.00000 100 GiB 85 GiB 85 GiB 84 KiB 524 MiB 15 GiB 85.05 1.00 169 up 0 hdd 0.09760 1.00000 100 GiB 85 GiB 85 GiB 84 KiB 524 MiB 15 GiB 85.05 1.00 169 up TOTAL 300 GiB 255 GiB 254 GiB 255 KiB 1.5 GiB 45 GiB 85.05 MIN/MAX VAR: 1.00/1.00 STDDEV: 0.00 AL CephClusterCriticallyFull Storage cluster utilization has crossed 80% and will become read-only at 85%. Free up some space or expand the storage cluster immediately. Critical Firing Since Jun 15, 2023, 11:04 AM Platform AL CephClusterNearFull Storage cluster utilization has crossed 75% and will become read-only at 85%. Free up some space or expand the storage cluster. Warning Firing Since Jun 15, 2023, 11:01 AM Platform Version of all relevant components (if applicable): OCP version - 4.13.0-0.nightly-2023-06-13-070743 ODF version - 4.13.0-218 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? always Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: On 4.12 the CephOSDCriticallyFull and CephOSDNearFull alerts are firing when ceph osd reaches the full ratios Steps to Reproduce: =================== Manual steps 1) Create an ODF cluster 2) Fill the osd's till 85% and check for CephOSDCriticallyFull and CephOSDNearFull alerts automation: Run system test - https://github.com/red-hat-storage/ocs-ci/blob/master/tests/e2e/system_test/test_cluster_full_and_recovery.py Actual results: =============== The CephOSDCriticallyFull and CephOSDNearFull alerts are not firing when reaching the ceph OSD full ratios. Expected results: ================= The CephOSDCriticallyFull and CephOSDNearFull alerts should fire when reaching the ceph OSD full ratios
Had an initial investigation with the QE cluster (thanks to Prasad), Both alerts (CephOSDCriticallyFull and CephOSDNearFull) use the following metrics ceph_osd_metadata ceph_osd_stat_bytes_used ceph_osd_stat_bytes We are getting 'null' (with value: 'None') values while individually running each (above) metric commands. Thus not getting a definite value for the alert query. Yet to figure out why we are having the null values for the metrics. Triaging...
(In reply to arun kumar mohan from comment #5) > Had an initial investigation with the QE cluster (thanks to Prasad), > > Both alerts (CephOSDCriticallyFull and CephOSDNearFull) use the following > metrics > > ceph_osd_metadata > ceph_osd_stat_bytes_used > ceph_osd_stat_bytes > > We are getting 'null' (with value: 'None') values while individually running > each (above) metric commands. Thus not getting a definite value for the > alert query. > Yet to figure out why we are having the null values for the metrics. > Triaging... Hi Arun, Any update on the RCA?
Hi Harish, we are trying to put up a changed query PR, which provide only the non-null value results (which should fix the issue). Currently hitting the issue where the new/changed query also drags this 'None' values
CC-ing Avan (who worked in the ceph-exporter area) for any insight (or any further other changes which we might have missed from the given limited samples)
Added a PR: https://github.com/red-hat-storage/ocs-operator/pull/2081
@Kusuma, requesting you to add this as a known issue in the 4.13.0 Release notes. @Arun, could you please provide the doc text?
Since this is a regression (and not a new issue), how we will categorize this. Mudit can you take a look (on how to proceed)? Provided the doc-text as requested PS: After a quick chat with Avan, moved the above mentioned PR#2081 to a draft one as Avan is working on PR: https://github.com/ceph/ceph/pull/52084 which will have a 'ceph_daemon' format issue fix.
Already tagged as a known issue. Avan, you should create a ceph bug (clone of this bug) so that the downstream backport can be tracked there.
Will take it once the dependent BZ get completed...
As per comment#8, this is a two part issue of which the first part is resolved with Avan's fix. A minor second part is fixed through this PR: https://github.com/red-hat-storage/ocs-operator/pull/2081
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.14.0 security, enhancement & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:6832
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days