Bug 2215239
Summary: | The CephOSDCriticallyFull and CephOSDNearFull alerts are not firing when reaching the ceph OSD full ratios | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Prasad Desala <tdesala> |
Component: | ceph-monitoring | Assignee: | arun kumar mohan <amohan> |
Status: | CLOSED ERRATA | QA Contact: | Vishakha Kathole <vkathole> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.13 | CC: | amohan, athakkar, ebenahar, hnallurv, kbg, muagarwa, nthomas, odf-bz-bot |
Target Milestone: | --- | Keywords: | Automation, AutomationBackLog, Regression |
Target Release: | ODF 4.14.0 | Flags: | kbg:
needinfo-
|
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | 4.14.0-126 | Doc Type: | Known Issue |
Doc Text: |
The alerts `CephOSDCriticallyFull` and `CephOSDNearFull` do not fire as expected because `ceph_daemon` value format has changed in ceph provided metrics and these alerts rely on the old value format.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2023-11-08 18:51:26 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 2217817 | ||
Bug Blocks: | 2154341, 2244409 |
Description
Prasad Desala
2023-06-15 06:55:30 UTC
Had an initial investigation with the QE cluster (thanks to Prasad), Both alerts (CephOSDCriticallyFull and CephOSDNearFull) use the following metrics ceph_osd_metadata ceph_osd_stat_bytes_used ceph_osd_stat_bytes We are getting 'null' (with value: 'None') values while individually running each (above) metric commands. Thus not getting a definite value for the alert query. Yet to figure out why we are having the null values for the metrics. Triaging... (In reply to arun kumar mohan from comment #5) > Had an initial investigation with the QE cluster (thanks to Prasad), > > Both alerts (CephOSDCriticallyFull and CephOSDNearFull) use the following > metrics > > ceph_osd_metadata > ceph_osd_stat_bytes_used > ceph_osd_stat_bytes > > We are getting 'null' (with value: 'None') values while individually running > each (above) metric commands. Thus not getting a definite value for the > alert query. > Yet to figure out why we are having the null values for the metrics. > Triaging... Hi Arun, Any update on the RCA? Hi Harish, we are trying to put up a changed query PR, which provide only the non-null value results (which should fix the issue). Currently hitting the issue where the new/changed query also drags this 'None' values CC-ing Avan (who worked in the ceph-exporter area) for any insight (or any further other changes which we might have missed from the given limited samples) @Kusuma, requesting you to add this as a known issue in the 4.13.0 Release notes. @Arun, could you please provide the doc text? Since this is a regression (and not a new issue), how we will categorize this. Mudit can you take a look (on how to proceed)? Provided the doc-text as requested PS: After a quick chat with Avan, moved the above mentioned PR#2081 to a draft one as Avan is working on PR: https://github.com/ceph/ceph/pull/52084 which will have a 'ceph_daemon' format issue fix. Already tagged as a known issue. Avan, you should create a ceph bug (clone of this bug) so that the downstream backport can be tracked there. Will take it once the dependent BZ get completed... As per comment#8, this is a two part issue of which the first part is resolved with Avan's fix. A minor second part is fixed through this PR: https://github.com/red-hat-storage/ocs-operator/pull/2081 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.14.0 security, enhancement & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:6832 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days |