2279979 – [IBM Z] Wrong ceph_osd_stat_bytes_used and ceph_osd_stat_bytes in ODF

Bug 2279979 - [IBM Z] Wrong ceph_osd_stat_bytes_used and ceph_osd_stat_bytes in ODF

Summary: [IBM Z] Wrong ceph_osd_stat_bytes_used and ceph_osd_stat_bytes in ODF

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	Ceph-Metrics
Sub Component:
Version:	7.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	6.1z9
Assignee:	Juan Miguel Olmo
QA Contact:	Sayalee
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2274525
TreeView+	depends on / blocked

Reported:	2024-05-10 10:46 UTC by umanga
Modified:	2024-11-05 05:39 UTC (History)
CC List:	15 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	2274525
Environment:
Last Closed:	2024-11-05 05:39:12 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHCEPH-8994	0	None	None	None	2024-05-10 10:50:53 UTC

Description umanga 2024-05-10 10:46:09 UTC

+++ This bug was initially created as a clone of Bug #2274525 +++

Description of problem (please be detailed as possible and provide log
snippests):

When checking for the OSD disk usage, we find two sets of values of 2 metrics

- ceph_osd_stat_bytes
- ceph_osd_stat_bytes_used

Obviously the correct set of value should be where ceph_osd_stat_bytes_used (actually used) is smaller than ceph_osd_stat_bytes (total).


Set 1:
Ceph Admin GUI and `ceph osd df` (via rook-ceph-tools-xx container) show similar set of values of those two metrics. It seems to be correct where ceph_osd_stat_bytes_used < ceph_osd_stat_bytes.

Set 2:
Prometheus GUI and internal /metrics endpoint show wrong values where ceph_osd_stat_bytes_used > ceph_osd_stat_bytes.

The internal /metrics is derived from:
# oc describe pod rook-ceph-exporter-worker-xxx | grep IP
# curl -s http://10.129.2.32:9926/metrics | grep 'ceph_osd_stat_bytes'
ceph_osd_stat_bytes{ceph_daemon="osd.2"} 6978919051352
ceph_osd_stat_bytes_used{ceph_daemon="osd.2"} 8441947082770

When we tried testing ceph reef standalone, the numbers seem to be correct. I suspect the problem lies in ceph exporter code, I see some commits in reef branch has not been backported to quincy:
https://github.com/ceph/ceph/commits/quincy/src/exporter


Version of all relevant components (if applicable):
ODF 4.15
ceph version 17.2.6-196.el9cp (cbbf2cfb549196ca18c0c9caff9124d83ed681a4) quincy (stable)


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?
Yes


Can this issue reproduce from the UI?
Yes


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.
2.
3.


Actual results:


Expected results:


Additional info:

--- Additional comment from RHEL Program Management on 2024-04-11 18:46:13 IST ---

This bug having no release flag set previously, is now set with release flag 'odf‑4.16.0' to '?', and so is being proposed to be fixed at the ODF 4.16.0 release. Note that the 3 Acks (pm_ack, devel_ack, qa_ack), if any previously set while release flag was missing, have now been reset since the Acks are to be set against a release flag.

--- Additional comment from RHEL Program Management on 2024-04-11 18:46:13 IST ---

The 'Target Release' is not to be set manually at the Red Hat OpenShift Data Foundation product.

The 'Target Release' will be auto set appropriately, after the 3 Acks (pm,devel,qa) are set to "+" for a specific release flag and that release flag gets auto set to "+".

--- Additional comment from umanga on 2024-05-10 15:03:04 IST ---

This needs to be looked at by ceph and ceph-exporter devs. Nothing we can do here.
Moving this out to 4.17 and finding the right assignee for it.

Comment 2 Paulo 2024-09-04 13:51:55 UTC

We have a customer in the path to migrate to 4.15 and evaluating 4.16.
On top of that, they're evaluating External CEPH 5 and 7.1, is this fixed anywhere ?

Note You need to log in before you can comment on or make changes to this bug.