Bug 2242324

Summary:	ODF Monitoring is missing some of the metric values 4.13
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	Daniel Osypenko <dosypenk>
Component:	rook	Assignee:	avan <athakkar>
Status:	CLOSED ERRATA	QA Contact:	Neha Berry <nberry>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	4.13	CC:	athakkar, athar.lh, branto, ebenahar, fbalak, hnallurv, kdreyer, kramdoss, muagarwa, murtaza.8060, nthomas, odf-bz-bot, rcyriac, sheggodu, tnielsen
Target Milestone:	---	Keywords:	Regression
Target Release:	ODF 4.13.8
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	4.13.8-1	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	2221488	Environment:
Last Closed:	2024-04-03 07:03:00 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	2221488, 2253428, 2253429
Bug Blocks:

Description Daniel Osypenko 2023-10-05 15:12:20 UTC

+++ This bug was initially created as a clone of Bug #2221488 +++

This bug was initially created as a copy of Bug #2203795

I am copying this bug because: 
Similarly to original bug found on ODF 4.14 we are missing 166 metrics.
Test test_ceph_metrics_available fails constantly last 9 test runs. (previously stably Passed)
must-gather logs: https://url.corp.redhat.com/mg-logs

csv
NAME                                    DISPLAY                       VERSION        REPLACES                                PHASE
mcg-operator.v4.13.4-rhodf              NooBaa Operator               4.13.4-rhodf   mcg-operator.v4.12.8-rhodf              Succeeded
ocs-operator.v4.13.4-rhodf              OpenShift Container Storage   4.13.4-rhodf   ocs-operator.v4.12.8-rhodf              Succeeded
odf-csi-addons-operator.v4.13.4-rhodf   CSI Addons                    4.13.4-rhodf   odf-csi-addons-operator.v4.12.8-rhodf   Succeeded
odf-operator.v4.13.4-rhodf              OpenShift Data Foundation     4.13.4-rhodf   odf-operator.v4.12.8-rhodf              Succeeded

Even though missing metric names are different comparing to 4.13 missing metrics, the description of the problem and parties of the discussion should be the same

Description of problem (please be detailed as possible and provide log
snippests):
ODF Monitoring is missing some of the ceph_* metric values. No related epic providing change/rename was found 

List of missing metric values:
'ceph_bluestore_state_aio_wait_lat_sum',

'ceph_paxos_store_state_latency_sum',

'ceph_osd_op_out_bytes',

'ceph_bluestore_txc_submit_lat_sum',

'ceph_paxos_commit',

'ceph_paxos_new_pn_latency_count',

'ceph_osd_op_r_process_latency_count',

'ceph_bluestore_txc_submit_lat_count',

'ceph_bluestore_kv_final_lat_sum',

'ceph_paxos_collect_keys_sum',

'ceph_paxos_accept_timeout',

'ceph_paxos_begin_latency_count',

'ceph_bluefs_wal_total_bytes',

'ceph_paxos_refresh',

'ceph_bluestore_read_lat_count',

'ceph_mon_num_sessions',

'ceph_bluefs_bytes_written_wal',

'ceph_mon_num_elections',

'ceph_rocksdb_compact',

'ceph_bluestore_kv_sync_lat_sum',

'ceph_osd_op_process_latency_count',

'ceph_osd_op_w_prepare_latency_count',

'ceph_paxos_begin_latency_sum',

'ceph_osd_op_r',

'ceph_osd_op_rw_prepare_latency_sum',

'ceph_paxos_new_pn',

'ceph_rocksdb_get_latency_count',

'ceph_paxos_commit_latency_count',

'ceph_bluestore_txc_throttle_lat_count',

'ceph_paxos_lease_ack_timeout',

'ceph_bluestore_txc_commit_lat_sum',

'ceph_paxos_collect_bytes_sum',

'ceph_osd_op_rw_latency_count',

'ceph_paxos_collect_uncommitted',

'ceph_osd_op_rw_latency_sum',

'ceph_paxos_share_state',

'ceph_osd_op_r_prepare_latency_sum',

'ceph_bluestore_kv_flush_lat_sum',

'ceph_osd_op_rw_process_latency_sum',

'ceph_rocksdb_rocksdb_write_memtable_time_count',

'ceph_paxos_collect_latency_count',

'ceph_osd_op_rw_prepare_latency_count',

'ceph_paxos_collect_latency_sum',

'ceph_rocksdb_rocksdb_write_delay_time_count',

'ceph_paxos_begin_bytes_sum',

'ceph_osd_numpg',

'ceph_osd_stat_bytes',

'ceph_rocksdb_submit_sync_latency_sum',

'ceph_rocksdb_compact_queue_merge',

'ceph_paxos_collect_bytes_count',

'ceph_osd_op',

'ceph_paxos_commit_keys_sum',

'ceph_osd_op_rw_in_bytes',

'ceph_osd_op_rw_out_bytes',

'ceph_bluefs_bytes_written_sst',

'ceph_osd_op_rw_process_latency_count',

'ceph_rocksdb_compact_queue_len',

'ceph_bluestore_txc_throttle_lat_sum',

'ceph_bluefs_slow_used_bytes',

'ceph_osd_op_r_latency_sum',

'ceph_bluestore_kv_flush_lat_count',

'ceph_rocksdb_compact_range',

'ceph_osd_op_latency_sum',

'ceph_mon_session_add',

'ceph_paxos_share_state_keys_count',

'ceph_paxos_collect',

'ceph_osd_op_w_in_bytes',

'ceph_osd_op_r_process_latency_sum',

'ceph_paxos_start_peon',

'ceph_mon_session_trim',

'ceph_rocksdb_get_latency_sum',

'ceph_osd_op_rw',

'ceph_paxos_store_state_keys_count',

'ceph_rocksdb_rocksdb_write_delay_time_sum',

'ceph_osd_recovery_ops',

'ceph_bluefs_logged_bytes',

'ceph_bluefs_db_total_bytes',

'ceph_osd_op_w_latency_count',

'ceph_bluestore_txc_commit_lat_count',

'ceph_bluestore_state_aio_wait_lat_count',

'ceph_paxos_begin_bytes_count',

'ceph_paxos_start_leader',

'ceph_mon_election_call',

'ceph_rocksdb_rocksdb_write_pre_and_post_time_count',

'ceph_mon_session_rm',

'ceph_paxos_store_state',

'ceph_paxos_store_state_bytes_count',

'ceph_osd_op_w_latency_sum',

'ceph_rocksdb_submit_latency_count',

'ceph_paxos_commit_latency_sum',

'ceph_rocksdb_rocksdb_write_memtable_time_sum',

'ceph_paxos_share_state_bytes_sum',

'ceph_osd_op_process_latency_sum',

'ceph_paxos_begin_keys_sum',

'ceph_rocksdb_rocksdb_write_pre_and_post_time_sum',

'ceph_bluefs_wal_used_bytes',

'ceph_rocksdb_rocksdb_write_wal_time_sum',

'ceph_osd_op_wip',

'ceph_paxos_lease_timeout',

'ceph_osd_op_r_out_bytes',

'ceph_paxos_begin_keys_count',

'ceph_bluestore_kv_sync_lat_count',

'ceph_osd_op_prepare_latency_count',

'ceph_bluefs_bytes_written_slow',

'ceph_rocksdb_submit_latency_sum',

'ceph_osd_op_r_latency_count',

'ceph_paxos_share_state_keys_sum',

'ceph_paxos_store_state_bytes_sum',

'ceph_osd_op_latency_count',

'ceph_paxos_commit_bytes_count',

'ceph_paxos_restart',

'ceph_bluefs_slow_total_bytes',

'ceph_paxos_collect_timeout',

'ceph_osd_op_w_process_latency_sum',

'ceph_paxos_collect_keys_count',

'ceph_paxos_share_state_bytes_count',

'ceph_osd_op_w_prepare_latency_sum',

'ceph_bluestore_read_lat_sum',

'ceph_osd_stat_bytes_used',

'ceph_paxos_begin',

'ceph_mon_election_win',

'ceph_osd_op_w_process_latency_count',

'ceph_rocksdb_rocksdb_write_wal_time_count',

'ceph_paxos_store_state_keys_sum',

'ceph_osd_numpg_removing',

'ceph_paxos_commit_keys_count',

'ceph_paxos_new_pn_latency_sum',

'ceph_osd_op_in_bytes',

'ceph_paxos_store_state_latency_count',

'ceph_paxos_refresh_latency_count',

'ceph_osd_op_r_prepare_latency_count',

'ceph_bluefs_num_files',

'ceph_mon_election_lose',

'ceph_osd_op_prepare_latency_sum',

'ceph_bluefs_db_used_bytes',

'ceph_bluestore_kv_final_lat_count',

'ceph_paxos_refresh_latency_sum',

'ceph_osd_recovery_bytes',

'ceph_osd_op_w',

'ceph_paxos_commit_bytes_sum',

'ceph_bluefs_log_bytes',

'ceph_rocksdb_submit_sync_latency_count',

ceph metrics which should be present on a healthy cluster:
https://github.com/red-hat-storage/ocs-ci/blob/81ca20aed067a30dd109e0f29e026f2a18c752ee/ocs_ci/ocs/metrics.py#L70

Polarion documentation: https://polarion.engineering.redhat.com/polarion/#/project/OpenShiftContainerStorage/workitem?id=OCS-958

Version of all relevant components (if applicable):

OC version:
Client Version: 4.13.4
Kustomize Version: v4.5.7
Server Version: 4.14.0-0.nightly-2023-06-30-131338
Kubernetes Version: v1.27.3+ab0b8ee

OCS verison:
ocs-operator.v4.14.0-36.stable              OpenShift Container Storage   4.14.0-36.stable              Succeeded

Cluster version
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.14.0-0.nightly-2023-06-30-131338   True        False         4d1h    Cluster version is 4.14.0-0.nightly-2023-06-30-131338

Rook version:
rook: v4.14.0-0.d8ce011027a26218154bcedf63a54e97f020df40
go: go1.20.4

Ceph version:
ceph version 17.2.6-70.el9cp (fe62dcdbb2c6e05782a3e2b67d025b84ff5047cc) quincy (stable)


Can this issue reproducible?
yes, repeatable in ci run and with local run


Steps to Reproduce:
1. Install OCP/ODF cluster
2. After installation, check whether Prometheus provides values for the
   metrics listed above.


Actual results:
OCP Prometheus provides no values for any of the metrics listed above.

Expected results:
OCP Prometheus provides values for all metrics listed above.

logs of the test-run:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-034aikt1c33-t1/j-034aikt1c33-t1_20230704T064403/logs/ocs-ci-logs-1688456400/by_outcome/failed/tests/manage/monitoring/prometheusmetrics/test_monitoring_defaults.py/test_ceph_metrics_available/logs

must-gather logs
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-034aikt1c33-t1/j-034aikt1c33-t1_20230704T064403/logs/testcases_1688456400/

--- Additional comment from RHEL Program Management on 2023-07-09 12:02:36 UTC ---

This bug having no release flag set previously, is now set with release flag 'odf‑4.14.0' to '?', and so is being proposed to be fixed at the ODF 4.14.0 release. Note that the 3 Acks (pm_ack, devel_ack, qa_ack), if any previously set while release flag was missing, have now been reset since the Acks are to be set against a release flag.

--- Additional comment from RHEL Program Management on 2023-07-09 12:02:36 UTC ---

The 'Target Release' is not to be set manually at the Red Hat OpenShift Data Foundation product.

The 'Target Release' will be auto set appropriately, after the 3 Acks (pm,devel,qa) are set to "+" for a specific release flag and that release flag gets auto set to "+".

--- Additional comment from Travis Nielsen on 2023-07-10 16:47:34 UTC ---

Avan PTAL

--- Additional comment from avan on 2023-07-25 17:26:37 UTC ---

@Daniel,
Is this still reproducible?

--- Additional comment from Daniel Osypenko on 2023-07-26 08:21:32 UTC ---

@athakkar OCS 4.14.0-77 still fails

--- Additional comment from avan on 2023-08-01 10:45:40 UTC ---

(In reply to Daniel Osypenko from comment #5)
> @athakkar OCS 4.14.0-77 still fails

Currently the ceph-exporter is disabled for 4.14 build as there were some issue detected in upstream Ceph. The plan is to get it delivered to 6.1z2 and then enable it in 4.14 release branch of rook repo by this week.

--- Additional comment from Travis Nielsen on 2023-08-01 15:12:18 UTC ---

Avan Was the exporter disabled in Ceph? If so, we can move this BZ over to the ceph component

--- Additional comment from avan on 2023-08-02 09:37:26 UTC ---

(In reply to Travis Nielsen from comment #7)
> Avan Was the exporter disabled in Ceph? If so, we can move this BZ over to
> the ceph component

No, I mean it was disabled on rook end. By the way the fixes are merged upstream for exporter so soon will be backported to downstream 6.1z2

--- Additional comment from Travis Nielsen on 2023-08-02 16:32:13 UTC ---

Oh right, the min version upstream requires v18 for the exporter to be enabled, which means it's disabled in 4.14 until we change the MinVersionForCephExporter again.

--- Additional comment from Red Hat Bugzilla on 2023-08-03 08:28:02 UTC ---

Account disabled by LDAP Audit

--- Additional comment from Mudit Agarwal on 2023-08-08 05:35:34 UTC ---

Avan, please add the link of Ceph BZ/PR which has the exporter changes?
Also, when are we planning it to be enabled from rook side?

Elad, please provide qa ack.

--- Additional comment from RHEL Program Management on 2023-08-08 06:31:57 UTC ---

This BZ is being approved for ODF 4.14.0 release, upon receipt of the 3 ACKs (PM,Devel,QA) for the release flag 'odf‑4.14.0

--- Additional comment from RHEL Program Management on 2023-08-08 06:31:57 UTC ---

Since this bug has been approved for ODF 4.14.0 release, through release flag 'odf-4.14.0+', the Target Release is being set to 'ODF 4.14.0

--- Additional comment from avan on 2023-08-08 06:33:39 UTC ---

(In reply to Mudit Agarwal from comment #11)
> Avan, please add the link of Ceph BZ/PR which has the exporter changes?
> Also, when are we planning it to be enabled from rook side?
> 
> Elad, please provide qa ack.

Ceph BZs 
https://bugzilla.redhat.com/show_bug.cgi?id=2217817
https://bugzilla.redhat.com/show_bug.cgi?id=2229267

Once we have this BZs moved to on_qa(once we have new build), it can be enabled in rook for 4.14 release branch

--- Additional comment from avan on 2023-08-09 11:20:02 UTC ---

@kdreyer @branto

Give that we have the new Ceph image ready with the required exporter changes https://bugzilla.redhat.com/show_bug.cgi?id=2217817#c3, can you help making sure that ODF 4.14 uses this new image for testing?

--- Additional comment from Boris Ranto on 2023-08-09 11:53:41 UTC ---

Done, I updated the defaults to use the new RHCS 6.1z2 first build (6-200). We should have it in our builds starting from tomorrow.

--- Additional comment from errata-xmlrpc on 2023-08-10 03:50:33 UTC ---

This bug has been added to advisory RHBA-2023:115514 by ceph-build service account (ceph-build.COM)

--- Additional comment from Daniel Osypenko on 2023-08-31 11:29:03 UTC ---

Fixed. Same automation test, previously failed (test_monitoring_reporting_ok_when_idle) now passes:

13:42:54 - MainThread - /Users/danielosypenko/Work/automation_4/ocs-ci/ocs_ci/utility/prometheus.py - INFO  - No bad values detected
13:42:54 - MainThread - /Users/danielosypenko/Work/automation_4/ocs-ci/ocs_ci/utility/prometheus.py - INFO  - No invalid values detected
13:42:54 - MainThread - test_monitoring_defaults - INFO  - ceph_osd_in metric does indicate no problems with OSDs
PASSED

--- Additional comment from Daniel Osypenko on 2023-09-04 10:13:06 UTC ---

BZ has been moved to Verified by a mistake.

List of missing metrics on OCP 4.14.0-0.nightly-2023-09-02-132842 ODF 4.14.0-125.stable
['ceph_bluestore_state_aio_wait_lat_sum', 'ceph_paxos_store_state_latency_sum', 'ceph_osd_op_out_bytes', 'ceph_bluestore_txc_submit_lat_sum', 'ceph_paxos_commit', 'ceph_paxos_new_pn_latency_count', 'ceph_osd_op_r_process_latency_count', 'ceph_bluestore_txc_submit_lat_count', 'ceph_bluestore_kv_final_lat_sum', 'ceph_paxos_collect_keys_sum', 'ceph_paxos_accept_timeout', 'ceph_paxos_begin_latency_count', 'ceph_bluefs_wal_total_bytes', 'ceph_paxos_refresh', 'ceph_bluestore_read_lat_count', 'ceph_mon_num_sessions', 'ceph_objecter_op_rmw', 'ceph_bluefs_bytes_written_wal', 'ceph_mon_num_elections', 'ceph_rocksdb_compact', 'ceph_bluestore_kv_sync_lat_sum', 'ceph_osd_op_process_latency_count', 'ceph_osd_op_w_prepare_latency_count', 'ceph_objecter_op_active', 'ceph_paxos_begin_latency_sum', 'ceph_osd_op_r', 'ceph_osd_op_rw_prepare_latency_sum', 'ceph_paxos_new_pn', 'ceph_rgw_qlen', 'ceph_rgw_req', 'ceph_rocksdb_get_latency_count', 'ceph_rgw_cache_miss', 'ceph_paxos_commit_latency_count', 'ceph_bluestore_txc_throttle_lat_count', 'ceph_paxos_lease_ack_timeout', 'ceph_bluestore_txc_commit_lat_sum', 'ceph_paxos_collect_bytes_sum', 'ceph_osd_op_rw_latency_count', 'ceph_paxos_collect_uncommitted', 'ceph_osd_op_rw_latency_sum', 'ceph_paxos_share_state', 'ceph_osd_op_r_prepare_latency_sum', 'ceph_bluestore_kv_flush_lat_sum', 'ceph_osd_op_rw_process_latency_sum', 'ceph_rocksdb_rocksdb_write_memtable_time_count', 'ceph_paxos_collect_latency_count', 'ceph_osd_op_rw_prepare_latency_count', 'ceph_paxos_collect_latency_sum', 'ceph_rocksdb_rocksdb_write_delay_time_count', 'ceph_objecter_op_rmw', 'ceph_paxos_begin_bytes_sum', 'ceph_osd_numpg', 'ceph_osd_stat_bytes', 'ceph_rocksdb_submit_sync_latency_sum', 'ceph_rocksdb_compact_queue_merge', 'ceph_paxos_collect_bytes_count', 'ceph_osd_op', 'ceph_paxos_commit_keys_sum', 'ceph_osd_op_rw_in_bytes', 'ceph_osd_op_rw_out_bytes', 'ceph_bluefs_bytes_written_sst', 'ceph_rgw_put', 'ceph_osd_op_rw_process_latency_count', 'ceph_rocksdb_compact_queue_len', 'ceph_bluestore_txc_throttle_lat_sum', 'ceph_bluefs_slow_used_bytes', 'ceph_osd_op_r_latency_sum', 'ceph_bluestore_kv_flush_lat_count', 'ceph_rocksdb_compact_range', 'ceph_osd_op_latency_sum', 'ceph_mon_session_add', 'ceph_paxos_share_state_keys_count', 'ceph_paxos_collect', 'ceph_osd_op_w_in_bytes', 'ceph_osd_op_r_process_latency_sum', 'ceph_paxos_start_peon', 'ceph_mon_session_trim', 'ceph_rocksdb_get_latency_sum', 'ceph_osd_op_rw', 'ceph_paxos_store_state_keys_count', 'ceph_rocksdb_rocksdb_write_delay_time_sum', 'ceph_objecter_op_r', 'ceph_objecter_op_active', 'ceph_objecter_op_w', 'ceph_osd_recovery_ops', 'ceph_bluefs_logged_bytes', 'ceph_bluefs_db_total_bytes', 'ceph_rgw_put_initial_lat_sum', 'ceph_osd_op_w_latency_count', 'ceph_rgw_put_initial_lat_count', 'ceph_bluestore_txc_commit_lat_count', 'ceph_bluestore_state_aio_wait_lat_count', 'ceph_paxos_begin_bytes_count', 'ceph_paxos_start_leader', 'ceph_mon_election_call', 'ceph_rocksdb_rocksdb_write_pre_and_post_time_count', 'ceph_mon_session_rm', 'ceph_paxos_store_state', 'ceph_paxos_store_state_bytes_count', 'ceph_osd_op_w_latency_sum', 'ceph_rgw_keystone_token_cache_hit', 'ceph_rocksdb_submit_latency_count', 'ceph_paxos_commit_latency_sum', 'ceph_rocksdb_rocksdb_write_memtable_time_sum', 'ceph_paxos_share_state_bytes_sum', 'ceph_osd_op_process_latency_sum', 'ceph_paxos_begin_keys_sum', 'ceph_rgw_qactive', 'ceph_rocksdb_rocksdb_write_pre_and_post_time_sum', 'ceph_bluefs_wal_used_bytes', 'ceph_rocksdb_rocksdb_write_wal_time_sum', 'ceph_osd_op_wip', 'ceph_rgw_get_initial_lat_sum', 'ceph_paxos_lease_timeout', 'ceph_osd_op_r_out_bytes', 'ceph_paxos_begin_keys_count', 'ceph_bluestore_kv_sync_lat_count', 'ceph_osd_op_prepare_latency_count', 'ceph_bluefs_bytes_written_slow', 'ceph_rocksdb_submit_latency_sum', 'ceph_osd_op_r_latency_count', 'ceph_paxos_share_state_keys_sum', 'ceph_paxos_store_state_bytes_sum', 'ceph_osd_op_latency_count', 'ceph_paxos_commit_bytes_count', 'ceph_paxos_restart', 'ceph_rgw_get_initial_lat_count', 'ceph_bluefs_slow_total_bytes', 'ceph_paxos_collect_timeout', 'ceph_osd_op_w_process_latency_sum', 'ceph_paxos_collect_keys_count', 'ceph_paxos_share_state_bytes_count', 'ceph_osd_op_w_prepare_latency_sum', 'ceph_bluestore_read_lat_sum', 'ceph_osd_stat_bytes_used', 'ceph_paxos_begin', 'ceph_mon_election_win', 'ceph_osd_op_w_process_latency_count', 'ceph_rgw_get_b', 'ceph_rgw_failed_req', 'ceph_rocksdb_rocksdb_write_wal_time_count', 'ceph_rgw_keystone_token_cache_miss', 'ceph_paxos_store_state_keys_sum', 'ceph_osd_numpg_removing', 'ceph_paxos_commit_keys_count', 'ceph_paxos_new_pn_latency_sum', 'ceph_osd_op_in_bytes', 'ceph_paxos_store_state_latency_count', 'ceph_paxos_refresh_latency_count', 'ceph_rgw_get', 'ceph_osd_op_r_prepare_latency_count', 'ceph_rgw_cache_hit', 'ceph_objecter_op_w', 'ceph_objecter_op_r', 'ceph_bluefs_num_files', 'ceph_rgw_put_b', 'ceph_mon_election_lose', 'ceph_osd_op_prepare_latency_sum', 'ceph_bluefs_db_used_bytes', 'ceph_bluestore_kv_final_lat_count', 'ceph_paxos_refresh_latency_sum', 'ceph_osd_recovery_bytes', 'ceph_osd_op_w', 'ceph_paxos_commit_bytes_sum', 'ceph_bluefs_log_bytes', 'ceph_rocksdb_submit_sync_latency_count']

--- Additional comment from avan on 2023-09-05 12:24:27 UTC ---

There's a fix under review currently https://github.com/red-hat-storage/rook/pull/516

--- Additional comment from Travis Nielsen on 2023-09-05 15:29:15 UTC ---

PR 516 was merged now.

--- Additional comment from Daniel Osypenko on 2023-09-07 10:12:21 UTC ---

Verified, PASSED: 
test_ceph_metrics_available http://pastebin.test.redhat.com/1108991
test_ceph_rbd_metrics_available http://pastebin.test.redhat.com/1108993

--- Additional comment from Sunil Kumar Acharya on 2023-09-21 05:54:14 UTC ---

Please update the requires_doc_text(RDT) flag/text appropriately.

Comment 4 Mudit Agarwal 2023-10-10 10:36:17 UTC

So, this is applicable for 4.13 only because bug #2221488 is already fixed for 4.14?

Comment 5 Daniel Osypenko 2023-10-10 11:28:41 UTC

@muagarwa yes. 4.14 is stable in Passing tests now. 4.13 is constantly failing missing now 142 metrics. 4.10, 4.11, 4.12 has 100% pass ratio.
Full list of missing metrics http://pastebin.test.redhat.com/1110338

Regarding the summary, sorry for misleading, it was taken from original bz and should be "ODF Monitoring is missing some of the metric values 4.13"

Comment 12 Daniel Osypenko 2023-11-02 11:41:31 UTC

after the cmd `ceph config set mgr mgr/prometheus/exclude_perf_counters false` done, here the list of the missing metrics (test test_ceph_metrics_available)

ceph_rgw_put
ceph_rgw_put_initial_lat_sum
ceph_rgw_put_initial_lat_count
ceph_rgw_keystone_token_cache_hit
ceph_rgw_metadata
ceph_rgw_qactive
ceph_rgw_get_initial_lat_sum
ceph_rgw_get_initial_lat_count
ceph_rgw_get_b
ceph_rgw_failed_req
ceph_rgw_keystone_token_cache_miss
ceph_rgw_get
ceph_rgw_cache_hit
ceph_rgw_put_b

Comment 14 Daniel Osypenko 2023-11-02 13:06:38 UTC

The ceph_rgw metrics are not showing up because of unavailable rgw service on the cluster. Now test Passes. I can move it to VERIFIED once it will be ON_QA.
Regards

Comment 15 Nishanth Thomas 2023-11-03 10:12:31 UTC

Moving to ON_QA as discussed

Comment 16 Daniel Osypenko 2023-11-12 10:11:29 UTC

Can you please specify the version I need to verify? 
Regards

Comment 17 Daniel Osypenko 2023-11-14 17:30:41 UTC

OC version:
Client Version: 4.13.4
Kustomize Version: v4.5.7
Server Version: 4.13.0-0.nightly-2023-11-14-104446
Kubernetes Version: v1.26.9+636f2be

OCS verison:
ocs-operator.v4.13.5-rhodf              OpenShift Container Storage   4.13.5-rhodf   ocs-operator.v4.13.4-rhodf              Succeeded

Cluster version
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.13.0-0.nightly-2023-11-14-104446   True        False         46m     Cluster version is 4.13.0-0.nightly-2023-11-14-104446

Rook version:
rook: v4.13.5-0.42f43768ad57d91be47327f83653c05eeb721977
go: go1.19.13

Ceph version:
ceph version 17.2.6-148.el9cp (badc1d27cb07762bea48f6554ad4f92b9d3fbb6b) quincy (stable)

Missing 166 metrics
https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/31288/console

Comment 18 Daniel Osypenko 2023-11-14 17:42:43 UTC

regarding the recommendation unset mgr/prometheus/exclude_perf_counters
did we have such flag (mgr/prometheus/exclude_perf_counters) before and does it mean that by default a user would not see metrics and this is acceptable / documented?
Thanks

@athar.lh

Comment 22 krishnaram Karthick 2023-12-15 06:10:47 UTC

Moving the bug to 4.13.7 as we are doing a quick 4.13.6 to include a critical fix at RGW (2254303) before to shutdown

Comment 23 Ricky 2023-12-15 06:34:59 UTC Comment hidden (spam)

I appreciate your posting. I've read about a lot of related subjects! Contrary to other articles, yours left me with a really distinct impression. I hope you'll keep writing insightful posts like this one and others for us to everyone to read!

Comment 29 Daniel Osypenko 2024-01-18 08:13:00 UTC

test_ceph_metrics_available fails. 
166 missing metrics. 
ceph_bluestore_state_aio_wait_lat_sum,ceph_paxos_store_state_latency_sum,ceph_osd_op_out_bytes,ceph_bluestore_txc_submit_lat_sum,ceph_paxos_commit,ceph_paxos_new_pn_latency_count,ceph_osd_op_r_process_latency_count,ceph_bluestore_txc_submit_lat_count,ceph_bluestore_kv_final_lat_sum,ceph_paxos_collect_keys_sum,ceph_paxos_accept_timeout,ceph_paxos_begin_latency_count,ceph_bluefs_wal_total_bytes,ceph_paxos_refresh,ceph_bluestore_read_lat_count,ceph_mon_num_sessions,ceph_objecter_op_rmw,ceph_bluefs_bytes_written_wal,ceph_mon_num_elections,ceph_rocksdb_compact,ceph_bluestore_kv_sync_lat_sum,ceph_osd_op_process_latency_count,ceph_osd_op_w_prepare_latency_count,ceph_objecter_op_active,ceph_paxos_begin_latency_sum,ceph_osd_op_r,ceph_osd_op_rw_prepare_latency_sum,ceph_paxos_new_pn,ceph_rgw_qlen,ceph_rgw_req,ceph_rocksdb_get_latency_count,ceph_rgw_cache_miss,ceph_paxos_commit_latency_count,ceph_bluestore_txc_throttle_lat_count,ceph_paxos_lease_ack_timeout,ceph_bluestore_txc_commit_lat_sum,ceph_paxos_collect_bytes_sum,ceph_osd_op_rw_latency_count,ceph_paxos_collect_uncommitted,ceph_osd_op_rw_latency_sum,ceph_paxos_share_state,ceph_osd_op_r_prepare_latency_sum,ceph_bluestore_kv_flush_lat_sum,ceph_osd_op_rw_process_latency_sum,ceph_rocksdb_rocksdb_write_memtable_time_count,ceph_paxos_collect_latency_count,ceph_osd_op_rw_prepare_latency_count,ceph_paxos_collect_latency_sum,ceph_rocksdb_rocksdb_write_delay_time_count,ceph_objecter_op_rmw,ceph_paxos_begin_bytes_sum,ceph_osd_numpg,ceph_osd_stat_bytes,ceph_rocksdb_submit_sync_latency_sum

ODF 4.13.7-rhodf vSphere UPI deployment
OCP 4.13.0-0.nightly-2024-01-17-100523

Elaborating "missing metrics" meaning. When we run query we get empty metrics data along with status 304 Not Modified. Than means we don't have data available, response has not been modified since the last request, and there is no need to resend the entire content. 

Logs: 

2024-01-17 18:08:34,071 - MainThread - INFO - /home/jenkins/workspace/qe-deploy-ocs-cluster/ocs-ci/ocs_ci/utility/prometheus.py.query.504 - Performing prometheus instant query 'ceph_bluestore_state_aio_wait_lat_sum'
2024-01-17 18:08:34,071 - MainThread - DEBUG - /home/jenkins/workspace/qe-deploy-ocs-cluster/ocs-ci/ocs_ci/utility/prometheus.py.get.422 - GET https://prometheus-k8s-openshift-monitoring.apps.dosypenk-171.qe.rh-ocs.com/api/v1/query
2024-01-17 18:08:34,071 - MainThread - DEBUG - /home/jenkins/workspace/qe-deploy-ocs-cluster/ocs-ci/ocs_ci/utility/prometheus.py.get.423 - headers={'Authorization': 'Bearer sha256~f9UfvhOsP02LNP5oLPx9uQhph3oSHYJpL6qaPBH7wlk'}
2024-01-17 18:08:34,071 - MainThread - DEBUG - /home/jenkins/workspace/qe-deploy-ocs-cluster/ocs-ci/ocs_ci/utility/prometheus.py.get.424 - verify=False
2024-01-17 18:08:34,072 - MainThread - DEBUG - /home/jenkins/workspace/qe-deploy-ocs-cluster/ocs-ci/ocs_ci/utility/prometheus.py.get.425 - params={'query': 'ceph_bluestore_state_aio_wait_lat_sum'}
2024-01-17 18:08:34,073 - MainThread - DEBUG - urllib3.connectionpool._new_conn.1019 - Starting new HTTPS connection (1): prometheus-k8s-openshift-monitoring.apps.dosypenk-171.qe.rh-ocs.com:443
2024-01-17 18:08:34,107 - MainThread - DEBUG - urllib3.connectionpool._make_request.474 - https://prometheus-k8s-openshift-monitoring.apps.dosypenk-171.qe.rh-ocs.com:443 "GET /api/v1/query?query=ceph_bluestore_state_aio_wait_lat_sum HTTP/1.1" 200 87
2024-01-17 18:08:34,108 - MainThread - DEBUG - /home/jenkins/workspace/qe-deploy-ocs-cluster/ocs-ci/ocs_ci/utility/prometheus.py.validate_status.304 - content value: {'status': 'success', 'data': {'resultType': 'vector', 'result': []}}
2024-01-17 18:08:34,109 - MainThread - ERROR - ocs_ci.ocs.metrics.get_missing_metrics.352 - failed to get results for ceph_bluestore_state_aio_wait_lat_sum

Comment 36 Daniel Osypenko 2024-01-31 09:47:17 UTC

@athakkar 
confirming
ceph config get mgr mgr/prometheus/exclude_perf_counters false - metrics become visible on ODF 4.13 OCP 4.13

Comment 41 Santosh Pillai 2024-02-14 11:58:42 UTC

This got assigned to me when we moved it to Rook. Assigning it back to Avan.

Comment 46 Daniel Osypenko 2024-03-12 07:58:40 UTC

verified on IBM cloud deployment 
* ODF 4.13.8-1 
* OCP 4.13.0-0.nightly-2024-03-08-182318

test_ceph_rbd_metrics_available - https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/34912/console

Comment 51 errata-xmlrpc 2024-04-03 07:03:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Data Foundation 4.13.8 Bug Fix Update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2024:1657