Bug 2221488 - ODF Monitoring is missing some of the metric values 4.14
Summary: ODF Monitoring is missing some of the metric values 4.14
Keywords:
Status: ON_QA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: rook
Version: 4.14
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ODF 4.14.0
Assignee: avan
QA Contact: Neha Berry
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-07-09 12:02 UTC by Daniel Osypenko
Modified: 2023-08-10 03:50 UTC (History)
8 users (show)

Fixed In Version: 4.14.0-105
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github red-hat-storage rook pull 504 0 None open Bug 2221488: monitoring: enable exporter for downstream 4.14 2023-08-08 17:56:38 UTC

Description Daniel Osypenko 2023-07-09 12:02:28 UTC
This bug was initially created as a copy of Bug #2203795

I am copying this bug because: 

Even though missing metric names are different comparing to 4.13 missing metrics, the description of the problem and parties of the discussion should be the same

Description of problem (please be detailed as possible and provide log
snippests):
ODF Monitoring is missing some of the ceph_* metric values. No related epic providing change/rename was found 

List of missing metric values:
'ceph_bluestore_state_aio_wait_lat_sum',

'ceph_paxos_store_state_latency_sum',

'ceph_osd_op_out_bytes',

'ceph_bluestore_txc_submit_lat_sum',

'ceph_paxos_commit',

'ceph_paxos_new_pn_latency_count',

'ceph_osd_op_r_process_latency_count',

'ceph_bluestore_txc_submit_lat_count',

'ceph_bluestore_kv_final_lat_sum',

'ceph_paxos_collect_keys_sum',

'ceph_paxos_accept_timeout',

'ceph_paxos_begin_latency_count',

'ceph_bluefs_wal_total_bytes',

'ceph_paxos_refresh',

'ceph_bluestore_read_lat_count',

'ceph_mon_num_sessions',

'ceph_bluefs_bytes_written_wal',

'ceph_mon_num_elections',

'ceph_rocksdb_compact',

'ceph_bluestore_kv_sync_lat_sum',

'ceph_osd_op_process_latency_count',

'ceph_osd_op_w_prepare_latency_count',

'ceph_paxos_begin_latency_sum',

'ceph_osd_op_r',

'ceph_osd_op_rw_prepare_latency_sum',

'ceph_paxos_new_pn',

'ceph_rocksdb_get_latency_count',

'ceph_paxos_commit_latency_count',

'ceph_bluestore_txc_throttle_lat_count',

'ceph_paxos_lease_ack_timeout',

'ceph_bluestore_txc_commit_lat_sum',

'ceph_paxos_collect_bytes_sum',

'ceph_osd_op_rw_latency_count',

'ceph_paxos_collect_uncommitted',

'ceph_osd_op_rw_latency_sum',

'ceph_paxos_share_state',

'ceph_osd_op_r_prepare_latency_sum',

'ceph_bluestore_kv_flush_lat_sum',

'ceph_osd_op_rw_process_latency_sum',

'ceph_rocksdb_rocksdb_write_memtable_time_count',

'ceph_paxos_collect_latency_count',

'ceph_osd_op_rw_prepare_latency_count',

'ceph_paxos_collect_latency_sum',

'ceph_rocksdb_rocksdb_write_delay_time_count',

'ceph_paxos_begin_bytes_sum',

'ceph_osd_numpg',

'ceph_osd_stat_bytes',

'ceph_rocksdb_submit_sync_latency_sum',

'ceph_rocksdb_compact_queue_merge',

'ceph_paxos_collect_bytes_count',

'ceph_osd_op',

'ceph_paxos_commit_keys_sum',

'ceph_osd_op_rw_in_bytes',

'ceph_osd_op_rw_out_bytes',

'ceph_bluefs_bytes_written_sst',

'ceph_osd_op_rw_process_latency_count',

'ceph_rocksdb_compact_queue_len',

'ceph_bluestore_txc_throttle_lat_sum',

'ceph_bluefs_slow_used_bytes',

'ceph_osd_op_r_latency_sum',

'ceph_bluestore_kv_flush_lat_count',

'ceph_rocksdb_compact_range',

'ceph_osd_op_latency_sum',

'ceph_mon_session_add',

'ceph_paxos_share_state_keys_count',

'ceph_paxos_collect',

'ceph_osd_op_w_in_bytes',

'ceph_osd_op_r_process_latency_sum',

'ceph_paxos_start_peon',

'ceph_mon_session_trim',

'ceph_rocksdb_get_latency_sum',

'ceph_osd_op_rw',

'ceph_paxos_store_state_keys_count',

'ceph_rocksdb_rocksdb_write_delay_time_sum',

'ceph_osd_recovery_ops',

'ceph_bluefs_logged_bytes',

'ceph_bluefs_db_total_bytes',

'ceph_osd_op_w_latency_count',

'ceph_bluestore_txc_commit_lat_count',

'ceph_bluestore_state_aio_wait_lat_count',

'ceph_paxos_begin_bytes_count',

'ceph_paxos_start_leader',

'ceph_mon_election_call',

'ceph_rocksdb_rocksdb_write_pre_and_post_time_count',

'ceph_mon_session_rm',

'ceph_paxos_store_state',

'ceph_paxos_store_state_bytes_count',

'ceph_osd_op_w_latency_sum',

'ceph_rocksdb_submit_latency_count',

'ceph_paxos_commit_latency_sum',

'ceph_rocksdb_rocksdb_write_memtable_time_sum',

'ceph_paxos_share_state_bytes_sum',

'ceph_osd_op_process_latency_sum',

'ceph_paxos_begin_keys_sum',

'ceph_rocksdb_rocksdb_write_pre_and_post_time_sum',

'ceph_bluefs_wal_used_bytes',

'ceph_rocksdb_rocksdb_write_wal_time_sum',

'ceph_osd_op_wip',

'ceph_paxos_lease_timeout',

'ceph_osd_op_r_out_bytes',

'ceph_paxos_begin_keys_count',

'ceph_bluestore_kv_sync_lat_count',

'ceph_osd_op_prepare_latency_count',

'ceph_bluefs_bytes_written_slow',

'ceph_rocksdb_submit_latency_sum',

'ceph_osd_op_r_latency_count',

'ceph_paxos_share_state_keys_sum',

'ceph_paxos_store_state_bytes_sum',

'ceph_osd_op_latency_count',

'ceph_paxos_commit_bytes_count',

'ceph_paxos_restart',

'ceph_bluefs_slow_total_bytes',

'ceph_paxos_collect_timeout',

'ceph_osd_op_w_process_latency_sum',

'ceph_paxos_collect_keys_count',

'ceph_paxos_share_state_bytes_count',

'ceph_osd_op_w_prepare_latency_sum',

'ceph_bluestore_read_lat_sum',

'ceph_osd_stat_bytes_used',

'ceph_paxos_begin',

'ceph_mon_election_win',

'ceph_osd_op_w_process_latency_count',

'ceph_rocksdb_rocksdb_write_wal_time_count',

'ceph_paxos_store_state_keys_sum',

'ceph_osd_numpg_removing',

'ceph_paxos_commit_keys_count',

'ceph_paxos_new_pn_latency_sum',

'ceph_osd_op_in_bytes',

'ceph_paxos_store_state_latency_count',

'ceph_paxos_refresh_latency_count',

'ceph_osd_op_r_prepare_latency_count',

'ceph_bluefs_num_files',

'ceph_mon_election_lose',

'ceph_osd_op_prepare_latency_sum',

'ceph_bluefs_db_used_bytes',

'ceph_bluestore_kv_final_lat_count',

'ceph_paxos_refresh_latency_sum',

'ceph_osd_recovery_bytes',

'ceph_osd_op_w',

'ceph_paxos_commit_bytes_sum',

'ceph_bluefs_log_bytes',

'ceph_rocksdb_submit_sync_latency_count',

ceph metrics which should be present on a healthy cluster:
https://github.com/red-hat-storage/ocs-ci/blob/81ca20aed067a30dd109e0f29e026f2a18c752ee/ocs_ci/ocs/metrics.py#L70

Polarion documentation: https://polarion.engineering.redhat.com/polarion/#/project/OpenShiftContainerStorage/workitem?id=OCS-958

Version of all relevant components (if applicable):

OC version:
Client Version: 4.13.4
Kustomize Version: v4.5.7
Server Version: 4.14.0-0.nightly-2023-06-30-131338
Kubernetes Version: v1.27.3+ab0b8ee

OCS verison:
ocs-operator.v4.14.0-36.stable              OpenShift Container Storage   4.14.0-36.stable              Succeeded

Cluster version
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.14.0-0.nightly-2023-06-30-131338   True        False         4d1h    Cluster version is 4.14.0-0.nightly-2023-06-30-131338

Rook version:
rook: v4.14.0-0.d8ce011027a26218154bcedf63a54e97f020df40
go: go1.20.4

Ceph version:
ceph version 17.2.6-70.el9cp (fe62dcdbb2c6e05782a3e2b67d025b84ff5047cc) quincy (stable)


Can this issue reproducible?
yes, repeatable in ci run and with local run


Steps to Reproduce:
1. Install OCP/ODF cluster
2. After installation, check whether Prometheus provides values for the
   metrics listed above.


Actual results:
OCP Prometheus provides no values for any of the metrics listed above.

Expected results:
OCP Prometheus provides values for all metrics listed above.

logs of the test-run:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-034aikt1c33-t1/j-034aikt1c33-t1_20230704T064403/logs/ocs-ci-logs-1688456400/by_outcome/failed/tests/manage/monitoring/prometheusmetrics/test_monitoring_defaults.py/test_ceph_metrics_available/logs

must-gather logs
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-034aikt1c33-t1/j-034aikt1c33-t1_20230704T064403/logs/testcases_1688456400/

Comment 3 Travis Nielsen 2023-07-10 16:47:34 UTC
Avan PTAL

Comment 4 avan 2023-07-25 17:26:37 UTC
@Daniel,
Is this still reproducible?

Comment 16 Boris Ranto 2023-08-09 11:53:41 UTC
Done, I updated the defaults to use the new RHCS 6.1z2 first build (6-200). We should have it in our builds starting from tomorrow.


Note You need to log in before you can comment on or make changes to this bug.