Bug 2221488

Summary: ODF Monitoring is missing some of the metric values 4.14
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Daniel Osypenko <dosypenk>
Component: rookAssignee: avan <athakkar>
Status: ON_QA --- QA Contact: Neha Berry <nberry>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.14CC: athakkar, branto, ebenahar, fbalak, kdreyer, muagarwa, odf-bz-bot, tnielsen
Target Milestone: ---   
Target Release: ODF 4.14.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 4.14.0-105 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Daniel Osypenko 2023-07-09 12:02:28 UTC
This bug was initially created as a copy of Bug #2203795

I am copying this bug because: 

Even though missing metric names are different comparing to 4.13 missing metrics, the description of the problem and parties of the discussion should be the same

Description of problem (please be detailed as possible and provide log
snippests):
ODF Monitoring is missing some of the ceph_* metric values. No related epic providing change/rename was found 

List of missing metric values:
'ceph_bluestore_state_aio_wait_lat_sum',

'ceph_paxos_store_state_latency_sum',

'ceph_osd_op_out_bytes',

'ceph_bluestore_txc_submit_lat_sum',

'ceph_paxos_commit',

'ceph_paxos_new_pn_latency_count',

'ceph_osd_op_r_process_latency_count',

'ceph_bluestore_txc_submit_lat_count',

'ceph_bluestore_kv_final_lat_sum',

'ceph_paxos_collect_keys_sum',

'ceph_paxos_accept_timeout',

'ceph_paxos_begin_latency_count',

'ceph_bluefs_wal_total_bytes',

'ceph_paxos_refresh',

'ceph_bluestore_read_lat_count',

'ceph_mon_num_sessions',

'ceph_bluefs_bytes_written_wal',

'ceph_mon_num_elections',

'ceph_rocksdb_compact',

'ceph_bluestore_kv_sync_lat_sum',

'ceph_osd_op_process_latency_count',

'ceph_osd_op_w_prepare_latency_count',

'ceph_paxos_begin_latency_sum',

'ceph_osd_op_r',

'ceph_osd_op_rw_prepare_latency_sum',

'ceph_paxos_new_pn',

'ceph_rocksdb_get_latency_count',

'ceph_paxos_commit_latency_count',

'ceph_bluestore_txc_throttle_lat_count',

'ceph_paxos_lease_ack_timeout',

'ceph_bluestore_txc_commit_lat_sum',

'ceph_paxos_collect_bytes_sum',

'ceph_osd_op_rw_latency_count',

'ceph_paxos_collect_uncommitted',

'ceph_osd_op_rw_latency_sum',

'ceph_paxos_share_state',

'ceph_osd_op_r_prepare_latency_sum',

'ceph_bluestore_kv_flush_lat_sum',

'ceph_osd_op_rw_process_latency_sum',

'ceph_rocksdb_rocksdb_write_memtable_time_count',

'ceph_paxos_collect_latency_count',

'ceph_osd_op_rw_prepare_latency_count',

'ceph_paxos_collect_latency_sum',

'ceph_rocksdb_rocksdb_write_delay_time_count',

'ceph_paxos_begin_bytes_sum',

'ceph_osd_numpg',

'ceph_osd_stat_bytes',

'ceph_rocksdb_submit_sync_latency_sum',

'ceph_rocksdb_compact_queue_merge',

'ceph_paxos_collect_bytes_count',

'ceph_osd_op',

'ceph_paxos_commit_keys_sum',

'ceph_osd_op_rw_in_bytes',

'ceph_osd_op_rw_out_bytes',

'ceph_bluefs_bytes_written_sst',

'ceph_osd_op_rw_process_latency_count',

'ceph_rocksdb_compact_queue_len',

'ceph_bluestore_txc_throttle_lat_sum',

'ceph_bluefs_slow_used_bytes',

'ceph_osd_op_r_latency_sum',

'ceph_bluestore_kv_flush_lat_count',

'ceph_rocksdb_compact_range',

'ceph_osd_op_latency_sum',

'ceph_mon_session_add',

'ceph_paxos_share_state_keys_count',

'ceph_paxos_collect',

'ceph_osd_op_w_in_bytes',

'ceph_osd_op_r_process_latency_sum',

'ceph_paxos_start_peon',

'ceph_mon_session_trim',

'ceph_rocksdb_get_latency_sum',

'ceph_osd_op_rw',

'ceph_paxos_store_state_keys_count',

'ceph_rocksdb_rocksdb_write_delay_time_sum',

'ceph_osd_recovery_ops',

'ceph_bluefs_logged_bytes',

'ceph_bluefs_db_total_bytes',

'ceph_osd_op_w_latency_count',

'ceph_bluestore_txc_commit_lat_count',

'ceph_bluestore_state_aio_wait_lat_count',

'ceph_paxos_begin_bytes_count',

'ceph_paxos_start_leader',

'ceph_mon_election_call',

'ceph_rocksdb_rocksdb_write_pre_and_post_time_count',

'ceph_mon_session_rm',

'ceph_paxos_store_state',

'ceph_paxos_store_state_bytes_count',

'ceph_osd_op_w_latency_sum',

'ceph_rocksdb_submit_latency_count',

'ceph_paxos_commit_latency_sum',

'ceph_rocksdb_rocksdb_write_memtable_time_sum',

'ceph_paxos_share_state_bytes_sum',

'ceph_osd_op_process_latency_sum',

'ceph_paxos_begin_keys_sum',

'ceph_rocksdb_rocksdb_write_pre_and_post_time_sum',

'ceph_bluefs_wal_used_bytes',

'ceph_rocksdb_rocksdb_write_wal_time_sum',

'ceph_osd_op_wip',

'ceph_paxos_lease_timeout',

'ceph_osd_op_r_out_bytes',

'ceph_paxos_begin_keys_count',

'ceph_bluestore_kv_sync_lat_count',

'ceph_osd_op_prepare_latency_count',

'ceph_bluefs_bytes_written_slow',

'ceph_rocksdb_submit_latency_sum',

'ceph_osd_op_r_latency_count',

'ceph_paxos_share_state_keys_sum',

'ceph_paxos_store_state_bytes_sum',

'ceph_osd_op_latency_count',

'ceph_paxos_commit_bytes_count',

'ceph_paxos_restart',

'ceph_bluefs_slow_total_bytes',

'ceph_paxos_collect_timeout',

'ceph_osd_op_w_process_latency_sum',

'ceph_paxos_collect_keys_count',

'ceph_paxos_share_state_bytes_count',

'ceph_osd_op_w_prepare_latency_sum',

'ceph_bluestore_read_lat_sum',

'ceph_osd_stat_bytes_used',

'ceph_paxos_begin',

'ceph_mon_election_win',

'ceph_osd_op_w_process_latency_count',

'ceph_rocksdb_rocksdb_write_wal_time_count',

'ceph_paxos_store_state_keys_sum',

'ceph_osd_numpg_removing',

'ceph_paxos_commit_keys_count',

'ceph_paxos_new_pn_latency_sum',

'ceph_osd_op_in_bytes',

'ceph_paxos_store_state_latency_count',

'ceph_paxos_refresh_latency_count',

'ceph_osd_op_r_prepare_latency_count',

'ceph_bluefs_num_files',

'ceph_mon_election_lose',

'ceph_osd_op_prepare_latency_sum',

'ceph_bluefs_db_used_bytes',

'ceph_bluestore_kv_final_lat_count',

'ceph_paxos_refresh_latency_sum',

'ceph_osd_recovery_bytes',

'ceph_osd_op_w',

'ceph_paxos_commit_bytes_sum',

'ceph_bluefs_log_bytes',

'ceph_rocksdb_submit_sync_latency_count',

ceph metrics which should be present on a healthy cluster:
https://github.com/red-hat-storage/ocs-ci/blob/81ca20aed067a30dd109e0f29e026f2a18c752ee/ocs_ci/ocs/metrics.py#L70

Polarion documentation: https://polarion.engineering.redhat.com/polarion/#/project/OpenShiftContainerStorage/workitem?id=OCS-958

Version of all relevant components (if applicable):

OC version:
Client Version: 4.13.4
Kustomize Version: v4.5.7
Server Version: 4.14.0-0.nightly-2023-06-30-131338
Kubernetes Version: v1.27.3+ab0b8ee

OCS verison:
ocs-operator.v4.14.0-36.stable              OpenShift Container Storage   4.14.0-36.stable              Succeeded

Cluster version
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.14.0-0.nightly-2023-06-30-131338   True        False         4d1h    Cluster version is 4.14.0-0.nightly-2023-06-30-131338

Rook version:
rook: v4.14.0-0.d8ce011027a26218154bcedf63a54e97f020df40
go: go1.20.4

Ceph version:
ceph version 17.2.6-70.el9cp (fe62dcdbb2c6e05782a3e2b67d025b84ff5047cc) quincy (stable)


Can this issue reproducible?
yes, repeatable in ci run and with local run


Steps to Reproduce:
1. Install OCP/ODF cluster
2. After installation, check whether Prometheus provides values for the
   metrics listed above.


Actual results:
OCP Prometheus provides no values for any of the metrics listed above.

Expected results:
OCP Prometheus provides values for all metrics listed above.

logs of the test-run:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-034aikt1c33-t1/j-034aikt1c33-t1_20230704T064403/logs/ocs-ci-logs-1688456400/by_outcome/failed/tests/manage/monitoring/prometheusmetrics/test_monitoring_defaults.py/test_ceph_metrics_available/logs

must-gather logs
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-034aikt1c33-t1/j-034aikt1c33-t1_20230704T064403/logs/testcases_1688456400/

Comment 3 Travis Nielsen 2023-07-10 16:47:34 UTC
Avan PTAL

Comment 4 avan 2023-07-25 17:26:37 UTC
@Daniel,
Is this still reproducible?

Comment 16 Boris Ranto 2023-08-09 11:53:41 UTC
Done, I updated the defaults to use the new RHCS 6.1z2 first build (6-200). We should have it in our builds starting from tomorrow.