Description of problem (please be detailed as possible and provide log snippests): ODF Monitoring is missing some of the ceph_* metric values List of missing metric values: 'ceph_bluefs_bytes_written_slow', 'ceph_bluefs_bytes_written_sst', 'ceph_bluefs_bytes_written_wal', 'ceph_bluefs_db_total_bytes', 'ceph_bluefs_db_used_bytes', 'ceph_bluefs_log_bytes', 'ceph_bluefs_logged_bytes', 'ceph_bluefs_num_files', 'ceph_bluefs_slow_total_bytes', 'ceph_bluefs_slow_used_bytes', 'ceph_bluefs_wal_total_bytes', 'ceph_bluefs_wal_used_bytes', 'ceph_bluestore_commit_lat_count', 'ceph_bluestore_commit_lat_sum', 'ceph_bluestore_kv_final_lat_count', 'ceph_bluestore_kv_final_lat_sum', 'ceph_bluestore_kv_flush_lat_count', 'ceph_bluestore_kv_flush_lat_sum', 'ceph_bluestore_kv_sync_lat_count', 'ceph_bluestore_kv_sync_lat_sum', 'ceph_bluestore_read_lat_count', 'ceph_bluestore_read_lat_sum', 'ceph_bluestore_state_aio_wait_lat_count', 'ceph_bluestore_state_aio_wait_lat_sum', 'ceph_bluestore_submit_lat_count', 'ceph_bluestore_submit_lat_sum', 'ceph_bluestore_throttle_lat_count', 'ceph_bluestore_throttle_lat_sum', 'ceph_mon_election_call', 'ceph_mon_election_lose', 'ceph_mon_election_win', 'ceph_mon_num_elections', 'ceph_mon_num_sessions', 'ceph_mon_session_add', 'ceph_mon_session_rm', 'ceph_mon_session_trim', 'ceph_objecter_op_active', 'ceph_objecter_op_active', 'ceph_objecter_op_r', 'ceph_objecter_op_r', 'ceph_objecter_op_rmw', 'ceph_objecter_op_rmw', 'ceph_objecter_op_w', 'ceph_objecter_op_w', 'ceph_osd_numpg', 'ceph_osd_numpg_removing', 'ceph_osd_op', 'ceph_osd_op_in_bytes', 'ceph_osd_op_latency_count', 'ceph_osd_op_latency_sum', 'ceph_osd_op_out_bytes', 'ceph_osd_op_prepare_latency_count', 'ceph_osd_op_prepare_latency_sum', 'ceph_osd_op_process_latency_count', 'ceph_osd_op_process_latency_sum', 'ceph_osd_op_r', 'ceph_osd_op_r_latency_count', 'ceph_osd_op_r_latency_sum', 'ceph_osd_op_r_out_bytes', 'ceph_osd_op_r_prepare_latency_count', 'ceph_osd_op_r_prepare_latency_sum', 'ceph_osd_op_r_process_latency_count', 'ceph_osd_op_r_process_latency_sum', 'ceph_osd_op_rw', 'ceph_osd_op_rw_in_bytes', 'ceph_osd_op_rw_latency_count', 'ceph_osd_op_rw_latency_sum', 'ceph_osd_op_rw_out_bytes', 'ceph_osd_op_rw_prepare_latency_count', 'ceph_osd_op_rw_prepare_latency_sum', 'ceph_osd_op_rw_process_latency_count', 'ceph_osd_op_rw_process_latency_sum', 'ceph_osd_op_w', 'ceph_osd_op_w_in_bytes', 'ceph_osd_op_w_latency_count', 'ceph_osd_op_w_latency_sum', 'ceph_osd_op_w_prepare_latency_count', 'ceph_osd_op_w_prepare_latency_sum', 'ceph_osd_op_w_process_latency_count', 'ceph_osd_op_w_process_latency_sum', 'ceph_osd_op_wip', 'ceph_osd_recovery_bytes', 'ceph_osd_recovery_ops', 'ceph_osd_stat_bytes', 'ceph_osd_stat_bytes_used', 'ceph_paxos_accept_timeout', 'ceph_paxos_begin', 'ceph_paxos_begin_bytes_count', 'ceph_paxos_begin_bytes_sum', 'ceph_paxos_begin_keys_count', 'ceph_paxos_begin_keys_sum', 'ceph_paxos_begin_latency_count', 'ceph_paxos_begin_latency_sum', 'ceph_paxos_collect', 'ceph_paxos_collect_bytes_count', 'ceph_paxos_collect_bytes_sum', 'ceph_paxos_collect_keys_count', 'ceph_paxos_collect_keys_sum', 'ceph_paxos_collect_latency_count', 'ceph_paxos_collect_latency_sum', 'ceph_paxos_collect_timeout', 'ceph_paxos_collect_uncommitted', 'ceph_paxos_commit', 'ceph_paxos_commit_bytes_count', 'ceph_paxos_commit_bytes_sum', 'ceph_paxos_commit_keys_count', 'ceph_paxos_commit_keys_sum', 'ceph_paxos_commit_latency_count', 'ceph_paxos_commit_latency_sum', 'ceph_paxos_lease_ack_timeout', 'ceph_paxos_lease_timeout', 'ceph_paxos_new_pn', 'ceph_paxos_new_pn_latency_count', 'ceph_paxos_new_pn_latency_sum', 'ceph_paxos_refresh', 'ceph_paxos_refresh_latency_count', 'ceph_paxos_refresh_latency_sum', 'ceph_paxos_restart', 'ceph_paxos_share_state', 'ceph_paxos_share_state_bytes_count', 'ceph_paxos_share_state_bytes_sum', 'ceph_paxos_share_state_keys_count', 'ceph_paxos_share_state_keys_sum', 'ceph_paxos_start_leader', 'ceph_paxos_start_peon', 'ceph_paxos_store_state', 'ceph_paxos_store_state_bytes_count', 'ceph_paxos_store_state_bytes_sum', 'ceph_paxos_store_state_keys_count', 'ceph_paxos_store_state_keys_sum', 'ceph_paxos_store_state_latency_count', 'ceph_paxos_store_state_latency_sum', 'ceph_rgw_cache_hit', 'ceph_rgw_cache_miss', 'ceph_rgw_failed_req', 'ceph_rgw_get', 'ceph_rgw_get_b', 'ceph_rgw_get_initial_lat_count', 'ceph_rgw_get_initial_lat_sum', 'ceph_rgw_keystone_token_cache_hit', 'ceph_rgw_keystone_token_cache_miss', 'ceph_rgw_put', 'ceph_rgw_put_b', 'ceph_rgw_put_initial_lat_count', 'ceph_rgw_put_initial_lat_sum', 'ceph_rgw_qactive', 'ceph_rgw_qlen', 'ceph_rgw_req', 'ceph_rocksdb_compact', 'ceph_rocksdb_compact_queue_len', 'ceph_rocksdb_compact_queue_merge', 'ceph_rocksdb_compact_range', 'ceph_rocksdb_get', 'ceph_rocksdb_get_latency_count', 'ceph_rocksdb_get_latency_sum', 'ceph_rocksdb_rocksdb_write_delay_time_count', 'ceph_rocksdb_rocksdb_write_delay_time_sum', 'ceph_rocksdb_rocksdb_write_memtable_time_count', 'ceph_rocksdb_rocksdb_write_memtable_time_sum', 'ceph_rocksdb_rocksdb_write_pre_and_post_time_count', 'ceph_rocksdb_rocksdb_write_pre_and_post_time_sum', 'ceph_rocksdb_rocksdb_write_wal_time_count', 'ceph_rocksdb_rocksdb_write_wal_time_sum', 'ceph_rocksdb_submit_latency_count', 'ceph_rocksdb_submit_latency_sum', 'ceph_rocksdb_submit_sync_latency_count', 'ceph_rocksdb_submit_sync_latency_sum' Ceph metrics which should be present on a healthy cluster: https://github.com/red-hat-storage/ocs-ci/blob/81ca20aed067a30dd109e0f29e026f2a18c752ee/ocs_ci/ocs/metrics.py#L70 Version of all relevant components (if applicable): ODF-4.13.0-186.stable OCP-4.13.0-0.nightly-2023-05-10-062807 Can this issue reproducible? yes (seen in about 8 ci runs) Steps to Reproduce: 1. Install OCP/ODF cluster 2. After installation, check whether Prometheus provides values for the metrics listed above. Actual results: OCP Prometheus provides no values for any of the metrics listed above. Expected results: OCP Prometheus provides values for all metrics listed above.
Not a 4.13 blocker
Nishanth, please assign it to someone.
Do we have the ODF cluster or must gather logs for this?
After the fix there are still missing metrics 'ceph_bluestore_submit_lat_sum', 'ceph_bluestore_submit_lat_count', 'ceph_bluestore_throttle_lat_count', 'ceph_bluestore_commit_lat_sum', 'ceph_bluestore_throttle_lat_sum', 'ceph_rocksdb_get', 'ceph_bluestore_commit_lat_count'. Is this expected? Tested with odf 4.13.0-214.
Hi @Filip, I did investigate the missing metrics you reported, it seems there's a discrepancy in metrics name on ocs-ci end. The metrics which are exported by ceph are actually named as for example ceph_bluestore_txc_submit_lat_count and similar for other metrics. https://github.com/ceph/ceph/blame/v17.2.6/src/os/bluestore/BlueStore.cc#L5076 I see that this was updated ~2 years ago in Ceph and the metrics.py for ocs-ci is last updated ~3 years ago, so it must be adopt the metrics name coming from ceph https://github.com/red-hat-storage/ocs-ci/blob/master/ocs_ci/ocs/metrics.py#L116 and the same goes for `ceph_rocksdb_get`, I don't see any metrics exported by Ceph with that name, so it must be removed from the file. Hope this helps. Thanks
Also observing this bug as part of the following ocs-ci test execution on IBM Z tests/manage/monitoring/prometheusmetrics/test_monitoring_negative.py::test_ceph_metrics_presence_when_osd_down
This is verified based on discussion in thread https://chat.google.com/room/AAAAREGEba8/KoCb6Izr65o. There will be needed a note in release notes. New metric names: ceph_bluestore_submit_lat_sum -> ceph_bluestore_txc_submit_lat_sum ceph_bluestore_submit_lat_count -> ceph_bluestore_txc_submit_lat_count ceph_bluestore_throttle_lat_count -> ceph_bluestore_txc_throttle_lat_count ceph_bluestore_commit_lat_sum -> ceph_bluestore_txc_commit_lat_sum ceph_bluestore_throttle_lat_sum -> ceph_bluestore_txc_throttle_lat_sum ceph_bluestore_commit_lat_count -> ceph_bluestore_txc_commit_lat_count Metric ceph_rocksdb_get was removed because it was redundant and its data can be accessed from metrics ceph_rocksdb_get_latency_sum and ceph_rocksdb_get_latency_count. --> VERIFIED
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Data Foundation 4.13.0 enhancement and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:3742