This bug was initially created as a copy of Bug #2203795 I am copying this bug because: Even though missing metric names are different comparing to 4.13 missing metrics, the description of the problem and parties of the discussion should be the same Description of problem (please be detailed as possible and provide log snippests): ODF Monitoring is missing some of the ceph_* metric values. No related epic providing change/rename was found List of missing metric values: 'ceph_bluestore_state_aio_wait_lat_sum', 'ceph_paxos_store_state_latency_sum', 'ceph_osd_op_out_bytes', 'ceph_bluestore_txc_submit_lat_sum', 'ceph_paxos_commit', 'ceph_paxos_new_pn_latency_count', 'ceph_osd_op_r_process_latency_count', 'ceph_bluestore_txc_submit_lat_count', 'ceph_bluestore_kv_final_lat_sum', 'ceph_paxos_collect_keys_sum', 'ceph_paxos_accept_timeout', 'ceph_paxos_begin_latency_count', 'ceph_bluefs_wal_total_bytes', 'ceph_paxos_refresh', 'ceph_bluestore_read_lat_count', 'ceph_mon_num_sessions', 'ceph_bluefs_bytes_written_wal', 'ceph_mon_num_elections', 'ceph_rocksdb_compact', 'ceph_bluestore_kv_sync_lat_sum', 'ceph_osd_op_process_latency_count', 'ceph_osd_op_w_prepare_latency_count', 'ceph_paxos_begin_latency_sum', 'ceph_osd_op_r', 'ceph_osd_op_rw_prepare_latency_sum', 'ceph_paxos_new_pn', 'ceph_rocksdb_get_latency_count', 'ceph_paxos_commit_latency_count', 'ceph_bluestore_txc_throttle_lat_count', 'ceph_paxos_lease_ack_timeout', 'ceph_bluestore_txc_commit_lat_sum', 'ceph_paxos_collect_bytes_sum', 'ceph_osd_op_rw_latency_count', 'ceph_paxos_collect_uncommitted', 'ceph_osd_op_rw_latency_sum', 'ceph_paxos_share_state', 'ceph_osd_op_r_prepare_latency_sum', 'ceph_bluestore_kv_flush_lat_sum', 'ceph_osd_op_rw_process_latency_sum', 'ceph_rocksdb_rocksdb_write_memtable_time_count', 'ceph_paxos_collect_latency_count', 'ceph_osd_op_rw_prepare_latency_count', 'ceph_paxos_collect_latency_sum', 'ceph_rocksdb_rocksdb_write_delay_time_count', 'ceph_paxos_begin_bytes_sum', 'ceph_osd_numpg', 'ceph_osd_stat_bytes', 'ceph_rocksdb_submit_sync_latency_sum', 'ceph_rocksdb_compact_queue_merge', 'ceph_paxos_collect_bytes_count', 'ceph_osd_op', 'ceph_paxos_commit_keys_sum', 'ceph_osd_op_rw_in_bytes', 'ceph_osd_op_rw_out_bytes', 'ceph_bluefs_bytes_written_sst', 'ceph_osd_op_rw_process_latency_count', 'ceph_rocksdb_compact_queue_len', 'ceph_bluestore_txc_throttle_lat_sum', 'ceph_bluefs_slow_used_bytes', 'ceph_osd_op_r_latency_sum', 'ceph_bluestore_kv_flush_lat_count', 'ceph_rocksdb_compact_range', 'ceph_osd_op_latency_sum', 'ceph_mon_session_add', 'ceph_paxos_share_state_keys_count', 'ceph_paxos_collect', 'ceph_osd_op_w_in_bytes', 'ceph_osd_op_r_process_latency_sum', 'ceph_paxos_start_peon', 'ceph_mon_session_trim', 'ceph_rocksdb_get_latency_sum', 'ceph_osd_op_rw', 'ceph_paxos_store_state_keys_count', 'ceph_rocksdb_rocksdb_write_delay_time_sum', 'ceph_osd_recovery_ops', 'ceph_bluefs_logged_bytes', 'ceph_bluefs_db_total_bytes', 'ceph_osd_op_w_latency_count', 'ceph_bluestore_txc_commit_lat_count', 'ceph_bluestore_state_aio_wait_lat_count', 'ceph_paxos_begin_bytes_count', 'ceph_paxos_start_leader', 'ceph_mon_election_call', 'ceph_rocksdb_rocksdb_write_pre_and_post_time_count', 'ceph_mon_session_rm', 'ceph_paxos_store_state', 'ceph_paxos_store_state_bytes_count', 'ceph_osd_op_w_latency_sum', 'ceph_rocksdb_submit_latency_count', 'ceph_paxos_commit_latency_sum', 'ceph_rocksdb_rocksdb_write_memtable_time_sum', 'ceph_paxos_share_state_bytes_sum', 'ceph_osd_op_process_latency_sum', 'ceph_paxos_begin_keys_sum', 'ceph_rocksdb_rocksdb_write_pre_and_post_time_sum', 'ceph_bluefs_wal_used_bytes', 'ceph_rocksdb_rocksdb_write_wal_time_sum', 'ceph_osd_op_wip', 'ceph_paxos_lease_timeout', 'ceph_osd_op_r_out_bytes', 'ceph_paxos_begin_keys_count', 'ceph_bluestore_kv_sync_lat_count', 'ceph_osd_op_prepare_latency_count', 'ceph_bluefs_bytes_written_slow', 'ceph_rocksdb_submit_latency_sum', 'ceph_osd_op_r_latency_count', 'ceph_paxos_share_state_keys_sum', 'ceph_paxos_store_state_bytes_sum', 'ceph_osd_op_latency_count', 'ceph_paxos_commit_bytes_count', 'ceph_paxos_restart', 'ceph_bluefs_slow_total_bytes', 'ceph_paxos_collect_timeout', 'ceph_osd_op_w_process_latency_sum', 'ceph_paxos_collect_keys_count', 'ceph_paxos_share_state_bytes_count', 'ceph_osd_op_w_prepare_latency_sum', 'ceph_bluestore_read_lat_sum', 'ceph_osd_stat_bytes_used', 'ceph_paxos_begin', 'ceph_mon_election_win', 'ceph_osd_op_w_process_latency_count', 'ceph_rocksdb_rocksdb_write_wal_time_count', 'ceph_paxos_store_state_keys_sum', 'ceph_osd_numpg_removing', 'ceph_paxos_commit_keys_count', 'ceph_paxos_new_pn_latency_sum', 'ceph_osd_op_in_bytes', 'ceph_paxos_store_state_latency_count', 'ceph_paxos_refresh_latency_count', 'ceph_osd_op_r_prepare_latency_count', 'ceph_bluefs_num_files', 'ceph_mon_election_lose', 'ceph_osd_op_prepare_latency_sum', 'ceph_bluefs_db_used_bytes', 'ceph_bluestore_kv_final_lat_count', 'ceph_paxos_refresh_latency_sum', 'ceph_osd_recovery_bytes', 'ceph_osd_op_w', 'ceph_paxos_commit_bytes_sum', 'ceph_bluefs_log_bytes', 'ceph_rocksdb_submit_sync_latency_count', ceph metrics which should be present on a healthy cluster: https://github.com/red-hat-storage/ocs-ci/blob/81ca20aed067a30dd109e0f29e026f2a18c752ee/ocs_ci/ocs/metrics.py#L70 Polarion documentation: https://polarion.engineering.redhat.com/polarion/#/project/OpenShiftContainerStorage/workitem?id=OCS-958 Version of all relevant components (if applicable): OC version: Client Version: 4.13.4 Kustomize Version: v4.5.7 Server Version: 4.14.0-0.nightly-2023-06-30-131338 Kubernetes Version: v1.27.3+ab0b8ee OCS verison: ocs-operator.v4.14.0-36.stable OpenShift Container Storage 4.14.0-36.stable Succeeded Cluster version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.14.0-0.nightly-2023-06-30-131338 True False 4d1h Cluster version is 4.14.0-0.nightly-2023-06-30-131338 Rook version: rook: v4.14.0-0.d8ce011027a26218154bcedf63a54e97f020df40 go: go1.20.4 Ceph version: ceph version 17.2.6-70.el9cp (fe62dcdbb2c6e05782a3e2b67d025b84ff5047cc) quincy (stable) Can this issue reproducible? yes, repeatable in ci run and with local run Steps to Reproduce: 1. Install OCP/ODF cluster 2. After installation, check whether Prometheus provides values for the metrics listed above. Actual results: OCP Prometheus provides no values for any of the metrics listed above. Expected results: OCP Prometheus provides values for all metrics listed above. logs of the test-run: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-034aikt1c33-t1/j-034aikt1c33-t1_20230704T064403/logs/ocs-ci-logs-1688456400/by_outcome/failed/tests/manage/monitoring/prometheusmetrics/test_monitoring_defaults.py/test_ceph_metrics_available/logs must-gather logs http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-034aikt1c33-t1/j-034aikt1c33-t1_20230704T064403/logs/testcases_1688456400/
Avan PTAL
@Daniel, Is this still reproducible?
Done, I updated the defaults to use the new RHCS 6.1z2 first build (6-200). We should have it in our builds starting from tomorrow.
Fixed. Same automation test, previously failed (test_monitoring_reporting_ok_when_idle) now passes: 13:42:54 - MainThread - /Users/danielosypenko/Work/automation_4/ocs-ci/ocs_ci/utility/prometheus.py - INFO - No bad values detected 13:42:54 - MainThread - /Users/danielosypenko/Work/automation_4/ocs-ci/ocs_ci/utility/prometheus.py - INFO - No invalid values detected 13:42:54 - MainThread - test_monitoring_defaults - INFO - ceph_osd_in metric does indicate no problems with OSDs PASSED
BZ has been moved to Verified by a mistake. List of missing metrics on OCP 4.14.0-0.nightly-2023-09-02-132842 ODF 4.14.0-125.stable ['ceph_bluestore_state_aio_wait_lat_sum', 'ceph_paxos_store_state_latency_sum', 'ceph_osd_op_out_bytes', 'ceph_bluestore_txc_submit_lat_sum', 'ceph_paxos_commit', 'ceph_paxos_new_pn_latency_count', 'ceph_osd_op_r_process_latency_count', 'ceph_bluestore_txc_submit_lat_count', 'ceph_bluestore_kv_final_lat_sum', 'ceph_paxos_collect_keys_sum', 'ceph_paxos_accept_timeout', 'ceph_paxos_begin_latency_count', 'ceph_bluefs_wal_total_bytes', 'ceph_paxos_refresh', 'ceph_bluestore_read_lat_count', 'ceph_mon_num_sessions', 'ceph_objecter_op_rmw', 'ceph_bluefs_bytes_written_wal', 'ceph_mon_num_elections', 'ceph_rocksdb_compact', 'ceph_bluestore_kv_sync_lat_sum', 'ceph_osd_op_process_latency_count', 'ceph_osd_op_w_prepare_latency_count', 'ceph_objecter_op_active', 'ceph_paxos_begin_latency_sum', 'ceph_osd_op_r', 'ceph_osd_op_rw_prepare_latency_sum', 'ceph_paxos_new_pn', 'ceph_rgw_qlen', 'ceph_rgw_req', 'ceph_rocksdb_get_latency_count', 'ceph_rgw_cache_miss', 'ceph_paxos_commit_latency_count', 'ceph_bluestore_txc_throttle_lat_count', 'ceph_paxos_lease_ack_timeout', 'ceph_bluestore_txc_commit_lat_sum', 'ceph_paxos_collect_bytes_sum', 'ceph_osd_op_rw_latency_count', 'ceph_paxos_collect_uncommitted', 'ceph_osd_op_rw_latency_sum', 'ceph_paxos_share_state', 'ceph_osd_op_r_prepare_latency_sum', 'ceph_bluestore_kv_flush_lat_sum', 'ceph_osd_op_rw_process_latency_sum', 'ceph_rocksdb_rocksdb_write_memtable_time_count', 'ceph_paxos_collect_latency_count', 'ceph_osd_op_rw_prepare_latency_count', 'ceph_paxos_collect_latency_sum', 'ceph_rocksdb_rocksdb_write_delay_time_count', 'ceph_objecter_op_rmw', 'ceph_paxos_begin_bytes_sum', 'ceph_osd_numpg', 'ceph_osd_stat_bytes', 'ceph_rocksdb_submit_sync_latency_sum', 'ceph_rocksdb_compact_queue_merge', 'ceph_paxos_collect_bytes_count', 'ceph_osd_op', 'ceph_paxos_commit_keys_sum', 'ceph_osd_op_rw_in_bytes', 'ceph_osd_op_rw_out_bytes', 'ceph_bluefs_bytes_written_sst', 'ceph_rgw_put', 'ceph_osd_op_rw_process_latency_count', 'ceph_rocksdb_compact_queue_len', 'ceph_bluestore_txc_throttle_lat_sum', 'ceph_bluefs_slow_used_bytes', 'ceph_osd_op_r_latency_sum', 'ceph_bluestore_kv_flush_lat_count', 'ceph_rocksdb_compact_range', 'ceph_osd_op_latency_sum', 'ceph_mon_session_add', 'ceph_paxos_share_state_keys_count', 'ceph_paxos_collect', 'ceph_osd_op_w_in_bytes', 'ceph_osd_op_r_process_latency_sum', 'ceph_paxos_start_peon', 'ceph_mon_session_trim', 'ceph_rocksdb_get_latency_sum', 'ceph_osd_op_rw', 'ceph_paxos_store_state_keys_count', 'ceph_rocksdb_rocksdb_write_delay_time_sum', 'ceph_objecter_op_r', 'ceph_objecter_op_active', 'ceph_objecter_op_w', 'ceph_osd_recovery_ops', 'ceph_bluefs_logged_bytes', 'ceph_bluefs_db_total_bytes', 'ceph_rgw_put_initial_lat_sum', 'ceph_osd_op_w_latency_count', 'ceph_rgw_put_initial_lat_count', 'ceph_bluestore_txc_commit_lat_count', 'ceph_bluestore_state_aio_wait_lat_count', 'ceph_paxos_begin_bytes_count', 'ceph_paxos_start_leader', 'ceph_mon_election_call', 'ceph_rocksdb_rocksdb_write_pre_and_post_time_count', 'ceph_mon_session_rm', 'ceph_paxos_store_state', 'ceph_paxos_store_state_bytes_count', 'ceph_osd_op_w_latency_sum', 'ceph_rgw_keystone_token_cache_hit', 'ceph_rocksdb_submit_latency_count', 'ceph_paxos_commit_latency_sum', 'ceph_rocksdb_rocksdb_write_memtable_time_sum', 'ceph_paxos_share_state_bytes_sum', 'ceph_osd_op_process_latency_sum', 'ceph_paxos_begin_keys_sum', 'ceph_rgw_qactive', 'ceph_rocksdb_rocksdb_write_pre_and_post_time_sum', 'ceph_bluefs_wal_used_bytes', 'ceph_rocksdb_rocksdb_write_wal_time_sum', 'ceph_osd_op_wip', 'ceph_rgw_get_initial_lat_sum', 'ceph_paxos_lease_timeout', 'ceph_osd_op_r_out_bytes', 'ceph_paxos_begin_keys_count', 'ceph_bluestore_kv_sync_lat_count', 'ceph_osd_op_prepare_latency_count', 'ceph_bluefs_bytes_written_slow', 'ceph_rocksdb_submit_latency_sum', 'ceph_osd_op_r_latency_count', 'ceph_paxos_share_state_keys_sum', 'ceph_paxos_store_state_bytes_sum', 'ceph_osd_op_latency_count', 'ceph_paxos_commit_bytes_count', 'ceph_paxos_restart', 'ceph_rgw_get_initial_lat_count', 'ceph_bluefs_slow_total_bytes', 'ceph_paxos_collect_timeout', 'ceph_osd_op_w_process_latency_sum', 'ceph_paxos_collect_keys_count', 'ceph_paxos_share_state_bytes_count', 'ceph_osd_op_w_prepare_latency_sum', 'ceph_bluestore_read_lat_sum', 'ceph_osd_stat_bytes_used', 'ceph_paxos_begin', 'ceph_mon_election_win', 'ceph_osd_op_w_process_latency_count', 'ceph_rgw_get_b', 'ceph_rgw_failed_req', 'ceph_rocksdb_rocksdb_write_wal_time_count', 'ceph_rgw_keystone_token_cache_miss', 'ceph_paxos_store_state_keys_sum', 'ceph_osd_numpg_removing', 'ceph_paxos_commit_keys_count', 'ceph_paxos_new_pn_latency_sum', 'ceph_osd_op_in_bytes', 'ceph_paxos_store_state_latency_count', 'ceph_paxos_refresh_latency_count', 'ceph_rgw_get', 'ceph_osd_op_r_prepare_latency_count', 'ceph_rgw_cache_hit', 'ceph_objecter_op_w', 'ceph_objecter_op_r', 'ceph_bluefs_num_files', 'ceph_rgw_put_b', 'ceph_mon_election_lose', 'ceph_osd_op_prepare_latency_sum', 'ceph_bluefs_db_used_bytes', 'ceph_bluestore_kv_final_lat_count', 'ceph_paxos_refresh_latency_sum', 'ceph_osd_recovery_bytes', 'ceph_osd_op_w', 'ceph_paxos_commit_bytes_sum', 'ceph_bluefs_log_bytes', 'ceph_rocksdb_submit_sync_latency_count']
Verified, PASSED: test_ceph_metrics_available http://pastebin.test.redhat.com/1108991 test_ceph_rbd_metrics_available http://pastebin.test.redhat.com/1108993
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.14.0 security, enhancement & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:6832