+++ This bug was initially created as a clone of Bug #2221488 +++ This bug was initially created as a copy of Bug #2203795 I am copying this bug because: Similarly to original bug found on ODF 4.14 we are missing 166 metrics. Test test_ceph_metrics_available fails constantly last 9 test runs. (previously stably Passed) must-gather logs: https://url.corp.redhat.com/mg-logs csv NAME DISPLAY VERSION REPLACES PHASE mcg-operator.v4.13.4-rhodf NooBaa Operator 4.13.4-rhodf mcg-operator.v4.12.8-rhodf Succeeded ocs-operator.v4.13.4-rhodf OpenShift Container Storage 4.13.4-rhodf ocs-operator.v4.12.8-rhodf Succeeded odf-csi-addons-operator.v4.13.4-rhodf CSI Addons 4.13.4-rhodf odf-csi-addons-operator.v4.12.8-rhodf Succeeded odf-operator.v4.13.4-rhodf OpenShift Data Foundation 4.13.4-rhodf odf-operator.v4.12.8-rhodf Succeeded Even though missing metric names are different comparing to 4.13 missing metrics, the description of the problem and parties of the discussion should be the same Description of problem (please be detailed as possible and provide log snippests): ODF Monitoring is missing some of the ceph_* metric values. No related epic providing change/rename was found List of missing metric values: 'ceph_bluestore_state_aio_wait_lat_sum', 'ceph_paxos_store_state_latency_sum', 'ceph_osd_op_out_bytes', 'ceph_bluestore_txc_submit_lat_sum', 'ceph_paxos_commit', 'ceph_paxos_new_pn_latency_count', 'ceph_osd_op_r_process_latency_count', 'ceph_bluestore_txc_submit_lat_count', 'ceph_bluestore_kv_final_lat_sum', 'ceph_paxos_collect_keys_sum', 'ceph_paxos_accept_timeout', 'ceph_paxos_begin_latency_count', 'ceph_bluefs_wal_total_bytes', 'ceph_paxos_refresh', 'ceph_bluestore_read_lat_count', 'ceph_mon_num_sessions', 'ceph_bluefs_bytes_written_wal', 'ceph_mon_num_elections', 'ceph_rocksdb_compact', 'ceph_bluestore_kv_sync_lat_sum', 'ceph_osd_op_process_latency_count', 'ceph_osd_op_w_prepare_latency_count', 'ceph_paxos_begin_latency_sum', 'ceph_osd_op_r', 'ceph_osd_op_rw_prepare_latency_sum', 'ceph_paxos_new_pn', 'ceph_rocksdb_get_latency_count', 'ceph_paxos_commit_latency_count', 'ceph_bluestore_txc_throttle_lat_count', 'ceph_paxos_lease_ack_timeout', 'ceph_bluestore_txc_commit_lat_sum', 'ceph_paxos_collect_bytes_sum', 'ceph_osd_op_rw_latency_count', 'ceph_paxos_collect_uncommitted', 'ceph_osd_op_rw_latency_sum', 'ceph_paxos_share_state', 'ceph_osd_op_r_prepare_latency_sum', 'ceph_bluestore_kv_flush_lat_sum', 'ceph_osd_op_rw_process_latency_sum', 'ceph_rocksdb_rocksdb_write_memtable_time_count', 'ceph_paxos_collect_latency_count', 'ceph_osd_op_rw_prepare_latency_count', 'ceph_paxos_collect_latency_sum', 'ceph_rocksdb_rocksdb_write_delay_time_count', 'ceph_paxos_begin_bytes_sum', 'ceph_osd_numpg', 'ceph_osd_stat_bytes', 'ceph_rocksdb_submit_sync_latency_sum', 'ceph_rocksdb_compact_queue_merge', 'ceph_paxos_collect_bytes_count', 'ceph_osd_op', 'ceph_paxos_commit_keys_sum', 'ceph_osd_op_rw_in_bytes', 'ceph_osd_op_rw_out_bytes', 'ceph_bluefs_bytes_written_sst', 'ceph_osd_op_rw_process_latency_count', 'ceph_rocksdb_compact_queue_len', 'ceph_bluestore_txc_throttle_lat_sum', 'ceph_bluefs_slow_used_bytes', 'ceph_osd_op_r_latency_sum', 'ceph_bluestore_kv_flush_lat_count', 'ceph_rocksdb_compact_range', 'ceph_osd_op_latency_sum', 'ceph_mon_session_add', 'ceph_paxos_share_state_keys_count', 'ceph_paxos_collect', 'ceph_osd_op_w_in_bytes', 'ceph_osd_op_r_process_latency_sum', 'ceph_paxos_start_peon', 'ceph_mon_session_trim', 'ceph_rocksdb_get_latency_sum', 'ceph_osd_op_rw', 'ceph_paxos_store_state_keys_count', 'ceph_rocksdb_rocksdb_write_delay_time_sum', 'ceph_osd_recovery_ops', 'ceph_bluefs_logged_bytes', 'ceph_bluefs_db_total_bytes', 'ceph_osd_op_w_latency_count', 'ceph_bluestore_txc_commit_lat_count', 'ceph_bluestore_state_aio_wait_lat_count', 'ceph_paxos_begin_bytes_count', 'ceph_paxos_start_leader', 'ceph_mon_election_call', 'ceph_rocksdb_rocksdb_write_pre_and_post_time_count', 'ceph_mon_session_rm', 'ceph_paxos_store_state', 'ceph_paxos_store_state_bytes_count', 'ceph_osd_op_w_latency_sum', 'ceph_rocksdb_submit_latency_count', 'ceph_paxos_commit_latency_sum', 'ceph_rocksdb_rocksdb_write_memtable_time_sum', 'ceph_paxos_share_state_bytes_sum', 'ceph_osd_op_process_latency_sum', 'ceph_paxos_begin_keys_sum', 'ceph_rocksdb_rocksdb_write_pre_and_post_time_sum', 'ceph_bluefs_wal_used_bytes', 'ceph_rocksdb_rocksdb_write_wal_time_sum', 'ceph_osd_op_wip', 'ceph_paxos_lease_timeout', 'ceph_osd_op_r_out_bytes', 'ceph_paxos_begin_keys_count', 'ceph_bluestore_kv_sync_lat_count', 'ceph_osd_op_prepare_latency_count', 'ceph_bluefs_bytes_written_slow', 'ceph_rocksdb_submit_latency_sum', 'ceph_osd_op_r_latency_count', 'ceph_paxos_share_state_keys_sum', 'ceph_paxos_store_state_bytes_sum', 'ceph_osd_op_latency_count', 'ceph_paxos_commit_bytes_count', 'ceph_paxos_restart', 'ceph_bluefs_slow_total_bytes', 'ceph_paxos_collect_timeout', 'ceph_osd_op_w_process_latency_sum', 'ceph_paxos_collect_keys_count', 'ceph_paxos_share_state_bytes_count', 'ceph_osd_op_w_prepare_latency_sum', 'ceph_bluestore_read_lat_sum', 'ceph_osd_stat_bytes_used', 'ceph_paxos_begin', 'ceph_mon_election_win', 'ceph_osd_op_w_process_latency_count', 'ceph_rocksdb_rocksdb_write_wal_time_count', 'ceph_paxos_store_state_keys_sum', 'ceph_osd_numpg_removing', 'ceph_paxos_commit_keys_count', 'ceph_paxos_new_pn_latency_sum', 'ceph_osd_op_in_bytes', 'ceph_paxos_store_state_latency_count', 'ceph_paxos_refresh_latency_count', 'ceph_osd_op_r_prepare_latency_count', 'ceph_bluefs_num_files', 'ceph_mon_election_lose', 'ceph_osd_op_prepare_latency_sum', 'ceph_bluefs_db_used_bytes', 'ceph_bluestore_kv_final_lat_count', 'ceph_paxos_refresh_latency_sum', 'ceph_osd_recovery_bytes', 'ceph_osd_op_w', 'ceph_paxos_commit_bytes_sum', 'ceph_bluefs_log_bytes', 'ceph_rocksdb_submit_sync_latency_count', ceph metrics which should be present on a healthy cluster: https://github.com/red-hat-storage/ocs-ci/blob/81ca20aed067a30dd109e0f29e026f2a18c752ee/ocs_ci/ocs/metrics.py#L70 Polarion documentation: https://polarion.engineering.redhat.com/polarion/#/project/OpenShiftContainerStorage/workitem?id=OCS-958 Version of all relevant components (if applicable): OC version: Client Version: 4.13.4 Kustomize Version: v4.5.7 Server Version: 4.14.0-0.nightly-2023-06-30-131338 Kubernetes Version: v1.27.3+ab0b8ee OCS verison: ocs-operator.v4.14.0-36.stable OpenShift Container Storage 4.14.0-36.stable Succeeded Cluster version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.14.0-0.nightly-2023-06-30-131338 True False 4d1h Cluster version is 4.14.0-0.nightly-2023-06-30-131338 Rook version: rook: v4.14.0-0.d8ce011027a26218154bcedf63a54e97f020df40 go: go1.20.4 Ceph version: ceph version 17.2.6-70.el9cp (fe62dcdbb2c6e05782a3e2b67d025b84ff5047cc) quincy (stable) Can this issue reproducible? yes, repeatable in ci run and with local run Steps to Reproduce: 1. Install OCP/ODF cluster 2. After installation, check whether Prometheus provides values for the metrics listed above. Actual results: OCP Prometheus provides no values for any of the metrics listed above. Expected results: OCP Prometheus provides values for all metrics listed above. logs of the test-run: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-034aikt1c33-t1/j-034aikt1c33-t1_20230704T064403/logs/ocs-ci-logs-1688456400/by_outcome/failed/tests/manage/monitoring/prometheusmetrics/test_monitoring_defaults.py/test_ceph_metrics_available/logs must-gather logs http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-034aikt1c33-t1/j-034aikt1c33-t1_20230704T064403/logs/testcases_1688456400/ --- Additional comment from RHEL Program Management on 2023-07-09 12:02:36 UTC --- This bug having no release flag set previously, is now set with release flag 'odf‑4.14.0' to '?', and so is being proposed to be fixed at the ODF 4.14.0 release. Note that the 3 Acks (pm_ack, devel_ack, qa_ack), if any previously set while release flag was missing, have now been reset since the Acks are to be set against a release flag. --- Additional comment from RHEL Program Management on 2023-07-09 12:02:36 UTC --- The 'Target Release' is not to be set manually at the Red Hat OpenShift Data Foundation product. The 'Target Release' will be auto set appropriately, after the 3 Acks (pm,devel,qa) are set to "+" for a specific release flag and that release flag gets auto set to "+". --- Additional comment from Travis Nielsen on 2023-07-10 16:47:34 UTC --- Avan PTAL --- Additional comment from avan on 2023-07-25 17:26:37 UTC --- @Daniel, Is this still reproducible? --- Additional comment from Daniel Osypenko on 2023-07-26 08:21:32 UTC --- @athakkar OCS 4.14.0-77 still fails --- Additional comment from avan on 2023-08-01 10:45:40 UTC --- (In reply to Daniel Osypenko from comment #5) > @athakkar OCS 4.14.0-77 still fails Currently the ceph-exporter is disabled for 4.14 build as there were some issue detected in upstream Ceph. The plan is to get it delivered to 6.1z2 and then enable it in 4.14 release branch of rook repo by this week. --- Additional comment from Travis Nielsen on 2023-08-01 15:12:18 UTC --- Avan Was the exporter disabled in Ceph? If so, we can move this BZ over to the ceph component --- Additional comment from avan on 2023-08-02 09:37:26 UTC --- (In reply to Travis Nielsen from comment #7) > Avan Was the exporter disabled in Ceph? If so, we can move this BZ over to > the ceph component No, I mean it was disabled on rook end. By the way the fixes are merged upstream for exporter so soon will be backported to downstream 6.1z2 --- Additional comment from Travis Nielsen on 2023-08-02 16:32:13 UTC --- Oh right, the min version upstream requires v18 for the exporter to be enabled, which means it's disabled in 4.14 until we change the MinVersionForCephExporter again. --- Additional comment from Red Hat Bugzilla on 2023-08-03 08:28:02 UTC --- Account disabled by LDAP Audit --- Additional comment from Mudit Agarwal on 2023-08-08 05:35:34 UTC --- Avan, please add the link of Ceph BZ/PR which has the exporter changes? Also, when are we planning it to be enabled from rook side? Elad, please provide qa ack. --- Additional comment from RHEL Program Management on 2023-08-08 06:31:57 UTC --- This BZ is being approved for ODF 4.14.0 release, upon receipt of the 3 ACKs (PM,Devel,QA) for the release flag 'odf‑4.14.0 --- Additional comment from RHEL Program Management on 2023-08-08 06:31:57 UTC --- Since this bug has been approved for ODF 4.14.0 release, through release flag 'odf-4.14.0+', the Target Release is being set to 'ODF 4.14.0 --- Additional comment from avan on 2023-08-08 06:33:39 UTC --- (In reply to Mudit Agarwal from comment #11) > Avan, please add the link of Ceph BZ/PR which has the exporter changes? > Also, when are we planning it to be enabled from rook side? > > Elad, please provide qa ack. Ceph BZs https://bugzilla.redhat.com/show_bug.cgi?id=2217817 https://bugzilla.redhat.com/show_bug.cgi?id=2229267 Once we have this BZs moved to on_qa(once we have new build), it can be enabled in rook for 4.14 release branch --- Additional comment from avan on 2023-08-09 11:20:02 UTC --- @kdreyer @branto Give that we have the new Ceph image ready with the required exporter changes https://bugzilla.redhat.com/show_bug.cgi?id=2217817#c3, can you help making sure that ODF 4.14 uses this new image for testing? --- Additional comment from Boris Ranto on 2023-08-09 11:53:41 UTC --- Done, I updated the defaults to use the new RHCS 6.1z2 first build (6-200). We should have it in our builds starting from tomorrow. --- Additional comment from errata-xmlrpc on 2023-08-10 03:50:33 UTC --- This bug has been added to advisory RHBA-2023:115514 by ceph-build service account (ceph-build.COM) --- Additional comment from Daniel Osypenko on 2023-08-31 11:29:03 UTC --- Fixed. Same automation test, previously failed (test_monitoring_reporting_ok_when_idle) now passes: 13:42:54 - MainThread - /Users/danielosypenko/Work/automation_4/ocs-ci/ocs_ci/utility/prometheus.py - INFO - No bad values detected 13:42:54 - MainThread - /Users/danielosypenko/Work/automation_4/ocs-ci/ocs_ci/utility/prometheus.py - INFO - No invalid values detected 13:42:54 - MainThread - test_monitoring_defaults - INFO - ceph_osd_in metric does indicate no problems with OSDs PASSED --- Additional comment from Daniel Osypenko on 2023-09-04 10:13:06 UTC --- BZ has been moved to Verified by a mistake. List of missing metrics on OCP 4.14.0-0.nightly-2023-09-02-132842 ODF 4.14.0-125.stable ['ceph_bluestore_state_aio_wait_lat_sum', 'ceph_paxos_store_state_latency_sum', 'ceph_osd_op_out_bytes', 'ceph_bluestore_txc_submit_lat_sum', 'ceph_paxos_commit', 'ceph_paxos_new_pn_latency_count', 'ceph_osd_op_r_process_latency_count', 'ceph_bluestore_txc_submit_lat_count', 'ceph_bluestore_kv_final_lat_sum', 'ceph_paxos_collect_keys_sum', 'ceph_paxos_accept_timeout', 'ceph_paxos_begin_latency_count', 'ceph_bluefs_wal_total_bytes', 'ceph_paxos_refresh', 'ceph_bluestore_read_lat_count', 'ceph_mon_num_sessions', 'ceph_objecter_op_rmw', 'ceph_bluefs_bytes_written_wal', 'ceph_mon_num_elections', 'ceph_rocksdb_compact', 'ceph_bluestore_kv_sync_lat_sum', 'ceph_osd_op_process_latency_count', 'ceph_osd_op_w_prepare_latency_count', 'ceph_objecter_op_active', 'ceph_paxos_begin_latency_sum', 'ceph_osd_op_r', 'ceph_osd_op_rw_prepare_latency_sum', 'ceph_paxos_new_pn', 'ceph_rgw_qlen', 'ceph_rgw_req', 'ceph_rocksdb_get_latency_count', 'ceph_rgw_cache_miss', 'ceph_paxos_commit_latency_count', 'ceph_bluestore_txc_throttle_lat_count', 'ceph_paxos_lease_ack_timeout', 'ceph_bluestore_txc_commit_lat_sum', 'ceph_paxos_collect_bytes_sum', 'ceph_osd_op_rw_latency_count', 'ceph_paxos_collect_uncommitted', 'ceph_osd_op_rw_latency_sum', 'ceph_paxos_share_state', 'ceph_osd_op_r_prepare_latency_sum', 'ceph_bluestore_kv_flush_lat_sum', 'ceph_osd_op_rw_process_latency_sum', 'ceph_rocksdb_rocksdb_write_memtable_time_count', 'ceph_paxos_collect_latency_count', 'ceph_osd_op_rw_prepare_latency_count', 'ceph_paxos_collect_latency_sum', 'ceph_rocksdb_rocksdb_write_delay_time_count', 'ceph_objecter_op_rmw', 'ceph_paxos_begin_bytes_sum', 'ceph_osd_numpg', 'ceph_osd_stat_bytes', 'ceph_rocksdb_submit_sync_latency_sum', 'ceph_rocksdb_compact_queue_merge', 'ceph_paxos_collect_bytes_count', 'ceph_osd_op', 'ceph_paxos_commit_keys_sum', 'ceph_osd_op_rw_in_bytes', 'ceph_osd_op_rw_out_bytes', 'ceph_bluefs_bytes_written_sst', 'ceph_rgw_put', 'ceph_osd_op_rw_process_latency_count', 'ceph_rocksdb_compact_queue_len', 'ceph_bluestore_txc_throttle_lat_sum', 'ceph_bluefs_slow_used_bytes', 'ceph_osd_op_r_latency_sum', 'ceph_bluestore_kv_flush_lat_count', 'ceph_rocksdb_compact_range', 'ceph_osd_op_latency_sum', 'ceph_mon_session_add', 'ceph_paxos_share_state_keys_count', 'ceph_paxos_collect', 'ceph_osd_op_w_in_bytes', 'ceph_osd_op_r_process_latency_sum', 'ceph_paxos_start_peon', 'ceph_mon_session_trim', 'ceph_rocksdb_get_latency_sum', 'ceph_osd_op_rw', 'ceph_paxos_store_state_keys_count', 'ceph_rocksdb_rocksdb_write_delay_time_sum', 'ceph_objecter_op_r', 'ceph_objecter_op_active', 'ceph_objecter_op_w', 'ceph_osd_recovery_ops', 'ceph_bluefs_logged_bytes', 'ceph_bluefs_db_total_bytes', 'ceph_rgw_put_initial_lat_sum', 'ceph_osd_op_w_latency_count', 'ceph_rgw_put_initial_lat_count', 'ceph_bluestore_txc_commit_lat_count', 'ceph_bluestore_state_aio_wait_lat_count', 'ceph_paxos_begin_bytes_count', 'ceph_paxos_start_leader', 'ceph_mon_election_call', 'ceph_rocksdb_rocksdb_write_pre_and_post_time_count', 'ceph_mon_session_rm', 'ceph_paxos_store_state', 'ceph_paxos_store_state_bytes_count', 'ceph_osd_op_w_latency_sum', 'ceph_rgw_keystone_token_cache_hit', 'ceph_rocksdb_submit_latency_count', 'ceph_paxos_commit_latency_sum', 'ceph_rocksdb_rocksdb_write_memtable_time_sum', 'ceph_paxos_share_state_bytes_sum', 'ceph_osd_op_process_latency_sum', 'ceph_paxos_begin_keys_sum', 'ceph_rgw_qactive', 'ceph_rocksdb_rocksdb_write_pre_and_post_time_sum', 'ceph_bluefs_wal_used_bytes', 'ceph_rocksdb_rocksdb_write_wal_time_sum', 'ceph_osd_op_wip', 'ceph_rgw_get_initial_lat_sum', 'ceph_paxos_lease_timeout', 'ceph_osd_op_r_out_bytes', 'ceph_paxos_begin_keys_count', 'ceph_bluestore_kv_sync_lat_count', 'ceph_osd_op_prepare_latency_count', 'ceph_bluefs_bytes_written_slow', 'ceph_rocksdb_submit_latency_sum', 'ceph_osd_op_r_latency_count', 'ceph_paxos_share_state_keys_sum', 'ceph_paxos_store_state_bytes_sum', 'ceph_osd_op_latency_count', 'ceph_paxos_commit_bytes_count', 'ceph_paxos_restart', 'ceph_rgw_get_initial_lat_count', 'ceph_bluefs_slow_total_bytes', 'ceph_paxos_collect_timeout', 'ceph_osd_op_w_process_latency_sum', 'ceph_paxos_collect_keys_count', 'ceph_paxos_share_state_bytes_count', 'ceph_osd_op_w_prepare_latency_sum', 'ceph_bluestore_read_lat_sum', 'ceph_osd_stat_bytes_used', 'ceph_paxos_begin', 'ceph_mon_election_win', 'ceph_osd_op_w_process_latency_count', 'ceph_rgw_get_b', 'ceph_rgw_failed_req', 'ceph_rocksdb_rocksdb_write_wal_time_count', 'ceph_rgw_keystone_token_cache_miss', 'ceph_paxos_store_state_keys_sum', 'ceph_osd_numpg_removing', 'ceph_paxos_commit_keys_count', 'ceph_paxos_new_pn_latency_sum', 'ceph_osd_op_in_bytes', 'ceph_paxos_store_state_latency_count', 'ceph_paxos_refresh_latency_count', 'ceph_rgw_get', 'ceph_osd_op_r_prepare_latency_count', 'ceph_rgw_cache_hit', 'ceph_objecter_op_w', 'ceph_objecter_op_r', 'ceph_bluefs_num_files', 'ceph_rgw_put_b', 'ceph_mon_election_lose', 'ceph_osd_op_prepare_latency_sum', 'ceph_bluefs_db_used_bytes', 'ceph_bluestore_kv_final_lat_count', 'ceph_paxos_refresh_latency_sum', 'ceph_osd_recovery_bytes', 'ceph_osd_op_w', 'ceph_paxos_commit_bytes_sum', 'ceph_bluefs_log_bytes', 'ceph_rocksdb_submit_sync_latency_count'] --- Additional comment from avan on 2023-09-05 12:24:27 UTC --- There's a fix under review currently https://github.com/red-hat-storage/rook/pull/516 --- Additional comment from Travis Nielsen on 2023-09-05 15:29:15 UTC --- PR 516 was merged now. --- Additional comment from Daniel Osypenko on 2023-09-07 10:12:21 UTC --- Verified, PASSED: test_ceph_metrics_available http://pastebin.test.redhat.com/1108991 test_ceph_rbd_metrics_available http://pastebin.test.redhat.com/1108993 --- Additional comment from Sunil Kumar Acharya on 2023-09-21 05:54:14 UTC --- Please update the requires_doc_text(RDT) flag/text appropriately.
So, this is applicable for 4.13 only because bug #2221488 is already fixed for 4.14?
@muagarwa yes. 4.14 is stable in Passing tests now. 4.13 is constantly failing missing now 142 metrics. 4.10, 4.11, 4.12 has 100% pass ratio. Full list of missing metrics http://pastebin.test.redhat.com/1110338 Regarding the summary, sorry for misleading, it was taken from original bz and should be "ODF Monitoring is missing some of the metric values 4.13"
after the cmd `ceph config set mgr mgr/prometheus/exclude_perf_counters false` done, here the list of the missing metrics (test test_ceph_metrics_available) ceph_rgw_put ceph_rgw_put_initial_lat_sum ceph_rgw_put_initial_lat_count ceph_rgw_keystone_token_cache_hit ceph_rgw_metadata ceph_rgw_qactive ceph_rgw_get_initial_lat_sum ceph_rgw_get_initial_lat_count ceph_rgw_get_b ceph_rgw_failed_req ceph_rgw_keystone_token_cache_miss ceph_rgw_get ceph_rgw_cache_hit ceph_rgw_put_b
The ceph_rgw metrics are not showing up because of unavailable rgw service on the cluster. Now test Passes. I can move it to VERIFIED once it will be ON_QA. Regards
Moving to ON_QA as discussed
Can you please specify the version I need to verify? Regards
OC version: Client Version: 4.13.4 Kustomize Version: v4.5.7 Server Version: 4.13.0-0.nightly-2023-11-14-104446 Kubernetes Version: v1.26.9+636f2be OCS verison: ocs-operator.v4.13.5-rhodf OpenShift Container Storage 4.13.5-rhodf ocs-operator.v4.13.4-rhodf Succeeded Cluster version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.13.0-0.nightly-2023-11-14-104446 True False 46m Cluster version is 4.13.0-0.nightly-2023-11-14-104446 Rook version: rook: v4.13.5-0.42f43768ad57d91be47327f83653c05eeb721977 go: go1.19.13 Ceph version: ceph version 17.2.6-148.el9cp (badc1d27cb07762bea48f6554ad4f92b9d3fbb6b) quincy (stable) Missing 166 metrics https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/31288/console
regarding the recommendation unset mgr/prometheus/exclude_perf_counters did we have such flag (mgr/prometheus/exclude_perf_counters) before and does it mean that by default a user would not see metrics and this is acceptable / documented? Thanks @athar.lh
Moving the bug to 4.13.7 as we are doing a quick 4.13.6 to include a critical fix at RGW (2254303) before to shutdown
I appreciate your posting. I've read about a lot of related subjects! Contrary to other articles, yours left me with a really distinct impression. I hope you'll keep writing insightful posts like this one and others for us to everyone to read!
test_ceph_metrics_available fails. 166 missing metrics. ceph_bluestore_state_aio_wait_lat_sum,ceph_paxos_store_state_latency_sum,ceph_osd_op_out_bytes,ceph_bluestore_txc_submit_lat_sum,ceph_paxos_commit,ceph_paxos_new_pn_latency_count,ceph_osd_op_r_process_latency_count,ceph_bluestore_txc_submit_lat_count,ceph_bluestore_kv_final_lat_sum,ceph_paxos_collect_keys_sum,ceph_paxos_accept_timeout,ceph_paxos_begin_latency_count,ceph_bluefs_wal_total_bytes,ceph_paxos_refresh,ceph_bluestore_read_lat_count,ceph_mon_num_sessions,ceph_objecter_op_rmw,ceph_bluefs_bytes_written_wal,ceph_mon_num_elections,ceph_rocksdb_compact,ceph_bluestore_kv_sync_lat_sum,ceph_osd_op_process_latency_count,ceph_osd_op_w_prepare_latency_count,ceph_objecter_op_active,ceph_paxos_begin_latency_sum,ceph_osd_op_r,ceph_osd_op_rw_prepare_latency_sum,ceph_paxos_new_pn,ceph_rgw_qlen,ceph_rgw_req,ceph_rocksdb_get_latency_count,ceph_rgw_cache_miss,ceph_paxos_commit_latency_count,ceph_bluestore_txc_throttle_lat_count,ceph_paxos_lease_ack_timeout,ceph_bluestore_txc_commit_lat_sum,ceph_paxos_collect_bytes_sum,ceph_osd_op_rw_latency_count,ceph_paxos_collect_uncommitted,ceph_osd_op_rw_latency_sum,ceph_paxos_share_state,ceph_osd_op_r_prepare_latency_sum,ceph_bluestore_kv_flush_lat_sum,ceph_osd_op_rw_process_latency_sum,ceph_rocksdb_rocksdb_write_memtable_time_count,ceph_paxos_collect_latency_count,ceph_osd_op_rw_prepare_latency_count,ceph_paxos_collect_latency_sum,ceph_rocksdb_rocksdb_write_delay_time_count,ceph_objecter_op_rmw,ceph_paxos_begin_bytes_sum,ceph_osd_numpg,ceph_osd_stat_bytes,ceph_rocksdb_submit_sync_latency_sum ODF 4.13.7-rhodf vSphere UPI deployment OCP 4.13.0-0.nightly-2024-01-17-100523 Elaborating "missing metrics" meaning. When we run query we get empty metrics data along with status 304 Not Modified. Than means we don't have data available, response has not been modified since the last request, and there is no need to resend the entire content. Logs: 2024-01-17 18:08:34,071 - MainThread - INFO - /home/jenkins/workspace/qe-deploy-ocs-cluster/ocs-ci/ocs_ci/utility/prometheus.py.query.504 - Performing prometheus instant query 'ceph_bluestore_state_aio_wait_lat_sum' 2024-01-17 18:08:34,071 - MainThread - DEBUG - /home/jenkins/workspace/qe-deploy-ocs-cluster/ocs-ci/ocs_ci/utility/prometheus.py.get.422 - GET https://prometheus-k8s-openshift-monitoring.apps.dosypenk-171.qe.rh-ocs.com/api/v1/query 2024-01-17 18:08:34,071 - MainThread - DEBUG - /home/jenkins/workspace/qe-deploy-ocs-cluster/ocs-ci/ocs_ci/utility/prometheus.py.get.423 - headers={'Authorization': 'Bearer sha256~f9UfvhOsP02LNP5oLPx9uQhph3oSHYJpL6qaPBH7wlk'} 2024-01-17 18:08:34,071 - MainThread - DEBUG - /home/jenkins/workspace/qe-deploy-ocs-cluster/ocs-ci/ocs_ci/utility/prometheus.py.get.424 - verify=False 2024-01-17 18:08:34,072 - MainThread - DEBUG - /home/jenkins/workspace/qe-deploy-ocs-cluster/ocs-ci/ocs_ci/utility/prometheus.py.get.425 - params={'query': 'ceph_bluestore_state_aio_wait_lat_sum'} 2024-01-17 18:08:34,073 - MainThread - DEBUG - urllib3.connectionpool._new_conn.1019 - Starting new HTTPS connection (1): prometheus-k8s-openshift-monitoring.apps.dosypenk-171.qe.rh-ocs.com:443 2024-01-17 18:08:34,107 - MainThread - DEBUG - urllib3.connectionpool._make_request.474 - https://prometheus-k8s-openshift-monitoring.apps.dosypenk-171.qe.rh-ocs.com:443 "GET /api/v1/query?query=ceph_bluestore_state_aio_wait_lat_sum HTTP/1.1" 200 87 2024-01-17 18:08:34,108 - MainThread - DEBUG - /home/jenkins/workspace/qe-deploy-ocs-cluster/ocs-ci/ocs_ci/utility/prometheus.py.validate_status.304 - content value: {'status': 'success', 'data': {'resultType': 'vector', 'result': []}} 2024-01-17 18:08:34,109 - MainThread - ERROR - ocs_ci.ocs.metrics.get_missing_metrics.352 - failed to get results for ceph_bluestore_state_aio_wait_lat_sum
@athakkar confirming ceph config get mgr mgr/prometheus/exclude_perf_counters false - metrics become visible on ODF 4.13 OCP 4.13
This got assigned to me when we moved it to Rook. Assigning it back to Avan.
verified on IBM cloud deployment * ODF 4.13.8-1 * OCP 4.13.0-0.nightly-2024-03-08-182318 test_ceph_rbd_metrics_available - https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/34912/console
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Data Foundation 4.13.8 Bug Fix Update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2024:1657