Bug 2242324
Summary: | ODF Monitoring is missing some of the metric values 4.13 | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Daniel Osypenko <dosypenk> |
Component: | rook | Assignee: | avan <athakkar> |
Status: | CLOSED ERRATA | QA Contact: | Neha Berry <nberry> |
Severity: | medium | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.13 | CC: | athakkar, athar.lh, branto, ebenahar, fbalak, hnallurv, kdreyer, kramdoss, muagarwa, murtaza.8060, nthomas, odf-bz-bot, rcyriac, sheggodu, tnielsen |
Target Milestone: | --- | Keywords: | Regression |
Target Release: | ODF 4.13.8 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | 4.13.8-1 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | 2221488 | Environment: | |
Last Closed: | 2024-04-03 07:03:00 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 2221488, 2253428, 2253429 | ||
Bug Blocks: |
Description
Daniel Osypenko
2023-10-05 15:12:20 UTC
So, this is applicable for 4.13 only because bug #2221488 is already fixed for 4.14? @muagarwa yes. 4.14 is stable in Passing tests now. 4.13 is constantly failing missing now 142 metrics. 4.10, 4.11, 4.12 has 100% pass ratio. Full list of missing metrics http://pastebin.test.redhat.com/1110338 Regarding the summary, sorry for misleading, it was taken from original bz and should be "ODF Monitoring is missing some of the metric values 4.13" after the cmd `ceph config set mgr mgr/prometheus/exclude_perf_counters false` done, here the list of the missing metrics (test test_ceph_metrics_available) ceph_rgw_put ceph_rgw_put_initial_lat_sum ceph_rgw_put_initial_lat_count ceph_rgw_keystone_token_cache_hit ceph_rgw_metadata ceph_rgw_qactive ceph_rgw_get_initial_lat_sum ceph_rgw_get_initial_lat_count ceph_rgw_get_b ceph_rgw_failed_req ceph_rgw_keystone_token_cache_miss ceph_rgw_get ceph_rgw_cache_hit ceph_rgw_put_b The ceph_rgw metrics are not showing up because of unavailable rgw service on the cluster. Now test Passes. I can move it to VERIFIED once it will be ON_QA. Regards Moving to ON_QA as discussed Can you please specify the version I need to verify? Regards OC version: Client Version: 4.13.4 Kustomize Version: v4.5.7 Server Version: 4.13.0-0.nightly-2023-11-14-104446 Kubernetes Version: v1.26.9+636f2be OCS verison: ocs-operator.v4.13.5-rhodf OpenShift Container Storage 4.13.5-rhodf ocs-operator.v4.13.4-rhodf Succeeded Cluster version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.13.0-0.nightly-2023-11-14-104446 True False 46m Cluster version is 4.13.0-0.nightly-2023-11-14-104446 Rook version: rook: v4.13.5-0.42f43768ad57d91be47327f83653c05eeb721977 go: go1.19.13 Ceph version: ceph version 17.2.6-148.el9cp (badc1d27cb07762bea48f6554ad4f92b9d3fbb6b) quincy (stable) Missing 166 metrics https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/31288/console regarding the recommendation unset mgr/prometheus/exclude_perf_counters did we have such flag (mgr/prometheus/exclude_perf_counters) before and does it mean that by default a user would not see metrics and this is acceptable / documented? Thanks @athar.lh Moving the bug to 4.13.7 as we are doing a quick 4.13.6 to include a critical fix at RGW (2254303) before to shutdown I appreciate your posting. I've read about a lot of related subjects! Contrary to other articles, yours left me with a really distinct impression. I hope you'll keep writing insightful posts like this one and others for us to everyone to read! test_ceph_metrics_available fails. 166 missing metrics. ceph_bluestore_state_aio_wait_lat_sum,ceph_paxos_store_state_latency_sum,ceph_osd_op_out_bytes,ceph_bluestore_txc_submit_lat_sum,ceph_paxos_commit,ceph_paxos_new_pn_latency_count,ceph_osd_op_r_process_latency_count,ceph_bluestore_txc_submit_lat_count,ceph_bluestore_kv_final_lat_sum,ceph_paxos_collect_keys_sum,ceph_paxos_accept_timeout,ceph_paxos_begin_latency_count,ceph_bluefs_wal_total_bytes,ceph_paxos_refresh,ceph_bluestore_read_lat_count,ceph_mon_num_sessions,ceph_objecter_op_rmw,ceph_bluefs_bytes_written_wal,ceph_mon_num_elections,ceph_rocksdb_compact,ceph_bluestore_kv_sync_lat_sum,ceph_osd_op_process_latency_count,ceph_osd_op_w_prepare_latency_count,ceph_objecter_op_active,ceph_paxos_begin_latency_sum,ceph_osd_op_r,ceph_osd_op_rw_prepare_latency_sum,ceph_paxos_new_pn,ceph_rgw_qlen,ceph_rgw_req,ceph_rocksdb_get_latency_count,ceph_rgw_cache_miss,ceph_paxos_commit_latency_count,ceph_bluestore_txc_throttle_lat_count,ceph_paxos_lease_ack_timeout,ceph_bluestore_txc_commit_lat_sum,ceph_paxos_collect_bytes_sum,ceph_osd_op_rw_latency_count,ceph_paxos_collect_uncommitted,ceph_osd_op_rw_latency_sum,ceph_paxos_share_state,ceph_osd_op_r_prepare_latency_sum,ceph_bluestore_kv_flush_lat_sum,ceph_osd_op_rw_process_latency_sum,ceph_rocksdb_rocksdb_write_memtable_time_count,ceph_paxos_collect_latency_count,ceph_osd_op_rw_prepare_latency_count,ceph_paxos_collect_latency_sum,ceph_rocksdb_rocksdb_write_delay_time_count,ceph_objecter_op_rmw,ceph_paxos_begin_bytes_sum,ceph_osd_numpg,ceph_osd_stat_bytes,ceph_rocksdb_submit_sync_latency_sum ODF 4.13.7-rhodf vSphere UPI deployment OCP 4.13.0-0.nightly-2024-01-17-100523 Elaborating "missing metrics" meaning. When we run query we get empty metrics data along with status 304 Not Modified. Than means we don't have data available, response has not been modified since the last request, and there is no need to resend the entire content. Logs: 2024-01-17 18:08:34,071 - MainThread - INFO - /home/jenkins/workspace/qe-deploy-ocs-cluster/ocs-ci/ocs_ci/utility/prometheus.py.query.504 - Performing prometheus instant query 'ceph_bluestore_state_aio_wait_lat_sum' 2024-01-17 18:08:34,071 - MainThread - DEBUG - /home/jenkins/workspace/qe-deploy-ocs-cluster/ocs-ci/ocs_ci/utility/prometheus.py.get.422 - GET https://prometheus-k8s-openshift-monitoring.apps.dosypenk-171.qe.rh-ocs.com/api/v1/query 2024-01-17 18:08:34,071 - MainThread - DEBUG - /home/jenkins/workspace/qe-deploy-ocs-cluster/ocs-ci/ocs_ci/utility/prometheus.py.get.423 - headers={'Authorization': 'Bearer sha256~f9UfvhOsP02LNP5oLPx9uQhph3oSHYJpL6qaPBH7wlk'} 2024-01-17 18:08:34,071 - MainThread - DEBUG - /home/jenkins/workspace/qe-deploy-ocs-cluster/ocs-ci/ocs_ci/utility/prometheus.py.get.424 - verify=False 2024-01-17 18:08:34,072 - MainThread - DEBUG - /home/jenkins/workspace/qe-deploy-ocs-cluster/ocs-ci/ocs_ci/utility/prometheus.py.get.425 - params={'query': 'ceph_bluestore_state_aio_wait_lat_sum'} 2024-01-17 18:08:34,073 - MainThread - DEBUG - urllib3.connectionpool._new_conn.1019 - Starting new HTTPS connection (1): prometheus-k8s-openshift-monitoring.apps.dosypenk-171.qe.rh-ocs.com:443 2024-01-17 18:08:34,107 - MainThread - DEBUG - urllib3.connectionpool._make_request.474 - https://prometheus-k8s-openshift-monitoring.apps.dosypenk-171.qe.rh-ocs.com:443 "GET /api/v1/query?query=ceph_bluestore_state_aio_wait_lat_sum HTTP/1.1" 200 87 2024-01-17 18:08:34,108 - MainThread - DEBUG - /home/jenkins/workspace/qe-deploy-ocs-cluster/ocs-ci/ocs_ci/utility/prometheus.py.validate_status.304 - content value: {'status': 'success', 'data': {'resultType': 'vector', 'result': []}} 2024-01-17 18:08:34,109 - MainThread - ERROR - ocs_ci.ocs.metrics.get_missing_metrics.352 - failed to get results for ceph_bluestore_state_aio_wait_lat_sum @athakkar confirming ceph config get mgr mgr/prometheus/exclude_perf_counters false - metrics become visible on ODF 4.13 OCP 4.13 This got assigned to me when we moved it to Rook. Assigning it back to Avan. verified on IBM cloud deployment * ODF 4.13.8-1 * OCP 4.13.0-0.nightly-2024-03-08-182318 test_ceph_rbd_metrics_available - https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/34912/console Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Data Foundation 4.13.8 Bug Fix Update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2024:1657 |