Bug 2221488 - ODF Monitoring is missing some of the metric values 4.14
Summary: ODF Monitoring is missing some of the metric values 4.14
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: rook
Version: 4.14
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ODF 4.14.0
Assignee: avan
QA Contact: Daniel Osypenko
URL:
Whiteboard:
Depends On:
Blocks: 2242324 2244409 2253428
TreeView+ depends on / blocked
 
Reported: 2023-07-09 12:02 UTC by Daniel Osypenko
Modified: 2024-02-01 11:17 UTC (History)
9 users (show)

Fixed In Version: 4.14.0-128
Doc Type: Bug Fix
Doc Text:
.ODF monitoring is no longer missing any metric values Previously, there was a missing port for the service monitor of ceph-exporter. This meant that Ceph daemons performance metrics were missing. With this fix, the port for ceph-exporter service monitor has been added, and Ceph daemons performance metrics are visible in prometheus.
Clone Of:
: 2242324 2253428 2253429 (view as bug list)
Environment:
Last Closed: 2023-11-08 18:52:23 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github red-hat-storage rook pull 504 0 None open Bug 2221488: monitoring: enable exporter for downstream 4.14 2023-08-08 17:56:38 UTC
Github red-hat-storage rook pull 516 0 None open Bug 2236444: monitoring: set port for servicemonitor for ceph-exporter 2023-09-05 12:24:27 UTC
Red Hat Product Errata RHSA-2023:6832 0 None None None 2023-11-08 18:54:09 UTC

Description Daniel Osypenko 2023-07-09 12:02:28 UTC
This bug was initially created as a copy of Bug #2203795

I am copying this bug because: 

Even though missing metric names are different comparing to 4.13 missing metrics, the description of the problem and parties of the discussion should be the same

Description of problem (please be detailed as possible and provide log
snippests):
ODF Monitoring is missing some of the ceph_* metric values. No related epic providing change/rename was found 

List of missing metric values:
'ceph_bluestore_state_aio_wait_lat_sum',

'ceph_paxos_store_state_latency_sum',

'ceph_osd_op_out_bytes',

'ceph_bluestore_txc_submit_lat_sum',

'ceph_paxos_commit',

'ceph_paxos_new_pn_latency_count',

'ceph_osd_op_r_process_latency_count',

'ceph_bluestore_txc_submit_lat_count',

'ceph_bluestore_kv_final_lat_sum',

'ceph_paxos_collect_keys_sum',

'ceph_paxos_accept_timeout',

'ceph_paxos_begin_latency_count',

'ceph_bluefs_wal_total_bytes',

'ceph_paxos_refresh',

'ceph_bluestore_read_lat_count',

'ceph_mon_num_sessions',

'ceph_bluefs_bytes_written_wal',

'ceph_mon_num_elections',

'ceph_rocksdb_compact',

'ceph_bluestore_kv_sync_lat_sum',

'ceph_osd_op_process_latency_count',

'ceph_osd_op_w_prepare_latency_count',

'ceph_paxos_begin_latency_sum',

'ceph_osd_op_r',

'ceph_osd_op_rw_prepare_latency_sum',

'ceph_paxos_new_pn',

'ceph_rocksdb_get_latency_count',

'ceph_paxos_commit_latency_count',

'ceph_bluestore_txc_throttle_lat_count',

'ceph_paxos_lease_ack_timeout',

'ceph_bluestore_txc_commit_lat_sum',

'ceph_paxos_collect_bytes_sum',

'ceph_osd_op_rw_latency_count',

'ceph_paxos_collect_uncommitted',

'ceph_osd_op_rw_latency_sum',

'ceph_paxos_share_state',

'ceph_osd_op_r_prepare_latency_sum',

'ceph_bluestore_kv_flush_lat_sum',

'ceph_osd_op_rw_process_latency_sum',

'ceph_rocksdb_rocksdb_write_memtable_time_count',

'ceph_paxos_collect_latency_count',

'ceph_osd_op_rw_prepare_latency_count',

'ceph_paxos_collect_latency_sum',

'ceph_rocksdb_rocksdb_write_delay_time_count',

'ceph_paxos_begin_bytes_sum',

'ceph_osd_numpg',

'ceph_osd_stat_bytes',

'ceph_rocksdb_submit_sync_latency_sum',

'ceph_rocksdb_compact_queue_merge',

'ceph_paxos_collect_bytes_count',

'ceph_osd_op',

'ceph_paxos_commit_keys_sum',

'ceph_osd_op_rw_in_bytes',

'ceph_osd_op_rw_out_bytes',

'ceph_bluefs_bytes_written_sst',

'ceph_osd_op_rw_process_latency_count',

'ceph_rocksdb_compact_queue_len',

'ceph_bluestore_txc_throttle_lat_sum',

'ceph_bluefs_slow_used_bytes',

'ceph_osd_op_r_latency_sum',

'ceph_bluestore_kv_flush_lat_count',

'ceph_rocksdb_compact_range',

'ceph_osd_op_latency_sum',

'ceph_mon_session_add',

'ceph_paxos_share_state_keys_count',

'ceph_paxos_collect',

'ceph_osd_op_w_in_bytes',

'ceph_osd_op_r_process_latency_sum',

'ceph_paxos_start_peon',

'ceph_mon_session_trim',

'ceph_rocksdb_get_latency_sum',

'ceph_osd_op_rw',

'ceph_paxos_store_state_keys_count',

'ceph_rocksdb_rocksdb_write_delay_time_sum',

'ceph_osd_recovery_ops',

'ceph_bluefs_logged_bytes',

'ceph_bluefs_db_total_bytes',

'ceph_osd_op_w_latency_count',

'ceph_bluestore_txc_commit_lat_count',

'ceph_bluestore_state_aio_wait_lat_count',

'ceph_paxos_begin_bytes_count',

'ceph_paxos_start_leader',

'ceph_mon_election_call',

'ceph_rocksdb_rocksdb_write_pre_and_post_time_count',

'ceph_mon_session_rm',

'ceph_paxos_store_state',

'ceph_paxos_store_state_bytes_count',

'ceph_osd_op_w_latency_sum',

'ceph_rocksdb_submit_latency_count',

'ceph_paxos_commit_latency_sum',

'ceph_rocksdb_rocksdb_write_memtable_time_sum',

'ceph_paxos_share_state_bytes_sum',

'ceph_osd_op_process_latency_sum',

'ceph_paxos_begin_keys_sum',

'ceph_rocksdb_rocksdb_write_pre_and_post_time_sum',

'ceph_bluefs_wal_used_bytes',

'ceph_rocksdb_rocksdb_write_wal_time_sum',

'ceph_osd_op_wip',

'ceph_paxos_lease_timeout',

'ceph_osd_op_r_out_bytes',

'ceph_paxos_begin_keys_count',

'ceph_bluestore_kv_sync_lat_count',

'ceph_osd_op_prepare_latency_count',

'ceph_bluefs_bytes_written_slow',

'ceph_rocksdb_submit_latency_sum',

'ceph_osd_op_r_latency_count',

'ceph_paxos_share_state_keys_sum',

'ceph_paxos_store_state_bytes_sum',

'ceph_osd_op_latency_count',

'ceph_paxos_commit_bytes_count',

'ceph_paxos_restart',

'ceph_bluefs_slow_total_bytes',

'ceph_paxos_collect_timeout',

'ceph_osd_op_w_process_latency_sum',

'ceph_paxos_collect_keys_count',

'ceph_paxos_share_state_bytes_count',

'ceph_osd_op_w_prepare_latency_sum',

'ceph_bluestore_read_lat_sum',

'ceph_osd_stat_bytes_used',

'ceph_paxos_begin',

'ceph_mon_election_win',

'ceph_osd_op_w_process_latency_count',

'ceph_rocksdb_rocksdb_write_wal_time_count',

'ceph_paxos_store_state_keys_sum',

'ceph_osd_numpg_removing',

'ceph_paxos_commit_keys_count',

'ceph_paxos_new_pn_latency_sum',

'ceph_osd_op_in_bytes',

'ceph_paxos_store_state_latency_count',

'ceph_paxos_refresh_latency_count',

'ceph_osd_op_r_prepare_latency_count',

'ceph_bluefs_num_files',

'ceph_mon_election_lose',

'ceph_osd_op_prepare_latency_sum',

'ceph_bluefs_db_used_bytes',

'ceph_bluestore_kv_final_lat_count',

'ceph_paxos_refresh_latency_sum',

'ceph_osd_recovery_bytes',

'ceph_osd_op_w',

'ceph_paxos_commit_bytes_sum',

'ceph_bluefs_log_bytes',

'ceph_rocksdb_submit_sync_latency_count',

ceph metrics which should be present on a healthy cluster:
https://github.com/red-hat-storage/ocs-ci/blob/81ca20aed067a30dd109e0f29e026f2a18c752ee/ocs_ci/ocs/metrics.py#L70

Polarion documentation: https://polarion.engineering.redhat.com/polarion/#/project/OpenShiftContainerStorage/workitem?id=OCS-958

Version of all relevant components (if applicable):

OC version:
Client Version: 4.13.4
Kustomize Version: v4.5.7
Server Version: 4.14.0-0.nightly-2023-06-30-131338
Kubernetes Version: v1.27.3+ab0b8ee

OCS verison:
ocs-operator.v4.14.0-36.stable              OpenShift Container Storage   4.14.0-36.stable              Succeeded

Cluster version
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.14.0-0.nightly-2023-06-30-131338   True        False         4d1h    Cluster version is 4.14.0-0.nightly-2023-06-30-131338

Rook version:
rook: v4.14.0-0.d8ce011027a26218154bcedf63a54e97f020df40
go: go1.20.4

Ceph version:
ceph version 17.2.6-70.el9cp (fe62dcdbb2c6e05782a3e2b67d025b84ff5047cc) quincy (stable)


Can this issue reproducible?
yes, repeatable in ci run and with local run


Steps to Reproduce:
1. Install OCP/ODF cluster
2. After installation, check whether Prometheus provides values for the
   metrics listed above.


Actual results:
OCP Prometheus provides no values for any of the metrics listed above.

Expected results:
OCP Prometheus provides values for all metrics listed above.

logs of the test-run:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-034aikt1c33-t1/j-034aikt1c33-t1_20230704T064403/logs/ocs-ci-logs-1688456400/by_outcome/failed/tests/manage/monitoring/prometheusmetrics/test_monitoring_defaults.py/test_ceph_metrics_available/logs

must-gather logs
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-034aikt1c33-t1/j-034aikt1c33-t1_20230704T064403/logs/testcases_1688456400/

Comment 3 Travis Nielsen 2023-07-10 16:47:34 UTC
Avan PTAL

Comment 4 avan 2023-07-25 17:26:37 UTC
@Daniel,
Is this still reproducible?

Comment 16 Boris Ranto 2023-08-09 11:53:41 UTC
Done, I updated the defaults to use the new RHCS 6.1z2 first build (6-200). We should have it in our builds starting from tomorrow.

Comment 18 Daniel Osypenko 2023-08-31 11:29:03 UTC
Fixed. Same automation test, previously failed (test_monitoring_reporting_ok_when_idle) now passes:

13:42:54 - MainThread - /Users/danielosypenko/Work/automation_4/ocs-ci/ocs_ci/utility/prometheus.py - INFO  - No bad values detected
13:42:54 - MainThread - /Users/danielosypenko/Work/automation_4/ocs-ci/ocs_ci/utility/prometheus.py - INFO  - No invalid values detected
13:42:54 - MainThread - test_monitoring_defaults - INFO  - ceph_osd_in metric does indicate no problems with OSDs
PASSED

Comment 19 Daniel Osypenko 2023-09-04 10:13:06 UTC
BZ has been moved to Verified by a mistake.

List of missing metrics on OCP 4.14.0-0.nightly-2023-09-02-132842 ODF 4.14.0-125.stable
['ceph_bluestore_state_aio_wait_lat_sum', 'ceph_paxos_store_state_latency_sum', 'ceph_osd_op_out_bytes', 'ceph_bluestore_txc_submit_lat_sum', 'ceph_paxos_commit', 'ceph_paxos_new_pn_latency_count', 'ceph_osd_op_r_process_latency_count', 'ceph_bluestore_txc_submit_lat_count', 'ceph_bluestore_kv_final_lat_sum', 'ceph_paxos_collect_keys_sum', 'ceph_paxos_accept_timeout', 'ceph_paxos_begin_latency_count', 'ceph_bluefs_wal_total_bytes', 'ceph_paxos_refresh', 'ceph_bluestore_read_lat_count', 'ceph_mon_num_sessions', 'ceph_objecter_op_rmw', 'ceph_bluefs_bytes_written_wal', 'ceph_mon_num_elections', 'ceph_rocksdb_compact', 'ceph_bluestore_kv_sync_lat_sum', 'ceph_osd_op_process_latency_count', 'ceph_osd_op_w_prepare_latency_count', 'ceph_objecter_op_active', 'ceph_paxos_begin_latency_sum', 'ceph_osd_op_r', 'ceph_osd_op_rw_prepare_latency_sum', 'ceph_paxos_new_pn', 'ceph_rgw_qlen', 'ceph_rgw_req', 'ceph_rocksdb_get_latency_count', 'ceph_rgw_cache_miss', 'ceph_paxos_commit_latency_count', 'ceph_bluestore_txc_throttle_lat_count', 'ceph_paxos_lease_ack_timeout', 'ceph_bluestore_txc_commit_lat_sum', 'ceph_paxos_collect_bytes_sum', 'ceph_osd_op_rw_latency_count', 'ceph_paxos_collect_uncommitted', 'ceph_osd_op_rw_latency_sum', 'ceph_paxos_share_state', 'ceph_osd_op_r_prepare_latency_sum', 'ceph_bluestore_kv_flush_lat_sum', 'ceph_osd_op_rw_process_latency_sum', 'ceph_rocksdb_rocksdb_write_memtable_time_count', 'ceph_paxos_collect_latency_count', 'ceph_osd_op_rw_prepare_latency_count', 'ceph_paxos_collect_latency_sum', 'ceph_rocksdb_rocksdb_write_delay_time_count', 'ceph_objecter_op_rmw', 'ceph_paxos_begin_bytes_sum', 'ceph_osd_numpg', 'ceph_osd_stat_bytes', 'ceph_rocksdb_submit_sync_latency_sum', 'ceph_rocksdb_compact_queue_merge', 'ceph_paxos_collect_bytes_count', 'ceph_osd_op', 'ceph_paxos_commit_keys_sum', 'ceph_osd_op_rw_in_bytes', 'ceph_osd_op_rw_out_bytes', 'ceph_bluefs_bytes_written_sst', 'ceph_rgw_put', 'ceph_osd_op_rw_process_latency_count', 'ceph_rocksdb_compact_queue_len', 'ceph_bluestore_txc_throttle_lat_sum', 'ceph_bluefs_slow_used_bytes', 'ceph_osd_op_r_latency_sum', 'ceph_bluestore_kv_flush_lat_count', 'ceph_rocksdb_compact_range', 'ceph_osd_op_latency_sum', 'ceph_mon_session_add', 'ceph_paxos_share_state_keys_count', 'ceph_paxos_collect', 'ceph_osd_op_w_in_bytes', 'ceph_osd_op_r_process_latency_sum', 'ceph_paxos_start_peon', 'ceph_mon_session_trim', 'ceph_rocksdb_get_latency_sum', 'ceph_osd_op_rw', 'ceph_paxos_store_state_keys_count', 'ceph_rocksdb_rocksdb_write_delay_time_sum', 'ceph_objecter_op_r', 'ceph_objecter_op_active', 'ceph_objecter_op_w', 'ceph_osd_recovery_ops', 'ceph_bluefs_logged_bytes', 'ceph_bluefs_db_total_bytes', 'ceph_rgw_put_initial_lat_sum', 'ceph_osd_op_w_latency_count', 'ceph_rgw_put_initial_lat_count', 'ceph_bluestore_txc_commit_lat_count', 'ceph_bluestore_state_aio_wait_lat_count', 'ceph_paxos_begin_bytes_count', 'ceph_paxos_start_leader', 'ceph_mon_election_call', 'ceph_rocksdb_rocksdb_write_pre_and_post_time_count', 'ceph_mon_session_rm', 'ceph_paxos_store_state', 'ceph_paxos_store_state_bytes_count', 'ceph_osd_op_w_latency_sum', 'ceph_rgw_keystone_token_cache_hit', 'ceph_rocksdb_submit_latency_count', 'ceph_paxos_commit_latency_sum', 'ceph_rocksdb_rocksdb_write_memtable_time_sum', 'ceph_paxos_share_state_bytes_sum', 'ceph_osd_op_process_latency_sum', 'ceph_paxos_begin_keys_sum', 'ceph_rgw_qactive', 'ceph_rocksdb_rocksdb_write_pre_and_post_time_sum', 'ceph_bluefs_wal_used_bytes', 'ceph_rocksdb_rocksdb_write_wal_time_sum', 'ceph_osd_op_wip', 'ceph_rgw_get_initial_lat_sum', 'ceph_paxos_lease_timeout', 'ceph_osd_op_r_out_bytes', 'ceph_paxos_begin_keys_count', 'ceph_bluestore_kv_sync_lat_count', 'ceph_osd_op_prepare_latency_count', 'ceph_bluefs_bytes_written_slow', 'ceph_rocksdb_submit_latency_sum', 'ceph_osd_op_r_latency_count', 'ceph_paxos_share_state_keys_sum', 'ceph_paxos_store_state_bytes_sum', 'ceph_osd_op_latency_count', 'ceph_paxos_commit_bytes_count', 'ceph_paxos_restart', 'ceph_rgw_get_initial_lat_count', 'ceph_bluefs_slow_total_bytes', 'ceph_paxos_collect_timeout', 'ceph_osd_op_w_process_latency_sum', 'ceph_paxos_collect_keys_count', 'ceph_paxos_share_state_bytes_count', 'ceph_osd_op_w_prepare_latency_sum', 'ceph_bluestore_read_lat_sum', 'ceph_osd_stat_bytes_used', 'ceph_paxos_begin', 'ceph_mon_election_win', 'ceph_osd_op_w_process_latency_count', 'ceph_rgw_get_b', 'ceph_rgw_failed_req', 'ceph_rocksdb_rocksdb_write_wal_time_count', 'ceph_rgw_keystone_token_cache_miss', 'ceph_paxos_store_state_keys_sum', 'ceph_osd_numpg_removing', 'ceph_paxos_commit_keys_count', 'ceph_paxos_new_pn_latency_sum', 'ceph_osd_op_in_bytes', 'ceph_paxos_store_state_latency_count', 'ceph_paxos_refresh_latency_count', 'ceph_rgw_get', 'ceph_osd_op_r_prepare_latency_count', 'ceph_rgw_cache_hit', 'ceph_objecter_op_w', 'ceph_objecter_op_r', 'ceph_bluefs_num_files', 'ceph_rgw_put_b', 'ceph_mon_election_lose', 'ceph_osd_op_prepare_latency_sum', 'ceph_bluefs_db_used_bytes', 'ceph_bluestore_kv_final_lat_count', 'ceph_paxos_refresh_latency_sum', 'ceph_osd_recovery_bytes', 'ceph_osd_op_w', 'ceph_paxos_commit_bytes_sum', 'ceph_bluefs_log_bytes', 'ceph_rocksdb_submit_sync_latency_count']

Comment 22 Daniel Osypenko 2023-09-07 10:12:21 UTC
Verified, PASSED: 
test_ceph_metrics_available http://pastebin.test.redhat.com/1108991
test_ceph_rbd_metrics_available http://pastebin.test.redhat.com/1108993

Comment 25 errata-xmlrpc 2023-11-08 18:52:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.14.0 security, enhancement & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:6832


Note You need to log in before you can comment on or make changes to this bug.