Bug 2203795 - ODF Monitoring is missing some of the ceph_* metric values
Summary: ODF Monitoring is missing some of the ceph_* metric values
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: rook
Version: 4.13
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ODF 4.13.0
Assignee: avan
QA Contact: Filip Balák
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-05-15 09:32 UTC by Vishakha Kathole
Modified: 2023-08-09 17:03 UTC (History)
14 users (show)

Fixed In Version: 4.13.0-214
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-06-21 15:25:37 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github red-hat-storage rook pull 495 0 None open Bug 2203795: core: empty ceph-daemons-sock-dir for osd onPVC 2023-06-01 17:18:57 UTC
Github rook rook pull 12299 0 None open core: use ROOK_CEPH_MON_HOST for osd from config store 2023-05-31 07:12:24 UTC
Red Hat Product Errata RHBA-2023:3742 0 None None None 2023-06-21 15:25:48 UTC

Internal Links: 2210027 2227770

Description Vishakha Kathole 2023-05-15 09:32:17 UTC
Description of problem (please be detailed as possible and provide log
snippests):
ODF Monitoring is missing some of the ceph_* metric values

List of missing metric values:
'ceph_bluefs_bytes_written_slow', 'ceph_bluefs_bytes_written_sst', 'ceph_bluefs_bytes_written_wal', 'ceph_bluefs_db_total_bytes', 'ceph_bluefs_db_used_bytes', 'ceph_bluefs_log_bytes', 'ceph_bluefs_logged_bytes', 'ceph_bluefs_num_files', 'ceph_bluefs_slow_total_bytes', 'ceph_bluefs_slow_used_bytes', 'ceph_bluefs_wal_total_bytes', 'ceph_bluefs_wal_used_bytes', 'ceph_bluestore_commit_lat_count',
'ceph_bluestore_commit_lat_sum', 'ceph_bluestore_kv_final_lat_count', 'ceph_bluestore_kv_final_lat_sum', 'ceph_bluestore_kv_flush_lat_count', 'ceph_bluestore_kv_flush_lat_sum', 'ceph_bluestore_kv_sync_lat_count', 'ceph_bluestore_kv_sync_lat_sum', 'ceph_bluestore_read_lat_count', 'ceph_bluestore_read_lat_sum', 'ceph_bluestore_state_aio_wait_lat_count', 'ceph_bluestore_state_aio_wait_lat_sum', 'ceph_bluestore_submit_lat_count', 'ceph_bluestore_submit_lat_sum', 'ceph_bluestore_throttle_lat_count', 'ceph_bluestore_throttle_lat_sum', 
'ceph_mon_election_call', 'ceph_mon_election_lose', 'ceph_mon_election_win', 'ceph_mon_num_elections', 'ceph_mon_num_sessions', 'ceph_mon_session_add', 'ceph_mon_session_rm', 'ceph_mon_session_trim', 
'ceph_objecter_op_active', 'ceph_objecter_op_active', 'ceph_objecter_op_r', 'ceph_objecter_op_r', 'ceph_objecter_op_rmw', 'ceph_objecter_op_rmw', 'ceph_objecter_op_w', 'ceph_objecter_op_w', 
'ceph_osd_numpg', 'ceph_osd_numpg_removing', 'ceph_osd_op', 'ceph_osd_op_in_bytes', 'ceph_osd_op_latency_count', 'ceph_osd_op_latency_sum', 'ceph_osd_op_out_bytes', 'ceph_osd_op_prepare_latency_count', 'ceph_osd_op_prepare_latency_sum', 'ceph_osd_op_process_latency_count', 'ceph_osd_op_process_latency_sum', 'ceph_osd_op_r', 'ceph_osd_op_r_latency_count', 'ceph_osd_op_r_latency_sum', 'ceph_osd_op_r_out_bytes', 'ceph_osd_op_r_prepare_latency_count', 'ceph_osd_op_r_prepare_latency_sum', 'ceph_osd_op_r_process_latency_count', 'ceph_osd_op_r_process_latency_sum', 'ceph_osd_op_rw', 'ceph_osd_op_rw_in_bytes', 'ceph_osd_op_rw_latency_count', 'ceph_osd_op_rw_latency_sum', 'ceph_osd_op_rw_out_bytes', 'ceph_osd_op_rw_prepare_latency_count', 'ceph_osd_op_rw_prepare_latency_sum', 'ceph_osd_op_rw_process_latency_count', 'ceph_osd_op_rw_process_latency_sum', 'ceph_osd_op_w', 'ceph_osd_op_w_in_bytes', 'ceph_osd_op_w_latency_count', 'ceph_osd_op_w_latency_sum', 'ceph_osd_op_w_prepare_latency_count', 'ceph_osd_op_w_prepare_latency_sum', 'ceph_osd_op_w_process_latency_count', 'ceph_osd_op_w_process_latency_sum', 'ceph_osd_op_wip', 'ceph_osd_recovery_bytes', 'ceph_osd_recovery_ops', 'ceph_osd_stat_bytes', 'ceph_osd_stat_bytes_used', 
'ceph_paxos_accept_timeout', 'ceph_paxos_begin', 'ceph_paxos_begin_bytes_count', 'ceph_paxos_begin_bytes_sum', 'ceph_paxos_begin_keys_count', 'ceph_paxos_begin_keys_sum', 'ceph_paxos_begin_latency_count', 'ceph_paxos_begin_latency_sum', 'ceph_paxos_collect', 'ceph_paxos_collect_bytes_count', 'ceph_paxos_collect_bytes_sum', 'ceph_paxos_collect_keys_count', 'ceph_paxos_collect_keys_sum', 'ceph_paxos_collect_latency_count', 'ceph_paxos_collect_latency_sum', 'ceph_paxos_collect_timeout', 'ceph_paxos_collect_uncommitted', 'ceph_paxos_commit', 'ceph_paxos_commit_bytes_count', 'ceph_paxos_commit_bytes_sum', 'ceph_paxos_commit_keys_count', 'ceph_paxos_commit_keys_sum', 'ceph_paxos_commit_latency_count', 'ceph_paxos_commit_latency_sum', 'ceph_paxos_lease_ack_timeout', 'ceph_paxos_lease_timeout', 'ceph_paxos_new_pn', 'ceph_paxos_new_pn_latency_count', 'ceph_paxos_new_pn_latency_sum', 'ceph_paxos_refresh', 'ceph_paxos_refresh_latency_count', 'ceph_paxos_refresh_latency_sum', 'ceph_paxos_restart', 'ceph_paxos_share_state', 'ceph_paxos_share_state_bytes_count', 'ceph_paxos_share_state_bytes_sum', 'ceph_paxos_share_state_keys_count', 'ceph_paxos_share_state_keys_sum', 'ceph_paxos_start_leader', 'ceph_paxos_start_peon', 'ceph_paxos_store_state', 'ceph_paxos_store_state_bytes_count', 'ceph_paxos_store_state_bytes_sum', 'ceph_paxos_store_state_keys_count', 'ceph_paxos_store_state_keys_sum', 'ceph_paxos_store_state_latency_count', 'ceph_paxos_store_state_latency_sum', 
'ceph_rgw_cache_hit', 'ceph_rgw_cache_miss', 'ceph_rgw_failed_req', 'ceph_rgw_get', 'ceph_rgw_get_b', 'ceph_rgw_get_initial_lat_count', 'ceph_rgw_get_initial_lat_sum', 'ceph_rgw_keystone_token_cache_hit', 'ceph_rgw_keystone_token_cache_miss', 'ceph_rgw_put', 'ceph_rgw_put_b', 'ceph_rgw_put_initial_lat_count', 'ceph_rgw_put_initial_lat_sum', 'ceph_rgw_qactive', 'ceph_rgw_qlen', 'ceph_rgw_req', 
'ceph_rocksdb_compact', 'ceph_rocksdb_compact_queue_len', 'ceph_rocksdb_compact_queue_merge', 'ceph_rocksdb_compact_range', 'ceph_rocksdb_get', 'ceph_rocksdb_get_latency_count', 'ceph_rocksdb_get_latency_sum', 'ceph_rocksdb_rocksdb_write_delay_time_count', 'ceph_rocksdb_rocksdb_write_delay_time_sum', 'ceph_rocksdb_rocksdb_write_memtable_time_count', 'ceph_rocksdb_rocksdb_write_memtable_time_sum', 'ceph_rocksdb_rocksdb_write_pre_and_post_time_count', 'ceph_rocksdb_rocksdb_write_pre_and_post_time_sum', 'ceph_rocksdb_rocksdb_write_wal_time_count', 'ceph_rocksdb_rocksdb_write_wal_time_sum', 'ceph_rocksdb_submit_latency_count', 'ceph_rocksdb_submit_latency_sum', 'ceph_rocksdb_submit_sync_latency_count', 'ceph_rocksdb_submit_sync_latency_sum'

Ceph metrics which should be present on a healthy cluster:
https://github.com/red-hat-storage/ocs-ci/blob/81ca20aed067a30dd109e0f29e026f2a18c752ee/ocs_ci/ocs/metrics.py#L70

Version of all relevant components (if applicable):
ODF-4.13.0-186.stable
OCP-4.13.0-0.nightly-2023-05-10-062807

Can this issue reproducible?
yes (seen in about 8 ci runs)


Steps to Reproduce:
1. Install OCP/ODF cluster
2. After installation, check whether Prometheus provides values for the
   metrics listed above.


Actual results:
OCP Prometheus provides no values for any of the metrics listed above.

Expected results:
OCP Prometheus provides values for all metrics listed above.

Comment 3 Mudit Agarwal 2023-05-15 17:25:05 UTC
Not a 4.13 blocker

Comment 6 Mudit Agarwal 2023-05-23 16:07:57 UTC
Nishanth, please assign it to someone.

Comment 18 Parth Arora 2023-05-30 13:02:36 UTC
Do we have the ODF cluster or must gather logs for this?

Comment 28 Filip Balák 2023-06-07 11:18:12 UTC
After the fix there are still missing metrics 'ceph_bluestore_submit_lat_sum', 'ceph_bluestore_submit_lat_count', 'ceph_bluestore_throttle_lat_count', 'ceph_bluestore_commit_lat_sum', 'ceph_bluestore_throttle_lat_sum', 'ceph_rocksdb_get', 'ceph_bluestore_commit_lat_count'.
Is this expected?

Tested with odf 4.13.0-214.

Comment 30 avan 2023-06-08 07:25:31 UTC
Hi @Filip,
I did investigate the missing metrics you reported, it seems there's a discrepancy in metrics name on ocs-ci end. The metrics which are exported by ceph are actually named as 
for example ceph_bluestore_txc_submit_lat_count and similar for other metrics. https://github.com/ceph/ceph/blame/v17.2.6/src/os/bluestore/BlueStore.cc#L5076

I see that this was updated ~2 years ago in Ceph and the metrics.py for ocs-ci is last updated ~3 years ago, so it must be adopt the metrics name coming from ceph https://github.com/red-hat-storage/ocs-ci/blob/master/ocs_ci/ocs/metrics.py#L116 and the same goes for `ceph_rocksdb_get`, I don't see any metrics exported by Ceph with that name, so it must be removed from the file.
Hope this helps.

Thanks

Comment 34 Sravika 2023-06-12 08:53:53 UTC
Also observing this bug as part of the following ocs-ci test execution on IBM Z

tests/manage/monitoring/prometheusmetrics/test_monitoring_negative.py::test_ceph_metrics_presence_when_osd_down

Comment 35 Filip Balák 2023-06-13 11:22:43 UTC
This is verified based on discussion in thread https://chat.google.com/room/AAAAREGEba8/KoCb6Izr65o. There will be needed a note in release notes.

New metric names:
ceph_bluestore_submit_lat_sum -> ceph_bluestore_txc_submit_lat_sum
ceph_bluestore_submit_lat_count -> ceph_bluestore_txc_submit_lat_count
ceph_bluestore_throttle_lat_count -> ceph_bluestore_txc_throttle_lat_count
ceph_bluestore_commit_lat_sum -> ceph_bluestore_txc_commit_lat_sum
ceph_bluestore_throttle_lat_sum -> ceph_bluestore_txc_throttle_lat_sum
ceph_bluestore_commit_lat_count -> ceph_bluestore_txc_commit_lat_count

Metric ceph_rocksdb_get was removed because it was redundant and its data can be accessed from metrics ceph_rocksdb_get_latency_sum and ceph_rocksdb_get_latency_count.

--> VERIFIED

Comment 37 errata-xmlrpc 2023-06-21 15:25:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Data Foundation 4.13.0 enhancement and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:3742


Note You need to log in before you can comment on or make changes to this bug.