Bug 2203795

Summary: ODF Monitoring is missing some of the ceph_* metric values
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Vishakha Kathole <vkathole>
Component: rookAssignee: avan <athakkar>
Status: CLOSED ERRATA QA Contact: Filip Balák <fbalak>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.13CC: akandath, athakkar, ebenahar, fbalak, hnallurv, jolmomar, muagarwa, nthomas, ocs-bugs, odf-bz-bot, paarora, sbalusu, tdesala, tnielsen
Target Milestone: ---Keywords: Automation, Regression
Target Release: ODF 4.13.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 4.13.0-214 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-06-21 15:25:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Vishakha Kathole 2023-05-15 09:32:17 UTC
Description of problem (please be detailed as possible and provide log
snippests):
ODF Monitoring is missing some of the ceph_* metric values

List of missing metric values:
'ceph_bluefs_bytes_written_slow', 'ceph_bluefs_bytes_written_sst', 'ceph_bluefs_bytes_written_wal', 'ceph_bluefs_db_total_bytes', 'ceph_bluefs_db_used_bytes', 'ceph_bluefs_log_bytes', 'ceph_bluefs_logged_bytes', 'ceph_bluefs_num_files', 'ceph_bluefs_slow_total_bytes', 'ceph_bluefs_slow_used_bytes', 'ceph_bluefs_wal_total_bytes', 'ceph_bluefs_wal_used_bytes', 'ceph_bluestore_commit_lat_count',
'ceph_bluestore_commit_lat_sum', 'ceph_bluestore_kv_final_lat_count', 'ceph_bluestore_kv_final_lat_sum', 'ceph_bluestore_kv_flush_lat_count', 'ceph_bluestore_kv_flush_lat_sum', 'ceph_bluestore_kv_sync_lat_count', 'ceph_bluestore_kv_sync_lat_sum', 'ceph_bluestore_read_lat_count', 'ceph_bluestore_read_lat_sum', 'ceph_bluestore_state_aio_wait_lat_count', 'ceph_bluestore_state_aio_wait_lat_sum', 'ceph_bluestore_submit_lat_count', 'ceph_bluestore_submit_lat_sum', 'ceph_bluestore_throttle_lat_count', 'ceph_bluestore_throttle_lat_sum', 
'ceph_mon_election_call', 'ceph_mon_election_lose', 'ceph_mon_election_win', 'ceph_mon_num_elections', 'ceph_mon_num_sessions', 'ceph_mon_session_add', 'ceph_mon_session_rm', 'ceph_mon_session_trim', 
'ceph_objecter_op_active', 'ceph_objecter_op_active', 'ceph_objecter_op_r', 'ceph_objecter_op_r', 'ceph_objecter_op_rmw', 'ceph_objecter_op_rmw', 'ceph_objecter_op_w', 'ceph_objecter_op_w', 
'ceph_osd_numpg', 'ceph_osd_numpg_removing', 'ceph_osd_op', 'ceph_osd_op_in_bytes', 'ceph_osd_op_latency_count', 'ceph_osd_op_latency_sum', 'ceph_osd_op_out_bytes', 'ceph_osd_op_prepare_latency_count', 'ceph_osd_op_prepare_latency_sum', 'ceph_osd_op_process_latency_count', 'ceph_osd_op_process_latency_sum', 'ceph_osd_op_r', 'ceph_osd_op_r_latency_count', 'ceph_osd_op_r_latency_sum', 'ceph_osd_op_r_out_bytes', 'ceph_osd_op_r_prepare_latency_count', 'ceph_osd_op_r_prepare_latency_sum', 'ceph_osd_op_r_process_latency_count', 'ceph_osd_op_r_process_latency_sum', 'ceph_osd_op_rw', 'ceph_osd_op_rw_in_bytes', 'ceph_osd_op_rw_latency_count', 'ceph_osd_op_rw_latency_sum', 'ceph_osd_op_rw_out_bytes', 'ceph_osd_op_rw_prepare_latency_count', 'ceph_osd_op_rw_prepare_latency_sum', 'ceph_osd_op_rw_process_latency_count', 'ceph_osd_op_rw_process_latency_sum', 'ceph_osd_op_w', 'ceph_osd_op_w_in_bytes', 'ceph_osd_op_w_latency_count', 'ceph_osd_op_w_latency_sum', 'ceph_osd_op_w_prepare_latency_count', 'ceph_osd_op_w_prepare_latency_sum', 'ceph_osd_op_w_process_latency_count', 'ceph_osd_op_w_process_latency_sum', 'ceph_osd_op_wip', 'ceph_osd_recovery_bytes', 'ceph_osd_recovery_ops', 'ceph_osd_stat_bytes', 'ceph_osd_stat_bytes_used', 
'ceph_paxos_accept_timeout', 'ceph_paxos_begin', 'ceph_paxos_begin_bytes_count', 'ceph_paxos_begin_bytes_sum', 'ceph_paxos_begin_keys_count', 'ceph_paxos_begin_keys_sum', 'ceph_paxos_begin_latency_count', 'ceph_paxos_begin_latency_sum', 'ceph_paxos_collect', 'ceph_paxos_collect_bytes_count', 'ceph_paxos_collect_bytes_sum', 'ceph_paxos_collect_keys_count', 'ceph_paxos_collect_keys_sum', 'ceph_paxos_collect_latency_count', 'ceph_paxos_collect_latency_sum', 'ceph_paxos_collect_timeout', 'ceph_paxos_collect_uncommitted', 'ceph_paxos_commit', 'ceph_paxos_commit_bytes_count', 'ceph_paxos_commit_bytes_sum', 'ceph_paxos_commit_keys_count', 'ceph_paxos_commit_keys_sum', 'ceph_paxos_commit_latency_count', 'ceph_paxos_commit_latency_sum', 'ceph_paxos_lease_ack_timeout', 'ceph_paxos_lease_timeout', 'ceph_paxos_new_pn', 'ceph_paxos_new_pn_latency_count', 'ceph_paxos_new_pn_latency_sum', 'ceph_paxos_refresh', 'ceph_paxos_refresh_latency_count', 'ceph_paxos_refresh_latency_sum', 'ceph_paxos_restart', 'ceph_paxos_share_state', 'ceph_paxos_share_state_bytes_count', 'ceph_paxos_share_state_bytes_sum', 'ceph_paxos_share_state_keys_count', 'ceph_paxos_share_state_keys_sum', 'ceph_paxos_start_leader', 'ceph_paxos_start_peon', 'ceph_paxos_store_state', 'ceph_paxos_store_state_bytes_count', 'ceph_paxos_store_state_bytes_sum', 'ceph_paxos_store_state_keys_count', 'ceph_paxos_store_state_keys_sum', 'ceph_paxos_store_state_latency_count', 'ceph_paxos_store_state_latency_sum', 
'ceph_rgw_cache_hit', 'ceph_rgw_cache_miss', 'ceph_rgw_failed_req', 'ceph_rgw_get', 'ceph_rgw_get_b', 'ceph_rgw_get_initial_lat_count', 'ceph_rgw_get_initial_lat_sum', 'ceph_rgw_keystone_token_cache_hit', 'ceph_rgw_keystone_token_cache_miss', 'ceph_rgw_put', 'ceph_rgw_put_b', 'ceph_rgw_put_initial_lat_count', 'ceph_rgw_put_initial_lat_sum', 'ceph_rgw_qactive', 'ceph_rgw_qlen', 'ceph_rgw_req', 
'ceph_rocksdb_compact', 'ceph_rocksdb_compact_queue_len', 'ceph_rocksdb_compact_queue_merge', 'ceph_rocksdb_compact_range', 'ceph_rocksdb_get', 'ceph_rocksdb_get_latency_count', 'ceph_rocksdb_get_latency_sum', 'ceph_rocksdb_rocksdb_write_delay_time_count', 'ceph_rocksdb_rocksdb_write_delay_time_sum', 'ceph_rocksdb_rocksdb_write_memtable_time_count', 'ceph_rocksdb_rocksdb_write_memtable_time_sum', 'ceph_rocksdb_rocksdb_write_pre_and_post_time_count', 'ceph_rocksdb_rocksdb_write_pre_and_post_time_sum', 'ceph_rocksdb_rocksdb_write_wal_time_count', 'ceph_rocksdb_rocksdb_write_wal_time_sum', 'ceph_rocksdb_submit_latency_count', 'ceph_rocksdb_submit_latency_sum', 'ceph_rocksdb_submit_sync_latency_count', 'ceph_rocksdb_submit_sync_latency_sum'

Ceph metrics which should be present on a healthy cluster:
https://github.com/red-hat-storage/ocs-ci/blob/81ca20aed067a30dd109e0f29e026f2a18c752ee/ocs_ci/ocs/metrics.py#L70

Version of all relevant components (if applicable):
ODF-4.13.0-186.stable
OCP-4.13.0-0.nightly-2023-05-10-062807

Can this issue reproducible?
yes (seen in about 8 ci runs)


Steps to Reproduce:
1. Install OCP/ODF cluster
2. After installation, check whether Prometheus provides values for the
   metrics listed above.


Actual results:
OCP Prometheus provides no values for any of the metrics listed above.

Expected results:
OCP Prometheus provides values for all metrics listed above.

Comment 3 Mudit Agarwal 2023-05-15 17:25:05 UTC
Not a 4.13 blocker

Comment 6 Mudit Agarwal 2023-05-23 16:07:57 UTC
Nishanth, please assign it to someone.

Comment 18 Parth Arora 2023-05-30 13:02:36 UTC
Do we have the ODF cluster or must gather logs for this?

Comment 28 Filip Balák 2023-06-07 11:18:12 UTC
After the fix there are still missing metrics 'ceph_bluestore_submit_lat_sum', 'ceph_bluestore_submit_lat_count', 'ceph_bluestore_throttle_lat_count', 'ceph_bluestore_commit_lat_sum', 'ceph_bluestore_throttle_lat_sum', 'ceph_rocksdb_get', 'ceph_bluestore_commit_lat_count'.
Is this expected?

Tested with odf 4.13.0-214.

Comment 30 avan 2023-06-08 07:25:31 UTC
Hi @Filip,
I did investigate the missing metrics you reported, it seems there's a discrepancy in metrics name on ocs-ci end. The metrics which are exported by ceph are actually named as 
for example ceph_bluestore_txc_submit_lat_count and similar for other metrics. https://github.com/ceph/ceph/blame/v17.2.6/src/os/bluestore/BlueStore.cc#L5076

I see that this was updated ~2 years ago in Ceph and the metrics.py for ocs-ci is last updated ~3 years ago, so it must be adopt the metrics name coming from ceph https://github.com/red-hat-storage/ocs-ci/blob/master/ocs_ci/ocs/metrics.py#L116 and the same goes for `ceph_rocksdb_get`, I don't see any metrics exported by Ceph with that name, so it must be removed from the file.
Hope this helps.

Thanks

Comment 34 Sravika 2023-06-12 08:53:53 UTC
Also observing this bug as part of the following ocs-ci test execution on IBM Z

tests/manage/monitoring/prometheusmetrics/test_monitoring_negative.py::test_ceph_metrics_presence_when_osd_down

Comment 35 Filip Balák 2023-06-13 11:22:43 UTC
This is verified based on discussion in thread https://chat.google.com/room/AAAAREGEba8/KoCb6Izr65o. There will be needed a note in release notes.

New metric names:
ceph_bluestore_submit_lat_sum -> ceph_bluestore_txc_submit_lat_sum
ceph_bluestore_submit_lat_count -> ceph_bluestore_txc_submit_lat_count
ceph_bluestore_throttle_lat_count -> ceph_bluestore_txc_throttle_lat_count
ceph_bluestore_commit_lat_sum -> ceph_bluestore_txc_commit_lat_sum
ceph_bluestore_throttle_lat_sum -> ceph_bluestore_txc_throttle_lat_sum
ceph_bluestore_commit_lat_count -> ceph_bluestore_txc_commit_lat_count

Metric ceph_rocksdb_get was removed because it was redundant and its data can be accessed from metrics ceph_rocksdb_get_latency_sum and ceph_rocksdb_get_latency_count.

--> VERIFIED

Comment 37 errata-xmlrpc 2023-06-21 15:25:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Data Foundation 4.13.0 enhancement and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:3742