Bug 2210027 - OSD daemons not providing perf counters
Summary: OSD daemons not providing perf counters
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RADOS
Version: 6.1
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: ---
: 6.1z1
Assignee: Radoslaw Zarzynski
QA Contact: Pawan
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-05-25 14:01 UTC by Vishakha Kathole
Modified: 2023-05-31 15:32 UTC (History)
15 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-05-31 15:32:20 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 2203795 0 unspecified CLOSED ODF Monitoring is missing some of the ceph_* metric values 2023-08-09 17:03:01 UTC
Red Hat Issue Tracker RHCEPH-6745 0 None None None 2023-05-26 09:10:59 UTC

Description Vishakha Kathole 2023-05-25 14:01:03 UTC
Description of problem (please be detailed as possible and provide log
snippests):

ODF Monitoring is missing some of the ceph_* metric values

List of missing metric values:
'ceph_bluefs_bytes_written_slow', 'ceph_bluefs_bytes_written_sst', 'ceph_bluefs_bytes_written_wal', 'ceph_bluefs_db_total_bytes', 'ceph_bluefs_db_used_bytes', 'ceph_bluefs_log_bytes', 'ceph_bluefs_logged_bytes', 'ceph_bluefs_num_files', 'ceph_bluefs_slow_total_bytes', 'ceph_bluefs_slow_used_bytes', 'ceph_bluefs_wal_total_bytes', 'ceph_bluefs_wal_used_bytes', 'ceph_bluestore_commit_lat_count',
'ceph_bluestore_commit_lat_sum', 'ceph_bluestore_kv_final_lat_count', 'ceph_bluestore_kv_final_lat_sum', 'ceph_bluestore_kv_flush_lat_count', 'ceph_bluestore_kv_flush_lat_sum', 'ceph_bluestore_kv_sync_lat_count', 'ceph_bluestore_kv_sync_lat_sum', 'ceph_bluestore_read_lat_count', 'ceph_bluestore_read_lat_sum', 'ceph_bluestore_state_aio_wait_lat_count', 'ceph_bluestore_state_aio_wait_lat_sum', 'ceph_bluestore_submit_lat_count', 'ceph_bluestore_submit_lat_sum', 'ceph_bluestore_throttle_lat_count', 'ceph_bluestore_throttle_lat_sum', 
'ceph_mon_election_call', 'ceph_mon_election_lose', 'ceph_mon_election_win', 'ceph_mon_num_elections', 'ceph_mon_num_sessions', 'ceph_mon_session_add', 'ceph_mon_session_rm', 'ceph_mon_session_trim', 
'ceph_objecter_op_active', 'ceph_objecter_op_active', 'ceph_objecter_op_r', 'ceph_objecter_op_r', 'ceph_objecter_op_rmw', 'ceph_objecter_op_rmw', 'ceph_objecter_op_w', 'ceph_objecter_op_w', 
'ceph_osd_numpg', 'ceph_osd_numpg_removing', 'ceph_osd_op', 'ceph_osd_op_in_bytes', 'ceph_osd_op_latency_count', 'ceph_osd_op_latency_sum', 'ceph_osd_op_out_bytes', 'ceph_osd_op_prepare_latency_count', 'ceph_osd_op_prepare_latency_sum', 'ceph_osd_op_process_latency_count', 'ceph_osd_op_process_latency_sum', 'ceph_osd_op_r', 'ceph_osd_op_r_latency_count', 'ceph_osd_op_r_latency_sum', 'ceph_osd_op_r_out_bytes', 'ceph_osd_op_r_prepare_latency_count', 'ceph_osd_op_r_prepare_latency_sum', 'ceph_osd_op_r_process_latency_count', 'ceph_osd_op_r_process_latency_sum', 'ceph_osd_op_rw', 'ceph_osd_op_rw_in_bytes', 'ceph_osd_op_rw_latency_count', 'ceph_osd_op_rw_latency_sum', 'ceph_osd_op_rw_out_bytes', 'ceph_osd_op_rw_prepare_latency_count', 'ceph_osd_op_rw_prepare_latency_sum', 'ceph_osd_op_rw_process_latency_count', 'ceph_osd_op_rw_process_latency_sum', 'ceph_osd_op_w', 'ceph_osd_op_w_in_bytes', 'ceph_osd_op_w_latency_count', 'ceph_osd_op_w_latency_sum', 'ceph_osd_op_w_prepare_latency_count', 'ceph_osd_op_w_prepare_latency_sum', 'ceph_osd_op_w_process_latency_count', 'ceph_osd_op_w_process_latency_sum', 'ceph_osd_op_wip', 'ceph_osd_recovery_bytes', 'ceph_osd_recovery_ops', 'ceph_osd_stat_bytes', 'ceph_osd_stat_bytes_used', 
'ceph_paxos_accept_timeout', 'ceph_paxos_begin', 'ceph_paxos_begin_bytes_count', 'ceph_paxos_begin_bytes_sum', 'ceph_paxos_begin_keys_count', 'ceph_paxos_begin_keys_sum', 'ceph_paxos_begin_latency_count', 'ceph_paxos_begin_latency_sum', 'ceph_paxos_collect', 'ceph_paxos_collect_bytes_count', 'ceph_paxos_collect_bytes_sum', 'ceph_paxos_collect_keys_count', 'ceph_paxos_collect_keys_sum', 'ceph_paxos_collect_latency_count', 'ceph_paxos_collect_latency_sum', 'ceph_paxos_collect_timeout', 'ceph_paxos_collect_uncommitted', 'ceph_paxos_commit', 'ceph_paxos_commit_bytes_count', 'ceph_paxos_commit_bytes_sum', 'ceph_paxos_commit_keys_count', 'ceph_paxos_commit_keys_sum', 'ceph_paxos_commit_latency_count', 'ceph_paxos_commit_latency_sum', 'ceph_paxos_lease_ack_timeout', 'ceph_paxos_lease_timeout', 'ceph_paxos_new_pn', 'ceph_paxos_new_pn_latency_count', 'ceph_paxos_new_pn_latency_sum', 'ceph_paxos_refresh', 'ceph_paxos_refresh_latency_count', 'ceph_paxos_refresh_latency_sum', 'ceph_paxos_restart', 'ceph_paxos_share_state', 'ceph_paxos_share_state_bytes_count', 'ceph_paxos_share_state_bytes_sum', 'ceph_paxos_share_state_keys_count', 'ceph_paxos_share_state_keys_sum', 'ceph_paxos_start_leader', 'ceph_paxos_start_peon', 'ceph_paxos_store_state', 'ceph_paxos_store_state_bytes_count', 'ceph_paxos_store_state_bytes_sum', 'ceph_paxos_store_state_keys_count', 'ceph_paxos_store_state_keys_sum', 'ceph_paxos_store_state_latency_count', 'ceph_paxos_store_state_latency_sum', 
'ceph_rgw_cache_hit', 'ceph_rgw_cache_miss', 'ceph_rgw_failed_req', 'ceph_rgw_get', 'ceph_rgw_get_b', 'ceph_rgw_get_initial_lat_count', 'ceph_rgw_get_initial_lat_sum', 'ceph_rgw_keystone_token_cache_hit', 'ceph_rgw_keystone_token_cache_miss', 'ceph_rgw_put', 'ceph_rgw_put_b', 'ceph_rgw_put_initial_lat_count', 'ceph_rgw_put_initial_lat_sum', 'ceph_rgw_qactive', 'ceph_rgw_qlen', 'ceph_rgw_req', 
'ceph_rocksdb_compact', 'ceph_rocksdb_compact_queue_len', 'ceph_rocksdb_compact_queue_merge', 'ceph_rocksdb_compact_range', 'ceph_rocksdb_get', 'ceph_rocksdb_get_latency_count', 'ceph_rocksdb_get_latency_sum', 'ceph_rocksdb_rocksdb_write_delay_time_count', 'ceph_rocksdb_rocksdb_write_delay_time_sum', 'ceph_rocksdb_rocksdb_write_memtable_time_count', 'ceph_rocksdb_rocksdb_write_memtable_time_sum', 'ceph_rocksdb_rocksdb_write_pre_and_post_time_count', 'ceph_rocksdb_rocksdb_write_pre_and_post_time_sum', 'ceph_rocksdb_rocksdb_write_wal_time_count', 'ceph_rocksdb_rocksdb_write_wal_time_sum', 'ceph_rocksdb_submit_latency_count', 'ceph_rocksdb_submit_latency_sum', 'ceph_rocksdb_submit_sync_latency_count', 'ceph_rocksdb_submit_sync_latency_sum'

Version of all relevant components (if applicable):
ODF- 4.13.0-186.stable
OCP- 4.13.0-0.nightly-2023-05-10-062807


Is there any workaround available to the best of your knowledge?
No

Can this issue reproducible?
yes (seen in about 8 ci runs)

Can this issue reproduce from the UI?
It can be checked at "Observe" -> "Metrics" and type metrics name

If this is a regression, please provide more details to justify this:
Logs for failed testcase
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/vkathole-wl29m413/vkathole-wl29m413_20230330T054932/logs/ocs-ci-logs-1680159780/by_outcome/failed/tests/e2e/workloads/ocp/monitoring/test_monitoring_on_negative_scenarios.py/TestMonitoringBackedByOCS/test_monitoring_shutdown_mgr_pod/logs

Steps to Reproduce:
1. Install OCP/ODF cluster
2. After installation, check whether Prometheus provides values for the
   metrics listed above.


Actual results:
OCP Prometheus provides no values for any of the metrics listed above.

Expected results:
OCP Prometheus provides values for all metrics listed above.


Additional info:
The issue seems to be related to performance counters of OSDs daemons. Please refer to https://bugzilla.redhat.com/show_bug.cgi?id=2203795#c16

Comment 3 Harish NV Rao 2023-05-26 08:04:56 UTC
@athakkar should the product of this BZ be Red Hat Ceph Storage and component be RADOS?

Comment 6 Mudit Agarwal 2023-05-30 07:01:10 UTC
Vishakha, can you please check the same (as comment #5) on a downstream 6.1 cluster?

Comment 17 Mudit Agarwal 2023-05-31 15:32:20 UTC
It needs a fix in rook, more details https://bugzilla.redhat.com/show_bug.cgi?id=2203795#c24

Closing the ceph bug.


Note You need to log in before you can comment on or make changes to this bug.