Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
This project is now read‑only. Starting Monday, February 2, please use https://ibm-ceph.atlassian.net/ for all bug tracking management.

Bug 2210027

Summary: OSD daemons not providing perf counters
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Vishakha Kathole <vkathole>
Component: RADOSAssignee: Radoslaw Zarzynski <rzarzyns>
Status: CLOSED NOTABUG QA Contact: Pawan <pdhiran>
Severity: high Docs Contact:
Priority: urgent    
Version: 6.1CC: athakkar, bhubbard, bniver, ceph-eng-bugs, cephqe-warriors, hakumar, hnallurv, jolmomar, muagarwa, nojha, ocs-bugs, paarora, rzarzyns, sostapov, vumrao
Target Milestone: ---Keywords: Automation, Regression
Target Release: 6.1z1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-05-31 15:32:20 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Vishakha Kathole 2023-05-25 14:01:03 UTC
Description of problem (please be detailed as possible and provide log
snippests):

ODF Monitoring is missing some of the ceph_* metric values

List of missing metric values:
'ceph_bluefs_bytes_written_slow', 'ceph_bluefs_bytes_written_sst', 'ceph_bluefs_bytes_written_wal', 'ceph_bluefs_db_total_bytes', 'ceph_bluefs_db_used_bytes', 'ceph_bluefs_log_bytes', 'ceph_bluefs_logged_bytes', 'ceph_bluefs_num_files', 'ceph_bluefs_slow_total_bytes', 'ceph_bluefs_slow_used_bytes', 'ceph_bluefs_wal_total_bytes', 'ceph_bluefs_wal_used_bytes', 'ceph_bluestore_commit_lat_count',
'ceph_bluestore_commit_lat_sum', 'ceph_bluestore_kv_final_lat_count', 'ceph_bluestore_kv_final_lat_sum', 'ceph_bluestore_kv_flush_lat_count', 'ceph_bluestore_kv_flush_lat_sum', 'ceph_bluestore_kv_sync_lat_count', 'ceph_bluestore_kv_sync_lat_sum', 'ceph_bluestore_read_lat_count', 'ceph_bluestore_read_lat_sum', 'ceph_bluestore_state_aio_wait_lat_count', 'ceph_bluestore_state_aio_wait_lat_sum', 'ceph_bluestore_submit_lat_count', 'ceph_bluestore_submit_lat_sum', 'ceph_bluestore_throttle_lat_count', 'ceph_bluestore_throttle_lat_sum', 
'ceph_mon_election_call', 'ceph_mon_election_lose', 'ceph_mon_election_win', 'ceph_mon_num_elections', 'ceph_mon_num_sessions', 'ceph_mon_session_add', 'ceph_mon_session_rm', 'ceph_mon_session_trim', 
'ceph_objecter_op_active', 'ceph_objecter_op_active', 'ceph_objecter_op_r', 'ceph_objecter_op_r', 'ceph_objecter_op_rmw', 'ceph_objecter_op_rmw', 'ceph_objecter_op_w', 'ceph_objecter_op_w', 
'ceph_osd_numpg', 'ceph_osd_numpg_removing', 'ceph_osd_op', 'ceph_osd_op_in_bytes', 'ceph_osd_op_latency_count', 'ceph_osd_op_latency_sum', 'ceph_osd_op_out_bytes', 'ceph_osd_op_prepare_latency_count', 'ceph_osd_op_prepare_latency_sum', 'ceph_osd_op_process_latency_count', 'ceph_osd_op_process_latency_sum', 'ceph_osd_op_r', 'ceph_osd_op_r_latency_count', 'ceph_osd_op_r_latency_sum', 'ceph_osd_op_r_out_bytes', 'ceph_osd_op_r_prepare_latency_count', 'ceph_osd_op_r_prepare_latency_sum', 'ceph_osd_op_r_process_latency_count', 'ceph_osd_op_r_process_latency_sum', 'ceph_osd_op_rw', 'ceph_osd_op_rw_in_bytes', 'ceph_osd_op_rw_latency_count', 'ceph_osd_op_rw_latency_sum', 'ceph_osd_op_rw_out_bytes', 'ceph_osd_op_rw_prepare_latency_count', 'ceph_osd_op_rw_prepare_latency_sum', 'ceph_osd_op_rw_process_latency_count', 'ceph_osd_op_rw_process_latency_sum', 'ceph_osd_op_w', 'ceph_osd_op_w_in_bytes', 'ceph_osd_op_w_latency_count', 'ceph_osd_op_w_latency_sum', 'ceph_osd_op_w_prepare_latency_count', 'ceph_osd_op_w_prepare_latency_sum', 'ceph_osd_op_w_process_latency_count', 'ceph_osd_op_w_process_latency_sum', 'ceph_osd_op_wip', 'ceph_osd_recovery_bytes', 'ceph_osd_recovery_ops', 'ceph_osd_stat_bytes', 'ceph_osd_stat_bytes_used', 
'ceph_paxos_accept_timeout', 'ceph_paxos_begin', 'ceph_paxos_begin_bytes_count', 'ceph_paxos_begin_bytes_sum', 'ceph_paxos_begin_keys_count', 'ceph_paxos_begin_keys_sum', 'ceph_paxos_begin_latency_count', 'ceph_paxos_begin_latency_sum', 'ceph_paxos_collect', 'ceph_paxos_collect_bytes_count', 'ceph_paxos_collect_bytes_sum', 'ceph_paxos_collect_keys_count', 'ceph_paxos_collect_keys_sum', 'ceph_paxos_collect_latency_count', 'ceph_paxos_collect_latency_sum', 'ceph_paxos_collect_timeout', 'ceph_paxos_collect_uncommitted', 'ceph_paxos_commit', 'ceph_paxos_commit_bytes_count', 'ceph_paxos_commit_bytes_sum', 'ceph_paxos_commit_keys_count', 'ceph_paxos_commit_keys_sum', 'ceph_paxos_commit_latency_count', 'ceph_paxos_commit_latency_sum', 'ceph_paxos_lease_ack_timeout', 'ceph_paxos_lease_timeout', 'ceph_paxos_new_pn', 'ceph_paxos_new_pn_latency_count', 'ceph_paxos_new_pn_latency_sum', 'ceph_paxos_refresh', 'ceph_paxos_refresh_latency_count', 'ceph_paxos_refresh_latency_sum', 'ceph_paxos_restart', 'ceph_paxos_share_state', 'ceph_paxos_share_state_bytes_count', 'ceph_paxos_share_state_bytes_sum', 'ceph_paxos_share_state_keys_count', 'ceph_paxos_share_state_keys_sum', 'ceph_paxos_start_leader', 'ceph_paxos_start_peon', 'ceph_paxos_store_state', 'ceph_paxos_store_state_bytes_count', 'ceph_paxos_store_state_bytes_sum', 'ceph_paxos_store_state_keys_count', 'ceph_paxos_store_state_keys_sum', 'ceph_paxos_store_state_latency_count', 'ceph_paxos_store_state_latency_sum', 
'ceph_rgw_cache_hit', 'ceph_rgw_cache_miss', 'ceph_rgw_failed_req', 'ceph_rgw_get', 'ceph_rgw_get_b', 'ceph_rgw_get_initial_lat_count', 'ceph_rgw_get_initial_lat_sum', 'ceph_rgw_keystone_token_cache_hit', 'ceph_rgw_keystone_token_cache_miss', 'ceph_rgw_put', 'ceph_rgw_put_b', 'ceph_rgw_put_initial_lat_count', 'ceph_rgw_put_initial_lat_sum', 'ceph_rgw_qactive', 'ceph_rgw_qlen', 'ceph_rgw_req', 
'ceph_rocksdb_compact', 'ceph_rocksdb_compact_queue_len', 'ceph_rocksdb_compact_queue_merge', 'ceph_rocksdb_compact_range', 'ceph_rocksdb_get', 'ceph_rocksdb_get_latency_count', 'ceph_rocksdb_get_latency_sum', 'ceph_rocksdb_rocksdb_write_delay_time_count', 'ceph_rocksdb_rocksdb_write_delay_time_sum', 'ceph_rocksdb_rocksdb_write_memtable_time_count', 'ceph_rocksdb_rocksdb_write_memtable_time_sum', 'ceph_rocksdb_rocksdb_write_pre_and_post_time_count', 'ceph_rocksdb_rocksdb_write_pre_and_post_time_sum', 'ceph_rocksdb_rocksdb_write_wal_time_count', 'ceph_rocksdb_rocksdb_write_wal_time_sum', 'ceph_rocksdb_submit_latency_count', 'ceph_rocksdb_submit_latency_sum', 'ceph_rocksdb_submit_sync_latency_count', 'ceph_rocksdb_submit_sync_latency_sum'

Version of all relevant components (if applicable):
ODF- 4.13.0-186.stable
OCP- 4.13.0-0.nightly-2023-05-10-062807


Is there any workaround available to the best of your knowledge?
No

Can this issue reproducible?
yes (seen in about 8 ci runs)

Can this issue reproduce from the UI?
It can be checked at "Observe" -> "Metrics" and type metrics name

If this is a regression, please provide more details to justify this:
Logs for failed testcase
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/vkathole-wl29m413/vkathole-wl29m413_20230330T054932/logs/ocs-ci-logs-1680159780/by_outcome/failed/tests/e2e/workloads/ocp/monitoring/test_monitoring_on_negative_scenarios.py/TestMonitoringBackedByOCS/test_monitoring_shutdown_mgr_pod/logs

Steps to Reproduce:
1. Install OCP/ODF cluster
2. After installation, check whether Prometheus provides values for the
   metrics listed above.


Actual results:
OCP Prometheus provides no values for any of the metrics listed above.

Expected results:
OCP Prometheus provides values for all metrics listed above.


Additional info:
The issue seems to be related to performance counters of OSDs daemons. Please refer to https://bugzilla.redhat.com/show_bug.cgi?id=2203795#c16

Comment 3 Harish NV Rao 2023-05-26 08:04:56 UTC
@athakkar should the product of this BZ be Red Hat Ceph Storage and component be RADOS?

Comment 6 Mudit Agarwal 2023-05-30 07:01:10 UTC
Vishakha, can you please check the same (as comment #5) on a downstream 6.1 cluster?

Comment 17 Mudit Agarwal 2023-05-31 15:32:20 UTC
It needs a fix in rook, more details https://bugzilla.redhat.com/show_bug.cgi?id=2203795#c24

Closing the ceph bug.