Bug 2253429 - ODF Monitoring is missing some of the metric values 4.14 [NEEDINFO]
Summary: ODF Monitoring is missing some of the metric values 4.14
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ceph-monitoring
Version: 4.14
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ODF 4.14.6
Assignee: Divyansh Kamboj
QA Contact: Daniel Osypenko
URL:
Whiteboard:
: 2253428 2262307 (view as bug list)
Depends On: 2258861
Blocks: 2242324 2244409
TreeView+ depends on / blocked
 
Reported: 2023-12-07 11:59 UTC by Daniel Osypenko
Modified: 2024-04-01 09:17 UTC (History)
14 users (show)

Fixed In Version: 4.14.6-1
Doc Type: No Doc Update
Doc Text:
Clone Of: 2221488
Environment:
Last Closed: 2024-04-01 09:17:35 UTC
Embargoed:
athakkar: needinfo? (fbalak)
muagarwa: needinfo? (fbalak)
nthomas: needinfo? (dkamboj)
sheggodu: needinfo? (dkamboj)
sheggodu: needinfo? (dkamboj)
sheggodu: needinfo? (dkamboj)


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github red-hat-storage rook pull 566 0 None open Bug 2262307: exporter: Don't delete exporter service on daemon deletion 2024-02-09 09:15:13 UTC
Red Hat Product Errata RHBA-2024:1579 0 None None None 2024-04-01 09:17:49 UTC

Description Daniel Osypenko 2023-12-07 11:59:10 UTC
+++ This bug was initially created as a clone of Bug #2221488 +++

This bug was initially created as a copy of Bug #2203795


I am copying this bug because: 

----
ODF 4.14.1-14

Same list of 142 metrics are missing, on non-external mode deployment. 
https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#ocs/launches/557/17055/827985/827986/827990/log?logParams=history%3D827990%26page.page%3D1
----

Cloned bug:

Even though missing metric names are different comparing to 4.13 missing metrics, the description of the problem and parties of the discussion should be the same

Description of problem (please be detailed as possible and provide log
snippests):
ODF Monitoring is missing some of the ceph_* metric values. No related epic providing change/rename was found 

List of missing metric values:
'ceph_bluestore_state_aio_wait_lat_sum',

'ceph_paxos_store_state_latency_sum',

'ceph_osd_op_out_bytes',

'ceph_bluestore_txc_submit_lat_sum',

'ceph_paxos_commit',

'ceph_paxos_new_pn_latency_count',

'ceph_osd_op_r_process_latency_count',

'ceph_bluestore_txc_submit_lat_count',

'ceph_bluestore_kv_final_lat_sum',

'ceph_paxos_collect_keys_sum',

'ceph_paxos_accept_timeout',

'ceph_paxos_begin_latency_count',

'ceph_bluefs_wal_total_bytes',

'ceph_paxos_refresh',

'ceph_bluestore_read_lat_count',

'ceph_mon_num_sessions',

'ceph_bluefs_bytes_written_wal',

'ceph_mon_num_elections',

'ceph_rocksdb_compact',

'ceph_bluestore_kv_sync_lat_sum',

'ceph_osd_op_process_latency_count',

'ceph_osd_op_w_prepare_latency_count',

'ceph_paxos_begin_latency_sum',

'ceph_osd_op_r',

'ceph_osd_op_rw_prepare_latency_sum',

'ceph_paxos_new_pn',

'ceph_rocksdb_get_latency_count',

'ceph_paxos_commit_latency_count',

'ceph_bluestore_txc_throttle_lat_count',

'ceph_paxos_lease_ack_timeout',

'ceph_bluestore_txc_commit_lat_sum',

'ceph_paxos_collect_bytes_sum',

'ceph_osd_op_rw_latency_count',

'ceph_paxos_collect_uncommitted',

'ceph_osd_op_rw_latency_sum',

'ceph_paxos_share_state',

'ceph_osd_op_r_prepare_latency_sum',

'ceph_bluestore_kv_flush_lat_sum',

'ceph_osd_op_rw_process_latency_sum',

'ceph_rocksdb_rocksdb_write_memtable_time_count',

'ceph_paxos_collect_latency_count',

'ceph_osd_op_rw_prepare_latency_count',

'ceph_paxos_collect_latency_sum',

'ceph_rocksdb_rocksdb_write_delay_time_count',

'ceph_paxos_begin_bytes_sum',

'ceph_osd_numpg',

'ceph_osd_stat_bytes',

'ceph_rocksdb_submit_sync_latency_sum',

'ceph_rocksdb_compact_queue_merge',

'ceph_paxos_collect_bytes_count',

'ceph_osd_op',

'ceph_paxos_commit_keys_sum',

'ceph_osd_op_rw_in_bytes',

'ceph_osd_op_rw_out_bytes',

'ceph_bluefs_bytes_written_sst',

'ceph_osd_op_rw_process_latency_count',

'ceph_rocksdb_compact_queue_len',

'ceph_bluestore_txc_throttle_lat_sum',

'ceph_bluefs_slow_used_bytes',

'ceph_osd_op_r_latency_sum',

'ceph_bluestore_kv_flush_lat_count',

'ceph_rocksdb_compact_range',

'ceph_osd_op_latency_sum',

'ceph_mon_session_add',

'ceph_paxos_share_state_keys_count',

'ceph_paxos_collect',

'ceph_osd_op_w_in_bytes',

'ceph_osd_op_r_process_latency_sum',

'ceph_paxos_start_peon',

'ceph_mon_session_trim',

'ceph_rocksdb_get_latency_sum',

'ceph_osd_op_rw',

'ceph_paxos_store_state_keys_count',

'ceph_rocksdb_rocksdb_write_delay_time_sum',

'ceph_osd_recovery_ops',

'ceph_bluefs_logged_bytes',

'ceph_bluefs_db_total_bytes',

'ceph_osd_op_w_latency_count',

'ceph_bluestore_txc_commit_lat_count',

'ceph_bluestore_state_aio_wait_lat_count',

'ceph_paxos_begin_bytes_count',

'ceph_paxos_start_leader',

'ceph_mon_election_call',

'ceph_rocksdb_rocksdb_write_pre_and_post_time_count',

'ceph_mon_session_rm',

'ceph_paxos_store_state',

'ceph_paxos_store_state_bytes_count',

'ceph_osd_op_w_latency_sum',

'ceph_rocksdb_submit_latency_count',

'ceph_paxos_commit_latency_sum',

'ceph_rocksdb_rocksdb_write_memtable_time_sum',

'ceph_paxos_share_state_bytes_sum',

'ceph_osd_op_process_latency_sum',

'ceph_paxos_begin_keys_sum',

'ceph_rocksdb_rocksdb_write_pre_and_post_time_sum',

'ceph_bluefs_wal_used_bytes',

'ceph_rocksdb_rocksdb_write_wal_time_sum',

'ceph_osd_op_wip',

'ceph_paxos_lease_timeout',

'ceph_osd_op_r_out_bytes',

'ceph_paxos_begin_keys_count',

'ceph_bluestore_kv_sync_lat_count',

'ceph_osd_op_prepare_latency_count',

'ceph_bluefs_bytes_written_slow',

'ceph_rocksdb_submit_latency_sum',

'ceph_osd_op_r_latency_count',

'ceph_paxos_share_state_keys_sum',

'ceph_paxos_store_state_bytes_sum',

'ceph_osd_op_latency_count',

'ceph_paxos_commit_bytes_count',

'ceph_paxos_restart',

'ceph_bluefs_slow_total_bytes',

'ceph_paxos_collect_timeout',

'ceph_osd_op_w_process_latency_sum',

'ceph_paxos_collect_keys_count',

'ceph_paxos_share_state_bytes_count',

'ceph_osd_op_w_prepare_latency_sum',

'ceph_bluestore_read_lat_sum',

'ceph_osd_stat_bytes_used',

'ceph_paxos_begin',

'ceph_mon_election_win',

'ceph_osd_op_w_process_latency_count',

'ceph_rocksdb_rocksdb_write_wal_time_count',

'ceph_paxos_store_state_keys_sum',

'ceph_osd_numpg_removing',

'ceph_paxos_commit_keys_count',

'ceph_paxos_new_pn_latency_sum',

'ceph_osd_op_in_bytes',

'ceph_paxos_store_state_latency_count',

'ceph_paxos_refresh_latency_count',

'ceph_osd_op_r_prepare_latency_count',

'ceph_bluefs_num_files',

'ceph_mon_election_lose',

'ceph_osd_op_prepare_latency_sum',

'ceph_bluefs_db_used_bytes',

'ceph_bluestore_kv_final_lat_count',

'ceph_paxos_refresh_latency_sum',

'ceph_osd_recovery_bytes',

'ceph_osd_op_w',

'ceph_paxos_commit_bytes_sum',

'ceph_bluefs_log_bytes',

'ceph_rocksdb_submit_sync_latency_count',

ceph metrics which should be present on a healthy cluster:
https://github.com/red-hat-storage/ocs-ci/blob/81ca20aed067a30dd109e0f29e026f2a18c752ee/ocs_ci/ocs/metrics.py#L70

Polarion documentation: https://polarion.engineering.redhat.com/polarion/#/project/OpenShiftContainerStorage/workitem?id=OCS-958

Version of all relevant components (if applicable):

OC version:
Client Version: 4.13.4
Kustomize Version: v4.5.7
Server Version: 4.14.0-0.nightly-2023-06-30-131338
Kubernetes Version: v1.27.3+ab0b8ee

OCS verison:
ocs-operator.v4.14.0-36.stable              OpenShift Container Storage   4.14.0-36.stable              Succeeded

Cluster version
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.14.0-0.nightly-2023-06-30-131338   True        False         4d1h    Cluster version is 4.14.0-0.nightly-2023-06-30-131338

Rook version:
rook: v4.14.0-0.d8ce011027a26218154bcedf63a54e97f020df40
go: go1.20.4

Ceph version:
ceph version 17.2.6-70.el9cp (fe62dcdbb2c6e05782a3e2b67d025b84ff5047cc) quincy (stable)


Can this issue reproducible?
yes, repeatable in ci run and with local run


Steps to Reproduce:
1. Install OCP/ODF cluster
2. After installation, check whether Prometheus provides values for the
   metrics listed above.


Actual results:
OCP Prometheus provides no values for any of the metrics listed above.

Expected results:
OCP Prometheus provides values for all metrics listed above.

logs of the test-run:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-034aikt1c33-t1/j-034aikt1c33-t1_20230704T064403/logs/ocs-ci-logs-1688456400/by_outcome/failed/tests/manage/monitoring/prometheusmetrics/test_monitoring_defaults.py/test_ceph_metrics_available/logs

must-gather logs
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-034aikt1c33-t1/j-034aikt1c33-t1_20230704T064403/logs/testcases_1688456400/

--- Additional comment from RHEL Program Management on 2023-07-09 12:02:36 UTC ---

This bug having no release flag set previously, is now set with release flag 'odf‑4.14.0' to '?', and so is being proposed to be fixed at the ODF 4.14.0 release. Note that the 3 Acks (pm_ack, devel_ack, qa_ack), if any previously set while release flag was missing, have now been reset since the Acks are to be set against a release flag.

--- Additional comment from RHEL Program Management on 2023-07-09 12:02:36 UTC ---

The 'Target Release' is not to be set manually at the Red Hat OpenShift Data Foundation product.

The 'Target Release' will be auto set appropriately, after the 3 Acks (pm,devel,qa) are set to "+" for a specific release flag and that release flag gets auto set to "+".

--- Additional comment from Travis Nielsen on 2023-07-10 16:47:34 UTC ---

Avan PTAL

--- Additional comment from avan on 2023-07-25 17:26:37 UTC ---

@Daniel,
Is this still reproducible?

--- Additional comment from Daniel Osypenko on 2023-07-26 08:21:32 UTC ---

@athakkar OCS 4.14.0-77 still fails

--- Additional comment from avan on 2023-08-01 10:45:40 UTC ---

(In reply to Daniel Osypenko from comment #5)
> @athakkar OCS 4.14.0-77 still fails

Currently the ceph-exporter is disabled for 4.14 build as there were some issue detected in upstream Ceph. The plan is to get it delivered to 6.1z2 and then enable it in 4.14 release branch of rook repo by this week.

--- Additional comment from Travis Nielsen on 2023-08-01 15:12:18 UTC ---

Avan Was the exporter disabled in Ceph? If so, we can move this BZ over to the ceph component

--- Additional comment from avan on 2023-08-02 09:37:26 UTC ---

(In reply to Travis Nielsen from comment #7)
> Avan Was the exporter disabled in Ceph? If so, we can move this BZ over to
> the ceph component

No, I mean it was disabled on rook end. By the way the fixes are merged upstream for exporter so soon will be backported to downstream 6.1z2

--- Additional comment from Travis Nielsen on 2023-08-02 16:32:13 UTC ---

Oh right, the min version upstream requires v18 for the exporter to be enabled, which means it's disabled in 4.14 until we change the MinVersionForCephExporter again.

--- Additional comment from Red Hat Bugzilla on 2023-08-03 08:28:02 UTC ---

Account disabled by LDAP Audit

--- Additional comment from Mudit Agarwal on 2023-08-08 05:35:34 UTC ---

Avan, please add the link of Ceph BZ/PR which has the exporter changes?
Also, when are we planning it to be enabled from rook side?

Elad, please provide qa ack.

--- Additional comment from RHEL Program Management on 2023-08-08 06:31:57 UTC ---

This BZ is being approved for ODF 4.14.0 release, upon receipt of the 3 ACKs (PM,Devel,QA) for the release flag 'odf‑4.14.0

--- Additional comment from RHEL Program Management on 2023-08-08 06:31:57 UTC ---

Since this bug has been approved for ODF 4.14.0 release, through release flag 'odf-4.14.0+', the Target Release is being set to 'ODF 4.14.0

--- Additional comment from avan on 2023-08-08 06:33:39 UTC ---

(In reply to Mudit Agarwal from comment #11)
> Avan, please add the link of Ceph BZ/PR which has the exporter changes?
> Also, when are we planning it to be enabled from rook side?
> 
> Elad, please provide qa ack.

Ceph BZs 
https://bugzilla.redhat.com/show_bug.cgi?id=2217817
https://bugzilla.redhat.com/show_bug.cgi?id=2229267

Once we have this BZs moved to on_qa(once we have new build), it can be enabled in rook for 4.14 release branch

--- Additional comment from avan on 2023-08-09 11:20:02 UTC ---

@kdreyer @branto

Give that we have the new Ceph image ready with the required exporter changes https://bugzilla.redhat.com/show_bug.cgi?id=2217817#c3, can you help making sure that ODF 4.14 uses this new image for testing?

--- Additional comment from Boris Ranto on 2023-08-09 11:53:41 UTC ---

Done, I updated the defaults to use the new RHCS 6.1z2 first build (6-200). We should have it in our builds starting from tomorrow.

--- Additional comment from errata-xmlrpc on 2023-08-10 03:50:33 UTC ---

This bug has been added to advisory RHBA-2023:115514 by ceph-build service account (ceph-build.COM)

--- Additional comment from Daniel Osypenko on 2023-08-31 11:29:03 UTC ---

Fixed. Same automation test, previously failed (test_monitoring_reporting_ok_when_idle) now passes:

13:42:54 - MainThread - /Users/danielosypenko/Work/automation_4/ocs-ci/ocs_ci/utility/prometheus.py - INFO  - No bad values detected
13:42:54 - MainThread - /Users/danielosypenko/Work/automation_4/ocs-ci/ocs_ci/utility/prometheus.py - INFO  - No invalid values detected
13:42:54 - MainThread - test_monitoring_defaults - INFO  - ceph_osd_in metric does indicate no problems with OSDs
PASSED

--- Additional comment from Daniel Osypenko on 2023-09-04 10:13:06 UTC ---

BZ has been moved to Verified by a mistake.

List of missing metrics on OCP 4.14.0-0.nightly-2023-09-02-132842 ODF 4.14.0-125.stable
['ceph_bluestore_state_aio_wait_lat_sum', 'ceph_paxos_store_state_latency_sum', 'ceph_osd_op_out_bytes', 'ceph_bluestore_txc_submit_lat_sum', 'ceph_paxos_commit', 'ceph_paxos_new_pn_latency_count', 'ceph_osd_op_r_process_latency_count', 'ceph_bluestore_txc_submit_lat_count', 'ceph_bluestore_kv_final_lat_sum', 'ceph_paxos_collect_keys_sum', 'ceph_paxos_accept_timeout', 'ceph_paxos_begin_latency_count', 'ceph_bluefs_wal_total_bytes', 'ceph_paxos_refresh', 'ceph_bluestore_read_lat_count', 'ceph_mon_num_sessions', 'ceph_objecter_op_rmw', 'ceph_bluefs_bytes_written_wal', 'ceph_mon_num_elections', 'ceph_rocksdb_compact', 'ceph_bluestore_kv_sync_lat_sum', 'ceph_osd_op_process_latency_count', 'ceph_osd_op_w_prepare_latency_count', 'ceph_objecter_op_active', 'ceph_paxos_begin_latency_sum', 'ceph_osd_op_r', 'ceph_osd_op_rw_prepare_latency_sum', 'ceph_paxos_new_pn', 'ceph_rgw_qlen', 'ceph_rgw_req', 'ceph_rocksdb_get_latency_count', 'ceph_rgw_cache_miss', 'ceph_paxos_commit_latency_count', 'ceph_bluestore_txc_throttle_lat_count', 'ceph_paxos_lease_ack_timeout', 'ceph_bluestore_txc_commit_lat_sum', 'ceph_paxos_collect_bytes_sum', 'ceph_osd_op_rw_latency_count', 'ceph_paxos_collect_uncommitted', 'ceph_osd_op_rw_latency_sum', 'ceph_paxos_share_state', 'ceph_osd_op_r_prepare_latency_sum', 'ceph_bluestore_kv_flush_lat_sum', 'ceph_osd_op_rw_process_latency_sum', 'ceph_rocksdb_rocksdb_write_memtable_time_count', 'ceph_paxos_collect_latency_count', 'ceph_osd_op_rw_prepare_latency_count', 'ceph_paxos_collect_latency_sum', 'ceph_rocksdb_rocksdb_write_delay_time_count', 'ceph_objecter_op_rmw', 'ceph_paxos_begin_bytes_sum', 'ceph_osd_numpg', 'ceph_osd_stat_bytes', 'ceph_rocksdb_submit_sync_latency_sum', 'ceph_rocksdb_compact_queue_merge', 'ceph_paxos_collect_bytes_count', 'ceph_osd_op', 'ceph_paxos_commit_keys_sum', 'ceph_osd_op_rw_in_bytes', 'ceph_osd_op_rw_out_bytes', 'ceph_bluefs_bytes_written_sst', 'ceph_rgw_put', 'ceph_osd_op_rw_process_latency_count', 'ceph_rocksdb_compact_queue_len', 'ceph_bluestore_txc_throttle_lat_sum', 'ceph_bluefs_slow_used_bytes', 'ceph_osd_op_r_latency_sum', 'ceph_bluestore_kv_flush_lat_count', 'ceph_rocksdb_compact_range', 'ceph_osd_op_latency_sum', 'ceph_mon_session_add', 'ceph_paxos_share_state_keys_count', 'ceph_paxos_collect', 'ceph_osd_op_w_in_bytes', 'ceph_osd_op_r_process_latency_sum', 'ceph_paxos_start_peon', 'ceph_mon_session_trim', 'ceph_rocksdb_get_latency_sum', 'ceph_osd_op_rw', 'ceph_paxos_store_state_keys_count', 'ceph_rocksdb_rocksdb_write_delay_time_sum', 'ceph_objecter_op_r', 'ceph_objecter_op_active', 'ceph_objecter_op_w', 'ceph_osd_recovery_ops', 'ceph_bluefs_logged_bytes', 'ceph_bluefs_db_total_bytes', 'ceph_rgw_put_initial_lat_sum', 'ceph_osd_op_w_latency_count', 'ceph_rgw_put_initial_lat_count', 'ceph_bluestore_txc_commit_lat_count', 'ceph_bluestore_state_aio_wait_lat_count', 'ceph_paxos_begin_bytes_count', 'ceph_paxos_start_leader', 'ceph_mon_election_call', 'ceph_rocksdb_rocksdb_write_pre_and_post_time_count', 'ceph_mon_session_rm', 'ceph_paxos_store_state', 'ceph_paxos_store_state_bytes_count', 'ceph_osd_op_w_latency_sum', 'ceph_rgw_keystone_token_cache_hit', 'ceph_rocksdb_submit_latency_count', 'ceph_paxos_commit_latency_sum', 'ceph_rocksdb_rocksdb_write_memtable_time_sum', 'ceph_paxos_share_state_bytes_sum', 'ceph_osd_op_process_latency_sum', 'ceph_paxos_begin_keys_sum', 'ceph_rgw_qactive', 'ceph_rocksdb_rocksdb_write_pre_and_post_time_sum', 'ceph_bluefs_wal_used_bytes', 'ceph_rocksdb_rocksdb_write_wal_time_sum', 'ceph_osd_op_wip', 'ceph_rgw_get_initial_lat_sum', 'ceph_paxos_lease_timeout', 'ceph_osd_op_r_out_bytes', 'ceph_paxos_begin_keys_count', 'ceph_bluestore_kv_sync_lat_count', 'ceph_osd_op_prepare_latency_count', 'ceph_bluefs_bytes_written_slow', 'ceph_rocksdb_submit_latency_sum', 'ceph_osd_op_r_latency_count', 'ceph_paxos_share_state_keys_sum', 'ceph_paxos_store_state_bytes_sum', 'ceph_osd_op_latency_count', 'ceph_paxos_commit_bytes_count', 'ceph_paxos_restart', 'ceph_rgw_get_initial_lat_count', 'ceph_bluefs_slow_total_bytes', 'ceph_paxos_collect_timeout', 'ceph_osd_op_w_process_latency_sum', 'ceph_paxos_collect_keys_count', 'ceph_paxos_share_state_bytes_count', 'ceph_osd_op_w_prepare_latency_sum', 'ceph_bluestore_read_lat_sum', 'ceph_osd_stat_bytes_used', 'ceph_paxos_begin', 'ceph_mon_election_win', 'ceph_osd_op_w_process_latency_count', 'ceph_rgw_get_b', 'ceph_rgw_failed_req', 'ceph_rocksdb_rocksdb_write_wal_time_count', 'ceph_rgw_keystone_token_cache_miss', 'ceph_paxos_store_state_keys_sum', 'ceph_osd_numpg_removing', 'ceph_paxos_commit_keys_count', 'ceph_paxos_new_pn_latency_sum', 'ceph_osd_op_in_bytes', 'ceph_paxos_store_state_latency_count', 'ceph_paxos_refresh_latency_count', 'ceph_rgw_get', 'ceph_osd_op_r_prepare_latency_count', 'ceph_rgw_cache_hit', 'ceph_objecter_op_w', 'ceph_objecter_op_r', 'ceph_bluefs_num_files', 'ceph_rgw_put_b', 'ceph_mon_election_lose', 'ceph_osd_op_prepare_latency_sum', 'ceph_bluefs_db_used_bytes', 'ceph_bluestore_kv_final_lat_count', 'ceph_paxos_refresh_latency_sum', 'ceph_osd_recovery_bytes', 'ceph_osd_op_w', 'ceph_paxos_commit_bytes_sum', 'ceph_bluefs_log_bytes', 'ceph_rocksdb_submit_sync_latency_count']

--- Additional comment from avan on 2023-09-05 12:24:27 UTC ---

There's a fix under review currently https://github.com/red-hat-storage/rook/pull/516

--- Additional comment from Travis Nielsen on 2023-09-05 15:29:15 UTC ---

PR 516 was merged now.

--- Additional comment from Daniel Osypenko on 2023-09-07 10:12:21 UTC ---

Verified, PASSED: 
test_ceph_metrics_available http://pastebin.test.redhat.com/1108991
test_ceph_rbd_metrics_available http://pastebin.test.redhat.com/1108993

--- Additional comment from Sunil Kumar Acharya on 2023-09-21 05:54:14 UTC ---

Please update the requires_doc_text(RDT) flag/text appropriately.

--- Additional comment from errata-xmlrpc on 2023-11-08 17:53:45 UTC ---

Bug report changed to RELEASE_PENDING status by Errata System.
Advisory RHSA-2023:115514-11 has been changed to PUSH_READY status.
https://errata.devel.redhat.com/advisory/115514

--- Additional comment from errata-xmlrpc on 2023-11-08 18:52:23 UTC ---

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.14.0 security, enhancement & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:6832

Comment 3 Juan Miguel Olmo 2023-12-20 11:11:27 UTC
*** Bug 2253428 has been marked as a duplicate of this bug. ***

Comment 11 Mudit Agarwal 2024-01-29 11:16:41 UTC
Avan, please don't change the bug state to ON_QA until the discussion is concluded and bug has all the acks.

Also, this bug kept moving from engineering to QA and vice versa. Can we please setup a meeting to discuss and close it because the turnaround time to have such discussion via bug page is too much.

Filip, if the fix is not working in 4.14.z then we need a separate bug for 4.14.z

Comment 12 Daniel Osypenko 2024-01-29 17:11:00 UTC
I have reproduced an issue on post-upgrade deployment of the IBM cloud cluster and it very good represents the failure history of the tests test_ceph_metrics_available and test_ceph_rbd_metrics_available and may explain why we did not see it on life deployment.

Issue happens only when we have OCP 4.15 on both ODF 4.14 and 4.15. Issue happens only with ceph metrics (not with rbd metrics). Issue happened on platforms: IBM cloud, Azure, AWS and GCP. 

Tested:
Upgrade from OCP 4.14 & ODF 4.14 to OCP 4.14 & ODF 4.14 and pos-upgrade check

Before upgrade data from the metrics were available
After upgrade data are not available for 167 metrics

Along with this ocs-storagecluster is in progressing, Data resiliency in progressing, one worker node is not ready (VM is running, no errors on odf-operator-controller-manager, no errors on ocs-metrics-exporter rather than health report issues)
Issue also observed recently on vSphere OCP 4.15 & ODF 4.15 NOT-post-upgrade

oc get pods -n openshift-storage
NAME                                                              READY   STATUS    RESTARTS        AGE
csi-addons-controller-manager-58746c86d6-qjggl                    2/2     Running   0               110m
csi-cephfsplugin-gzgxr                                            2/2     Running   2               31h
csi-cephfsplugin-provisioner-85f5789c76-pn2vh                     5/5     Running   0               3h14m
csi-cephfsplugin-provisioner-85f5789c76-pnltj                     5/5     Running   0               166m
csi-cephfsplugin-thprw                                            2/2     Running   2               31h
csi-cephfsplugin-zjlj7                                            2/2     Running   0               31h
csi-rbdplugin-6cblq                                               3/3     Running   3               3h27m
csi-rbdplugin-6r674                                               3/3     Running   0               3h26m
csi-rbdplugin-provisioner-5fb5cc859b-g94br                        6/6     Running   0               3h14m
csi-rbdplugin-provisioner-5fb5cc859b-qts6t                        6/6     Running   0               166m
csi-rbdplugin-qlc7x                                               3/3     Running   3               3h27m
noobaa-core-0                                                     1/1     Running   0               166m
noobaa-db-pg-0                                                    1/1     Running   0               166m
noobaa-endpoint-845d6d9998-lz296                                  1/1     Running   0               166m
noobaa-operator-5bcf546c-mmlpr                                    2/2     Running   0               166m
ocs-metrics-exporter-64755696fb-766qm                             1/1     Running   0               3h14m
ocs-operator-78c8fb9446-4g9x8                                     1/1     Running   2 (177m ago)    3h14m
odf-console-76b8fd5784-wptm4                                      1/1     Running   0               166m
odf-operator-controller-manager-7bff4bf5cf-5ldz9                  2/2     Running   0               166m
rook-ceph-crashcollector-dosypenk-281-i-fd2hc-worker-1-9kvcf5r5   1/1     Running   0               3h14m
rook-ceph-crashcollector-dosypenk-281-i-fd2hc-worker-2-8zwczqwf   1/1     Running   0               166m
rook-ceph-exporter-dosypenk-281-i-fd2hc-worker-1-9kv6q-689sx5vx   1/1     Running   0               3h14m
rook-ceph-exporter-dosypenk-281-i-fd2hc-worker-2-8zwns-874khbrp   1/1     Running   0               166m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-7f6766b5774n4   2/2     Running   11 (150m ago)   3h14m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-5b6dd7cb2hfsz   2/2     Running   5 (148m ago)    164m
rook-ceph-mgr-a-7bcc5c969-xzsm2                                   2/2     Running   0               166m
rook-ceph-mon-a-d7d7bf65f-hjk48                                   2/2     Running   0               3h27m
rook-ceph-mon-b-8f64c96cf-gqsll                                   2/2     Running   0               167m
rook-ceph-mon-c-6f6f79d5dc-ztv72                                  0/2     Pending   0               5m30s
rook-ceph-operator-7b7b6b8d5c-c26t5                               1/1     Running   0               3h14m
rook-ceph-osd-0-66948789f4-klprs                                  2/2     Running   0               3h12m
rook-ceph-osd-1-79b6766cff-5nltx                                  0/2     Pending   0               163m
rook-ceph-osd-2-747c74d944-xzzh6                                  2/2     Running   0               3h27m
rook-ceph-tools-57fd4d4d68-9kjgw                                  1/1     Running   0               3h14m


ocs must-gather stuck. adding an OCP must-gather and partially OCS must-gather

Comment 17 Nishanth Thomas 2024-02-09 08:49:15 UTC
root cause is similar to https://bugzilla.redhat.com/show_bug.cgi?id=2258861. Divyansh with link the backport PR(4.14.z) and move the BZ to POST

Comment 25 Daniel Osypenko 2024-03-12 08:07:07 UTC
OCP 4.14.0-0.nightly-2024-03-11-023324
OCS 4.14.6-1

tests passed: 
  test_ceph_metrics_available - https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/34898/consoleFull
  test_ceph_rbd_metrics_available - https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/34896/console

Comment 26 Travis Nielsen 2024-03-13 20:54:59 UTC
*** Bug 2262307 has been marked as a duplicate of this bug. ***

Comment 30 errata-xmlrpc 2024-04-01 09:17:35 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Data Foundation 4.14.6 Bug Fix Update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2024:1579


Note You need to log in before you can comment on or make changes to this bug.