Bug 2258479

Summary: [ODF Hackathon]: Ceph metrics timeout when looking for RBD mirroring when it is not configured (internal)
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Ramon Gordillo <ramon.gordillo>
Component: ceph-monitoringAssignee: Divyansh Kamboj <dkamboj>
Status: CLOSED WORKSFORME QA Contact: Harish NV Rao <hnallurv>
Severity: low Docs Contact:
Priority: unspecified    
Version: 4.14CC: ddomingu, etamir, muagarwa, nthomas, odf-bz-bot
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2024-05-02 11:58:45 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Ramon Gordillo 2024-01-15 15:31:33 UTC
Description of problem (please be detailed as possible and provide log
snippests):

In an internal ceph cluster without rbd mirroring, the 
ocs-metrics-exporter shows the following logs:

E0112 05:44:59.705009 1 ceph-block-pool.go:137] Invalid image health for pool ocs-storagecluster-cephblockpool. Must be OK, UNKNOWN, WARNING or ERROR
I0112 05:45:07.332607 1 rbd-mirror.go:296] RBD mirror store resync started at 2024-01-12 05:45:07.332593909 +0000 UTC m=+2061519.616778751
I0112 05:45:07.332637 1 rbd-mirror.go:321] RBD mirror store resync ended at 2024-01-12 05:45:07.332633306 +0000 UTC m=+2061519.616818150
E0112 05:45:18.347842 1 rbd-mirror.go:371] command rbd timedout in 30 seconds
I0112 05:45:18.347892 1 trace.go:236] Trace[1389586998]: "Reflector ListAndWatch" name:/remote-source/app/metrics/internal/collectors/registry.go:63 (12-Jan-2024 05:44:48.338) (total time: 30008ms):
Trace[1389586998]: [30.008962884s] [30.008962884s] END
E0112 05:45:18.347913 1 reflector.go:147] /remote-source/app/metrics/internal/collectors/registry.go:63: Failed to watch *v1.PersistentVolume: unable to sync list result: failed to get image status failed with output : , err: context deadline exceeded
E0112 05:45:26.159054 1 ceph-block-pool.go:137] Invalid image health for pool ocs-storagecluster-cephblockpool. Must be OK, UNKNOWN, WARNING or ERROR

When looking into the cluster, we can double check with the ceph tools in the cluster that it is not configured:

sh-5.1$ rbd mirror pool status ocs-storagecluster-cephblockpool
rbd: mirroring not enabled on the pool

The relevant code is https://github.com/red-hat-storage/ocs-operator/blob/main/metrics/internal/collectors/ceph-block-pool.go#L107-L139


Version of all relevant components (if applicable):

OCP 4.14.7, ODF 4.14.3


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

Prometheus is randomly losing some metrics.

Is there any workaround available to the best of your knowledge?

No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

1

Can this issue reproducible?

Yes

Can this issue reproduce from the UI?

N/A

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Install an ODF cluster without RBD mirroring
2. See the logs from the exporter


Actual results:

Metrics from ocs-metrics exporter are sometimes missing

Expected results:

Metrics scraped and not errors on the container

Additional info:

Comment 4 Divyansh Kamboj 2024-04-03 11:27:56 UTC
I beleive this has been fixed in the latest builds, I'll test it out on the latest and confirm if the fix is working

Comment 5 Divyansh Kamboj 2024-05-02 11:58:45 UTC
tested it out on 4.15, the logs don't give any issues regarding rbd. closing this, feel free to open, if you encounter it again