Bug 2258479 - [ODF Hackathon]: Ceph metrics timeout when looking for RBD mirroring when it is not configured (internal)
Summary: [ODF Hackathon]: Ceph metrics timeout when looking for RBD mirroring when it ...
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ceph-monitoring
Version: 4.14
Hardware: Unspecified
OS: Unspecified
unspecified
low
Target Milestone: ---
: ---
Assignee: Divyansh Kamboj
QA Contact: Harish NV Rao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2024-01-15 15:31 UTC by Ramon Gordillo
Modified: 2024-05-02 11:58 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2024-05-02 11:58:45 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 2256771 0 unspecified NEW [ODF Hackathon] 4.14 parent case 2024-09-03 12:53:00 UTC

Internal Links: 2257949

Description Ramon Gordillo 2024-01-15 15:31:33 UTC
Description of problem (please be detailed as possible and provide log
snippests):

In an internal ceph cluster without rbd mirroring, the 
ocs-metrics-exporter shows the following logs:

E0112 05:44:59.705009 1 ceph-block-pool.go:137] Invalid image health for pool ocs-storagecluster-cephblockpool. Must be OK, UNKNOWN, WARNING or ERROR
I0112 05:45:07.332607 1 rbd-mirror.go:296] RBD mirror store resync started at 2024-01-12 05:45:07.332593909 +0000 UTC m=+2061519.616778751
I0112 05:45:07.332637 1 rbd-mirror.go:321] RBD mirror store resync ended at 2024-01-12 05:45:07.332633306 +0000 UTC m=+2061519.616818150
E0112 05:45:18.347842 1 rbd-mirror.go:371] command rbd timedout in 30 seconds
I0112 05:45:18.347892 1 trace.go:236] Trace[1389586998]: "Reflector ListAndWatch" name:/remote-source/app/metrics/internal/collectors/registry.go:63 (12-Jan-2024 05:44:48.338) (total time: 30008ms):
Trace[1389586998]: [30.008962884s] [30.008962884s] END
E0112 05:45:18.347913 1 reflector.go:147] /remote-source/app/metrics/internal/collectors/registry.go:63: Failed to watch *v1.PersistentVolume: unable to sync list result: failed to get image status failed with output : , err: context deadline exceeded
E0112 05:45:26.159054 1 ceph-block-pool.go:137] Invalid image health for pool ocs-storagecluster-cephblockpool. Must be OK, UNKNOWN, WARNING or ERROR

When looking into the cluster, we can double check with the ceph tools in the cluster that it is not configured:

sh-5.1$ rbd mirror pool status ocs-storagecluster-cephblockpool
rbd: mirroring not enabled on the pool

The relevant code is https://github.com/red-hat-storage/ocs-operator/blob/main/metrics/internal/collectors/ceph-block-pool.go#L107-L139


Version of all relevant components (if applicable):

OCP 4.14.7, ODF 4.14.3


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

Prometheus is randomly losing some metrics.

Is there any workaround available to the best of your knowledge?

No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

1

Can this issue reproducible?

Yes

Can this issue reproduce from the UI?

N/A

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Install an ODF cluster without RBD mirroring
2. See the logs from the exporter


Actual results:

Metrics from ocs-metrics exporter are sometimes missing

Expected results:

Metrics scraped and not errors on the container

Additional info:

Comment 4 Divyansh Kamboj 2024-04-03 11:27:56 UTC
I beleive this has been fixed in the latest builds, I'll test it out on the latest and confirm if the fix is working

Comment 5 Divyansh Kamboj 2024-05-02 11:58:45 UTC
tested it out on 4.15, the logs don't give any issues regarding rbd. closing this, feel free to open, if you encounter it again


Note You need to log in before you can comment on or make changes to this bug.