2258479 – [ODF Hackathon]: Ceph metrics timeout when looking for RBD mirroring when it is not configured (internal)

Bug 2258479 - [ODF Hackathon]: Ceph metrics timeout when looking for RBD mirroring when it is not configured (internal)

Summary: [ODF Hackathon]: Ceph metrics timeout when looking for RBD mirroring when it ...

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ceph-monitoring
Sub Component:
Version:	4.14
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	low
Target Milestone:	---
Target Release:	---
Assignee:	Divyansh Kamboj
QA Contact:	Harish NV Rao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2024-01-15 15:31 UTC by Ramon Gordillo
Modified:	2024-05-02 11:58 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2024-05-02 11:58:45 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	2256771	0	unspecified	NEW	[ODF Hackathon] 4.14 parent case	2024-09-03 12:53:00 UTC

Internal Links: 2257949

Description Ramon Gordillo 2024-01-15 15:31:33 UTC

Description of problem (please be detailed as possible and provide log
snippests):

In an internal ceph cluster without rbd mirroring, the 
ocs-metrics-exporter shows the following logs:

E0112 05:44:59.705009 1 ceph-block-pool.go:137] Invalid image health for pool ocs-storagecluster-cephblockpool. Must be OK, UNKNOWN, WARNING or ERROR
I0112 05:45:07.332607 1 rbd-mirror.go:296] RBD mirror store resync started at 2024-01-12 05:45:07.332593909 +0000 UTC m=+2061519.616778751
I0112 05:45:07.332637 1 rbd-mirror.go:321] RBD mirror store resync ended at 2024-01-12 05:45:07.332633306 +0000 UTC m=+2061519.616818150
E0112 05:45:18.347842 1 rbd-mirror.go:371] command rbd timedout in 30 seconds
I0112 05:45:18.347892 1 trace.go:236] Trace[1389586998]: "Reflector ListAndWatch" name:/remote-source/app/metrics/internal/collectors/registry.go:63 (12-Jan-2024 05:44:48.338) (total time: 30008ms):
Trace[1389586998]: [30.008962884s] [30.008962884s] END
E0112 05:45:18.347913 1 reflector.go:147] /remote-source/app/metrics/internal/collectors/registry.go:63: Failed to watch *v1.PersistentVolume: unable to sync list result: failed to get image status failed with output : , err: context deadline exceeded
E0112 05:45:26.159054 1 ceph-block-pool.go:137] Invalid image health for pool ocs-storagecluster-cephblockpool. Must be OK, UNKNOWN, WARNING or ERROR

When looking into the cluster, we can double check with the ceph tools in the cluster that it is not configured:

sh-5.1$ rbd mirror pool status ocs-storagecluster-cephblockpool
rbd: mirroring not enabled on the pool

The relevant code is https://github.com/red-hat-storage/ocs-operator/blob/main/metrics/internal/collectors/ceph-block-pool.go#L107-L139


Version of all relevant components (if applicable):

OCP 4.14.7, ODF 4.14.3


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

Prometheus is randomly losing some metrics.

Is there any workaround available to the best of your knowledge?

No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

1

Can this issue reproducible?

Yes

Can this issue reproduce from the UI?

N/A

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Install an ODF cluster without RBD mirroring
2. See the logs from the exporter


Actual results:

Metrics from ocs-metrics exporter are sometimes missing

Expected results:

Metrics scraped and not errors on the container

Additional info:

Comment 4 Divyansh Kamboj 2024-04-03 11:27:56 UTC

I beleive this has been fixed in the latest builds, I'll test it out on the latest and confirm if the fix is working

Comment 5 Divyansh Kamboj 2024-05-02 11:58:45 UTC

tested it out on 4.15, the logs don't give any issues regarding rbd. closing this, feel free to open, if you encounter it again

Note You need to log in before you can comment on or make changes to this bug.