2269354 – [RFE] Change the default interval duration for two ServiceMonitors, 'rook-ceph-exporter' and 'rook-ceph-mgr'

Bug 2269354 - [RFE] Change the default interval duration for two ServiceMonitors, 'rook-ceph-exporter' and 'rook-ceph-mgr'

Summary: [RFE] Change the default interval duration for two ServiceMonitors, 'rook-cep...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ceph-monitoring
Sub Component:
Version:	4.14
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	ODF 4.16.0
Assignee:	arun kumar mohan
QA Contact:	Daniel Osypenko
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2260844
TreeView+	depends on / blocked

Reported:	2024-03-13 10:56 UTC by arun kumar mohan
Modified:	2024-07-17 13:16 UTC (History)
CC List:	8 users (show)
Fixed In Version:	4.16.0-94
Doc Type:	Bug Fix
Doc Text:	.Low default interval duration for two ServiceMonitors, 'rook-ceph-exporter' and 'rook-ceph-mgr' Previously, the exporter data collected by prometheus added load to the system as the prometheus scrap interval provided for service monitors, 'rook-ceph-exporter’ and 'rook-ceph-mgr’ was only 5 seconds. With this fix, the interval is increased to 30 seconds to balance the prometheus scrapping, thereby reducing the system load.
Clone Of:
Environment:
Last Closed:	2024-07-17 13:16:38 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	red-hat-storage ocs-operator pull 2506	0	None	Merged	Adding a monitoring interval for cephcluster	2024-04-23 12:31:32 UTC
Red Hat Product Errata	RHSA-2024:4591	0	None	None	None	2024-07-17 13:16:47 UTC

Description arun kumar mohan 2024-03-13 10:56:15 UTC

Description of problem (please be detailed as possible and provide log
snippests):

Currently we have set a very aggressive default interval duration of '5s' for the following two servicemonitors, 'rook-ceph-exporter' and 'rook-ceph-mgr'. This was noticed by a customer during their review of prometheus monitoring we provide (and was notified to us through ocs-tech-list <ocs-tech-list> email, with subject: "Prometheus scrape interval 5s").

So this BZ is an RFE to increase the default aggressive '5s' interval. Openshift's scrape interval is currently '30s', so we can increase these SM intervals to '30s' (a suggestion).

Version of all relevant components (if applicable):
Any ODF version as we haven't changed the default interval.

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
No

Is there any workaround available to the best of your knowledge?
We could directly edit the ServiceMonitors and make the changes to the 'interval'

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
NA

Can this issue reproduce from the UI?
NA

If this is a regression, please provide more details to justify this:
No

Steps to Reproduce:
1.
2.
3.


Actual results:


Expected results:


Additional info:

Comment 4 arun kumar mohan 2024-03-13 15:53:52 UTC

We could fix the issue in either of the following two ways,

A. We can either change the default values for the ServiceMonitors in their respective files in rook repo,
That is to change in these two yaml files,
         https://github.com/rook/rook/blob/master/deploy/examples/monitoring/exporter-service-monitor.yaml (for rook-ceph-exporter)
         https://github.com/rook/rook/blob/master/deploy/examples/monitoring/service-monitor.yaml (for rook-ceph-mgr)


OR

B. We can change the cephcluster creation in ocs-operator repo, adding an additional 'Interval' field to Spec->Monitoring, which will then be read by rook-operator and make the needed changes to both the SMs (rook-ceph-exporter and rook-ceph-mgr)
ocs-operator code: https://github.com/red-hat-storage/ocs-operator/blob/main/controllers/storagecluster/cephcluster.go#L455

Will discuss further, with the team, on which (optimal) path to take.

Comment 5 arun kumar mohan 2024-03-13 16:14:26 UTC

Submitted the RFE google form.
PS: not very sure about the name of the customer to add in the form

Comment 6 Travis Nielsen 2024-03-13 16:18:03 UTC

(In reply to arun kumar mohan from comment #4)
> We could fix the issue in either of the following two ways,
> 
> A. We can either change the default values for the ServiceMonitors in their
> respective files in rook repo,
> That is to change in these two yaml files,
>         
> https://github.com/rook/rook/blob/master/deploy/examples/monitoring/exporter-
> service-monitor.yaml (for rook-ceph-exporter)
>         
> https://github.com/rook/rook/blob/master/deploy/examples/monitoring/service-
> monitor.yaml (for rook-ceph-mgr)
> 
> 
> OR
> 
> B. We can change the cephcluster creation in ocs-operator repo, adding an
> additional 'Interval' field to Spec->Monitoring, which will then be read by
> rook-operator and make the needed changes to both the SMs
> (rook-ceph-exporter and rook-ceph-mgr)
> ocs-operator code:
> https://github.com/red-hat-storage/ocs-operator/blob/main/controllers/
> storagecluster/cephcluster.go#L455
> 
> Will discuss further, with the team, on which (optimal) path to take.

Option B would be recommended. The monitoring.interval setting is the intended way to override this value.

Comment 8 arun kumar mohan 2024-03-14 06:08:40 UTC

Thanks Travis.
Created PR: https://github.com/red-hat-storage/ocs-operator/pull/2506

Comment 9 arun kumar mohan 2024-04-25 12:07:46 UTC

A jira ticket is raised (in RHSTOR project): https://issues.redhat.com/browse/RHSTOR-5765
Please take a look.

Comment 13 Daniel Osypenko 2024-05-15 10:11:33 UTC

oc get cephclusters.ceph.rook.io ocs-storagecluster-cephcluster -n openshift-storage -o=jsonpath={'.spec.monitoring'}
{"enabled":true,"interval":"30s"}

oc get servicemonitor rook-ceph-exporter -n openshift-storage -o jsonpath='{.spec.endpoints[0].interval}'
30s

oc get servicemonitor rook-ceph-mgr -n openshift-storage -o jsonpath='{.spec.endpoints[0].interval}' 
30s


OC version:
Client Version: 4.13.4
Kustomize Version: v4.5.7
Server Version: 4.16.0-0.nightly-2024-05-15-001800
Kubernetes Version: v1.29.4+4a87b53

OCS verison:
ocs-operator.v4.16.0-99.stable              OpenShift Container Storage        4.16.0-99.stable              Succeeded

Cluster version
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.16.0-0.nightly-2024-05-15-001800   True        False         79m     Cluster version is 4.16.0-0.nightly-2024-05-15-001800

Rook version:
2024/05/15 10:10:52 maxprocs: Leaving GOMAXPROCS=16: CPU quota undefined
rook: v4.16.0-0.32d64a561bd504448dedcbda3a7a4e6083227ad5
go: go1.21.9 (Red Hat 1.21.9-1.el9_4)

Ceph version:
ceph version 18.2.1-167.el9cp (e8c836edb24adb7717a6c8ba1e93a07e3efede29) reef (stable)

Verified

Comment 14 Daniel Osypenko 2024-05-15 10:13:23 UTC

additionally, tests from tests/functional/pod_and_daemons/test_mgr_pods.py passed

Comment 16 arun kumar mohan 2024-05-29 17:08:14 UTC

Providing the RDT details, please take a look

Comment 18 errata-xmlrpc 2024-07-17 13:16:38 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.16.0 security, enhancement & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:4591

Note You need to log in before you can comment on or make changes to this bug.