Description of problem (please be detailed as possible and provide log snippests): Currently we have set a very aggressive default interval duration of '5s' for the following two servicemonitors, 'rook-ceph-exporter' and 'rook-ceph-mgr'. This was noticed by a customer during their review of prometheus monitoring we provide (and was notified to us through ocs-tech-list <ocs-tech-list> email, with subject: "Prometheus scrape interval 5s"). So this BZ is an RFE to increase the default aggressive '5s' interval. Openshift's scrape interval is currently '30s', so we can increase these SM intervals to '30s' (a suggestion). Version of all relevant components (if applicable): Any ODF version as we haven't changed the default interval. Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? No Is there any workaround available to the best of your knowledge? We could directly edit the ServiceMonitors and make the changes to the 'interval' Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? NA Can this issue reproduce from the UI? NA If this is a regression, please provide more details to justify this: No Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
We could fix the issue in either of the following two ways, A. We can either change the default values for the ServiceMonitors in their respective files in rook repo, That is to change in these two yaml files, https://github.com/rook/rook/blob/master/deploy/examples/monitoring/exporter-service-monitor.yaml (for rook-ceph-exporter) https://github.com/rook/rook/blob/master/deploy/examples/monitoring/service-monitor.yaml (for rook-ceph-mgr) OR B. We can change the cephcluster creation in ocs-operator repo, adding an additional 'Interval' field to Spec->Monitoring, which will then be read by rook-operator and make the needed changes to both the SMs (rook-ceph-exporter and rook-ceph-mgr) ocs-operator code: https://github.com/red-hat-storage/ocs-operator/blob/main/controllers/storagecluster/cephcluster.go#L455 Will discuss further, with the team, on which (optimal) path to take.
Submitted the RFE google form. PS: not very sure about the name of the customer to add in the form
(In reply to arun kumar mohan from comment #4) > We could fix the issue in either of the following two ways, > > A. We can either change the default values for the ServiceMonitors in their > respective files in rook repo, > That is to change in these two yaml files, > > https://github.com/rook/rook/blob/master/deploy/examples/monitoring/exporter- > service-monitor.yaml (for rook-ceph-exporter) > > https://github.com/rook/rook/blob/master/deploy/examples/monitoring/service- > monitor.yaml (for rook-ceph-mgr) > > > OR > > B. We can change the cephcluster creation in ocs-operator repo, adding an > additional 'Interval' field to Spec->Monitoring, which will then be read by > rook-operator and make the needed changes to both the SMs > (rook-ceph-exporter and rook-ceph-mgr) > ocs-operator code: > https://github.com/red-hat-storage/ocs-operator/blob/main/controllers/ > storagecluster/cephcluster.go#L455 > > Will discuss further, with the team, on which (optimal) path to take. Option B would be recommended. The monitoring.interval setting is the intended way to override this value.
Thanks Travis. Created PR: https://github.com/red-hat-storage/ocs-operator/pull/2506
A jira ticket is raised (in RHSTOR project): https://issues.redhat.com/browse/RHSTOR-5765 Please take a look.
oc get cephclusters.ceph.rook.io ocs-storagecluster-cephcluster -n openshift-storage -o=jsonpath={'.spec.monitoring'} {"enabled":true,"interval":"30s"} oc get servicemonitor rook-ceph-exporter -n openshift-storage -o jsonpath='{.spec.endpoints[0].interval}' 30s oc get servicemonitor rook-ceph-mgr -n openshift-storage -o jsonpath='{.spec.endpoints[0].interval}' 30s OC version: Client Version: 4.13.4 Kustomize Version: v4.5.7 Server Version: 4.16.0-0.nightly-2024-05-15-001800 Kubernetes Version: v1.29.4+4a87b53 OCS verison: ocs-operator.v4.16.0-99.stable OpenShift Container Storage 4.16.0-99.stable Succeeded Cluster version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.16.0-0.nightly-2024-05-15-001800 True False 79m Cluster version is 4.16.0-0.nightly-2024-05-15-001800 Rook version: 2024/05/15 10:10:52 maxprocs: Leaving GOMAXPROCS=16: CPU quota undefined rook: v4.16.0-0.32d64a561bd504448dedcbda3a7a4e6083227ad5 go: go1.21.9 (Red Hat 1.21.9-1.el9_4) Ceph version: ceph version 18.2.1-167.el9cp (e8c836edb24adb7717a6c8ba1e93a07e3efede29) reef (stable) Verified
additionally, tests from tests/functional/pod_and_daemons/test_mgr_pods.py passed
Providing the RDT details, please take a look
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.16.0 security, enhancement & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:4591