Bug 2037762
Summary: | Wrong ServiceMonitor definition is causing failure during Prometheus configuration reload and preventing changes from being applied | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Simon Reber <sreber> |
Component: | Monitoring | Assignee: | Jayapriya Pai <janantha> |
Status: | CLOSED ERRATA | QA Contact: | Junqi Zhao <juzhao> |
Severity: | medium | Docs Contact: | Brian Burt <bburt> |
Priority: | high | ||
Version: | 4.9 | CC: | amuller, anpicker, bburt, gparente, hongyli, janantha, jfajersk, pgough, spasquie |
Target Milestone: | --- | Keywords: | Reopened |
Target Release: | 4.11.0 | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
Before this update, prometheus-operator allowed any valid time value for ScrapeTimeout. After this change it will validate if ScrapeTimeout specified is greater than ScrapeInterval and reject the config if it is greater than ScrapeInterval
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2022-08-10 10:41:42 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Simon Reber
2022-01-06 14:01:08 UTC
Created the PR with fix in upstream prometheus-operator repo https://github.com/prometheus-operator/prometheus-operator/pull/4491 tested with the PR, validation for ScrapeTimeout is added set invalid value scrapeTimeout: 120S, # oc -n ns1 get servicemonitor prometheus-example-monitor -oyaml apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: creationTimestamp: "2022-01-29T03:20:58Z" generation: 1 name: prometheus-example-monitor namespace: ns1 resourceVersion: "37684" uid: 903cd3e0-651c-49b8-8771-c02acf2514c6 spec: endpoints: - interval: 30s port: web scheme: http scrapeTimeout: 120S namespaceSelector: matchNames: - ns1 selector: matchLabels: app: prometheus-example-app # oc -n openshift-user-workload-monitoring logs -c prometheus-operator prometheus-operator-5f96f69b58-zq6qr ... level=warn ts=2022-01-29T03:16:52.59730499Z caller=operator.go:1825 component=prometheusoperator msg="skipping servicemonitor" error="invalid scrapeTimeout: \"120S\": not a valid duration string: \"120S\"" servicemonitor=ns1/prometheus-example-monitor namespace=openshift-user-workload-monitoring prometheus=user-workload level=warn ts=2022-01-29T03:20:58.085474155Z caller=operator.go:1825 component=prometheusoperator msg="skipping servicemonitor" error="invalid scrapeTimeout: \"120S\": not a valid duration string: \"120S\"" servicemonitor=ns1/prometheus-example-monitor namespace=openshift-user-workload-monitoring prometheus=user-workload edit ServiceMonitor from scrapeTimeout: 120S to scrapeTimeout: 120s, and # oc -n openshift-user-workload-monitoring logs -c prometheus prometheus-user-workload-0 level=warn ts=2022-01-29T03:35:48.568227619Z caller=operator.go:1825 component=prometheusoperator msg="skipping servicemonitor" error="scrapeTimeout \"120s\" greater than scrapeInterval \"30s\"" servicemonitor=ns1/prometheus-example-monitor namespace=openshift-user-workload-monitoring prometheus=user-workload in both scenarios, the configuration won't loaded to prometheus # oc -n openshift-user-workload-monitoring exec -c prometheus prometheus-user-workload-0 -- cat /etc/prometheus/config_out/prometheus.env.yaml | grep ns1 no result change the scrapeTimeout value to less than scrapeInterval, could see the configuration loaded to prometheus # oc -n openshift-user-workload-monitoring exec -c prometheus prometheus-user-workload-0 -- cat /etc/prometheus/config_out/prometheus.env.yaml | grep ns1 - job_name: serviceMonitor/ns1/prometheus-example-monitor/0 - ns1 retested with the 4.11.0-0.nightly-2022-03-03-061758, prometheus operator 0.54.1, validation for ScrapeTimeout is added set invalid value scrapeTimeout: 120S, # oc -n ns1 get servicemonitor prometheus-example-monitor -oyaml apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: creationTimestamp: "2022-03-03T10:44:39Z" generation: 2 name: prometheus-example-monitor namespace: ns1 resourceVersion: "100562" uid: 5b92ae30-3194-4315-acaf-bf99dde6150b spec: endpoints: - interval: 30s port: web scheme: http scrapeTimeout: 120S namespaceSelector: matchNames: - ns1 selector: matchLabels: app: prometheus-example-app # oc -n openshift-user-workload-monitoring logs -c prometheus-operator prometheus-operator-549fbb5cc8-m96cx level=warn ts=2022-03-03T10:47:01.537431456Z caller=operator.go:1837 component=prometheusoperator msg="skipping servicemonitor" error="invalid scrapeTimeout: \"120S\": not a valid duration string: \"120S\"" servicemonitor=ns1/prometheus-example-monitor namespace=openshift-user-workload-monitoring prometheus=user-workload the configuration won't loaded to prometheus # oc -n openshift-user-workload-monitoring exec -c prometheus prometheus-user-workload-0 -- cat /etc/prometheus/config_out/prometheus.env.yaml | grep ns1 no result edit ServiceMonitor from scrapeTimeout: 120S to scrapeTimeout: 120s, and # oc -n openshift-user-workload-monitoring logs -c prometheus-operator prometheus-operator-549fbb5cc8-m96cx | grep scrapeTimeout level=warn ts=2022-03-03T10:49:26.919371633Z caller=operator.go:1837 component=prometheusoperator msg="skipping servicemonitor" error="scrapeTimeout \"120s\" greater than scrapeInterval \"30s\"" servicemonitor=ns1/prometheus-example-monitor namespace=openshift-user-workload-monitoring prometheus=user-workload the configuration won't loaded to prometheus # oc -n openshift-user-workload-monitoring exec -c prometheus prometheus-user-workload-0 -- cat /etc/prometheus/config_out/prometheus.env.yaml | grep ns1 no result change the scrapeTimeout value to less than scrapeInterval, no error in prometheus-operator and could see the configuration loaded to prometheus # oc -n openshift-user-workload-monitoring exec -c prometheus prometheus-user-workload-0 -- cat /etc/prometheus/config_out/prometheus.env.yaml | grep ns1 - job_name: serviceMonitor/ns1/prometheus-example-monitor/0 - ns1 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069 |