Bug 2037762
| Summary: | Wrong ServiceMonitor definition is causing failure during Prometheus configuration reload and preventing changes from being applied | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Simon Reber <sreber> |
| Component: | Monitoring | Assignee: | Jayapriya Pai <janantha> |
| Status: | CLOSED ERRATA | QA Contact: | Junqi Zhao <juzhao> |
| Severity: | medium | Docs Contact: | Brian Burt <bburt> |
| Priority: | high | ||
| Version: | 4.9 | CC: | amuller, anpicker, bburt, gparente, hongyli, janantha, jfajersk, pgough, spasquie |
| Target Milestone: | --- | Keywords: | Reopened |
| Target Release: | 4.11.0 | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: |
Before this update, prometheus-operator allowed any valid time value for ScrapeTimeout. After this change it will validate if ScrapeTimeout specified is greater than ScrapeInterval and reject the config if it is greater than ScrapeInterval
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-08-10 10:41:42 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Simon Reber
2022-01-06 14:01:08 UTC
Created the PR with fix in upstream prometheus-operator repo https://github.com/prometheus-operator/prometheus-operator/pull/4491 tested with the PR, validation for ScrapeTimeout is added
set invalid value scrapeTimeout: 120S,
# oc -n ns1 get servicemonitor prometheus-example-monitor -oyaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
creationTimestamp: "2022-01-29T03:20:58Z"
generation: 1
name: prometheus-example-monitor
namespace: ns1
resourceVersion: "37684"
uid: 903cd3e0-651c-49b8-8771-c02acf2514c6
spec:
endpoints:
- interval: 30s
port: web
scheme: http
scrapeTimeout: 120S
namespaceSelector:
matchNames:
- ns1
selector:
matchLabels:
app: prometheus-example-app
# oc -n openshift-user-workload-monitoring logs -c prometheus-operator prometheus-operator-5f96f69b58-zq6qr
...
level=warn ts=2022-01-29T03:16:52.59730499Z caller=operator.go:1825 component=prometheusoperator msg="skipping servicemonitor" error="invalid scrapeTimeout: \"120S\": not a valid duration string: \"120S\"" servicemonitor=ns1/prometheus-example-monitor namespace=openshift-user-workload-monitoring prometheus=user-workload
level=warn ts=2022-01-29T03:20:58.085474155Z caller=operator.go:1825 component=prometheusoperator msg="skipping servicemonitor" error="invalid scrapeTimeout: \"120S\": not a valid duration string: \"120S\"" servicemonitor=ns1/prometheus-example-monitor namespace=openshift-user-workload-monitoring prometheus=user-workload
edit ServiceMonitor from scrapeTimeout: 120S to scrapeTimeout: 120s, and
# oc -n openshift-user-workload-monitoring logs -c prometheus prometheus-user-workload-0
level=warn ts=2022-01-29T03:35:48.568227619Z caller=operator.go:1825 component=prometheusoperator msg="skipping servicemonitor" error="scrapeTimeout \"120s\" greater than scrapeInterval \"30s\"" servicemonitor=ns1/prometheus-example-monitor namespace=openshift-user-workload-monitoring prometheus=user-workload
in both scenarios, the configuration won't loaded to prometheus
# oc -n openshift-user-workload-monitoring exec -c prometheus prometheus-user-workload-0 -- cat /etc/prometheus/config_out/prometheus.env.yaml | grep ns1
no result
change the scrapeTimeout value to less than scrapeInterval, could see the configuration loaded to prometheus
# oc -n openshift-user-workload-monitoring exec -c prometheus prometheus-user-workload-0 -- cat /etc/prometheus/config_out/prometheus.env.yaml | grep ns1
- job_name: serviceMonitor/ns1/prometheus-example-monitor/0
- ns1
retested with the 4.11.0-0.nightly-2022-03-03-061758, prometheus operator 0.54.1, validation for ScrapeTimeout is added
set invalid value scrapeTimeout: 120S,
# oc -n ns1 get servicemonitor prometheus-example-monitor -oyaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
creationTimestamp: "2022-03-03T10:44:39Z"
generation: 2
name: prometheus-example-monitor
namespace: ns1
resourceVersion: "100562"
uid: 5b92ae30-3194-4315-acaf-bf99dde6150b
spec:
endpoints:
- interval: 30s
port: web
scheme: http
scrapeTimeout: 120S
namespaceSelector:
matchNames:
- ns1
selector:
matchLabels:
app: prometheus-example-app
# oc -n openshift-user-workload-monitoring logs -c prometheus-operator prometheus-operator-549fbb5cc8-m96cx
level=warn ts=2022-03-03T10:47:01.537431456Z caller=operator.go:1837 component=prometheusoperator msg="skipping servicemonitor" error="invalid scrapeTimeout: \"120S\": not a valid duration string: \"120S\"" servicemonitor=ns1/prometheus-example-monitor namespace=openshift-user-workload-monitoring prometheus=user-workload
the configuration won't loaded to prometheus
# oc -n openshift-user-workload-monitoring exec -c prometheus prometheus-user-workload-0 -- cat /etc/prometheus/config_out/prometheus.env.yaml | grep ns1
no result
edit ServiceMonitor from scrapeTimeout: 120S to scrapeTimeout: 120s, and
# oc -n openshift-user-workload-monitoring logs -c prometheus-operator prometheus-operator-549fbb5cc8-m96cx | grep scrapeTimeout
level=warn ts=2022-03-03T10:49:26.919371633Z caller=operator.go:1837 component=prometheusoperator msg="skipping servicemonitor" error="scrapeTimeout \"120s\" greater than scrapeInterval \"30s\"" servicemonitor=ns1/prometheus-example-monitor namespace=openshift-user-workload-monitoring prometheus=user-workload
the configuration won't loaded to prometheus
# oc -n openshift-user-workload-monitoring exec -c prometheus prometheus-user-workload-0 -- cat /etc/prometheus/config_out/prometheus.env.yaml | grep ns1
no result
change the scrapeTimeout value to less than scrapeInterval, no error in prometheus-operator and could see the configuration loaded to prometheus
# oc -n openshift-user-workload-monitoring exec -c prometheus prometheus-user-workload-0 -- cat /etc/prometheus/config_out/prometheus.env.yaml | grep ns1
- job_name: serviceMonitor/ns1/prometheus-example-monitor/0
- ns1
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069 |