Bug 2037762 - Wrong ServiceMonitor definition is causing failure during Prometheus configuration reload and preventing changes from being applied
Summary: Wrong ServiceMonitor definition is causing failure during Prometheus configur...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.9
Hardware: x86_64
OS: Linux
high
medium
Target Milestone: ---
: 4.11.0
Assignee: Jayapriya Pai
QA Contact: Junqi Zhao
Brian Burt
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-01-06 14:01 UTC by Simon Reber
Modified: 2022-10-18 03:31 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Before this update, prometheus-operator allowed any valid time value for ScrapeTimeout. After this change it will validate if ScrapeTimeout specified is greater than ScrapeInterval and reject the config if it is greater than ScrapeInterval
Clone Of:
Environment:
Last Closed: 2022-08-10 10:41:42 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift prometheus-operator pull 151 0 None open Bug 2037762: bump openshift/prometheus-operator to v0.54.0 2022-02-01 10:57:32 UTC
Red Hat Knowledge Base (Solution) 6627021 0 None None None 2022-01-06 14:23:45 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 10:42:09 UTC

Description Simon Reber 2022-01-06 14:01:08 UTC
Description of problem:

Enabling monitoring for user-defined projects (following https://docs.openshift.com/container-platform/4.9/monitoring/enabling-monitoring-for-user-defined-projects.html) with a ServiceMonitor that has an invalid configuration is causing Prometheus to fail loading the configuration and thus to apply all subsequent changes. This is impacting all projects using user workload monitoring and thus needs to be fixed with pre-validation of the configuration before applying it.

Version-Release number of selected component (if applicable):

 - OpenShift Container Platform 4.9.10

How reproducible:

 - Always

Steps to Reproduce:
1. Enable monitoring for user-defined projects following https://docs.openshift.com/container-platform/4.9/monitoring/enabling-monitoring-for-user-defined-projects.html#enabling-monitoring-for-user-defined-projects_enabling-monitoring-for-user-defined-projects

2. Create a ServiceMonitor as shown below

{
    "apiVersion": "monitoring.coreos.com/v1",
    "kind": "ServiceMonitor",
    "metadata": {
        "name": "python-metrics",
        "namespace": "project-101"
    },
    "spec": {
        "endpoints": [
            {
                "interval": "60s",
                "port": "http",
                "scrapeTimeout": "120s"
            }
        ],
        "jobLabel": "app.kubernetes.io/name",
        "selector": {
            "matchLabels": {
                "app": "httpd"
            }
        }
    }
}

3. Check the logs of the Prometheus Container in `openshift-user-workload-monitoring` project.

$ oc logs prometheus-user-workload-0 -c prometheus 
[...]
level=info ts=2022-01-05T16:17:48.209Z caller=main.go:801 msg="Server is ready to receive web requests."
level=info ts=2022-01-05T16:17:48.296Z caller=main.go:986 msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml
level=info ts=2022-01-05T16:17:48.298Z caller=kubernetes.go:284 component="discovery manager notify" discovery=kubernetes msg="Using pod service account via in-cluster config"
level=info ts=2022-01-05T16:17:48.300Z caller=main.go:1023 msg="Completed loading of configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml totalDuration=3.208219ms db_storage=1.189µs remote_storage=1.765µs web_handler=547ns query_engine=1.243µs scrape=54.128µs scrape_sd=3.571µs notify=177.875µs notify_sd=2.452636ms rules=56.123µs
level=info ts=2022-01-06T10:50:02.926Z caller=main.go:986 msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml
level=error ts=2022-01-06T10:50:02.927Z caller=main.go:763 msg="Error reloading config" err="couldn't load configuration (--config.file=\"/etc/prometheus/config_out/prometheus.env.yaml\"): parsing YAML file /etc/prometheus/config_out/prometheus.env.yaml: scrape timeout greater than scrape interval for scrape config with job name \"serviceMonitor/project-101/python-metrics/0\""
level=info ts=2022-01-06T10:50:07.924Z caller=main.go:986 msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml
level=error ts=2022-01-06T10:50:07.925Z caller=main.go:763 msg="Error reloading config" err="couldn't load configuration (--config.file=\"/etc/prometheus/config_out/prometheus.env.yaml\"): parsing YAML file /etc/prometheus/config_out/prometheus.env.yaml: scrape timeout greater than scrape interval for scrape config with job name \"serviceMonitor/project-101/python-metrics/0\""
level=info ts=2022-01-06T10:50:12.929Z caller=main.go:986 msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml
level=error ts=2022-01-06T10:50:12.930Z caller=main.go:763 msg="Error reloading config" err="couldn't load configuration (--config.file=\"/etc/prometheus/config_out/prometheus.env.yaml\"): parsing YAML file /etc/prometheus/config_out/prometheus.env.yaml: scrape timeout greater than scrape interval for scrape config with job name \"serviceMonitor/project-101/python-metrics/0\""
level=info ts=2022-01-06T10:50:17.929Z caller=main.go:986 msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml
level=error ts=2022-01-06T10:50:17.930Z caller=main.go:763 msg="Error reloading config" err="couldn't load configuration (--config.file=\"/etc/prometheus/config_out/prometheus.env.yaml\"): parsing YAML file /etc/prometheus/config_out/prometheus.env.yaml: scrape timeout greater than scrape interval for scrape config with job name \"serviceMonitor/project-101/python-metrics/0\""
level=info ts=2022-01-06T10:50:22.929Z caller=main.go:986 msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml
level=error ts=2022-01-06T10:50:22.930Z caller=main.go:763 msg="Error reloading config" err="couldn't load configuration (--config.file=\"/etc/prometheus/config_out/prometheus.env.yaml\"): parsing YAML file /etc/prometheus/config_out/prometheus.env.yaml: scrape timeout greater than scrape interval for scrape config with job name \"serviceMonitor/project-101/python-metrics/0\""
level=info ts=2022-01-06T10:50:27.929Z caller=main.go:986 msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml
level=error ts=2022-01-06T10:50:27.930Z caller=main.go:763 msg="Error reloading config" err="couldn't load configuration (--config.file=\"/etc/prometheus/config_out/prometheus.env.yaml\"): parsing YAML file /etc/prometheus/config_out/prometheus.env.yaml: scrape timeout greater than scrape interval for scrape config with job name \"serviceMonitor/project-101/python-metrics/0\""
level=info ts=2022-01-06T10:50:32.929Z caller=main.go:986 msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml
level=error ts=2022-01-06T10:50:32.929Z caller=main.go:763 msg="Error reloading config" err="couldn't load configuration (--config.file=\"/etc/prometheus/config_out/prometheus.env.yaml\"): parsing YAML file /etc/prometheus/config_out/prometheus.env.yaml: scrape timeout greater than scrape interval for scrape config with job name \"serviceMonitor/project-101/python-metrics/0\""
level=info ts=2022-01-06T10:50:37.929Z caller=main.go:986 msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml
level=error ts=2022-01-06T10:50:37.930Z caller=main.go:763 msg="Error reloading config" err="couldn't load configuration (--config.file=\"/etc/prometheus/config_out/prometheus.env.yaml\"): parsing YAML file /etc/prometheus/config_out/prometheus.env.yaml: scrape timeout greater than scrape interval for scrape config with job name \"serviceMonitor/project-101/python-metrics/0\""
level=info ts=2022-01-06T10:50:42.926Z caller=main.go:986 msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml
level=error ts=2022-01-06T10:50:42.927Z caller=main.go:763 msg="Error reloading config" err="couldn't load configuration (--config.file=\"/etc/prometheus/config_out/prometheus.env.yaml\"): parsing YAML file /etc/prometheus/config_out/prometheus.env.yaml: scrape timeout greater than scrape interval for scrape config with job name \"serviceMonitor/project-101/python-metrics/0\""
level=info ts=2022-01-06T10:50:47.929Z caller=main.go:986 msg="Loading configuration file" filename=/etc/promethighloghtheus/config_out/prometheus.env.yaml
level=error ts=2022-01-06T10:50:47.930Z caller=main.go:763 msg="Error reloading config" err="couldn't load configuration (--config.file=\"/etc/prometheus/config_out/prometheus.env.yaml\"): parsing YAML file /etc/prometheus/config_out/prometheus.env.yaml: scrape timeout greater than scrape interval for scrape config with job name \"serviceMonitor/project-101/python-metrics/0\""
level=info ts=2022-01-06T10:50:52.929Z caller=main.go:986 msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml
level=error ts=2022-01-06T10:50:52.930Z caller=main.go:763 msg="Error reloading config" err="couldn't load configuration (--config.file=\"/etc/prometheus/config_out/prometheus.env.yaml\"): parsing YAML file /etc/prometheus/config_out/prometheus.env.yaml: scrape timeout greater than scrape interval for scrape config with job name \"serviceMonitor/project-101/python-metrics/0\""
level=info ts=2022-01-06T10:50:57.929Z caller=main.go:986 msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml
level=error ts=2022-01-06T10:50:57.930Z caller=main.go:763 msg="Error reloading config" err="couldn't load configuration (--config.file=\"/etc/prometheus/config_out/prometheus.env.yaml\"): parsing YAML file /etc/prometheus/config_out/prometheus.env.yaml: scrape timeout greater than scrape interval for scrape config with job name \"serviceMonitor/project-101/python-metrics/0\""
level=info ts=2022-01-06T10:51:02.927Z caller=main.go:986 msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml
level=error ts=2022-01-06T10:51:02.928Z caller=main.go:763 msg="Error reloading config" err="couldn't load configuration (--config.file=\"/etc/prometheus/config_out/prometheus.env.yaml\"): parsing YAML file /etc/prometheus/config_out/prometheus.env.yaml: scrape timeout greater than scrape interval for scrape config with job name \"serviceMonitor/project-101/python-metrics/0\""

Actual results:

level=error ts=2022-01-06T10:51:02.928Z caller=main.go:763 msg="Error reloading config" err="couldn't load configuration (--config.file=\"/etc/prometheus/config_out/prometheus.env.yaml\"): parsing YAML file /etc/prometheus/config_out/prometheus.env.yaml: scrape timeout greater than scrape interval for scrape config with job name \"serviceMonitor/project-101/python-metrics/0\"" reported and all future changes won't be applied because of this problem

Expected results:

pre-validation of the configuration to happen, to reject invalid configuration. Another approach could be to skip this configuration snippet, highlight the error but then proceed with everything that is valid

Additional info:

Comment 1 Jayapriya Pai 2022-01-11 12:26:49 UTC
Created the PR with fix in upstream prometheus-operator repo https://github.com/prometheus-operator/prometheus-operator/pull/4491

Comment 8 Junqi Zhao 2022-01-29 03:48:45 UTC
tested with the PR, validation for ScrapeTimeout is added
set invalid value scrapeTimeout: 120S, 
# oc -n ns1 get servicemonitor prometheus-example-monitor -oyaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  creationTimestamp: "2022-01-29T03:20:58Z"
  generation: 1
  name: prometheus-example-monitor
  namespace: ns1
  resourceVersion: "37684"
  uid: 903cd3e0-651c-49b8-8771-c02acf2514c6
spec:
  endpoints:
  - interval: 30s
    port: web
    scheme: http
    scrapeTimeout: 120S
  namespaceSelector:
    matchNames:
    - ns1
  selector:
    matchLabels:
      app: prometheus-example-app

# oc -n openshift-user-workload-monitoring logs -c prometheus-operator prometheus-operator-5f96f69b58-zq6qr
...
level=warn ts=2022-01-29T03:16:52.59730499Z caller=operator.go:1825 component=prometheusoperator msg="skipping servicemonitor" error="invalid scrapeTimeout: \"120S\": not a valid duration string: \"120S\"" servicemonitor=ns1/prometheus-example-monitor namespace=openshift-user-workload-monitoring prometheus=user-workload
level=warn ts=2022-01-29T03:20:58.085474155Z caller=operator.go:1825 component=prometheusoperator msg="skipping servicemonitor" error="invalid scrapeTimeout: \"120S\": not a valid duration string: \"120S\"" servicemonitor=ns1/prometheus-example-monitor namespace=openshift-user-workload-monitoring prometheus=user-workload

edit ServiceMonitor from scrapeTimeout: 120S to scrapeTimeout: 120s, and

# oc -n openshift-user-workload-monitoring logs -c prometheus prometheus-user-workload-0 
level=warn ts=2022-01-29T03:35:48.568227619Z caller=operator.go:1825 component=prometheusoperator msg="skipping servicemonitor" error="scrapeTimeout \"120s\" greater than scrapeInterval \"30s\"" servicemonitor=ns1/prometheus-example-monitor namespace=openshift-user-workload-monitoring prometheus=user-workload


in both scenarios, the configuration won't loaded to prometheus
#  oc -n openshift-user-workload-monitoring  exec -c prometheus prometheus-user-workload-0  -- cat /etc/prometheus/config_out/prometheus.env.yaml | grep ns1
no result

change the scrapeTimeout value to less than scrapeInterval, could see the configuration loaded to prometheus
# oc -n openshift-user-workload-monitoring  exec -c prometheus prometheus-user-workload-0  -- cat /etc/prometheus/config_out/prometheus.env.yaml | grep ns1
- job_name: serviceMonitor/ns1/prometheus-example-monitor/0
      - ns1

Comment 10 Junqi Zhao 2022-03-03 11:00:31 UTC
retested with the 4.11.0-0.nightly-2022-03-03-061758, prometheus operator 0.54.1, validation for ScrapeTimeout is added
set invalid value scrapeTimeout: 120S, 
# oc -n ns1 get servicemonitor prometheus-example-monitor -oyaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  creationTimestamp: "2022-03-03T10:44:39Z"
  generation: 2
  name: prometheus-example-monitor
  namespace: ns1
  resourceVersion: "100562"
  uid: 5b92ae30-3194-4315-acaf-bf99dde6150b
spec:
  endpoints:
  - interval: 30s
    port: web
    scheme: http
    scrapeTimeout: 120S
  namespaceSelector:
    matchNames:
    - ns1
  selector:
    matchLabels:
      app: prometheus-example-app

# oc -n openshift-user-workload-monitoring logs -c prometheus-operator prometheus-operator-549fbb5cc8-m96cx 
level=warn ts=2022-03-03T10:47:01.537431456Z caller=operator.go:1837 component=prometheusoperator msg="skipping servicemonitor" error="invalid scrapeTimeout: \"120S\": not a valid duration string: \"120S\"" servicemonitor=ns1/prometheus-example-monitor namespace=openshift-user-workload-monitoring prometheus=user-workload

the configuration won't loaded to prometheus
#  oc -n openshift-user-workload-monitoring  exec -c prometheus prometheus-user-workload-0  -- cat /etc/prometheus/config_out/prometheus.env.yaml | grep ns1
no result

edit ServiceMonitor from scrapeTimeout: 120S to scrapeTimeout: 120s, and
# oc -n openshift-user-workload-monitoring logs -c prometheus-operator prometheus-operator-549fbb5cc8-m96cx | grep scrapeTimeout
level=warn ts=2022-03-03T10:49:26.919371633Z caller=operator.go:1837 component=prometheusoperator msg="skipping servicemonitor" error="scrapeTimeout \"120s\" greater than scrapeInterval \"30s\"" servicemonitor=ns1/prometheus-example-monitor namespace=openshift-user-workload-monitoring prometheus=user-workload

the configuration won't loaded to prometheus
#  oc -n openshift-user-workload-monitoring  exec -c prometheus prometheus-user-workload-0  -- cat /etc/prometheus/config_out/prometheus.env.yaml | grep ns1
no result


change the scrapeTimeout value to less than scrapeInterval, no error in prometheus-operator and could see the configuration loaded to prometheus
# oc -n openshift-user-workload-monitoring  exec -c prometheus prometheus-user-workload-0  -- cat /etc/prometheus/config_out/prometheus.env.yaml | grep ns1
- job_name: serviceMonitor/ns1/prometheus-example-monitor/0
      - ns1

Comment 16 errata-xmlrpc 2022-08-10 10:41:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069


Note You need to log in before you can comment on or make changes to this bug.