Bug 2037762

Summary: Wrong ServiceMonitor definition is causing failure during Prometheus configuration reload and preventing changes from being applied
Product: OpenShift Container Platform Reporter: Simon Reber <sreber>
Component: MonitoringAssignee: Jayapriya Pai <janantha>
Status: CLOSED ERRATA QA Contact: Junqi Zhao <juzhao>
Severity: medium Docs Contact: Brian Burt <bburt>
Priority: high    
Version: 4.9CC: amuller, anpicker, bburt, gparente, hongyli, janantha, jfajersk, pgough, spasquie
Target Milestone: ---Keywords: Reopened
Target Release: 4.11.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Before this update, prometheus-operator allowed any valid time value for ScrapeTimeout. After this change it will validate if ScrapeTimeout specified is greater than ScrapeInterval and reject the config if it is greater than ScrapeInterval
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-08-10 10:41:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Simon Reber 2022-01-06 14:01:08 UTC
Description of problem:

Enabling monitoring for user-defined projects (following https://docs.openshift.com/container-platform/4.9/monitoring/enabling-monitoring-for-user-defined-projects.html) with a ServiceMonitor that has an invalid configuration is causing Prometheus to fail loading the configuration and thus to apply all subsequent changes. This is impacting all projects using user workload monitoring and thus needs to be fixed with pre-validation of the configuration before applying it.

Version-Release number of selected component (if applicable):

 - OpenShift Container Platform 4.9.10

How reproducible:

 - Always

Steps to Reproduce:
1. Enable monitoring for user-defined projects following https://docs.openshift.com/container-platform/4.9/monitoring/enabling-monitoring-for-user-defined-projects.html#enabling-monitoring-for-user-defined-projects_enabling-monitoring-for-user-defined-projects

2. Create a ServiceMonitor as shown below

{
    "apiVersion": "monitoring.coreos.com/v1",
    "kind": "ServiceMonitor",
    "metadata": {
        "name": "python-metrics",
        "namespace": "project-101"
    },
    "spec": {
        "endpoints": [
            {
                "interval": "60s",
                "port": "http",
                "scrapeTimeout": "120s"
            }
        ],
        "jobLabel": "app.kubernetes.io/name",
        "selector": {
            "matchLabels": {
                "app": "httpd"
            }
        }
    }
}

3. Check the logs of the Prometheus Container in `openshift-user-workload-monitoring` project.

$ oc logs prometheus-user-workload-0 -c prometheus 
[...]
level=info ts=2022-01-05T16:17:48.209Z caller=main.go:801 msg="Server is ready to receive web requests."
level=info ts=2022-01-05T16:17:48.296Z caller=main.go:986 msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml
level=info ts=2022-01-05T16:17:48.298Z caller=kubernetes.go:284 component="discovery manager notify" discovery=kubernetes msg="Using pod service account via in-cluster config"
level=info ts=2022-01-05T16:17:48.300Z caller=main.go:1023 msg="Completed loading of configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml totalDuration=3.208219ms db_storage=1.189µs remote_storage=1.765µs web_handler=547ns query_engine=1.243µs scrape=54.128µs scrape_sd=3.571µs notify=177.875µs notify_sd=2.452636ms rules=56.123µs
level=info ts=2022-01-06T10:50:02.926Z caller=main.go:986 msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml
level=error ts=2022-01-06T10:50:02.927Z caller=main.go:763 msg="Error reloading config" err="couldn't load configuration (--config.file=\"/etc/prometheus/config_out/prometheus.env.yaml\"): parsing YAML file /etc/prometheus/config_out/prometheus.env.yaml: scrape timeout greater than scrape interval for scrape config with job name \"serviceMonitor/project-101/python-metrics/0\""
level=info ts=2022-01-06T10:50:07.924Z caller=main.go:986 msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml
level=error ts=2022-01-06T10:50:07.925Z caller=main.go:763 msg="Error reloading config" err="couldn't load configuration (--config.file=\"/etc/prometheus/config_out/prometheus.env.yaml\"): parsing YAML file /etc/prometheus/config_out/prometheus.env.yaml: scrape timeout greater than scrape interval for scrape config with job name \"serviceMonitor/project-101/python-metrics/0\""
level=info ts=2022-01-06T10:50:12.929Z caller=main.go:986 msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml
level=error ts=2022-01-06T10:50:12.930Z caller=main.go:763 msg="Error reloading config" err="couldn't load configuration (--config.file=\"/etc/prometheus/config_out/prometheus.env.yaml\"): parsing YAML file /etc/prometheus/config_out/prometheus.env.yaml: scrape timeout greater than scrape interval for scrape config with job name \"serviceMonitor/project-101/python-metrics/0\""
level=info ts=2022-01-06T10:50:17.929Z caller=main.go:986 msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml
level=error ts=2022-01-06T10:50:17.930Z caller=main.go:763 msg="Error reloading config" err="couldn't load configuration (--config.file=\"/etc/prometheus/config_out/prometheus.env.yaml\"): parsing YAML file /etc/prometheus/config_out/prometheus.env.yaml: scrape timeout greater than scrape interval for scrape config with job name \"serviceMonitor/project-101/python-metrics/0\""
level=info ts=2022-01-06T10:50:22.929Z caller=main.go:986 msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml
level=error ts=2022-01-06T10:50:22.930Z caller=main.go:763 msg="Error reloading config" err="couldn't load configuration (--config.file=\"/etc/prometheus/config_out/prometheus.env.yaml\"): parsing YAML file /etc/prometheus/config_out/prometheus.env.yaml: scrape timeout greater than scrape interval for scrape config with job name \"serviceMonitor/project-101/python-metrics/0\""
level=info ts=2022-01-06T10:50:27.929Z caller=main.go:986 msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml
level=error ts=2022-01-06T10:50:27.930Z caller=main.go:763 msg="Error reloading config" err="couldn't load configuration (--config.file=\"/etc/prometheus/config_out/prometheus.env.yaml\"): parsing YAML file /etc/prometheus/config_out/prometheus.env.yaml: scrape timeout greater than scrape interval for scrape config with job name \"serviceMonitor/project-101/python-metrics/0\""
level=info ts=2022-01-06T10:50:32.929Z caller=main.go:986 msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml
level=error ts=2022-01-06T10:50:32.929Z caller=main.go:763 msg="Error reloading config" err="couldn't load configuration (--config.file=\"/etc/prometheus/config_out/prometheus.env.yaml\"): parsing YAML file /etc/prometheus/config_out/prometheus.env.yaml: scrape timeout greater than scrape interval for scrape config with job name \"serviceMonitor/project-101/python-metrics/0\""
level=info ts=2022-01-06T10:50:37.929Z caller=main.go:986 msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml
level=error ts=2022-01-06T10:50:37.930Z caller=main.go:763 msg="Error reloading config" err="couldn't load configuration (--config.file=\"/etc/prometheus/config_out/prometheus.env.yaml\"): parsing YAML file /etc/prometheus/config_out/prometheus.env.yaml: scrape timeout greater than scrape interval for scrape config with job name \"serviceMonitor/project-101/python-metrics/0\""
level=info ts=2022-01-06T10:50:42.926Z caller=main.go:986 msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml
level=error ts=2022-01-06T10:50:42.927Z caller=main.go:763 msg="Error reloading config" err="couldn't load configuration (--config.file=\"/etc/prometheus/config_out/prometheus.env.yaml\"): parsing YAML file /etc/prometheus/config_out/prometheus.env.yaml: scrape timeout greater than scrape interval for scrape config with job name \"serviceMonitor/project-101/python-metrics/0\""
level=info ts=2022-01-06T10:50:47.929Z caller=main.go:986 msg="Loading configuration file" filename=/etc/promethighloghtheus/config_out/prometheus.env.yaml
level=error ts=2022-01-06T10:50:47.930Z caller=main.go:763 msg="Error reloading config" err="couldn't load configuration (--config.file=\"/etc/prometheus/config_out/prometheus.env.yaml\"): parsing YAML file /etc/prometheus/config_out/prometheus.env.yaml: scrape timeout greater than scrape interval for scrape config with job name \"serviceMonitor/project-101/python-metrics/0\""
level=info ts=2022-01-06T10:50:52.929Z caller=main.go:986 msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml
level=error ts=2022-01-06T10:50:52.930Z caller=main.go:763 msg="Error reloading config" err="couldn't load configuration (--config.file=\"/etc/prometheus/config_out/prometheus.env.yaml\"): parsing YAML file /etc/prometheus/config_out/prometheus.env.yaml: scrape timeout greater than scrape interval for scrape config with job name \"serviceMonitor/project-101/python-metrics/0\""
level=info ts=2022-01-06T10:50:57.929Z caller=main.go:986 msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml
level=error ts=2022-01-06T10:50:57.930Z caller=main.go:763 msg="Error reloading config" err="couldn't load configuration (--config.file=\"/etc/prometheus/config_out/prometheus.env.yaml\"): parsing YAML file /etc/prometheus/config_out/prometheus.env.yaml: scrape timeout greater than scrape interval for scrape config with job name \"serviceMonitor/project-101/python-metrics/0\""
level=info ts=2022-01-06T10:51:02.927Z caller=main.go:986 msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml
level=error ts=2022-01-06T10:51:02.928Z caller=main.go:763 msg="Error reloading config" err="couldn't load configuration (--config.file=\"/etc/prometheus/config_out/prometheus.env.yaml\"): parsing YAML file /etc/prometheus/config_out/prometheus.env.yaml: scrape timeout greater than scrape interval for scrape config with job name \"serviceMonitor/project-101/python-metrics/0\""

Actual results:

level=error ts=2022-01-06T10:51:02.928Z caller=main.go:763 msg="Error reloading config" err="couldn't load configuration (--config.file=\"/etc/prometheus/config_out/prometheus.env.yaml\"): parsing YAML file /etc/prometheus/config_out/prometheus.env.yaml: scrape timeout greater than scrape interval for scrape config with job name \"serviceMonitor/project-101/python-metrics/0\"" reported and all future changes won't be applied because of this problem

Expected results:

pre-validation of the configuration to happen, to reject invalid configuration. Another approach could be to skip this configuration snippet, highlight the error but then proceed with everything that is valid

Additional info:

Comment 1 Jayapriya Pai 2022-01-11 12:26:49 UTC
Created the PR with fix in upstream prometheus-operator repo https://github.com/prometheus-operator/prometheus-operator/pull/4491

Comment 8 Junqi Zhao 2022-01-29 03:48:45 UTC
tested with the PR, validation for ScrapeTimeout is added
set invalid value scrapeTimeout: 120S, 
# oc -n ns1 get servicemonitor prometheus-example-monitor -oyaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  creationTimestamp: "2022-01-29T03:20:58Z"
  generation: 1
  name: prometheus-example-monitor
  namespace: ns1
  resourceVersion: "37684"
  uid: 903cd3e0-651c-49b8-8771-c02acf2514c6
spec:
  endpoints:
  - interval: 30s
    port: web
    scheme: http
    scrapeTimeout: 120S
  namespaceSelector:
    matchNames:
    - ns1
  selector:
    matchLabels:
      app: prometheus-example-app

# oc -n openshift-user-workload-monitoring logs -c prometheus-operator prometheus-operator-5f96f69b58-zq6qr
...
level=warn ts=2022-01-29T03:16:52.59730499Z caller=operator.go:1825 component=prometheusoperator msg="skipping servicemonitor" error="invalid scrapeTimeout: \"120S\": not a valid duration string: \"120S\"" servicemonitor=ns1/prometheus-example-monitor namespace=openshift-user-workload-monitoring prometheus=user-workload
level=warn ts=2022-01-29T03:20:58.085474155Z caller=operator.go:1825 component=prometheusoperator msg="skipping servicemonitor" error="invalid scrapeTimeout: \"120S\": not a valid duration string: \"120S\"" servicemonitor=ns1/prometheus-example-monitor namespace=openshift-user-workload-monitoring prometheus=user-workload

edit ServiceMonitor from scrapeTimeout: 120S to scrapeTimeout: 120s, and

# oc -n openshift-user-workload-monitoring logs -c prometheus prometheus-user-workload-0 
level=warn ts=2022-01-29T03:35:48.568227619Z caller=operator.go:1825 component=prometheusoperator msg="skipping servicemonitor" error="scrapeTimeout \"120s\" greater than scrapeInterval \"30s\"" servicemonitor=ns1/prometheus-example-monitor namespace=openshift-user-workload-monitoring prometheus=user-workload


in both scenarios, the configuration won't loaded to prometheus
#  oc -n openshift-user-workload-monitoring  exec -c prometheus prometheus-user-workload-0  -- cat /etc/prometheus/config_out/prometheus.env.yaml | grep ns1
no result

change the scrapeTimeout value to less than scrapeInterval, could see the configuration loaded to prometheus
# oc -n openshift-user-workload-monitoring  exec -c prometheus prometheus-user-workload-0  -- cat /etc/prometheus/config_out/prometheus.env.yaml | grep ns1
- job_name: serviceMonitor/ns1/prometheus-example-monitor/0
      - ns1

Comment 10 Junqi Zhao 2022-03-03 11:00:31 UTC
retested with the 4.11.0-0.nightly-2022-03-03-061758, prometheus operator 0.54.1, validation for ScrapeTimeout is added
set invalid value scrapeTimeout: 120S, 
# oc -n ns1 get servicemonitor prometheus-example-monitor -oyaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  creationTimestamp: "2022-03-03T10:44:39Z"
  generation: 2
  name: prometheus-example-monitor
  namespace: ns1
  resourceVersion: "100562"
  uid: 5b92ae30-3194-4315-acaf-bf99dde6150b
spec:
  endpoints:
  - interval: 30s
    port: web
    scheme: http
    scrapeTimeout: 120S
  namespaceSelector:
    matchNames:
    - ns1
  selector:
    matchLabels:
      app: prometheus-example-app

# oc -n openshift-user-workload-monitoring logs -c prometheus-operator prometheus-operator-549fbb5cc8-m96cx 
level=warn ts=2022-03-03T10:47:01.537431456Z caller=operator.go:1837 component=prometheusoperator msg="skipping servicemonitor" error="invalid scrapeTimeout: \"120S\": not a valid duration string: \"120S\"" servicemonitor=ns1/prometheus-example-monitor namespace=openshift-user-workload-monitoring prometheus=user-workload

the configuration won't loaded to prometheus
#  oc -n openshift-user-workload-monitoring  exec -c prometheus prometheus-user-workload-0  -- cat /etc/prometheus/config_out/prometheus.env.yaml | grep ns1
no result

edit ServiceMonitor from scrapeTimeout: 120S to scrapeTimeout: 120s, and
# oc -n openshift-user-workload-monitoring logs -c prometheus-operator prometheus-operator-549fbb5cc8-m96cx | grep scrapeTimeout
level=warn ts=2022-03-03T10:49:26.919371633Z caller=operator.go:1837 component=prometheusoperator msg="skipping servicemonitor" error="scrapeTimeout \"120s\" greater than scrapeInterval \"30s\"" servicemonitor=ns1/prometheus-example-monitor namespace=openshift-user-workload-monitoring prometheus=user-workload

the configuration won't loaded to prometheus
#  oc -n openshift-user-workload-monitoring  exec -c prometheus prometheus-user-workload-0  -- cat /etc/prometheus/config_out/prometheus.env.yaml | grep ns1
no result


change the scrapeTimeout value to less than scrapeInterval, no error in prometheus-operator and could see the configuration loaded to prometheus
# oc -n openshift-user-workload-monitoring  exec -c prometheus prometheus-user-workload-0  -- cat /etc/prometheus/config_out/prometheus.env.yaml | grep ns1
- job_name: serviceMonitor/ns1/prometheus-example-monitor/0
      - ns1

Comment 16 errata-xmlrpc 2022-08-10 10:41:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069