Description of problem: Enabling monitoring for user-defined projects (following https://docs.openshift.com/container-platform/4.9/monitoring/enabling-monitoring-for-user-defined-projects.html) with a ServiceMonitor that has an invalid configuration is causing Prometheus to fail loading the configuration and thus to apply all subsequent changes. This is impacting all projects using user workload monitoring and thus needs to be fixed with pre-validation of the configuration before applying it. Version-Release number of selected component (if applicable): - OpenShift Container Platform 4.9.10 How reproducible: - Always Steps to Reproduce: 1. Enable monitoring for user-defined projects following https://docs.openshift.com/container-platform/4.9/monitoring/enabling-monitoring-for-user-defined-projects.html#enabling-monitoring-for-user-defined-projects_enabling-monitoring-for-user-defined-projects 2. Create a ServiceMonitor as shown below { "apiVersion": "monitoring.coreos.com/v1", "kind": "ServiceMonitor", "metadata": { "name": "python-metrics", "namespace": "project-101" }, "spec": { "endpoints": [ { "interval": "60s", "port": "http", "scrapeTimeout": "120s" } ], "jobLabel": "app.kubernetes.io/name", "selector": { "matchLabels": { "app": "httpd" } } } } 3. Check the logs of the Prometheus Container in `openshift-user-workload-monitoring` project. $ oc logs prometheus-user-workload-0 -c prometheus [...] level=info ts=2022-01-05T16:17:48.209Z caller=main.go:801 msg="Server is ready to receive web requests." level=info ts=2022-01-05T16:17:48.296Z caller=main.go:986 msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml level=info ts=2022-01-05T16:17:48.298Z caller=kubernetes.go:284 component="discovery manager notify" discovery=kubernetes msg="Using pod service account via in-cluster config" level=info ts=2022-01-05T16:17:48.300Z caller=main.go:1023 msg="Completed loading of configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml totalDuration=3.208219ms db_storage=1.189µs remote_storage=1.765µs web_handler=547ns query_engine=1.243µs scrape=54.128µs scrape_sd=3.571µs notify=177.875µs notify_sd=2.452636ms rules=56.123µs level=info ts=2022-01-06T10:50:02.926Z caller=main.go:986 msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml level=error ts=2022-01-06T10:50:02.927Z caller=main.go:763 msg="Error reloading config" err="couldn't load configuration (--config.file=\"/etc/prometheus/config_out/prometheus.env.yaml\"): parsing YAML file /etc/prometheus/config_out/prometheus.env.yaml: scrape timeout greater than scrape interval for scrape config with job name \"serviceMonitor/project-101/python-metrics/0\"" level=info ts=2022-01-06T10:50:07.924Z caller=main.go:986 msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml level=error ts=2022-01-06T10:50:07.925Z caller=main.go:763 msg="Error reloading config" err="couldn't load configuration (--config.file=\"/etc/prometheus/config_out/prometheus.env.yaml\"): parsing YAML file /etc/prometheus/config_out/prometheus.env.yaml: scrape timeout greater than scrape interval for scrape config with job name \"serviceMonitor/project-101/python-metrics/0\"" level=info ts=2022-01-06T10:50:12.929Z caller=main.go:986 msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml level=error ts=2022-01-06T10:50:12.930Z caller=main.go:763 msg="Error reloading config" err="couldn't load configuration (--config.file=\"/etc/prometheus/config_out/prometheus.env.yaml\"): parsing YAML file /etc/prometheus/config_out/prometheus.env.yaml: scrape timeout greater than scrape interval for scrape config with job name \"serviceMonitor/project-101/python-metrics/0\"" level=info ts=2022-01-06T10:50:17.929Z caller=main.go:986 msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml level=error ts=2022-01-06T10:50:17.930Z caller=main.go:763 msg="Error reloading config" err="couldn't load configuration (--config.file=\"/etc/prometheus/config_out/prometheus.env.yaml\"): parsing YAML file /etc/prometheus/config_out/prometheus.env.yaml: scrape timeout greater than scrape interval for scrape config with job name \"serviceMonitor/project-101/python-metrics/0\"" level=info ts=2022-01-06T10:50:22.929Z caller=main.go:986 msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml level=error ts=2022-01-06T10:50:22.930Z caller=main.go:763 msg="Error reloading config" err="couldn't load configuration (--config.file=\"/etc/prometheus/config_out/prometheus.env.yaml\"): parsing YAML file /etc/prometheus/config_out/prometheus.env.yaml: scrape timeout greater than scrape interval for scrape config with job name \"serviceMonitor/project-101/python-metrics/0\"" level=info ts=2022-01-06T10:50:27.929Z caller=main.go:986 msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml level=error ts=2022-01-06T10:50:27.930Z caller=main.go:763 msg="Error reloading config" err="couldn't load configuration (--config.file=\"/etc/prometheus/config_out/prometheus.env.yaml\"): parsing YAML file /etc/prometheus/config_out/prometheus.env.yaml: scrape timeout greater than scrape interval for scrape config with job name \"serviceMonitor/project-101/python-metrics/0\"" level=info ts=2022-01-06T10:50:32.929Z caller=main.go:986 msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml level=error ts=2022-01-06T10:50:32.929Z caller=main.go:763 msg="Error reloading config" err="couldn't load configuration (--config.file=\"/etc/prometheus/config_out/prometheus.env.yaml\"): parsing YAML file /etc/prometheus/config_out/prometheus.env.yaml: scrape timeout greater than scrape interval for scrape config with job name \"serviceMonitor/project-101/python-metrics/0\"" level=info ts=2022-01-06T10:50:37.929Z caller=main.go:986 msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml level=error ts=2022-01-06T10:50:37.930Z caller=main.go:763 msg="Error reloading config" err="couldn't load configuration (--config.file=\"/etc/prometheus/config_out/prometheus.env.yaml\"): parsing YAML file /etc/prometheus/config_out/prometheus.env.yaml: scrape timeout greater than scrape interval for scrape config with job name \"serviceMonitor/project-101/python-metrics/0\"" level=info ts=2022-01-06T10:50:42.926Z caller=main.go:986 msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml level=error ts=2022-01-06T10:50:42.927Z caller=main.go:763 msg="Error reloading config" err="couldn't load configuration (--config.file=\"/etc/prometheus/config_out/prometheus.env.yaml\"): parsing YAML file /etc/prometheus/config_out/prometheus.env.yaml: scrape timeout greater than scrape interval for scrape config with job name \"serviceMonitor/project-101/python-metrics/0\"" level=info ts=2022-01-06T10:50:47.929Z caller=main.go:986 msg="Loading configuration file" filename=/etc/promethighloghtheus/config_out/prometheus.env.yaml level=error ts=2022-01-06T10:50:47.930Z caller=main.go:763 msg="Error reloading config" err="couldn't load configuration (--config.file=\"/etc/prometheus/config_out/prometheus.env.yaml\"): parsing YAML file /etc/prometheus/config_out/prometheus.env.yaml: scrape timeout greater than scrape interval for scrape config with job name \"serviceMonitor/project-101/python-metrics/0\"" level=info ts=2022-01-06T10:50:52.929Z caller=main.go:986 msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml level=error ts=2022-01-06T10:50:52.930Z caller=main.go:763 msg="Error reloading config" err="couldn't load configuration (--config.file=\"/etc/prometheus/config_out/prometheus.env.yaml\"): parsing YAML file /etc/prometheus/config_out/prometheus.env.yaml: scrape timeout greater than scrape interval for scrape config with job name \"serviceMonitor/project-101/python-metrics/0\"" level=info ts=2022-01-06T10:50:57.929Z caller=main.go:986 msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml level=error ts=2022-01-06T10:50:57.930Z caller=main.go:763 msg="Error reloading config" err="couldn't load configuration (--config.file=\"/etc/prometheus/config_out/prometheus.env.yaml\"): parsing YAML file /etc/prometheus/config_out/prometheus.env.yaml: scrape timeout greater than scrape interval for scrape config with job name \"serviceMonitor/project-101/python-metrics/0\"" level=info ts=2022-01-06T10:51:02.927Z caller=main.go:986 msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml level=error ts=2022-01-06T10:51:02.928Z caller=main.go:763 msg="Error reloading config" err="couldn't load configuration (--config.file=\"/etc/prometheus/config_out/prometheus.env.yaml\"): parsing YAML file /etc/prometheus/config_out/prometheus.env.yaml: scrape timeout greater than scrape interval for scrape config with job name \"serviceMonitor/project-101/python-metrics/0\"" Actual results: level=error ts=2022-01-06T10:51:02.928Z caller=main.go:763 msg="Error reloading config" err="couldn't load configuration (--config.file=\"/etc/prometheus/config_out/prometheus.env.yaml\"): parsing YAML file /etc/prometheus/config_out/prometheus.env.yaml: scrape timeout greater than scrape interval for scrape config with job name \"serviceMonitor/project-101/python-metrics/0\"" reported and all future changes won't be applied because of this problem Expected results: pre-validation of the configuration to happen, to reject invalid configuration. Another approach could be to skip this configuration snippet, highlight the error but then proceed with everything that is valid Additional info:
Created the PR with fix in upstream prometheus-operator repo https://github.com/prometheus-operator/prometheus-operator/pull/4491
tested with the PR, validation for ScrapeTimeout is added set invalid value scrapeTimeout: 120S, # oc -n ns1 get servicemonitor prometheus-example-monitor -oyaml apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: creationTimestamp: "2022-01-29T03:20:58Z" generation: 1 name: prometheus-example-monitor namespace: ns1 resourceVersion: "37684" uid: 903cd3e0-651c-49b8-8771-c02acf2514c6 spec: endpoints: - interval: 30s port: web scheme: http scrapeTimeout: 120S namespaceSelector: matchNames: - ns1 selector: matchLabels: app: prometheus-example-app # oc -n openshift-user-workload-monitoring logs -c prometheus-operator prometheus-operator-5f96f69b58-zq6qr ... level=warn ts=2022-01-29T03:16:52.59730499Z caller=operator.go:1825 component=prometheusoperator msg="skipping servicemonitor" error="invalid scrapeTimeout: \"120S\": not a valid duration string: \"120S\"" servicemonitor=ns1/prometheus-example-monitor namespace=openshift-user-workload-monitoring prometheus=user-workload level=warn ts=2022-01-29T03:20:58.085474155Z caller=operator.go:1825 component=prometheusoperator msg="skipping servicemonitor" error="invalid scrapeTimeout: \"120S\": not a valid duration string: \"120S\"" servicemonitor=ns1/prometheus-example-monitor namespace=openshift-user-workload-monitoring prometheus=user-workload edit ServiceMonitor from scrapeTimeout: 120S to scrapeTimeout: 120s, and # oc -n openshift-user-workload-monitoring logs -c prometheus prometheus-user-workload-0 level=warn ts=2022-01-29T03:35:48.568227619Z caller=operator.go:1825 component=prometheusoperator msg="skipping servicemonitor" error="scrapeTimeout \"120s\" greater than scrapeInterval \"30s\"" servicemonitor=ns1/prometheus-example-monitor namespace=openshift-user-workload-monitoring prometheus=user-workload in both scenarios, the configuration won't loaded to prometheus # oc -n openshift-user-workload-monitoring exec -c prometheus prometheus-user-workload-0 -- cat /etc/prometheus/config_out/prometheus.env.yaml | grep ns1 no result change the scrapeTimeout value to less than scrapeInterval, could see the configuration loaded to prometheus # oc -n openshift-user-workload-monitoring exec -c prometheus prometheus-user-workload-0 -- cat /etc/prometheus/config_out/prometheus.env.yaml | grep ns1 - job_name: serviceMonitor/ns1/prometheus-example-monitor/0 - ns1
retested with the 4.11.0-0.nightly-2022-03-03-061758, prometheus operator 0.54.1, validation for ScrapeTimeout is added set invalid value scrapeTimeout: 120S, # oc -n ns1 get servicemonitor prometheus-example-monitor -oyaml apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: creationTimestamp: "2022-03-03T10:44:39Z" generation: 2 name: prometheus-example-monitor namespace: ns1 resourceVersion: "100562" uid: 5b92ae30-3194-4315-acaf-bf99dde6150b spec: endpoints: - interval: 30s port: web scheme: http scrapeTimeout: 120S namespaceSelector: matchNames: - ns1 selector: matchLabels: app: prometheus-example-app # oc -n openshift-user-workload-monitoring logs -c prometheus-operator prometheus-operator-549fbb5cc8-m96cx level=warn ts=2022-03-03T10:47:01.537431456Z caller=operator.go:1837 component=prometheusoperator msg="skipping servicemonitor" error="invalid scrapeTimeout: \"120S\": not a valid duration string: \"120S\"" servicemonitor=ns1/prometheus-example-monitor namespace=openshift-user-workload-monitoring prometheus=user-workload the configuration won't loaded to prometheus # oc -n openshift-user-workload-monitoring exec -c prometheus prometheus-user-workload-0 -- cat /etc/prometheus/config_out/prometheus.env.yaml | grep ns1 no result edit ServiceMonitor from scrapeTimeout: 120S to scrapeTimeout: 120s, and # oc -n openshift-user-workload-monitoring logs -c prometheus-operator prometheus-operator-549fbb5cc8-m96cx | grep scrapeTimeout level=warn ts=2022-03-03T10:49:26.919371633Z caller=operator.go:1837 component=prometheusoperator msg="skipping servicemonitor" error="scrapeTimeout \"120s\" greater than scrapeInterval \"30s\"" servicemonitor=ns1/prometheus-example-monitor namespace=openshift-user-workload-monitoring prometheus=user-workload the configuration won't loaded to prometheus # oc -n openshift-user-workload-monitoring exec -c prometheus prometheus-user-workload-0 -- cat /etc/prometheus/config_out/prometheus.env.yaml | grep ns1 no result change the scrapeTimeout value to less than scrapeInterval, no error in prometheus-operator and could see the configuration loaded to prometheus # oc -n openshift-user-workload-monitoring exec -c prometheus prometheus-user-workload-0 -- cat /etc/prometheus/config_out/prometheus.env.yaml | grep ns1 - job_name: serviceMonitor/ns1/prometheus-example-monitor/0 - ns1
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069