Created attachment 1851514 [details] deploy prometheus operator from UI Description of problem: admin user, create one project, and go to console "Opeators -> OpeatorHub", find Prometheus Operator, deploy it under user namespace, # oc -n test get po NAME READY STATUS RESTARTS AGE prometheus-operator-7bfb4f858f-l4ww5 1/1 Running 0 69s then go to "Opeators -> Installed Operators", click Prometheus Operator, create prometheus instance from the Details page, add one invalid setting, evaluationInterval: "30" to the config file, see: # oc -n test get prometheus example -oyaml apiVersion: monitoring.coreos.com/v1 kind: Prometheus metadata: creationTimestamp: "2022-01-18T04:44:24Z" generation: 1 name: example namespace: test resourceVersion: "131001" uid: 522ea32a-5c05-4fa0-94c0-ad4d69a85d4e spec: alerting: alertmanagers: - name: alertmanager-main namespace: test port: web evaluationInterval: "30" podMonitorSelector: {} probeSelector: {} replicas: 2 ruleSelector: {} serviceAccountName: prometheus-k8s serviceMonitorSelector: {} prometheus pod is CrashLoopBackOff # oc -n test get po NAME READY STATUS RESTARTS AGE prometheus-example-0 1/2 CrashLoopBackOff 5 (72s ago) 4m20s prometheus-example-1 1/2 CrashLoopBackOff 5 (66s ago) 4m20s prometheus-operator-7bfb4f858f-l4ww5 1/1 Running 0 5m19s # oc -n test logs -c prometheus prometheus-example-0 ts=2022-01-18T07:22:13.892Z caller=main.go:437 level=error msg="Error loading config (--config.file=/etc/prometheus/config_out/prometheus.env.yaml)" err="parsing YAML file /etc/prometheus/config_out/prometheus.env.yaml: not a valid duration string: \"30\"" error in config-reloader is expected # oc -n test logs -c config-reloader prometheus-example-0 level=info ts=2022-01-18T07:19:08.02517663Z caller=main.go:147 msg="Starting prometheus-config-reloader" version="(version=0.47.0, branch=refs/tags/pkg/client/v0.47.0, revision=539108b043e9ecc53c4e044083651e2ebfbd3492)" level=info ts=2022-01-18T07:19:08.025223572Z caller=main.go:148 build_context="(go=go1.16.3, user=simonpasquier, date=20210413-15:46:43)" level=info ts=2022-01-18T07:19:08.025371336Z caller=main.go:182 msg="Starting web server for metrics" listen=:8080 level=error ts=2022-01-18T07:19:08.026886949Z caller=runutil.go:101 msg="function failed. Retrying in next tick" err="trigger reload: reload request failed: Post \"http://localhost:9090/-/reload\": dial tcp [::1]:9090: connect: connection refused" level=error ts=2022-01-18T07:19:13.028050033Z caller=runutil.go:101 msg="function failed. Retrying in next tick" err="trigger reload: reload request failed: Post \"http://localhost:9090/-/reload\": dial tcp [::1]:9090: connect: connection refused" ... update to correct value # oc -n test get prometheus example -oyaml | grep evaluationInterval evaluationInterval: 30s still the same error # oc -n test get po NAME READY STATUS RESTARTS AGE prometheus-example-0 1/2 CrashLoopBackOff 6 (96s ago) 7m32s prometheus-example-1 1/2 CrashLoopBackOff 6 (93s ago) 7m32s prometheus-operator-7bfb4f858f-l4ww5 1/1 Running 0 8m31s # oc -n test get ep NAME ENDPOINTS AGE prometheus-operated 10m # oc -n test logs -c prometheus prometheus-example-0 ts=2022-01-18T04:47:15.454Z caller=main.go:437 level=error msg="Error loading config (--config.file=/etc/prometheus/config_out/prometheus.env.yaml)" err="parsing YAML file /etc/prometheus/config_out/prometheus.env.yaml: not a valid duration string: \"30\"" # oc -n test logs -c config-reloader prometheus-example-0 level=info ts=2022-01-18T07:19:08.02517663Z caller=main.go:147 msg="Starting prometheus-config-reloader" version="(version=0.47.0, branch=refs/tags/pkg/client/v0.47.0, revision=539108b043e9ecc53c4e044083651e2ebfbd3492)" level=info ts=2022-01-18T07:19:08.025223572Z caller=main.go:148 build_context="(go=go1.16.3, user=simonpasquier, date=20210413-15:46:43)" level=info ts=2022-01-18T07:19:08.025371336Z caller=main.go:182 msg="Starting web server for metrics" listen=:8080 level=error ts=2022-01-18T07:19:08.026886949Z caller=runutil.go:101 msg="function failed. Retrying in next tick" err="trigger reload: reload request failed: Post \"http://localhost:9090/-/reload\": dial tcp [::1]:9090: connect: connection refused" ... level=error ts=2022-01-18T07:27:03.031020167Z caller=runutil.go:101 msg="function failed. Retrying in next tick" err="trigger reload: reload request failed: Post \"http://localhost:9090/-/reload\": dial tcp [::1]:9090: connect: connection refused" level=error ts=2022-01-18T07:27:08.031472105Z caller=runutil.go:101 msg="function failed. Retrying in next tick" err="trigger reload: reload request failed: Post \"http://localhost:9090/-/reload\": dial tcp [::1]:9090: connect: connection refused" workaroud is delete sts prometheus-example, and let the pod recreated # oc -n test delete sts prometheus-example statefulset.apps "prometheus-example" deleted # oc -n test get po NAME READY STATUS RESTARTS AGE prometheus-example-0 2/2 Running 1 (23s ago) 25s prometheus-example-1 2/2 Running 1 (22s ago) 25s prometheus-operator-7bfb4f858f-l4ww5 1/1 Running 0 13m # oc -n test get ep NAME ENDPOINTS AGE prometheus-operated 10.128.2.63:9090,10.131.0.112:9090 12m # oc -n test logs -c prometheus prometheus-example-0 ts=2022-01-18T07:30:46.189Z caller=main.go:515 level=info msg="Starting Prometheus" version="(version=2.32.1, branch=HEAD, revision=41f1a8125e664985dd30674e5bdf6b683eff5d32)" ... NOTE: this issue is only happen with the invalid value, set to correct value and update to another value, don't have the issue. Version-Release number of selected component (if applicable): Prometheus Operator 0.47.0 Prometheus 2.32.1 How reproducible: always Steps to Reproduce: 1. see the description 2. 3. Actual results: prometheus pod is CrashLoopBackOff after prometheus updated from wrong value to correct value Expected results: prometheus pod is healthy Additional info:
The prometheus operator from OLM is a community project so I don't think that it deserves a BZ. Having said that, the issue has been fixed upstream in v0.52.0. [1] https://github.com/prometheus-operator/prometheus-operator/releases/tag/v0.52.0