Bug 2041725 - prometheus pod is still CrashLoopBackOff after prometheus field changed from invalid value to valid value
Summary: prometheus pod is still CrashLoopBackOff after prometheus field changed from ...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.10
Hardware: Unspecified
OS: Unspecified
unspecified
low
Target Milestone: ---
: ---
Assignee: Simon Pasquier
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-01-18 07:39 UTC by Junqi Zhao
Modified: 2022-01-18 11:58 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-01-18 11:58:30 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
deploy prometheus operator from UI (259.32 KB, image/png)
2022-01-18 07:39 UTC, Junqi Zhao
no flags Details

Description Junqi Zhao 2022-01-18 07:39:03 UTC
Created attachment 1851514 [details]
deploy prometheus operator from UI

Description of problem:
admin user, create one project, and go to console "Opeators -> OpeatorHub", find Prometheus Operator, deploy it under user namespace,
# oc -n test get po
NAME                                   READY   STATUS    RESTARTS   AGE
prometheus-operator-7bfb4f858f-l4ww5   1/1     Running   0          69s

then go to "Opeators -> Installed Operators", click Prometheus Operator, create prometheus instance from the Details page, add one invalid setting, evaluationInterval: "30" to the config file, see:
# oc -n test get prometheus example -oyaml
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  creationTimestamp: "2022-01-18T04:44:24Z"
  generation: 1
  name: example
  namespace: test
  resourceVersion: "131001"
  uid: 522ea32a-5c05-4fa0-94c0-ad4d69a85d4e
spec:
  alerting:
    alertmanagers:
    - name: alertmanager-main
      namespace: test
      port: web
  evaluationInterval: "30"
  podMonitorSelector: {}
  probeSelector: {}
  replicas: 2
  ruleSelector: {}
  serviceAccountName: prometheus-k8s
  serviceMonitorSelector: {}

prometheus pod is CrashLoopBackOff
# oc -n test get po
NAME                                   READY   STATUS             RESTARTS      AGE
prometheus-example-0                   1/2     CrashLoopBackOff   5 (72s ago)   4m20s
prometheus-example-1                   1/2     CrashLoopBackOff   5 (66s ago)   4m20s
prometheus-operator-7bfb4f858f-l4ww5   1/1     Running            0             5m19s


# oc -n test logs -c prometheus prometheus-example-0
ts=2022-01-18T07:22:13.892Z caller=main.go:437 level=error msg="Error loading config (--config.file=/etc/prometheus/config_out/prometheus.env.yaml)" err="parsing YAML file /etc/prometheus/config_out/prometheus.env.yaml: not a valid duration string: \"30\""

error in config-reloader is expected
# oc -n test logs -c config-reloader prometheus-example-0
level=info ts=2022-01-18T07:19:08.02517663Z caller=main.go:147 msg="Starting prometheus-config-reloader" version="(version=0.47.0, branch=refs/tags/pkg/client/v0.47.0, revision=539108b043e9ecc53c4e044083651e2ebfbd3492)"
level=info ts=2022-01-18T07:19:08.025223572Z caller=main.go:148 build_context="(go=go1.16.3, user=simonpasquier, date=20210413-15:46:43)"
level=info ts=2022-01-18T07:19:08.025371336Z caller=main.go:182 msg="Starting web server for metrics" listen=:8080
level=error ts=2022-01-18T07:19:08.026886949Z caller=runutil.go:101 msg="function failed. Retrying in next tick" err="trigger reload: reload request failed: Post \"http://localhost:9090/-/reload\": dial tcp [::1]:9090: connect: connection refused"
level=error ts=2022-01-18T07:19:13.028050033Z caller=runutil.go:101 msg="function failed. Retrying in next tick" err="trigger reload: reload request failed: Post \"http://localhost:9090/-/reload\": dial tcp [::1]:9090: connect: connection refused"
...


update to correct value
# oc -n test get prometheus example -oyaml | grep evaluationInterval
  evaluationInterval: 30s

still the same error
# oc -n test get po
NAME                                   READY   STATUS             RESTARTS        AGE
prometheus-example-0                   1/2     CrashLoopBackOff   6 (96s ago)   7m32s
prometheus-example-1                   1/2     CrashLoopBackOff   6 (93s ago)   7m32s
prometheus-operator-7bfb4f858f-l4ww5   1/1     Running            0             8m31s

# oc -n test get ep
NAME                  ENDPOINTS   AGE
prometheus-operated               10m

# oc -n test logs -c prometheus prometheus-example-0
ts=2022-01-18T04:47:15.454Z caller=main.go:437 level=error msg="Error loading config (--config.file=/etc/prometheus/config_out/prometheus.env.yaml)" err="parsing YAML file /etc/prometheus/config_out/prometheus.env.yaml: not a valid duration string: \"30\""

# oc -n test logs -c config-reloader prometheus-example-0
level=info ts=2022-01-18T07:19:08.02517663Z caller=main.go:147 msg="Starting prometheus-config-reloader" version="(version=0.47.0, branch=refs/tags/pkg/client/v0.47.0, revision=539108b043e9ecc53c4e044083651e2ebfbd3492)"
level=info ts=2022-01-18T07:19:08.025223572Z caller=main.go:148 build_context="(go=go1.16.3, user=simonpasquier, date=20210413-15:46:43)"
level=info ts=2022-01-18T07:19:08.025371336Z caller=main.go:182 msg="Starting web server for metrics" listen=:8080
level=error ts=2022-01-18T07:19:08.026886949Z caller=runutil.go:101 msg="function failed. Retrying in next tick" err="trigger reload: reload request failed: Post \"http://localhost:9090/-/reload\": dial tcp [::1]:9090: connect: connection refused"
...
level=error ts=2022-01-18T07:27:03.031020167Z caller=runutil.go:101 msg="function failed. Retrying in next tick" err="trigger reload: reload request failed: Post \"http://localhost:9090/-/reload\": dial tcp [::1]:9090: connect: connection refused"
level=error ts=2022-01-18T07:27:08.031472105Z caller=runutil.go:101 msg="function failed. Retrying in next tick" err="trigger reload: reload request failed: Post \"http://localhost:9090/-/reload\": dial tcp [::1]:9090: connect: connection refused"


workaroud is delete sts prometheus-example, and let the pod recreated
# oc -n test delete sts prometheus-example
statefulset.apps "prometheus-example" deleted

# oc -n test get po
NAME                                   READY   STATUS    RESTARTS      AGE
prometheus-example-0                   2/2     Running   1 (23s ago)   25s
prometheus-example-1                   2/2     Running   1 (22s ago)   25s
prometheus-operator-7bfb4f858f-l4ww5   1/1     Running   0             13m

# oc -n test get ep
NAME                  ENDPOINTS                            AGE
prometheus-operated   10.128.2.63:9090,10.131.0.112:9090   12m


# oc -n test logs -c prometheus prometheus-example-0
ts=2022-01-18T07:30:46.189Z caller=main.go:515 level=info msg="Starting Prometheus" version="(version=2.32.1, branch=HEAD, revision=41f1a8125e664985dd30674e5bdf6b683eff5d32)"
...

NOTE: this issue is only happen with the invalid value, set to correct value and update to another value, don't have the issue.

Version-Release number of selected component (if applicable):
Prometheus Operator 0.47.0
Prometheus 2.32.1


How reproducible:
always

Steps to Reproduce:
1. see the description
2.
3.

Actual results:
prometheus pod is CrashLoopBackOff after prometheus updated from wrong value to correct value

Expected results:
prometheus pod is healthy

Additional info:

Comment 3 Simon Pasquier 2022-01-18 11:58:30 UTC
The prometheus operator from OLM is a community project so I don't think that it deserves a BZ. Having said that, the issue has been fixed upstream in v0.52.0.

[1] https://github.com/prometheus-operator/prometheus-operator/releases/tag/v0.52.0


Note You need to log in before you can comment on or make changes to this bug.