Bug 2014954 - The prometheus-k8s-{0,1} pods are CrashLoopBackoff repeatedly
Summary: The prometheus-k8s-{0,1} pods are CrashLoopBackoff repeatedly
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.8
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ---
: 4.10.0
Assignee: Sunil Thaha
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-10-18 05:01 UTC by Hakyong Do
Modified: 2022-11-28 08:32 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-03-12 04:39:16 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github prometheus prometheus pull 9606 0 None open fix: panic when creating segment buffer reader 2021-10-28 04:53:38 UTC
Github prometheus prometheus pull 9687 0 None Merged fix: panic when checkpoint directory is empty 2021-11-18 07:01:27 UTC
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-12 04:39:32 UTC

Description Hakyong Do 2021-10-18 05:01:46 UTC
Description of problem:
The prometheus-k8s-{0,1} pods are CrashLoopBackoff status if delete these pods, prometheus started normally but a few days later, prometheus is CrashLoopBackoff again with below error message:


level=info ts=2021-10-05T01:41:12.108Z caller=main.go:418 msg="Starting Prometheus" version="(version=2.26.1, branch=rhaos-4.8-rhel-8, revision=c052078834283fc9419d39dc92fa98deca2074dd)"
level=info ts=2021-10-05T01:41:12.108Z caller=main.go:423 build_context="(go=go1.16.6, user=root@22ba12e0d3d8, date=20210729-16:32:07)"
level=info ts=2021-10-05T01:41:12.109Z caller=main.go:424 host_details="(Linux 4.18.0-305.10.2.el8_4.x86_64 #1 SMP Mon Jul 12 04:43:18 EDT 2021 x86_64 prometheus-k8s-1 (none))"
level=info ts=2021-10-05T01:41:12.109Z caller=main.go:425 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2021-10-05T01:41:12.109Z caller=main.go:426 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2021-10-05T01:41:12.185Z caller=web.go:540 component=web msg="Start listening for connections" address=127.0.0.1:9090
level=info ts=2021-10-05T01:41:12.186Z caller=main.go:795 msg="Starting TSDB ..."
level=info ts=2021-10-05T01:41:12.186Z caller=tls_config.go:191 component=web msg="TLS is disabled." http2=false
level=info ts=2021-10-05T01:41:12.210Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1631793600032 maxt=1631858400000 ulid=01FFSPM4CA77E4T8A8BE0B62N0
level=info ts=2021-10-05T01:41:12.212Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1631858400032 maxt=1631923200000 ulid=01FFVMDNYC3X486S79Q4E9HTKY
level=info ts=2021-10-05T01:41:12.213Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1631923200032 maxt=1631988000000 ulid=01FFXJ779WB4R2XK81T0W9H5GH
level=info ts=2021-10-05T01:41:12.214Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1631988000032 maxt=1632052800000 ulid=01FFZG0RMVS92P3BG9FV6D7ZCZ
level=info ts=2021-10-05T01:41:12.215Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1632052800032 maxt=1632117600000 ulid=01FG1DT8NNX3QPJY3KMXCZ9GKE
level=info ts=2021-10-05T01:41:12.216Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1632117600032 maxt=1632182400000 ulid=01FG3BKRYZY063P8EQ7XR6RPFX
level=info ts=2021-10-05T01:41:12.217Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1632182400032 maxt=1632247200000 ulid=01FGBY8TQQFT44BD6MERRMCB7D
level=info ts=2021-10-05T01:41:12.218Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1632471579722 maxt=1632506400000 ulid=01FGD0KJBTB0QETRQHNRE7W06N
level=info ts=2021-10-05T01:41:12.219Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1632506400035 maxt=1632571200000 ulid=01FGNP29DT69EPKMDFKRBWE8W1
level=info ts=2021-10-05T01:41:12.220Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1632571200033 maxt=1632592800000 ulid=01FGP0ZQ3DYZFK0EB3V030Z7RJ
level=info ts=2021-10-05T01:41:12.221Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1632811744270 maxt=1632830400000 ulid=01FGPNJYN03ATA585HX6H4V4F5
level=info ts=2021-10-05T01:41:12.222Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1632830400032 maxt=1632895200000 ulid=01FGRKCRY1SRA5ACCPFYVA38ZA
level=info ts=2021-10-05T01:41:12.224Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1632895200032 maxt=1632960000000 ulid=01FGTH6CADDF4CMCPH0HYVHEWR
level=info ts=2021-10-05T01:41:12.225Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1632960000032 maxt=1633024800000 ulid=01FGWEZTR3NTX1Y7BCEFSWJPR2
level=info ts=2021-10-05T01:41:12.226Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1633024800032 maxt=1633046400000 ulid=01FGX3JQFM8SC9W5YM85X52K2E
level=info ts=2021-10-05T01:41:12.226Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1633068000032 maxt=1633075200000 ulid=01FGXH9VNYYBDQ9CV2P03FWP8M
level=info ts=2021-10-05T01:41:12.228Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1633075200032 maxt=1633082400000 ulid=01FGXR5JXYT7DSEVFCDZW8ZQAG
level=info ts=2021-10-05T01:41:12.229Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1633046400032 maxt=1633068000000 ulid=01FGXR5XCKX9RN5DQZJ5MAJ57X
level=info ts=2021-10-05T01:41:12.230Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1633082400032 maxt=1633089600000 ulid=01FGXZ1A5XS06ZPQJFKAMRA2P6
level=info ts=2021-10-05T01:41:12.231Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1633089600032 maxt=1633096800000 ulid=01FGY5X1DX6QP29VEKVD5P8AQQ
level=info ts=2021-10-05T01:41:12.232Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1633096800032 maxt=1633104000000 ulid=01FGYCRRNWS18WA0YNTAXABFBR
level=info ts=2021-10-05T01:41:12.530Z caller=head.go:696 component=tsdb msg="Replaying on-disk memory mappable chunks if any"
level=info ts=2021-10-05T01:41:12.530Z caller=head.go:710 component=tsdb msg="On-disk memory mappable chunks replay completed" duration=3.481µs
level=info ts=2021-10-05T01:41:12.530Z caller=head.go:716 component=tsdb msg="Replaying WAL, this may take a while"
panic: runtime error: index out of range [0] with length 0

goroutine 297 [running]:
github.com/prometheus/prometheus/tsdb/wal.NewSegmentBufReader(...)
    /go/src/github.com/prometheus/prometheus/tsdb/wal/wal.go:862
github.com/prometheus/prometheus/tsdb/wal.NewSegmentsRangeReader(0xc0012a9a50, 0x1, 0x1, 0x23, 0x4ef, 0x0, 0x0)
    /go/src/github.com/prometheus/prometheus/tsdb/wal/wal.go:844 +0x770
github.com/prometheus/prometheus/tsdb/wal.NewSegmentsReader(...)
    /go/src/github.com/prometheus/prometheus/tsdb/wal/wal.go:816
github.com/prometheus/prometheus/tsdb.(*Head).Init(0xc0003f8000, 0x17c3c957400, 0x0, 0x0)
    /go/src/github.com/prometheus/prometheus/tsdb/head.go:726 +0x13ed
github.com/prometheus/prometheus/tsdb.open(0x7fff974e5138, 0xb, 0x3290520, 0xc00097e0c0, 0x32c8a90, 0xc00014ed20, 0xc0011f6ea0, 0xc00128e000, 0x3, 0xa, ...)
    /go/src/github.com/prometheus/prometheus/tsdb/db.go:695 +0x858
github.com/prometheus/prometheus/tsdb.Open(0x7fff974e5138, 0xb, 0x3290520, 0xc00097e0c0, 0x32c8a90, 0xc00014ed20, 0xc0011f6ea0, 0xc000560e48, 0x7ac9e2, 0xc0004da9c0)
    /go/src/github.com/prometheus/prometheus/tsdb/db.go:540 +0xbc
main.openDBWithMetrics(0x7fff974e5138, 0xb, 0x3290520, 0xc00098cdb0, 0x32c8a90, 0xc00014ed20, 0xc0011f6ea0, 0x0, 0x0, 0x0)
    /go/src/github.com/prometheus/prometheus/cmd/prometheus/main.go:888 +0x10d
main.main.func20(0x0, 0x0)
    /go/src/github.com/prometheus/prometheus/cmd/prometheus/main.go:801 +0x1ff
github.com/oklog/run.(*Group).Run.func1(0xc0011f6f00, 0xc0005543c0, 0xc0011fd350)
    /go/src/github.com/prometheus/prometheus/vendor/github.com/oklog/run/group.go:38 +0x27
created by github.com/oklog/run.(*Group).Run
    /go/src/github.com/prometheus/prometheus/vendor/github.com/oklog/run/group.go:37 +0xbb
The describe of pods is below:


Name:                 prometheus-k8s-0
Namespace:            openshift-monitoring
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Node:                 oci-ocpprdin03.oci-secocp.ocp.io/10.199.107.108
Start Time:           Tue, 05 Oct 2021 10:40:18 +0900
Labels:               app=prometheus
                      app.kubernetes.io/component=prometheus
                      app.kubernetes.io/instance=k8s
                      app.kubernetes.io/managed-by=prometheus-operator
                      app.kubernetes.io/name=prometheus
                      app.kubernetes.io/part-of=openshift-monitoring
                      app.kubernetes.io/version=2.26.1
                      controller-revision-hash=prometheus-k8s-78d55d7898
                      operator.prometheus.io/name=k8s
                      operator.prometheus.io/shard=0
                      prometheus=k8s
                      statefulset.kubernetes.io/pod-name=prometheus-k8s-0
Annotations:          k8s.v1.cni.cncf.io/network-status:
                        [{
                            "name": "openshift-sdn",
                            "interface": "eth0",
                            "ips": [
                                "40.105.1.97"
                            ],
                            "default": true,
                            "dns": {}
                        }]
                      k8s.v1.cni.cncf.io/networks-status:
                        [{
                            "name": "openshift-sdn",
                            "interface": "eth0",
                            "ips": [
                                "40.105.1.97"
                            ],
                            "default": true,
                            "dns": {}
                        }]
                      kubectl.kubernetes.io/default-container: prometheus
                      openshift.io/scc: nonroot
                      workload.openshift.io/warning: only single-node clusters support workload partitioning
Status:               Running
IP:                   40.105.1.97
IPs:
  IP:           40.105.1.97
Controlled By:  StatefulSet/prometheus-k8s
Containers:
  prometheus:
    Container ID:  cri-o://27f73ce1bd11c727aec89397c068b1ebc2ae925d46ba95ef95058b832af7f8e4
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e1c622a6e8d0c61d23e3dcd01e8b4dc6108c997fe111a5966f65c43dae468e39
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e1c622a6e8d0c61d23e3dcd01e8b4dc6108c997fe111a5966f65c43dae468e39
    Port:          <none>
    Host Port:     <none>
    Args:
      --web.console.templates=/etc/prometheus/consoles
      --web.console.libraries=/etc/prometheus/console_libraries
      --config.file=/etc/prometheus/config_out/prometheus.env.yaml
      --storage.tsdb.path=/prometheus
      --storage.tsdb.retention.time=15d
      --web.enable-lifecycle
      --storage.tsdb.no-lockfile
      --web.external-url=https://prometheus-k8s-openshift-monitoring.apps.oci-secocp.ocp.io/
      --web.route-prefix=/
      --web.listen-address=127.0.0.1:9090
    State:       Waiting
      Reason:    CrashLoopBackOff
    Last State:  Terminated
      Reason:    Error
      Message:   ng on-disk memory mappable chunks if any"
level=info ts=2021-10-07T04:20:59.999Z caller=head.go:710 component=tsdb msg="On-disk memory mappable chunks replay completed" duration=3.875µs
level=info ts=2021-10-07T04:20:59.999Z caller=head.go:716 component=tsdb msg="Replaying WAL, this may take a while"
panic: runtime error: index out of range [0] with length 0

goroutine 279 [running]:
github.com/prometheus/prometheus/tsdb/wal.NewSegmentBufReader(...)
  /go/src/github.com/prometheus/prometheus/tsdb/wal/wal.go:862
github.com/prometheus/prometheus/tsdb/wal.NewSegmentsRangeReader(0xc001039a50, 0x1, 0x1, 0x23, 0x5b9, 0x0, 0x0)
  /go/src/github.com/prometheus/prometheus/tsdb/wal/wal.go:844 +0x770
github.com/prometheus/prometheus/tsdb/wal.NewSegmentsReader(...)
  /go/src/github.com/prometheus/prometheus/tsdb/wal/wal.go:816
github.com/prometheus/prometheus/tsdb.(*Head).Init(0xc001454000, 0x17c56554000, 0x0, 0x0)
  /go/src/github.com/prometheus/prometheus/tsdb/head.go:726 +0x13ed
github.com/prometheus/prometheus/tsdb.open(0x7ffc1bc75138, 0xb, 0x3290520, 0xc000b76150, 0x32c8a90, 0xc0000bad70, 0xc000204720, 0xc000b7e000, 0x3, 0xa, ...)
  /go/src/github.com/prometheus/prometheus/tsdb/db.go:695 +0x858
github.com/prometheus/prometheus/tsdb.Open(0x7ffc1bc75138, 0xb, 0x3290520, 0xc000b76150, 0x32c8a90, 0xc0000bad70, 0xc000204720, 0xc000fc4e48, 0x7ac9e2, 0xc0004d2800)
  /go/src/github.com/prometheus/prometheus/tsdb/db.go:540 +0xbc
main.openDBWithMetrics(0x7ffc1bc75138, 0xb, 0x3290520, 0xc000490a50, 0x32c8a90, 0xc0000bad70, 0xc000204720, 0xc000f96958, 0x1, 0x2a)
  /go/src/github.com/prometheus/prometheus/cmd/prometheus/main.go:888 +0x10d
main.main.func20(0x1, 0x26)
  /go/src/github.com/prometheus/prometheus/cmd/prometheus/main.go:801 +0x1ff
github.com
      Exit Code:    2
      Started:      Thu, 07 Oct 2021 13:20:59 +0900
      Finished:     Thu, 07 Oct 2021 13:21:00 +0900
    Ready:          False
    Restart Count:  140
    Requests:
      cpu:        70m
      memory:     1Gi
    Readiness:    exec [sh -c if [ -x "$(command -v curl)" ]; then exec curl http://localhost:9090/-/ready; elif [ -x "$(command -v wget)" ]; then exec wget -q -O /dev/null http://localhost:9090/-/ready; else exit 1; fi] delay=0s timeout=3s period=5s #success=1 #failure=120
    Environment:  <none>
    Mounts:
      /etc/pki/ca-trust/extracted/pem/ from prometheus-trusted-ca-bundle (ro)
      /etc/prometheus/certs from tls-assets (ro)
      /etc/prometheus/config_out from config-out (ro)
      /etc/prometheus/configmaps/kubelet-serving-ca-bundle from configmap-kubelet-serving-ca-bundle (ro)
      /etc/prometheus/configmaps/serving-certs-ca-bundle from configmap-serving-certs-ca-bundle (ro)
      /etc/prometheus/rules/prometheus-k8s-rulefiles-0 from prometheus-k8s-rulefiles-0 (rw)
      /etc/prometheus/secrets/kube-etcd-client-certs from secret-kube-etcd-client-certs (ro)
      /etc/prometheus/secrets/kube-rbac-proxy from secret-kube-rbac-proxy (ro)
      /etc/prometheus/secrets/prometheus-k8s-htpasswd from secret-prometheus-k8s-htpasswd (ro)
      /etc/prometheus/secrets/prometheus-k8s-proxy from secret-prometheus-k8s-proxy (ro)
      /etc/prometheus/secrets/prometheus-k8s-thanos-sidecar-tls from secret-prometheus-k8s-thanos-sidecar-tls (ro)
      /etc/prometheus/secrets/prometheus-k8s-tls from secret-prometheus-k8s-tls (ro)
      /prometheus from prometheus-k8s-db (rw,path="prometheus-db")
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-x4cft (ro)
  config-reloader:
    Container ID:  cri-o://9af8a016d4459effb75e90c2bfba2708f339c6790940a1b7db1afa1073a6dc69
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0bbe9ebac91cf9f0e2909dd022e83a9e0f2011bdbfb4890c869df39405fb931f
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0bbe9ebac91cf9f0e2909dd022e83a9e0f2011bdbfb4890c869df39405fb931f
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/prometheus-config-reloader
    Args:
      --listen-address=localhost:8080
      --reload-url=http://localhost:9090/-/reload
      --config-file=/etc/prometheus/config/prometheus.yaml.gz
      --config-envsubst-file=/etc/prometheus/config_out/prometheus.env.yaml
      --watched-dir=/etc/prometheus/rules/prometheus-k8s-rulefiles-0
    State:          Running
      Started:      Tue, 05 Oct 2021 10:40:22 +0900
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:     1m
      memory:  10Mi
    Environment:
      POD_NAME:  prometheus-k8s-0 (v1:metadata.name)
      SHARD:     0
    Mounts:
      /etc/prometheus/config from config (rw)
      /etc/prometheus/config_out from config-out (rw)
      /etc/prometheus/rules/prometheus-k8s-rulefiles-0 from prometheus-k8s-rulefiles-0 (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-x4cft (ro)
  thanos-sidecar:
    Container ID:  cri-o://f2c67741d96bf03b6631caba6fcf56b77acb66ccebd606785493aa0a07c45fa1
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:fedb316063c42baa890555591cb49a151c24ba02d347d8ccccfafd12ba781067
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:fedb316063c42baa890555591cb49a151c24ba02d347d8ccccfafd12ba781067
    Ports:         10902/TCP, 10901/TCP
    Host Ports:    0/TCP, 0/TCP
    Args:
      sidecar
      --prometheus.url=http://localhost:9090/
      --tsdb.path=/prometheus
      --grpc-address=[$(POD_IP)]:10901
      --http-address=127.0.0.1:10902
      --grpc-server-tls-cert=/etc/tls/grpc/server.crt
      --grpc-server-tls-key=/etc/tls/grpc/server.key
      --grpc-server-tls-client-ca=/etc/tls/grpc/ca.crt
    State:          Running
      Started:      Tue, 05 Oct 2021 10:40:23 +0900
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:     1m
      memory:  25Mi
    Environment:
      POD_IP:   (v1:status.podIP)
    Mounts:
      /etc/tls/grpc from secret-grpc-tls (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-x4cft (ro)
  prometheus-proxy:
    Container ID:  cri-o://51f1cf269a9a0d060ff0ccd84b199268974b6f5fbb63e8516b56f050f2d2d02a
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:864b658b93adf38b3b6613225f00e8f3236299cb2f2f02aa16cf6b43eaa19229
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:864b658b93adf38b3b6613225f00e8f3236299cb2f2f02aa16cf6b43eaa19229
    Port:          9091/TCP
    Host Port:     0/TCP
    Args:
      -provider=openshift
      -https-address=:9091
      -http-address=
      -email-domain=*
      -upstream=http://localhost:9090
      -htpasswd-file=/etc/proxy/htpasswd/auth
      -openshift-service-account=prometheus-k8s
      -openshift-sar={"resource": "namespaces", "verb": "get"}
      -openshift-delegate-urls={"/": {"resource": "namespaces", "verb": "get"}}
      -tls-cert=/etc/tls/private/tls.crt
      -tls-key=/etc/tls/private/tls.key
      -client-secret-file=/var/run/secrets/kubernetes.io/serviceaccount/token
      -cookie-secret-file=/etc/proxy/secrets/session_secret
      -openshift-ca=/etc/pki/tls/cert.pem
      -openshift-ca=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    State:          Running
      Started:      Tue, 05 Oct 2021 10:40:23 +0900
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:     1m
      memory:  20Mi
    Environment:
      HTTP_PROXY:   
      HTTPS_PROXY:  
      NO_PROXY:     
    Mounts:
      /etc/pki/ca-trust/extracted/pem/ from prometheus-trusted-ca-bundle (ro)
      /etc/proxy/htpasswd from secret-prometheus-k8s-htpasswd (rw)
      /etc/proxy/secrets from secret-prometheus-k8s-proxy (rw)
      /etc/tls/private from secret-prometheus-k8s-tls (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-x4cft (ro)
  kube-rbac-proxy:
    Container ID:  cri-o://d73c51bda80f74ab226c79e9ddcbc041286d93c7b64e3ea6876dc5717f5ff6fd
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6d57bfd91fac9b68eb72d27226bc297472ceb136c996628b845ecc54a48b31cb
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6d57bfd91fac9b68eb72d27226bc297472ceb136c996628b845ecc54a48b31cb
    Port:          9092/TCP
    Host Port:     0/TCP
    Args:
      --secure-listen-address=0.0.0.0:9092
      --upstream=http://127.0.0.1:9095
      --config-file=/etc/kube-rbac-proxy/config.yaml
      --tls-cert-file=/etc/tls/private/tls.crt
      --tls-private-key-file=/etc/tls/private/tls.key
      --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305
      --logtostderr=true
      --v=10
    State:          Running
      Started:      Tue, 05 Oct 2021 10:40:23 +0900
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:        1m
      memory:     15Mi
    Environment:  <none>
    Mounts:
      /etc/kube-rbac-proxy from secret-kube-rbac-proxy (rw)
      /etc/tls/private from secret-prometheus-k8s-tls (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-x4cft (ro)
  prom-label-proxy:
    Container ID:  cri-o://3e1db3eb81a79e21fad603e426b8d2adcb9c7813977a53d84150dfbfdb3f0b95
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1e9b9b51d11596199cd6a1aebc4c7b6080d3c250b007dbbf083d22e020588b9b
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1e9b9b51d11596199cd6a1aebc4c7b6080d3c250b007dbbf083d22e020588b9b
    Port:          <none>
    Host Port:     <none>
    Args:
      --insecure-listen-address=127.0.0.1:9095
      --upstream=http://127.0.0.1:9090
      --label=namespace
    State:          Running
      Started:      Tue, 05 Oct 2021 10:40:24 +0900
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:        1m
      memory:     15Mi
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-x4cft (ro)
  kube-rbac-proxy-thanos:
    Container ID:  cri-o://81549d73c214f0262075e1293d1dc1610b934ae68a91e424171b4aec7a1e3732
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6d57bfd91fac9b68eb72d27226bc297472ceb136c996628b845ecc54a48b31cb
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6d57bfd91fac9b68eb72d27226bc297472ceb136c996628b845ecc54a48b31cb
    Port:          10902/TCP
    Host Port:     0/TCP
    Args:
      --secure-listen-address=[$(POD_IP)]:10902
      --upstream=http://127.0.0.1:10902
      --tls-cert-file=/etc/tls/private/tls.crt
      --tls-private-key-file=/etc/tls/private/tls.key
      --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305
      --allow-paths=/metrics
      --logtostderr=true
    State:          Running
      Started:      Tue, 05 Oct 2021 10:40:24 +0900
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:     1m
      memory:  10Mi
    Environment:
      POD_IP:   (v1:status.podIP)
    Mounts:
      /etc/tls/private from secret-prometheus-k8s-thanos-sidecar-tls (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-x4cft (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  prometheus-k8s-db:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  prometheus-k8s-db-prometheus-k8s-0
    ReadOnly:   false
  config:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  prometheus-k8s
    Optional:    false
  tls-assets:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  prometheus-k8s-tls-assets
    Optional:    false
  config-out:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  prometheus-k8s-rulefiles-0:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      prometheus-k8s-rulefiles-0
    Optional:  false
  secret-kube-etcd-client-certs:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  kube-etcd-client-certs
    Optional:    false
  secret-prometheus-k8s-tls:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  prometheus-k8s-tls
    Optional:    false
  secret-prometheus-k8s-proxy:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  prometheus-k8s-proxy
    Optional:    false
  secret-prometheus-k8s-htpasswd:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  prometheus-k8s-htpasswd
    Optional:    false
  secret-prometheus-k8s-thanos-sidecar-tls:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  prometheus-k8s-thanos-sidecar-tls
    Optional:    false
  secret-kube-rbac-proxy:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  kube-rbac-proxy
    Optional:    false
  configmap-serving-certs-ca-bundle:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      serving-certs-ca-bundle
    Optional:  false
  configmap-kubelet-serving-ca-bundle:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      kubelet-serving-ca-bundle
    Optional:  false
  secret-grpc-tls:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  prometheus-k8s-grpc-tls-5n5q32u80g297
    Optional:    false
  prometheus-trusted-ca-bundle:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      prometheus-trusted-ca-bundle-d34s91lhv300e
    Optional:  true
  kube-api-access-x4cft:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   Burstable
Node-Selectors:              node-role.kubernetes.io/infra=
Tolerations:                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason   Age                     From     Message
  ----     ------   ----                    ----     -------
  Normal   Pulled   38m (x133 over 2d2h)    kubelet  Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e1c622a6e8d0c61d23e3dcd01e8b4dc6108c997fe111a5966f65c43dae468e39" already present on machine
  Warning  BackOff  3m27s (x3267 over 11h)  kubelet  Back-off restarting failed container
I think the error logs are similar with below github:
https://github.com/prometheus/prometheus/issues/6976


So, I suggested remove all files in 'wal' directory and restart prometheus.
prometheus pods started successfully but few days later, crashloopbackoff again with above error. 



Version-Release number of selected component (if applicable):
OCP: 4.8.4
Prometheus: 2.26.1

How reproducible:

Steps to Reproduce:
1. prometheus-k8s-{0,1}
2. prometheus-k8s-{0,1} recreated automatically
3.one or two days later, prometheus-k8s-{0,1} are crashloopbackoff again.

Actual results:
due to prometheus pods crashed, can't monitoring OCP cluster and use HPA.

Expected results:
prometheus pods are running without crash.

Additional info:
- Prometheus is using the storage as NFS provided by Oracle cloud.

Comment 1 Junqi Zhao 2021-10-18 05:50:58 UTC
(In reply to Hakyong Do from comment #0)
> Additional info:
> - Prometheus is using the storage as NFS provided by Oracle cloud.

using NFS as storage is not recommended

Comment 2 Hakyong Do 2021-10-18 06:36:10 UTC
Hi,

The customer don't have object storage and using block storage(like hostpath) can't maintain data between monitoring nodes so they use NFS storage as PV.
There are any possibility it's prometheus's bug? this issue is happened because they use NFS storage? 

If you need any further information to analyze this issue, please let me know.

Thank you.

Comment 3 Philip Gough 2021-10-18 09:17:27 UTC
Hi, as per the Prometheus docs https://prometheus.io/docs/prometheus/latest/storage/#operational-aspects NFS is not supported option. 

The upstream issue that is related (https://github.com/prometheus/prometheus/issues/6976) was marked as resolved in v2.20.0

Comment 4 Hakyong Do 2021-10-18 09:33:00 UTC
Hi,
The Prometheus version they use is 2.26.1
So I think it's not same issue with upstream issue.

Thank you for your attention.

Comment 5 Junqi Zhao 2021-10-19 06:51:38 UTC
FYI: https://prometheus.io/docs/prometheus/latest/storage/#operational-aspects
CAUTION: Non-POSIX compliant filesystems are not supported for Prometheus' local storage as unrecoverable corruptions may happen. NFS filesystems (including AWS's EFS) are not supported. NFS could be POSIX-compliant, but most implementations are not. It is strongly recommended to use a local filesystem for reliability.

Comment 6 Sunil Thaha 2021-10-25 08:32:47 UTC
@hdo 

To isolate if this issue due to NFS, would it be possible to run Prometheus without a persistent storage and see if the issue recurs? 

As Junqi mentioned above Prometheus doesn't support NFS however we will certainly raise an upstream bug-fix to
shutdown gracefully if there is a WAL replay error instead of panicing.

Comment 7 Hakyong Do 2021-10-29 02:36:26 UTC
I asked the customer using local storage instead of NFS.
You'll be updated if same issue occurred again.

Thank you!!

Comment 8 Sunil Thaha 2021-11-18 07:01:28 UTC
Upstream PR - https://github.com/prometheus/prometheus/pull/9687 may fix the bug. It certainly does if the checkpoint directory is empty

@hdo , Next time this occurs could you please run `tree` of data directory?

Comment 9 Sunil Thaha 2021-11-25 08:32:23 UTC
Although the patch is merged upstream, the downstream fork will get the fix only after Prometheus 2.32.0 is released.
Hence setting the status back to ASSIGNED

Comment 10 Sunil Thaha 2022-01-17 08:01:33 UTC
Prometheus 2.32.0 has been merged, hence moving this bug to MODIFIED.

Comment 12 Junqi Zhao 2022-01-18 02:36:56 UTC
tested with 4.10.0-0.nightly-2022-01-17-182202, prometheus version is 2.32.1 and no regression issues
# oc -n openshift-monitoring logs -c prometheus prometheus-k8s-0 | head
ts=2022-01-17T11:22:21.268Z caller=main.go:532 level=info msg="Starting Prometheus" version="(version=2.32.1, branch=rhaos-4.10-rhel-8, revision=428381c4197ea73aa4a18b1920f9021ba8dd8b23)"

Comment 15 errata-xmlrpc 2022-03-12 04:39:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056


Note You need to log in before you can comment on or make changes to this bug.