Description of problem: The prometheus-k8s-{0,1} pods are CrashLoopBackoff status if delete these pods, prometheus started normally but a few days later, prometheus is CrashLoopBackoff again with below error message: level=info ts=2021-10-05T01:41:12.108Z caller=main.go:418 msg="Starting Prometheus" version="(version=2.26.1, branch=rhaos-4.8-rhel-8, revision=c052078834283fc9419d39dc92fa98deca2074dd)" level=info ts=2021-10-05T01:41:12.108Z caller=main.go:423 build_context="(go=go1.16.6, user=root@22ba12e0d3d8, date=20210729-16:32:07)" level=info ts=2021-10-05T01:41:12.109Z caller=main.go:424 host_details="(Linux 4.18.0-305.10.2.el8_4.x86_64 #1 SMP Mon Jul 12 04:43:18 EDT 2021 x86_64 prometheus-k8s-1 (none))" level=info ts=2021-10-05T01:41:12.109Z caller=main.go:425 fd_limits="(soft=1048576, hard=1048576)" level=info ts=2021-10-05T01:41:12.109Z caller=main.go:426 vm_limits="(soft=unlimited, hard=unlimited)" level=info ts=2021-10-05T01:41:12.185Z caller=web.go:540 component=web msg="Start listening for connections" address=127.0.0.1:9090 level=info ts=2021-10-05T01:41:12.186Z caller=main.go:795 msg="Starting TSDB ..." level=info ts=2021-10-05T01:41:12.186Z caller=tls_config.go:191 component=web msg="TLS is disabled." http2=false level=info ts=2021-10-05T01:41:12.210Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1631793600032 maxt=1631858400000 ulid=01FFSPM4CA77E4T8A8BE0B62N0 level=info ts=2021-10-05T01:41:12.212Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1631858400032 maxt=1631923200000 ulid=01FFVMDNYC3X486S79Q4E9HTKY level=info ts=2021-10-05T01:41:12.213Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1631923200032 maxt=1631988000000 ulid=01FFXJ779WB4R2XK81T0W9H5GH level=info ts=2021-10-05T01:41:12.214Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1631988000032 maxt=1632052800000 ulid=01FFZG0RMVS92P3BG9FV6D7ZCZ level=info ts=2021-10-05T01:41:12.215Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1632052800032 maxt=1632117600000 ulid=01FG1DT8NNX3QPJY3KMXCZ9GKE level=info ts=2021-10-05T01:41:12.216Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1632117600032 maxt=1632182400000 ulid=01FG3BKRYZY063P8EQ7XR6RPFX level=info ts=2021-10-05T01:41:12.217Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1632182400032 maxt=1632247200000 ulid=01FGBY8TQQFT44BD6MERRMCB7D level=info ts=2021-10-05T01:41:12.218Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1632471579722 maxt=1632506400000 ulid=01FGD0KJBTB0QETRQHNRE7W06N level=info ts=2021-10-05T01:41:12.219Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1632506400035 maxt=1632571200000 ulid=01FGNP29DT69EPKMDFKRBWE8W1 level=info ts=2021-10-05T01:41:12.220Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1632571200033 maxt=1632592800000 ulid=01FGP0ZQ3DYZFK0EB3V030Z7RJ level=info ts=2021-10-05T01:41:12.221Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1632811744270 maxt=1632830400000 ulid=01FGPNJYN03ATA585HX6H4V4F5 level=info ts=2021-10-05T01:41:12.222Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1632830400032 maxt=1632895200000 ulid=01FGRKCRY1SRA5ACCPFYVA38ZA level=info ts=2021-10-05T01:41:12.224Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1632895200032 maxt=1632960000000 ulid=01FGTH6CADDF4CMCPH0HYVHEWR level=info ts=2021-10-05T01:41:12.225Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1632960000032 maxt=1633024800000 ulid=01FGWEZTR3NTX1Y7BCEFSWJPR2 level=info ts=2021-10-05T01:41:12.226Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1633024800032 maxt=1633046400000 ulid=01FGX3JQFM8SC9W5YM85X52K2E level=info ts=2021-10-05T01:41:12.226Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1633068000032 maxt=1633075200000 ulid=01FGXH9VNYYBDQ9CV2P03FWP8M level=info ts=2021-10-05T01:41:12.228Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1633075200032 maxt=1633082400000 ulid=01FGXR5JXYT7DSEVFCDZW8ZQAG level=info ts=2021-10-05T01:41:12.229Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1633046400032 maxt=1633068000000 ulid=01FGXR5XCKX9RN5DQZJ5MAJ57X level=info ts=2021-10-05T01:41:12.230Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1633082400032 maxt=1633089600000 ulid=01FGXZ1A5XS06ZPQJFKAMRA2P6 level=info ts=2021-10-05T01:41:12.231Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1633089600032 maxt=1633096800000 ulid=01FGY5X1DX6QP29VEKVD5P8AQQ level=info ts=2021-10-05T01:41:12.232Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1633096800032 maxt=1633104000000 ulid=01FGYCRRNWS18WA0YNTAXABFBR level=info ts=2021-10-05T01:41:12.530Z caller=head.go:696 component=tsdb msg="Replaying on-disk memory mappable chunks if any" level=info ts=2021-10-05T01:41:12.530Z caller=head.go:710 component=tsdb msg="On-disk memory mappable chunks replay completed" duration=3.481µs level=info ts=2021-10-05T01:41:12.530Z caller=head.go:716 component=tsdb msg="Replaying WAL, this may take a while" panic: runtime error: index out of range [0] with length 0 goroutine 297 [running]: github.com/prometheus/prometheus/tsdb/wal.NewSegmentBufReader(...) /go/src/github.com/prometheus/prometheus/tsdb/wal/wal.go:862 github.com/prometheus/prometheus/tsdb/wal.NewSegmentsRangeReader(0xc0012a9a50, 0x1, 0x1, 0x23, 0x4ef, 0x0, 0x0) /go/src/github.com/prometheus/prometheus/tsdb/wal/wal.go:844 +0x770 github.com/prometheus/prometheus/tsdb/wal.NewSegmentsReader(...) /go/src/github.com/prometheus/prometheus/tsdb/wal/wal.go:816 github.com/prometheus/prometheus/tsdb.(*Head).Init(0xc0003f8000, 0x17c3c957400, 0x0, 0x0) /go/src/github.com/prometheus/prometheus/tsdb/head.go:726 +0x13ed github.com/prometheus/prometheus/tsdb.open(0x7fff974e5138, 0xb, 0x3290520, 0xc00097e0c0, 0x32c8a90, 0xc00014ed20, 0xc0011f6ea0, 0xc00128e000, 0x3, 0xa, ...) /go/src/github.com/prometheus/prometheus/tsdb/db.go:695 +0x858 github.com/prometheus/prometheus/tsdb.Open(0x7fff974e5138, 0xb, 0x3290520, 0xc00097e0c0, 0x32c8a90, 0xc00014ed20, 0xc0011f6ea0, 0xc000560e48, 0x7ac9e2, 0xc0004da9c0) /go/src/github.com/prometheus/prometheus/tsdb/db.go:540 +0xbc main.openDBWithMetrics(0x7fff974e5138, 0xb, 0x3290520, 0xc00098cdb0, 0x32c8a90, 0xc00014ed20, 0xc0011f6ea0, 0x0, 0x0, 0x0) /go/src/github.com/prometheus/prometheus/cmd/prometheus/main.go:888 +0x10d main.main.func20(0x0, 0x0) /go/src/github.com/prometheus/prometheus/cmd/prometheus/main.go:801 +0x1ff github.com/oklog/run.(*Group).Run.func1(0xc0011f6f00, 0xc0005543c0, 0xc0011fd350) /go/src/github.com/prometheus/prometheus/vendor/github.com/oklog/run/group.go:38 +0x27 created by github.com/oklog/run.(*Group).Run /go/src/github.com/prometheus/prometheus/vendor/github.com/oklog/run/group.go:37 +0xbb The describe of pods is below: Name: prometheus-k8s-0 Namespace: openshift-monitoring Priority: 2000000000 Priority Class Name: system-cluster-critical Node: oci-ocpprdin03.oci-secocp.ocp.io/10.199.107.108 Start Time: Tue, 05 Oct 2021 10:40:18 +0900 Labels: app=prometheus app.kubernetes.io/component=prometheus app.kubernetes.io/instance=k8s app.kubernetes.io/managed-by=prometheus-operator app.kubernetes.io/name=prometheus app.kubernetes.io/part-of=openshift-monitoring app.kubernetes.io/version=2.26.1 controller-revision-hash=prometheus-k8s-78d55d7898 operator.prometheus.io/name=k8s operator.prometheus.io/shard=0 prometheus=k8s statefulset.kubernetes.io/pod-name=prometheus-k8s-0 Annotations: k8s.v1.cni.cncf.io/network-status: [{ "name": "openshift-sdn", "interface": "eth0", "ips": [ "40.105.1.97" ], "default": true, "dns": {} }] k8s.v1.cni.cncf.io/networks-status: [{ "name": "openshift-sdn", "interface": "eth0", "ips": [ "40.105.1.97" ], "default": true, "dns": {} }] kubectl.kubernetes.io/default-container: prometheus openshift.io/scc: nonroot workload.openshift.io/warning: only single-node clusters support workload partitioning Status: Running IP: 40.105.1.97 IPs: IP: 40.105.1.97 Controlled By: StatefulSet/prometheus-k8s Containers: prometheus: Container ID: cri-o://27f73ce1bd11c727aec89397c068b1ebc2ae925d46ba95ef95058b832af7f8e4 Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e1c622a6e8d0c61d23e3dcd01e8b4dc6108c997fe111a5966f65c43dae468e39 Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e1c622a6e8d0c61d23e3dcd01e8b4dc6108c997fe111a5966f65c43dae468e39 Port: <none> Host Port: <none> Args: --web.console.templates=/etc/prometheus/consoles --web.console.libraries=/etc/prometheus/console_libraries --config.file=/etc/prometheus/config_out/prometheus.env.yaml --storage.tsdb.path=/prometheus --storage.tsdb.retention.time=15d --web.enable-lifecycle --storage.tsdb.no-lockfile --web.external-url=https://prometheus-k8s-openshift-monitoring.apps.oci-secocp.ocp.io/ --web.route-prefix=/ --web.listen-address=127.0.0.1:9090 State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error Message: ng on-disk memory mappable chunks if any" level=info ts=2021-10-07T04:20:59.999Z caller=head.go:710 component=tsdb msg="On-disk memory mappable chunks replay completed" duration=3.875µs level=info ts=2021-10-07T04:20:59.999Z caller=head.go:716 component=tsdb msg="Replaying WAL, this may take a while" panic: runtime error: index out of range [0] with length 0 goroutine 279 [running]: github.com/prometheus/prometheus/tsdb/wal.NewSegmentBufReader(...) /go/src/github.com/prometheus/prometheus/tsdb/wal/wal.go:862 github.com/prometheus/prometheus/tsdb/wal.NewSegmentsRangeReader(0xc001039a50, 0x1, 0x1, 0x23, 0x5b9, 0x0, 0x0) /go/src/github.com/prometheus/prometheus/tsdb/wal/wal.go:844 +0x770 github.com/prometheus/prometheus/tsdb/wal.NewSegmentsReader(...) /go/src/github.com/prometheus/prometheus/tsdb/wal/wal.go:816 github.com/prometheus/prometheus/tsdb.(*Head).Init(0xc001454000, 0x17c56554000, 0x0, 0x0) /go/src/github.com/prometheus/prometheus/tsdb/head.go:726 +0x13ed github.com/prometheus/prometheus/tsdb.open(0x7ffc1bc75138, 0xb, 0x3290520, 0xc000b76150, 0x32c8a90, 0xc0000bad70, 0xc000204720, 0xc000b7e000, 0x3, 0xa, ...) /go/src/github.com/prometheus/prometheus/tsdb/db.go:695 +0x858 github.com/prometheus/prometheus/tsdb.Open(0x7ffc1bc75138, 0xb, 0x3290520, 0xc000b76150, 0x32c8a90, 0xc0000bad70, 0xc000204720, 0xc000fc4e48, 0x7ac9e2, 0xc0004d2800) /go/src/github.com/prometheus/prometheus/tsdb/db.go:540 +0xbc main.openDBWithMetrics(0x7ffc1bc75138, 0xb, 0x3290520, 0xc000490a50, 0x32c8a90, 0xc0000bad70, 0xc000204720, 0xc000f96958, 0x1, 0x2a) /go/src/github.com/prometheus/prometheus/cmd/prometheus/main.go:888 +0x10d main.main.func20(0x1, 0x26) /go/src/github.com/prometheus/prometheus/cmd/prometheus/main.go:801 +0x1ff github.com Exit Code: 2 Started: Thu, 07 Oct 2021 13:20:59 +0900 Finished: Thu, 07 Oct 2021 13:21:00 +0900 Ready: False Restart Count: 140 Requests: cpu: 70m memory: 1Gi Readiness: exec [sh -c if [ -x "$(command -v curl)" ]; then exec curl http://localhost:9090/-/ready; elif [ -x "$(command -v wget)" ]; then exec wget -q -O /dev/null http://localhost:9090/-/ready; else exit 1; fi] delay=0s timeout=3s period=5s #success=1 #failure=120 Environment: <none> Mounts: /etc/pki/ca-trust/extracted/pem/ from prometheus-trusted-ca-bundle (ro) /etc/prometheus/certs from tls-assets (ro) /etc/prometheus/config_out from config-out (ro) /etc/prometheus/configmaps/kubelet-serving-ca-bundle from configmap-kubelet-serving-ca-bundle (ro) /etc/prometheus/configmaps/serving-certs-ca-bundle from configmap-serving-certs-ca-bundle (ro) /etc/prometheus/rules/prometheus-k8s-rulefiles-0 from prometheus-k8s-rulefiles-0 (rw) /etc/prometheus/secrets/kube-etcd-client-certs from secret-kube-etcd-client-certs (ro) /etc/prometheus/secrets/kube-rbac-proxy from secret-kube-rbac-proxy (ro) /etc/prometheus/secrets/prometheus-k8s-htpasswd from secret-prometheus-k8s-htpasswd (ro) /etc/prometheus/secrets/prometheus-k8s-proxy from secret-prometheus-k8s-proxy (ro) /etc/prometheus/secrets/prometheus-k8s-thanos-sidecar-tls from secret-prometheus-k8s-thanos-sidecar-tls (ro) /etc/prometheus/secrets/prometheus-k8s-tls from secret-prometheus-k8s-tls (ro) /prometheus from prometheus-k8s-db (rw,path="prometheus-db") /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-x4cft (ro) config-reloader: Container ID: cri-o://9af8a016d4459effb75e90c2bfba2708f339c6790940a1b7db1afa1073a6dc69 Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0bbe9ebac91cf9f0e2909dd022e83a9e0f2011bdbfb4890c869df39405fb931f Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0bbe9ebac91cf9f0e2909dd022e83a9e0f2011bdbfb4890c869df39405fb931f Port: <none> Host Port: <none> Command: /bin/prometheus-config-reloader Args: --listen-address=localhost:8080 --reload-url=http://localhost:9090/-/reload --config-file=/etc/prometheus/config/prometheus.yaml.gz --config-envsubst-file=/etc/prometheus/config_out/prometheus.env.yaml --watched-dir=/etc/prometheus/rules/prometheus-k8s-rulefiles-0 State: Running Started: Tue, 05 Oct 2021 10:40:22 +0900 Ready: True Restart Count: 0 Requests: cpu: 1m memory: 10Mi Environment: POD_NAME: prometheus-k8s-0 (v1:metadata.name) SHARD: 0 Mounts: /etc/prometheus/config from config (rw) /etc/prometheus/config_out from config-out (rw) /etc/prometheus/rules/prometheus-k8s-rulefiles-0 from prometheus-k8s-rulefiles-0 (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-x4cft (ro) thanos-sidecar: Container ID: cri-o://f2c67741d96bf03b6631caba6fcf56b77acb66ccebd606785493aa0a07c45fa1 Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:fedb316063c42baa890555591cb49a151c24ba02d347d8ccccfafd12ba781067 Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:fedb316063c42baa890555591cb49a151c24ba02d347d8ccccfafd12ba781067 Ports: 10902/TCP, 10901/TCP Host Ports: 0/TCP, 0/TCP Args: sidecar --prometheus.url=http://localhost:9090/ --tsdb.path=/prometheus --grpc-address=[$(POD_IP)]:10901 --http-address=127.0.0.1:10902 --grpc-server-tls-cert=/etc/tls/grpc/server.crt --grpc-server-tls-key=/etc/tls/grpc/server.key --grpc-server-tls-client-ca=/etc/tls/grpc/ca.crt State: Running Started: Tue, 05 Oct 2021 10:40:23 +0900 Ready: True Restart Count: 0 Requests: cpu: 1m memory: 25Mi Environment: POD_IP: (v1:status.podIP) Mounts: /etc/tls/grpc from secret-grpc-tls (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-x4cft (ro) prometheus-proxy: Container ID: cri-o://51f1cf269a9a0d060ff0ccd84b199268974b6f5fbb63e8516b56f050f2d2d02a Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:864b658b93adf38b3b6613225f00e8f3236299cb2f2f02aa16cf6b43eaa19229 Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:864b658b93adf38b3b6613225f00e8f3236299cb2f2f02aa16cf6b43eaa19229 Port: 9091/TCP Host Port: 0/TCP Args: -provider=openshift -https-address=:9091 -http-address= -email-domain=* -upstream=http://localhost:9090 -htpasswd-file=/etc/proxy/htpasswd/auth -openshift-service-account=prometheus-k8s -openshift-sar={"resource": "namespaces", "verb": "get"} -openshift-delegate-urls={"/": {"resource": "namespaces", "verb": "get"}} -tls-cert=/etc/tls/private/tls.crt -tls-key=/etc/tls/private/tls.key -client-secret-file=/var/run/secrets/kubernetes.io/serviceaccount/token -cookie-secret-file=/etc/proxy/secrets/session_secret -openshift-ca=/etc/pki/tls/cert.pem -openshift-ca=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt State: Running Started: Tue, 05 Oct 2021 10:40:23 +0900 Ready: True Restart Count: 0 Requests: cpu: 1m memory: 20Mi Environment: HTTP_PROXY: HTTPS_PROXY: NO_PROXY: Mounts: /etc/pki/ca-trust/extracted/pem/ from prometheus-trusted-ca-bundle (ro) /etc/proxy/htpasswd from secret-prometheus-k8s-htpasswd (rw) /etc/proxy/secrets from secret-prometheus-k8s-proxy (rw) /etc/tls/private from secret-prometheus-k8s-tls (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-x4cft (ro) kube-rbac-proxy: Container ID: cri-o://d73c51bda80f74ab226c79e9ddcbc041286d93c7b64e3ea6876dc5717f5ff6fd Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6d57bfd91fac9b68eb72d27226bc297472ceb136c996628b845ecc54a48b31cb Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6d57bfd91fac9b68eb72d27226bc297472ceb136c996628b845ecc54a48b31cb Port: 9092/TCP Host Port: 0/TCP Args: --secure-listen-address=0.0.0.0:9092 --upstream=http://127.0.0.1:9095 --config-file=/etc/kube-rbac-proxy/config.yaml --tls-cert-file=/etc/tls/private/tls.crt --tls-private-key-file=/etc/tls/private/tls.key --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305 --logtostderr=true --v=10 State: Running Started: Tue, 05 Oct 2021 10:40:23 +0900 Ready: True Restart Count: 0 Requests: cpu: 1m memory: 15Mi Environment: <none> Mounts: /etc/kube-rbac-proxy from secret-kube-rbac-proxy (rw) /etc/tls/private from secret-prometheus-k8s-tls (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-x4cft (ro) prom-label-proxy: Container ID: cri-o://3e1db3eb81a79e21fad603e426b8d2adcb9c7813977a53d84150dfbfdb3f0b95 Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1e9b9b51d11596199cd6a1aebc4c7b6080d3c250b007dbbf083d22e020588b9b Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1e9b9b51d11596199cd6a1aebc4c7b6080d3c250b007dbbf083d22e020588b9b Port: <none> Host Port: <none> Args: --insecure-listen-address=127.0.0.1:9095 --upstream=http://127.0.0.1:9090 --label=namespace State: Running Started: Tue, 05 Oct 2021 10:40:24 +0900 Ready: True Restart Count: 0 Requests: cpu: 1m memory: 15Mi Environment: <none> Mounts: /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-x4cft (ro) kube-rbac-proxy-thanos: Container ID: cri-o://81549d73c214f0262075e1293d1dc1610b934ae68a91e424171b4aec7a1e3732 Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6d57bfd91fac9b68eb72d27226bc297472ceb136c996628b845ecc54a48b31cb Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6d57bfd91fac9b68eb72d27226bc297472ceb136c996628b845ecc54a48b31cb Port: 10902/TCP Host Port: 0/TCP Args: --secure-listen-address=[$(POD_IP)]:10902 --upstream=http://127.0.0.1:10902 --tls-cert-file=/etc/tls/private/tls.crt --tls-private-key-file=/etc/tls/private/tls.key --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305 --allow-paths=/metrics --logtostderr=true State: Running Started: Tue, 05 Oct 2021 10:40:24 +0900 Ready: True Restart Count: 0 Requests: cpu: 1m memory: 10Mi Environment: POD_IP: (v1:status.podIP) Mounts: /etc/tls/private from secret-prometheus-k8s-thanos-sidecar-tls (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-x4cft (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: prometheus-k8s-db: Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace) ClaimName: prometheus-k8s-db-prometheus-k8s-0 ReadOnly: false config: Type: Secret (a volume populated by a Secret) SecretName: prometheus-k8s Optional: false tls-assets: Type: Secret (a volume populated by a Secret) SecretName: prometheus-k8s-tls-assets Optional: false config-out: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: <unset> prometheus-k8s-rulefiles-0: Type: ConfigMap (a volume populated by a ConfigMap) Name: prometheus-k8s-rulefiles-0 Optional: false secret-kube-etcd-client-certs: Type: Secret (a volume populated by a Secret) SecretName: kube-etcd-client-certs Optional: false secret-prometheus-k8s-tls: Type: Secret (a volume populated by a Secret) SecretName: prometheus-k8s-tls Optional: false secret-prometheus-k8s-proxy: Type: Secret (a volume populated by a Secret) SecretName: prometheus-k8s-proxy Optional: false secret-prometheus-k8s-htpasswd: Type: Secret (a volume populated by a Secret) SecretName: prometheus-k8s-htpasswd Optional: false secret-prometheus-k8s-thanos-sidecar-tls: Type: Secret (a volume populated by a Secret) SecretName: prometheus-k8s-thanos-sidecar-tls Optional: false secret-kube-rbac-proxy: Type: Secret (a volume populated by a Secret) SecretName: kube-rbac-proxy Optional: false configmap-serving-certs-ca-bundle: Type: ConfigMap (a volume populated by a ConfigMap) Name: serving-certs-ca-bundle Optional: false configmap-kubelet-serving-ca-bundle: Type: ConfigMap (a volume populated by a ConfigMap) Name: kubelet-serving-ca-bundle Optional: false secret-grpc-tls: Type: Secret (a volume populated by a Secret) SecretName: prometheus-k8s-grpc-tls-5n5q32u80g297 Optional: false prometheus-trusted-ca-bundle: Type: ConfigMap (a volume populated by a ConfigMap) Name: prometheus-trusted-ca-bundle-d34s91lhv300e Optional: true kube-api-access-x4cft: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true ConfigMapName: openshift-service-ca.crt ConfigMapOptional: <nil> QoS Class: Burstable Node-Selectors: node-role.kubernetes.io/infra= Tolerations: node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Pulled 38m (x133 over 2d2h) kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e1c622a6e8d0c61d23e3dcd01e8b4dc6108c997fe111a5966f65c43dae468e39" already present on machine Warning BackOff 3m27s (x3267 over 11h) kubelet Back-off restarting failed container I think the error logs are similar with below github: https://github.com/prometheus/prometheus/issues/6976 So, I suggested remove all files in 'wal' directory and restart prometheus. prometheus pods started successfully but few days later, crashloopbackoff again with above error. Version-Release number of selected component (if applicable): OCP: 4.8.4 Prometheus: 2.26.1 How reproducible: Steps to Reproduce: 1. prometheus-k8s-{0,1} 2. prometheus-k8s-{0,1} recreated automatically 3.one or two days later, prometheus-k8s-{0,1} are crashloopbackoff again. Actual results: due to prometheus pods crashed, can't monitoring OCP cluster and use HPA. Expected results: prometheus pods are running without crash. Additional info: - Prometheus is using the storage as NFS provided by Oracle cloud.
(In reply to Hakyong Do from comment #0) > Additional info: > - Prometheus is using the storage as NFS provided by Oracle cloud. using NFS as storage is not recommended
Hi, The customer don't have object storage and using block storage(like hostpath) can't maintain data between monitoring nodes so they use NFS storage as PV. There are any possibility it's prometheus's bug? this issue is happened because they use NFS storage? If you need any further information to analyze this issue, please let me know. Thank you.
Hi, as per the Prometheus docs https://prometheus.io/docs/prometheus/latest/storage/#operational-aspects NFS is not supported option. The upstream issue that is related (https://github.com/prometheus/prometheus/issues/6976) was marked as resolved in v2.20.0
Hi, The Prometheus version they use is 2.26.1 So I think it's not same issue with upstream issue. Thank you for your attention.
FYI: https://prometheus.io/docs/prometheus/latest/storage/#operational-aspects CAUTION: Non-POSIX compliant filesystems are not supported for Prometheus' local storage as unrecoverable corruptions may happen. NFS filesystems (including AWS's EFS) are not supported. NFS could be POSIX-compliant, but most implementations are not. It is strongly recommended to use a local filesystem for reliability.
@hdo To isolate if this issue due to NFS, would it be possible to run Prometheus without a persistent storage and see if the issue recurs? As Junqi mentioned above Prometheus doesn't support NFS however we will certainly raise an upstream bug-fix to shutdown gracefully if there is a WAL replay error instead of panicing.
I asked the customer using local storage instead of NFS. You'll be updated if same issue occurred again. Thank you!!
Upstream PR - https://github.com/prometheus/prometheus/pull/9687 may fix the bug. It certainly does if the checkpoint directory is empty @hdo , Next time this occurs could you please run `tree` of data directory?
Although the patch is merged upstream, the downstream fork will get the fix only after Prometheus 2.32.0 is released. Hence setting the status back to ASSIGNED
Prometheus 2.32.0 has been merged, hence moving this bug to MODIFIED.
tested with 4.10.0-0.nightly-2022-01-17-182202, prometheus version is 2.32.1 and no regression issues # oc -n openshift-monitoring logs -c prometheus prometheus-k8s-0 | head ts=2022-01-17T11:22:21.268Z caller=main.go:532 level=info msg="Starting Prometheus" version="(version=2.32.1, branch=rhaos-4.10-rhel-8, revision=428381c4197ea73aa4a18b1920f9021ba8dd8b23)"
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056