Description of problem: ----------------------- During continuous ocs-osd-controller-manager pod restarts in the provider cluster, it was observed that the prometheus-managed-ocs-prometheus-0 pod kept fluctuating between OOMKilled -> CrashLoopBackOff -> Running state. $ oc get pods -w prometheus-managed-ocs-prometheus-0 2/2 Running 25 (5m11s ago) 2d3h prometheus-managed-ocs-prometheus-0 1/2 OOMKilled 25 (5m17s ago) 2d3h ocs-osd-controller-manager-576fd978f4-7jczn 3/3 Terminating 0 25s ocs-osd-controller-manager-576fd978f4-7vzfb 0/3 Pending 0 prometheus-managed-ocs-prometheus-0 1/2 CrashLoopBackOff 25 (4s ago) 2d3h The following were seen in the oc describe output of the prometheus pod: Containers: prometheus: Container ID: cri-o://172cac7ba9dc27c353c7c7088deaeeb4fcd8d6a5481e3df929698855a5f8b314 Image: registry.redhat.io/openshift4/ose-prometheus:v4.8.0-202108130208.p0.git.f3beb88.assembly.stream Image ID: registry.redhat.io/openshift4/ose-prometheus@sha256:de65c4885d8de2b47fd6a9f90f824f40f289cfca16ff499312e39d291af3a672 Port: 9090/TCP Host Port: 0/TCP Args: --web.console.templates=/etc/prometheus/consoles --web.console.libraries=/etc/prometheus/console_libraries --config.file=/etc/prometheus/config_out/prometheus.env.yaml --storage.tsdb.path=/prometheus --storage.tsdb.retention.time=24h --web.enable-lifecycle --storage.tsdb.no-lockfile --web.route-prefix=/ State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: OOMKilled Message: 768 component=tsdb msg="WAL segment loaded" segment=25 maxSegment=29 level=info ts=2022-04-13T08:42:49.499Z caller=head.go:768 component=tsdb msg="WAL segment loaded" segment=26 maxSegment=29 level=info ts=2022-04-13T08:42:49.509Z caller=head.go:768 component=tsdb msg="WAL segment loaded" segment=27 maxSegment=29 level=info ts=2022-04-13T08:42:49.509Z caller=head.go:768 component=tsdb msg="WAL segment loaded" segment=28 maxSegment=29 level=info ts=2022-04-13T08:42:49.510Z caller=head.go:768 component=tsdb msg="WAL segment loaded" segment=29 maxSegment=29 level=info ts=2022-04-13T08:42:49.510Z caller=head.go:773 component=tsdb msg="WAL replay completed" checkpoint_replay_duration=33.326579ms wal_replay_duration=2.870536075s total_replay_duration=2.908673001s level=info ts=2022-04-13T08:42:49.590Z caller=main.go:815 fs_type=XFS_SUPER_MAGIC level=info ts=2022-04-13T08:42:49.590Z caller=main.go:818 msg="TSDB started" level=info ts=2022-04-13T08:42:49.590Z caller=main.go:944 msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml level=info ts=2022-04-13T08:42:49.592Z caller=kubernetes.go:266 component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config" level=info ts=2022-04-13T08:42:49.594Z caller=kubernetes.go:266 component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config" level=info ts=2022-04-13T08:42:49.596Z caller=kubernetes.go:266 component="discovery manager notify" discovery=kubernetes msg="Using pod service account via in-cluster config" level=info ts=2022-04-13T08:42:49.612Z caller=main.go:975 msg="Completed loading of configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml totalDuration=22.411756ms remote_storage=2.57µs web_handler=717ns query_engine=757ns scrape=205.332µs scrape_sd=3.678399ms notify=21.891µs notify_sd=2.117141ms rules=14.810505ms level=info ts=2022-04-13T08:42:49.612Z caller=main.go:767 msg="Server is ready to receive web requests." Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Started 5m42s (x4 over 2d1h) kubelet Started container prometheus Warning ProbeError 5m39s (x6 over 6m25s) kubelet Readiness probe error: HTTP probe failed with statuscode: 503 body: Service Unavailable Warning Unhealthy 5m39s (x6 over 6m25s) kubelet Readiness probe failed: HTTP probe failed with statuscode: 503 Normal Created 4m47s (x5 over 2d1h) kubelet Created container prometheus Normal Pulled 4m47s (x4 over 2d1h) kubelet Container image "registry.redhat.io/openshift4/ose-prometheus:v4.8.0-202108130208.p0.git.f3beb88.assembly.stream" already present on machine Warning BackOff 77s (x18 over 5m56s) kubelet Back-off restarting failed container Version-Release number of selected component (if applicable): ------------------------------------------------------------- OCP: 4.10.5 ODF: $ oc get csv --show-labels NAME DISPLAY VERSION REPLACES PHASE LABELS mcg-operator.v4.10.0 NooBaa Operator 4.10.0 Succeeded operators.coreos.com/mcg-operator.openshift-storage= ocs-operator.v4.10.0 OpenShift Container Storage 4.10.0 Succeeded full_version=4.10.0-219,operatorframework.io/arch.amd64=supported,operatorframework.io/arch.ppc64le=supported,operatorframework.io/arch.s390x=supported,operators.coreos.com/ocs-operator.openshift-storage= ocs-osd-deployer.v2.0.0 OCS OSD Deployer 2.0.0 Installing operators.coreos.com/ocs-osd-deployer.openshift-storage= odf-operator.v4.10.0 OpenShift Data Foundation 4.10.0 Succeeded full_version=4.10.0-219,operatorframework.io/arch.amd64=supported,operatorframework.io/arch.ppc64le=supported,operatorframework.io/arch.s390x=supported,operators.coreos.com/odf-operator.openshift-storage= ose-prometheus-operator.4.8.0 Prometheus Operator 4.8.0 Succeeded operators.coreos.com/ose-prometheus-operator.openshift-storage= route-monitor-operator.v0.1.408-c2256a2 Route Monitor Operator 0.1.408-c2256a2 route-monitor-operator.v0.1.406-54ff884 Succeeded olm.copiedFrom=openshift-route-monitor-operator $ oc describe csv ocs-osd-deployer.v2.0.0|grep -i image Image: quay.io/openshift/origin-kube-rbac-proxy:4.10.0 Image: quay.io/osd-addons/ocs-osd-deployer:2.0.0-8 Image: quay.io/osd-addons/ocs-osd-deployer:2.0.0-8 $ oc describe catsrc redhat-operators |grep -i image f:image: Image: quay.io/rhceph-dev/odf4-odf-compose-operator-bundle-index:4.10.0-219 How reproducible: ----------------- 3/3 Steps to Reproduce: ------------------- 1. Restart ocs-osd-controller-manager pod in a loop in the provider cluster: while true;do for i in `oc get pods|grep ocs-osd-controller|cut -d " " -f1`; do echo $i; oc delete pod $i ; sleep 20; done; done 2. Keep checking the status of the pods in the provider cluster Actual results: --------------- During the ocs-osd-controller-manager pod restarts, the prometheus-managed-ocs-prometheus pod started going into OOMKilled -> CLBO state
@rgeorge I tried to reproduce this bug but there was no error on Prometheus-managed-ocs pod. For me, it didn't even restart. Are you still able to reproduce this?
@omitrani @rg
I was able to reproduce this issue, and the problem is not with the prometheus-managed-ocs-prom-0 pod but the way our operator consume resources. When we run the script to delete the ocs-osd-deployer-pod again and again, everytime it re-initiate, every pod's memory increases by some percent. And after a certain time, when the memory of prometheus-managed-ocs-prom-0 exceeds the memory-limit, it goes into OOMKilled state. Right now it's 150mb. So we will observe that if we try to run this cmd for a long long time, at some point most of the small-memory pod will go into this state only(re-creation of deployer pod for more than 100-200 times is a very rare case). So one solution that I can think of, would be to increase the resource limit for Prometheus pod. @omitrani @dbindra
@akkhanna If I understood you last comment it seems to me that what you describe is very alarming as it means that no matter what resource limits we will set up for all of the pods, at some point, after an X amount of ocs-osd-operator restarts, all pods will go into OOM. We need an RCA for why all of the pods are raising their memory consumption after an ocs-osd-deployer restart in order to be able to come up with a proper mitigation
It looks like Prometheus needs more memory than the limits we set (200M) for example if you look at node ip-10-0-160-105.ec2.internal you can see it using 300m. The memory usage of the operator depends on the scraping interval and the number of metrics. Akshay, can you try increasing the pod limits to 400m and see if this reproduced?
Orit, I tried to make the memory limit up to 350Mb, but it still didn't work. We can try setting it 400Mb, but I doubt this would help because memory just keeps on going higher and higher.
Created a JIRA issue on ODF side to track the progress https://issues.redhat.com/browse/RHSTOR-3243
@akkhanna There is an automatic mirroring of BZs to JIRA, no need to create another JIRA ticket. Please close the one you created, and keep updating here, everything here will be mirrored to the JIRA ticket as well
"When we run the script to delete the ocs-osd-deployer-pod again and again" - why would we do it? What is again and again? 10 times or 100 times?
Yaniv, this is all part of the RCA process for this bug. The initial symptom was that deployer restarts caused a memory increase in the Prometheus instance until the instance got OOM killed, so we created a script for the reproduction. I do not think the restart count is relevant for the RCA, but it is relevant for deciding the severity of the issue.
Hello, we believe this issue (or a similar issue) is happening on `r-eu-prd-01.l4x7.p1.openshiftapps.com-ocs-consumer` see https://redhat.pagerduty.com/incidents/Q0FWZQZV1DVLX9. This is a production cluster, and because of this has not had any monitoring since at least August 22, 2022, but likely much longer. The prometheus pod is in CrashLoopBackOff because the prometheus container is constantly getting OOMKilled. We have also seen this alert triggered on `r-us-tst-odf`, but it is currently not triggered. These alerts can also be especially disruptive to the SRE team, because the page can self resolve and then re-trigger, created a new page for us, possibly in the middle of the night.
Update: a page was just triggered and then self-resolved on `r-eu-tst-01` https://redhat.pagerduty.com/incidents/Q155TFRJ8PALNQ
Here's the must-gather output of the above problem which happened again with another cluster - https://drive.google.com/drive/folders/1m04Yxmp_kOn51J2n08e3OJo0LJ0luNIZ?usp=sharing Cluster - 1s84t100kags7mmg8t4428kob6ufncuc r-eu-tst-03 https://api.r-eu-tst-03.12w0.p1.openshiftapps.com:6443 4.10.14 rosa aws eu-west-1 ready Here are the executions of a few commands: ``` ❯ oc get pods -n openshift-storage | grep "prom" prometheus-managed-ocs-prometheus-0 2/3 CrashLoopBackOff 270 (4m53s ago) 6d19h prometheus-operator-8547cc9f89-mx9bp 1/1 Running 0 97d ~ 11:49:47 ❯ oc get pods -n openshift-storage | grep "ocs-operator" ocs-operator-79bd8d6464-2fm9l 0/1 CrashLoopBackOff 400 (3m20s ago) 37h ~ ```