Bug 2074938

Summary: prometheus pod goes into OOMKilled state during ocs-osd-controller-manager pod restarts
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Rachael <rgeorge>
Component: odf-managed-serviceAssignee: Ohad <omitrani>
Status: ASSIGNED --- QA Contact: Elena Bondarenko <ebondare>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.10CC: aeyal, dbindra, fbalak, gathomas, mmuench, nberry, odf-bz-bot, owasserm, ykukreja
Target Milestone: ---Keywords: Tracking
Target Release: ---Flags: omitrani: needinfo? (akkhanna)
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2089342 (view as bug list) Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2089342, 2123653    

Description Rachael 2022-04-13 10:37:12 UTC
Description of problem:
-----------------------

During continuous ocs-osd-controller-manager pod restarts in the provider cluster, it was observed that the prometheus-managed-ocs-prometheus-0 pod kept fluctuating between OOMKilled -> CrashLoopBackOff -> Running state. 


$ oc get pods -w

prometheus-managed-ocs-prometheus-0                               2/2     Running             25 (5m11s ago)   2d3h
prometheus-managed-ocs-prometheus-0                               1/2     OOMKilled           25 (5m17s ago)   2d3h
ocs-osd-controller-manager-576fd978f4-7jczn                       3/3     Terminating         0                25s
ocs-osd-controller-manager-576fd978f4-7vzfb                       0/3     Pending             0  
prometheus-managed-ocs-prometheus-0                               1/2     CrashLoopBackOff    25 (4s ago)      2d3h


The following were seen in the oc describe output of the prometheus pod:

Containers:
  prometheus:
    Container ID:  cri-o://172cac7ba9dc27c353c7c7088deaeeb4fcd8d6a5481e3df929698855a5f8b314
    Image:         registry.redhat.io/openshift4/ose-prometheus:v4.8.0-202108130208.p0.git.f3beb88.assembly.stream
    Image ID:      registry.redhat.io/openshift4/ose-prometheus@sha256:de65c4885d8de2b47fd6a9f90f824f40f289cfca16ff499312e39d291af3a672
    Port:          9090/TCP
    Host Port:     0/TCP
    Args:
      --web.console.templates=/etc/prometheus/consoles
      --web.console.libraries=/etc/prometheus/console_libraries
      --config.file=/etc/prometheus/config_out/prometheus.env.yaml
      --storage.tsdb.path=/prometheus
      --storage.tsdb.retention.time=24h
      --web.enable-lifecycle
      --storage.tsdb.no-lockfile
      --web.route-prefix=/
    State:       Waiting
      Reason:    CrashLoopBackOff
    Last State:  Terminated
      Reason:    OOMKilled
      Message:   768 component=tsdb msg="WAL segment loaded" segment=25 maxSegment=29
level=info ts=2022-04-13T08:42:49.499Z caller=head.go:768 component=tsdb msg="WAL segment loaded" segment=26 maxSegment=29
level=info ts=2022-04-13T08:42:49.509Z caller=head.go:768 component=tsdb msg="WAL segment loaded" segment=27 maxSegment=29
level=info ts=2022-04-13T08:42:49.509Z caller=head.go:768 component=tsdb msg="WAL segment loaded" segment=28 maxSegment=29
level=info ts=2022-04-13T08:42:49.510Z caller=head.go:768 component=tsdb msg="WAL segment loaded" segment=29 maxSegment=29
level=info ts=2022-04-13T08:42:49.510Z caller=head.go:773 component=tsdb msg="WAL replay completed" checkpoint_replay_duration=33.326579ms wal_replay_duration=2.870536075s total_replay_duration=2.908673001s
level=info ts=2022-04-13T08:42:49.590Z caller=main.go:815 fs_type=XFS_SUPER_MAGIC
level=info ts=2022-04-13T08:42:49.590Z caller=main.go:818 msg="TSDB started"
level=info ts=2022-04-13T08:42:49.590Z caller=main.go:944 msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml
level=info ts=2022-04-13T08:42:49.592Z caller=kubernetes.go:266 component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config"
level=info ts=2022-04-13T08:42:49.594Z caller=kubernetes.go:266 component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config"
level=info ts=2022-04-13T08:42:49.596Z caller=kubernetes.go:266 component="discovery manager notify" discovery=kubernetes msg="Using pod service account via in-cluster config"
level=info ts=2022-04-13T08:42:49.612Z caller=main.go:975 msg="Completed loading of configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml totalDuration=22.411756ms remote_storage=2.57µs web_handler=717ns query_engine=757ns scrape=205.332µs scrape_sd=3.678399ms notify=21.891µs notify_sd=2.117141ms rules=14.810505ms
level=info ts=2022-04-13T08:42:49.612Z caller=main.go:767 msg="Server is ready to receive web requests."



Events:
  Type     Reason      Age                    From     Message
  ----     ------      ----                   ----     -------
  Normal   Started     5m42s (x4 over 2d1h)   kubelet  Started container prometheus
  Warning  ProbeError  5m39s (x6 over 6m25s)  kubelet  Readiness probe error: HTTP probe failed with statuscode: 503
body: Service Unavailable
  Warning  Unhealthy  5m39s (x6 over 6m25s)  kubelet  Readiness probe failed: HTTP probe failed with statuscode: 503
  Normal   Created    4m47s (x5 over 2d1h)   kubelet  Created container prometheus
  Normal   Pulled     4m47s (x4 over 2d1h)   kubelet  Container image "registry.redhat.io/openshift4/ose-prometheus:v4.8.0-202108130208.p0.git.f3beb88.assembly.stream" already present on machine
  Warning  BackOff    77s (x18 over 5m56s)   kubelet  Back-off restarting failed container



Version-Release number of selected component (if applicable):
-------------------------------------------------------------
OCP: 4.10.5
ODF: 
$ oc get csv --show-labels
NAME                                      DISPLAY                       VERSION           REPLACES                                  PHASE        LABELS
mcg-operator.v4.10.0                      NooBaa Operator               4.10.0                                                      Succeeded    operators.coreos.com/mcg-operator.openshift-storage=
ocs-operator.v4.10.0                      OpenShift Container Storage   4.10.0                                                      Succeeded    full_version=4.10.0-219,operatorframework.io/arch.amd64=supported,operatorframework.io/arch.ppc64le=supported,operatorframework.io/arch.s390x=supported,operators.coreos.com/ocs-operator.openshift-storage=
ocs-osd-deployer.v2.0.0                   OCS OSD Deployer              2.0.0                                                       Installing   operators.coreos.com/ocs-osd-deployer.openshift-storage=
odf-operator.v4.10.0                      OpenShift Data Foundation     4.10.0                                                      Succeeded    full_version=4.10.0-219,operatorframework.io/arch.amd64=supported,operatorframework.io/arch.ppc64le=supported,operatorframework.io/arch.s390x=supported,operators.coreos.com/odf-operator.openshift-storage=
ose-prometheus-operator.4.8.0             Prometheus Operator           4.8.0                                                       Succeeded    operators.coreos.com/ose-prometheus-operator.openshift-storage=
route-monitor-operator.v0.1.408-c2256a2   Route Monitor Operator        0.1.408-c2256a2   route-monitor-operator.v0.1.406-54ff884   Succeeded    olm.copiedFrom=openshift-route-monitor-operator

$ oc describe csv ocs-osd-deployer.v2.0.0|grep -i image
                Image:  quay.io/openshift/origin-kube-rbac-proxy:4.10.0
                Image:             quay.io/osd-addons/ocs-osd-deployer:2.0.0-8
                Image:             quay.io/osd-addons/ocs-osd-deployer:2.0.0-8

$ oc describe catsrc redhat-operators |grep -i image
        f:image:
  Image:         quay.io/rhceph-dev/odf4-odf-compose-operator-bundle-index:4.10.0-219



How reproducible:
-----------------
3/3


Steps to Reproduce:
-------------------
1. Restart ocs-osd-controller-manager pod in a loop in the provider cluster:

while true;do for i in `oc get pods|grep ocs-osd-controller|cut -d " " -f1`; do echo $i; oc delete pod $i ; sleep 20; done; done

2. Keep checking the status of the pods in the provider cluster

Actual results:
---------------
During the ocs-osd-controller-manager pod restarts, the prometheus-managed-ocs-prometheus pod started going into OOMKilled -> CLBO state

Comment 3 Akshay Khanna 2022-04-21 12:34:50 UTC
@rgeorge I tried to reproduce this bug but there was no error on Prometheus-managed-ocs pod. For me, it didn't even restart. Are you still able to reproduce this?

Comment 4 Akshay Khanna 2022-04-22 17:54:17 UTC
@omitrani @rg

Comment 5 Akshay Khanna 2022-04-22 18:04:56 UTC
I was able to reproduce this issue, and the problem is not with the prometheus-managed-ocs-prom-0 pod but the way our operator consume resources.

When we run the script to delete the ocs-osd-deployer-pod again and again, everytime it re-initiate, every pod's memory increases by some percent. And after a certain time, when the memory of prometheus-managed-ocs-prom-0 exceeds the memory-limit, it goes into OOMKilled state. Right now it's 150mb. So we will observe that if we try to run this cmd for a long long time, at some point most of the small-memory pod will go into this state only(re-creation of deployer pod for more than 100-200 times is a very rare case). 

So one solution that I can think of, would be to increase the resource limit for Prometheus pod.

@omitrani @dbindra

Comment 6 Ohad 2022-04-23 12:15:17 UTC
@akkhanna If I understood you last comment it seems to me that what you describe is very alarming as it means that no matter what resource limits we will set up for all of the pods, at some point, after an X amount of ocs-osd-operator restarts, all pods will go into OOM.

We need an RCA for why all of the pods are raising their memory consumption after an ocs-osd-deployer restart in order to be able to come up with a proper mitigation

Comment 7 Orit Wasserman 2022-05-17 12:40:34 UTC
It looks like Prometheus needs more memory than the limits we set (200M) for example if you look at node ip-10-0-160-105.ec2.internal you can see it using 300m.
The memory usage of the operator depends on the scraping interval and the number of metrics.
Akshay, can you try increasing the pod limits to 400m and see if this reproduced?

Comment 8 Akshay Khanna 2022-05-17 12:51:42 UTC
Orit, I tried to make the memory limit up to 350Mb, but it still didn't work. We can try setting it 400Mb, but I doubt this would help because memory just keeps on going higher and higher.

Comment 9 Akshay Khanna 2022-05-23 05:51:13 UTC
Created a JIRA issue on ODF side to track the progress
https://issues.redhat.com/browse/RHSTOR-3243

Comment 10 Ohad 2022-05-23 11:13:55 UTC
@akkhanna There is an automatic mirroring of BZs to JIRA, no need to create another JIRA ticket. 
Please close the one you created, and keep updating here, everything here will be mirrored to the JIRA ticket as well

Comment 11 Yaniv Kaul 2022-05-29 14:48:32 UTC
"When we run the script to delete the ocs-osd-deployer-pod again and again" - why would we do it? What is again and again? 10 times or 100 times?

Comment 12 Ohad 2022-05-30 07:16:14 UTC
Yaniv, this is all part of the RCA process for this bug. 
The initial symptom was that deployer restarts caused a memory increase in the Prometheus instance until the instance got OOM killed, so we created a script for the reproduction.
I do not think the restart count is relevant for the RCA, but it is relevant for deciding the severity of the issue.

Comment 16 gathomas 2022-08-31 08:49:48 UTC
Hello, we believe this issue (or a similar issue) is happening on `r-eu-prd-01.l4x7.p1.openshiftapps.com-ocs-consumer` see https://redhat.pagerduty.com/incidents/Q0FWZQZV1DVLX9. This is a production cluster, and because of this has not had any monitoring since at least August 22, 2022, but likely much longer. The prometheus pod is in CrashLoopBackOff because the prometheus container is constantly getting OOMKilled. 

We have also seen this alert triggered on `r-us-tst-odf`, but it is currently not triggered.

These alerts can also be especially disruptive to the SRE team, because the page can self resolve and then re-trigger, created a new page for us, possibly in the middle of the night.

Comment 17 gathomas 2022-08-31 12:59:05 UTC
Update: a page was just triggered and then self-resolved on `r-eu-tst-01` https://redhat.pagerduty.com/incidents/Q155TFRJ8PALNQ

Comment 18 Yashvardhan Kukreja 2022-09-08 07:27:15 UTC
Here's the must-gather output of the above problem which happened again with another cluster - 

https://drive.google.com/drive/folders/1m04Yxmp_kOn51J2n08e3OJo0LJ0luNIZ?usp=sharing

Cluster - 

1s84t100kags7mmg8t4428kob6ufncuc  r-eu-tst-03                   https://api.r-eu-tst-03.12w0.p1.openshiftapps.com:6443      4.10.14             rosa            aws             eu-west-1       ready

Here are the executions of a few commands:

```
❯ oc get pods -n openshift-storage | grep "prom"
prometheus-managed-ocs-prometheus-0                               2/3     CrashLoopBackOff   270 (4m53s ago)   6d19h
prometheus-operator-8547cc9f89-mx9bp                              1/1     Running            0                 97d

~                                                                                                                                                                        11:49:47
❯ oc get pods -n openshift-storage | grep "ocs-operator"
ocs-operator-79bd8d6464-2fm9l                                     0/1     CrashLoopBackOff   400 (3m20s ago)   37h

~
```