Bug 2121314

Summary: Prometheus pods in openshift-storage failing due to OOMKills triggering DMS alerts for MT-SRE
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Yashvardhan Kukreja <ykukreja>
Component: odf-managed-serviceAssignee: Dhruv Bindra <dbindra>
Status: NEW --- QA Contact: Filip Balák <fbalak>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.10CC: aeyal, apahim, dbindra, fbalak, gathomas, lgangava, nberry, nschiede, odf-bz-bot, rchikatw
Target Milestone: ---Keywords: Reopened
Target Release: ---Flags: rchikatw: needinfo-
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-03-14 15:25:49 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Yashvardhan Kukreja 2022-08-25 06:30:15 UTC
Description of the problem:

Currently, DMS (DeadManSnitch) alerts are happening on some clusters sporadically. The trigger of DMS alerts is usually DMS not being able to reach the Prometheus pods sitting in the customer's clusters and get a 200 Ack from them.
In the case of ocs-consumer, the Prometheus pods in consideration are the ones which are in the openshift-storage namespace with the name prometheus-managed-ocs-prometheus-0

Due to their need of consuming higher memory than the 250Mi memory limit allocated to them, they're getting OOMKilled and not getting the opportunity to serve a 200 Ack to the DMS, which is triggering the DMS alert to the MT-SRE.

The high memory consumption is a result of a combination of many factors such as certain metrics being of high cardinality, scrape interval being too frequent, etc.


Reproducing the issue:

Currently, there isn't any sure short way of reproducing this issue except somehow making the prometheus pods consume higher memory.

Component controlling this:

The prometheus pod is controlled by the statefulset in openshift-storage with the name prometheus-managed-ocs-prometheus
This stateful-set is controlled/owned by the Prometheus CR in the openshift-storage namespace with the name managed-ocs-prometheus
This Prometheus resource is further controlled/owned by the ManagedOCS resource in the openshift-storage  namespace with the name managedocs
This  managedocs resource is getting reconciled by the ocs-operator.

Hence, overall, it's the ocs-operator which is determining and controlling the memory limits which are eventually propagating to the promtheues pods.

Proposed Solution:

ocs-operator should be reprogrammed to reconcile the managedocs resource in such a way that it ends up creating the Prometheus CR with bumped up memory limits.

Specifically, this line (https://github.com/red-hat-storage/ocs-osd-deployer/blob/02ebe3916210326d00fae53bf55cbfef53ac1edb/utils/resources.go#L70) has to be modified to resource.MustParse("750Mi"),

Expected Results:
Prometheus pods running without facing any OOMKill failures and without needing any restarts.

 Actual Results:
~                                                                                                                     11s 16:40:40
❯ oc get pod prometheus-managed-ocs-prometheus-0 -n openshift-storage
NAME                                  READY   STATUS    RESTARTS       AGE
prometheus-managed-ocs-prometheus-0   3/3     Running   41 (13m ago)   26h
Prometheus periodically getting OOMKilled and restarting itself in the hopes of working properly.

Comment 1 Leela Venkaiah Gangavarapu 2022-08-29 06:42:28 UTC
@yash,

> This  managedocs resource is getting reconciled by the ocs-operator.
- ocs-osd-deployer reconciles this resource and at the same time I believe you mistyped it

- The bug is probably related to https://bugzilla.redhat.com/show_bug.cgi?id=2074938
- A lot of discussions similar to this https://coreos.slack.com/archives/C0VMT03S5/p1652858588689779, happened for above bug and simply bumping up the resource request for prometheus will not likely work based on past experience
- I propose we await next DMS alert, look for all the operators restarts in the openshift-namespace and continue from there, wdyt?

thanks,
leela.

Comment 2 Yashvardhan Kukreja 2022-08-31 11:20:29 UTC
I'd still strongly suggest to bump the memory limits to 750Mi to allow Prometheus with a bigger resource scope to operate under. That's because of the criticality of this component. Right now, the low limits compromise the entire monitoring setup of an ODF installation, which has a massive impact.
If it goes down, it makes us blind to the customer cluster's state.

And for fixing this, there aren't relevant workarounds except switching down the ocs-osd-controller-manager to zero replicas, so that it stops reconciling the Prom. to any fixes/workarounds we apply. 

Turning off ocs-osd-controller-manager would although allow us to bump the prom. limits in runtime (and fix this issue), still it would compromise the other important ocs-osd-controller-manager would be performing.

Comment 3 Ohad 2022-09-01 13:21:38 UTC
*** Bug 2119491 has been marked as a duplicate of this bug. ***

Comment 4 Yashvardhan Kukreja 2022-09-08 07:22:13 UTC
Here's the must-gather output of the above problem which happened again with another cluster - 

https://drive.google.com/drive/folders/1m04Yxmp_kOn51J2n08e3OJo0LJ0luNIZ?usp=sharing

Here are the executions of a few commands:

```
❯ oc get pods -n openshift-storage | grep "prom"
prometheus-managed-ocs-prometheus-0                               2/3     CrashLoopBackOff   270 (4m53s ago)   6d19h
prometheus-operator-8547cc9f89-mx9bp                              1/1     Running            0                 97d

~                                                                                                                                                                        11:49:47
❯ oc get pods -n openshift-storage | grep "ocs-operator"
ocs-operator-79bd8d6464-2fm9l                                     0/1     CrashLoopBackOff   400 (3m20s ago)   37h

~
```

Comment 6 Dhruv Bindra 2022-10-03 08:16:34 UTC
The two BZ links in the link section are for the backport request to ODF v4.11 and v4.10. This BZ will get fixed when backport is done and managed service starts using the ODF version that has the fix for this issue.

Comment 7 gathomas 2022-10-24 13:25:58 UTC
Hi, I ran into the situation where the Prometheus pod is being OOM killed but the ocs-operator has not even been restarted once:

kubectl get pods                    
NAME                                                              READY   STATUS             RESTARTS        AGE
...
ocs-operator-5c77756ddd-8zfqp                                     1/1     Running            0               33d
...
prometheus-managed-ocs-prometheus-0                               2/3     CrashLoopBackOff   202 (19s ago)   8d


We see that for the fix the ocs-operator pod memory limit was raised but the promethus pod memory limits were not. So from this it seems like that was not sufficient.

Comment 8 gathomas 2022-10-27 15:01:47 UTC
I forgot to mention this was the r-eu-tst-03 cluster

Comment 13 Dhruv Bindra 2023-01-23 06:31:58 UTC
Needs to be tested with latest build

Comment 19 Filip Balák 2023-03-06 14:37:14 UTC
Reproducer was taken from https://bugzilla.redhat.com/show_bug.cgi?id=2123653#c15 (tracked bz) and updated with simultaneous deletion of ocs-operator as described in comment 18.

Commands ran at the same time:
for j in {1..400};do date;echo $j;for i in `oc get pods|grep ocs-osd-controller|cut -d " " -f1`; do oc get pod $i -o wide; oc delete pod $i ;done;sleep 20;echo ==============================;done
for j in {1..400};do date;echo $j;for i in `oc get pods|grep ocs-operator|cut -d " " -f1`; do oc get pod $i -o wide; oc delete pod $i ;done;sleep 20;echo ==============================;done

Command for monitoring the status:
for j in {1..450};do date;for i in `oc get pods|grep prometheus|cut -d " " -f1`;do oc get pod $i -o wide;done;sleep 20;echo ===========================;done

Pods at the end of the test:
$ oc get pods
NAME                                                              READY   STATUS      RESTARTS   AGE
addon-ocs-provider-qe-catalog-gzdfw                               1/1     Running     0          6h30m
alertmanager-managed-ocs-alertmanager-0                           2/2     Running     0          6h26m
b5c66ab5e7f90122be20fd041ac62139a65d985117a45c88c21d845e65qnl7m   0/1     Completed   0          6h29m
bc7c211d2e048e04f08db7871b3ed242a893dda70bb2d102964f72ff93hgmzg   0/1     Completed   0          6h29m
csi-addons-controller-manager-759b488df-56m98                     2/2     Running     0          6h27m
ocs-metrics-exporter-5dd96c885b-8jgll                             1/1     Running     0          6h26m
ocs-operator-6888799d6b-4ffgm                                     1/1     Running     0          10m
ocs-osd-aws-data-gather-bc7c46cf9-tb76l                           1/1     Running     0          6h28m
ocs-osd-controller-manager-6988ff8577-x4qrq                       3/3     Running     0          15m
ocs-provider-server-6cd96b5ccb-htb78                              1/1     Running     0          6h26m
odf-console-57b8476cd4-whbvf                                      1/1     Running     0          6h27m
odf-operator-controller-manager-6f44676f4f-qww5r                  2/2     Running     0          6h27m
prometheus-managed-ocs-prometheus-0                               3/3     Running     0          6h26m
prometheus-operator-8547cc9f89-qv555                              1/1     Running     0          6h28m
rook-ceph-crashcollector-084922fac2286ea9642c71584d0cf0d4-j66bn   1/1     Running     0          6h23m
rook-ceph-crashcollector-65dd33fb18b9ccca0b14f252fa6088d6-rbnhl   1/1     Running     0          6h23m
rook-ceph-crashcollector-6d1aea9ead3d2c6a2556aa39d30e694b-hbt96   1/1     Running     0          6h23m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-6bd646668vf96   2/2     Running     0          6h22m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-55ccbf6bb2kbg   2/2     Running     0          6h22m
rook-ceph-mgr-a-fb74b5cd-ht46r                                    2/2     Running     0          6h23m
rook-ceph-mon-a-855bf455b4-7scns                                  2/2     Running     0          6h24m
rook-ceph-mon-b-5c7754454d-f5x7n                                  2/2     Running     0          6h24m
rook-ceph-mon-c-574fcc65df-h8hhk                                  2/2     Running     0          6h23m
rook-ceph-operator-548b87d44b-5w82s                               1/1     Running     0          6h26m
rook-ceph-osd-0-bb5976898-znnct                                   2/2     Running     0          6h22m
rook-ceph-osd-1-f46b8875f-gqdwr                                   2/2     Running     0          6h22m
rook-ceph-osd-2-5d566968cf-dhm68                                  2/2     Running     0          6h22m
rook-ceph-osd-prepare-default-0-data-0f7czh-nzfbf                 0/1     Completed   0          6h22m
rook-ceph-osd-prepare-default-1-data-09wl5d-bdcr9                 0/1     Completed   0          6h22m
rook-ceph-osd-prepare-default-2-data-04wm7m-2z79r                 0/1     Completed   0          6h22m
rook-ceph-tools-7c8c77bd96-hm2lt                                  1/1     Running     0          6h26m

--> VERIFIED

Tested with:
ocs-operator.v4.10.9
ocs-osd-deployer.v2.0.11

Comment 20 Ritesh Chikatwar 2023-03-14 15:25:49 UTC
Closing the bug as it's fixed in v2.0.11 and verified by QA.

Comment 21 apahim 2023-05-13 07:45:06 UTC
Reopening as multiple clusters are facing the issue again with ocs-osd-deployer.v2.0.12.

prometheus-managed-ocs-prometheus-0                2/3     CrashLoopBackOff   153 (2m9s ago)   3d18h

~$ oc -n openshift-storage get csv ocs-osd-deployer.v2.0.12
NAME                       DISPLAY            VERSION   REPLACES                   PHASE
ocs-osd-deployer.v2.0.12   OCS OSD Deployer   2.0.12    ocs-osd-deployer.v2.0.11   Succeeded