Description of the problem: Currently, DMS (DeadManSnitch) alerts are happening on some clusters sporadically. The trigger of DMS alerts is usually DMS not being able to reach the Prometheus pods sitting in the customer's clusters and get a 200 Ack from them. In the case of ocs-consumer, the Prometheus pods in consideration are the ones which are in the openshift-storage namespace with the name prometheus-managed-ocs-prometheus-0 Due to their need of consuming higher memory than the 250Mi memory limit allocated to them, they're getting OOMKilled and not getting the opportunity to serve a 200 Ack to the DMS, which is triggering the DMS alert to the MT-SRE. The high memory consumption is a result of a combination of many factors such as certain metrics being of high cardinality, scrape interval being too frequent, etc. Reproducing the issue: Currently, there isn't any sure short way of reproducing this issue except somehow making the prometheus pods consume higher memory. Component controlling this: The prometheus pod is controlled by the statefulset in openshift-storage with the name prometheus-managed-ocs-prometheus This stateful-set is controlled/owned by the Prometheus CR in the openshift-storage namespace with the name managed-ocs-prometheus This Prometheus resource is further controlled/owned by the ManagedOCS resource in the openshift-storage namespace with the name managedocs This managedocs resource is getting reconciled by the ocs-operator. Hence, overall, it's the ocs-operator which is determining and controlling the memory limits which are eventually propagating to the promtheues pods. Proposed Solution: ocs-operator should be reprogrammed to reconcile the managedocs resource in such a way that it ends up creating the Prometheus CR with bumped up memory limits. Specifically, this line (https://github.com/red-hat-storage/ocs-osd-deployer/blob/02ebe3916210326d00fae53bf55cbfef53ac1edb/utils/resources.go#L70) has to be modified to resource.MustParse("750Mi"), Expected Results: Prometheus pods running without facing any OOMKill failures and without needing any restarts. Actual Results: ~ 11s 16:40:40 ❯ oc get pod prometheus-managed-ocs-prometheus-0 -n openshift-storage NAME READY STATUS RESTARTS AGE prometheus-managed-ocs-prometheus-0 3/3 Running 41 (13m ago) 26h Prometheus periodically getting OOMKilled and restarting itself in the hopes of working properly.
@yash, > This managedocs resource is getting reconciled by the ocs-operator. - ocs-osd-deployer reconciles this resource and at the same time I believe you mistyped it - The bug is probably related to https://bugzilla.redhat.com/show_bug.cgi?id=2074938 - A lot of discussions similar to this https://coreos.slack.com/archives/C0VMT03S5/p1652858588689779, happened for above bug and simply bumping up the resource request for prometheus will not likely work based on past experience - I propose we await next DMS alert, look for all the operators restarts in the openshift-namespace and continue from there, wdyt? thanks, leela.
I'd still strongly suggest to bump the memory limits to 750Mi to allow Prometheus with a bigger resource scope to operate under. That's because of the criticality of this component. Right now, the low limits compromise the entire monitoring setup of an ODF installation, which has a massive impact. If it goes down, it makes us blind to the customer cluster's state. And for fixing this, there aren't relevant workarounds except switching down the ocs-osd-controller-manager to zero replicas, so that it stops reconciling the Prom. to any fixes/workarounds we apply. Turning off ocs-osd-controller-manager would although allow us to bump the prom. limits in runtime (and fix this issue), still it would compromise the other important ocs-osd-controller-manager would be performing.
*** Bug 2119491 has been marked as a duplicate of this bug. ***
Here's the must-gather output of the above problem which happened again with another cluster - https://drive.google.com/drive/folders/1m04Yxmp_kOn51J2n08e3OJo0LJ0luNIZ?usp=sharing Here are the executions of a few commands: ``` ❯ oc get pods -n openshift-storage | grep "prom" prometheus-managed-ocs-prometheus-0 2/3 CrashLoopBackOff 270 (4m53s ago) 6d19h prometheus-operator-8547cc9f89-mx9bp 1/1 Running 0 97d ~ 11:49:47 ❯ oc get pods -n openshift-storage | grep "ocs-operator" ocs-operator-79bd8d6464-2fm9l 0/1 CrashLoopBackOff 400 (3m20s ago) 37h ~ ```
The two BZ links in the link section are for the backport request to ODF v4.11 and v4.10. This BZ will get fixed when backport is done and managed service starts using the ODF version that has the fix for this issue.
Hi, I ran into the situation where the Prometheus pod is being OOM killed but the ocs-operator has not even been restarted once: kubectl get pods NAME READY STATUS RESTARTS AGE ... ocs-operator-5c77756ddd-8zfqp 1/1 Running 0 33d ... prometheus-managed-ocs-prometheus-0 2/3 CrashLoopBackOff 202 (19s ago) 8d We see that for the fix the ocs-operator pod memory limit was raised but the promethus pod memory limits were not. So from this it seems like that was not sufficient.
I forgot to mention this was the r-eu-tst-03 cluster
Needs to be tested with latest build
Reproducer was taken from https://bugzilla.redhat.com/show_bug.cgi?id=2123653#c15 (tracked bz) and updated with simultaneous deletion of ocs-operator as described in comment 18. Commands ran at the same time: for j in {1..400};do date;echo $j;for i in `oc get pods|grep ocs-osd-controller|cut -d " " -f1`; do oc get pod $i -o wide; oc delete pod $i ;done;sleep 20;echo ==============================;done for j in {1..400};do date;echo $j;for i in `oc get pods|grep ocs-operator|cut -d " " -f1`; do oc get pod $i -o wide; oc delete pod $i ;done;sleep 20;echo ==============================;done Command for monitoring the status: for j in {1..450};do date;for i in `oc get pods|grep prometheus|cut -d " " -f1`;do oc get pod $i -o wide;done;sleep 20;echo ===========================;done Pods at the end of the test: $ oc get pods NAME READY STATUS RESTARTS AGE addon-ocs-provider-qe-catalog-gzdfw 1/1 Running 0 6h30m alertmanager-managed-ocs-alertmanager-0 2/2 Running 0 6h26m b5c66ab5e7f90122be20fd041ac62139a65d985117a45c88c21d845e65qnl7m 0/1 Completed 0 6h29m bc7c211d2e048e04f08db7871b3ed242a893dda70bb2d102964f72ff93hgmzg 0/1 Completed 0 6h29m csi-addons-controller-manager-759b488df-56m98 2/2 Running 0 6h27m ocs-metrics-exporter-5dd96c885b-8jgll 1/1 Running 0 6h26m ocs-operator-6888799d6b-4ffgm 1/1 Running 0 10m ocs-osd-aws-data-gather-bc7c46cf9-tb76l 1/1 Running 0 6h28m ocs-osd-controller-manager-6988ff8577-x4qrq 3/3 Running 0 15m ocs-provider-server-6cd96b5ccb-htb78 1/1 Running 0 6h26m odf-console-57b8476cd4-whbvf 1/1 Running 0 6h27m odf-operator-controller-manager-6f44676f4f-qww5r 2/2 Running 0 6h27m prometheus-managed-ocs-prometheus-0 3/3 Running 0 6h26m prometheus-operator-8547cc9f89-qv555 1/1 Running 0 6h28m rook-ceph-crashcollector-084922fac2286ea9642c71584d0cf0d4-j66bn 1/1 Running 0 6h23m rook-ceph-crashcollector-65dd33fb18b9ccca0b14f252fa6088d6-rbnhl 1/1 Running 0 6h23m rook-ceph-crashcollector-6d1aea9ead3d2c6a2556aa39d30e694b-hbt96 1/1 Running 0 6h23m rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-6bd646668vf96 2/2 Running 0 6h22m rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-55ccbf6bb2kbg 2/2 Running 0 6h22m rook-ceph-mgr-a-fb74b5cd-ht46r 2/2 Running 0 6h23m rook-ceph-mon-a-855bf455b4-7scns 2/2 Running 0 6h24m rook-ceph-mon-b-5c7754454d-f5x7n 2/2 Running 0 6h24m rook-ceph-mon-c-574fcc65df-h8hhk 2/2 Running 0 6h23m rook-ceph-operator-548b87d44b-5w82s 1/1 Running 0 6h26m rook-ceph-osd-0-bb5976898-znnct 2/2 Running 0 6h22m rook-ceph-osd-1-f46b8875f-gqdwr 2/2 Running 0 6h22m rook-ceph-osd-2-5d566968cf-dhm68 2/2 Running 0 6h22m rook-ceph-osd-prepare-default-0-data-0f7czh-nzfbf 0/1 Completed 0 6h22m rook-ceph-osd-prepare-default-1-data-09wl5d-bdcr9 0/1 Completed 0 6h22m rook-ceph-osd-prepare-default-2-data-04wm7m-2z79r 0/1 Completed 0 6h22m rook-ceph-tools-7c8c77bd96-hm2lt 1/1 Running 0 6h26m --> VERIFIED Tested with: ocs-operator.v4.10.9 ocs-osd-deployer.v2.0.11
Closing the bug as it's fixed in v2.0.11 and verified by QA.
Reopening as multiple clusters are facing the issue again with ocs-osd-deployer.v2.0.12. prometheus-managed-ocs-prometheus-0 2/3 CrashLoopBackOff 153 (2m9s ago) 3d18h ~$ oc -n openshift-storage get csv ocs-osd-deployer.v2.0.12 NAME DISPLAY VERSION REPLACES PHASE ocs-osd-deployer.v2.0.12 OCS OSD Deployer 2.0.12 ocs-osd-deployer.v2.0.11 Succeeded