Bug 2121314
| Summary: | Prometheus pods in openshift-storage failing due to OOMKills triggering DMS alerts for MT-SRE | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Yashvardhan Kukreja <ykukreja> |
| Component: | odf-managed-service | Assignee: | Dhruv Bindra <dbindra> |
| Status: | NEW --- | QA Contact: | Filip Balák <fbalak> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.10 | CC: | aeyal, apahim, dbindra, fbalak, gathomas, lgangava, nberry, nschiede, odf-bz-bot, rchikatw |
| Target Milestone: | --- | Keywords: | Reopened |
| Target Release: | --- | Flags: | rchikatw:
needinfo-
|
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-03-14 15:25:49 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Yashvardhan Kukreja
2022-08-25 06:30:15 UTC
@yash, > This managedocs resource is getting reconciled by the ocs-operator. - ocs-osd-deployer reconciles this resource and at the same time I believe you mistyped it - The bug is probably related to https://bugzilla.redhat.com/show_bug.cgi?id=2074938 - A lot of discussions similar to this https://coreos.slack.com/archives/C0VMT03S5/p1652858588689779, happened for above bug and simply bumping up the resource request for prometheus will not likely work based on past experience - I propose we await next DMS alert, look for all the operators restarts in the openshift-namespace and continue from there, wdyt? thanks, leela. I'd still strongly suggest to bump the memory limits to 750Mi to allow Prometheus with a bigger resource scope to operate under. That's because of the criticality of this component. Right now, the low limits compromise the entire monitoring setup of an ODF installation, which has a massive impact. If it goes down, it makes us blind to the customer cluster's state. And for fixing this, there aren't relevant workarounds except switching down the ocs-osd-controller-manager to zero replicas, so that it stops reconciling the Prom. to any fixes/workarounds we apply. Turning off ocs-osd-controller-manager would although allow us to bump the prom. limits in runtime (and fix this issue), still it would compromise the other important ocs-osd-controller-manager would be performing. *** Bug 2119491 has been marked as a duplicate of this bug. *** Here's the must-gather output of the above problem which happened again with another cluster - https://drive.google.com/drive/folders/1m04Yxmp_kOn51J2n08e3OJo0LJ0luNIZ?usp=sharing Here are the executions of a few commands: ``` ❯ oc get pods -n openshift-storage | grep "prom" prometheus-managed-ocs-prometheus-0 2/3 CrashLoopBackOff 270 (4m53s ago) 6d19h prometheus-operator-8547cc9f89-mx9bp 1/1 Running 0 97d ~ 11:49:47 ❯ oc get pods -n openshift-storage | grep "ocs-operator" ocs-operator-79bd8d6464-2fm9l 0/1 CrashLoopBackOff 400 (3m20s ago) 37h ~ ``` The two BZ links in the link section are for the backport request to ODF v4.11 and v4.10. This BZ will get fixed when backport is done and managed service starts using the ODF version that has the fix for this issue. Hi, I ran into the situation where the Prometheus pod is being OOM killed but the ocs-operator has not even been restarted once: kubectl get pods NAME READY STATUS RESTARTS AGE ... ocs-operator-5c77756ddd-8zfqp 1/1 Running 0 33d ... prometheus-managed-ocs-prometheus-0 2/3 CrashLoopBackOff 202 (19s ago) 8d We see that for the fix the ocs-operator pod memory limit was raised but the promethus pod memory limits were not. So from this it seems like that was not sufficient. I forgot to mention this was the r-eu-tst-03 cluster Needs to be tested with latest build Reproducer was taken from https://bugzilla.redhat.com/show_bug.cgi?id=2123653#c15 (tracked bz) and updated with simultaneous deletion of ocs-operator as described in comment 18. Commands ran at the same time: for j in {1..400};do date;echo $j;for i in `oc get pods|grep ocs-osd-controller|cut -d " " -f1`; do oc get pod $i -o wide; oc delete pod $i ;done;sleep 20;echo ==============================;done for j in {1..400};do date;echo $j;for i in `oc get pods|grep ocs-operator|cut -d " " -f1`; do oc get pod $i -o wide; oc delete pod $i ;done;sleep 20;echo ==============================;done Command for monitoring the status: for j in {1..450};do date;for i in `oc get pods|grep prometheus|cut -d " " -f1`;do oc get pod $i -o wide;done;sleep 20;echo ===========================;done Pods at the end of the test: $ oc get pods NAME READY STATUS RESTARTS AGE addon-ocs-provider-qe-catalog-gzdfw 1/1 Running 0 6h30m alertmanager-managed-ocs-alertmanager-0 2/2 Running 0 6h26m b5c66ab5e7f90122be20fd041ac62139a65d985117a45c88c21d845e65qnl7m 0/1 Completed 0 6h29m bc7c211d2e048e04f08db7871b3ed242a893dda70bb2d102964f72ff93hgmzg 0/1 Completed 0 6h29m csi-addons-controller-manager-759b488df-56m98 2/2 Running 0 6h27m ocs-metrics-exporter-5dd96c885b-8jgll 1/1 Running 0 6h26m ocs-operator-6888799d6b-4ffgm 1/1 Running 0 10m ocs-osd-aws-data-gather-bc7c46cf9-tb76l 1/1 Running 0 6h28m ocs-osd-controller-manager-6988ff8577-x4qrq 3/3 Running 0 15m ocs-provider-server-6cd96b5ccb-htb78 1/1 Running 0 6h26m odf-console-57b8476cd4-whbvf 1/1 Running 0 6h27m odf-operator-controller-manager-6f44676f4f-qww5r 2/2 Running 0 6h27m prometheus-managed-ocs-prometheus-0 3/3 Running 0 6h26m prometheus-operator-8547cc9f89-qv555 1/1 Running 0 6h28m rook-ceph-crashcollector-084922fac2286ea9642c71584d0cf0d4-j66bn 1/1 Running 0 6h23m rook-ceph-crashcollector-65dd33fb18b9ccca0b14f252fa6088d6-rbnhl 1/1 Running 0 6h23m rook-ceph-crashcollector-6d1aea9ead3d2c6a2556aa39d30e694b-hbt96 1/1 Running 0 6h23m rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-6bd646668vf96 2/2 Running 0 6h22m rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-55ccbf6bb2kbg 2/2 Running 0 6h22m rook-ceph-mgr-a-fb74b5cd-ht46r 2/2 Running 0 6h23m rook-ceph-mon-a-855bf455b4-7scns 2/2 Running 0 6h24m rook-ceph-mon-b-5c7754454d-f5x7n 2/2 Running 0 6h24m rook-ceph-mon-c-574fcc65df-h8hhk 2/2 Running 0 6h23m rook-ceph-operator-548b87d44b-5w82s 1/1 Running 0 6h26m rook-ceph-osd-0-bb5976898-znnct 2/2 Running 0 6h22m rook-ceph-osd-1-f46b8875f-gqdwr 2/2 Running 0 6h22m rook-ceph-osd-2-5d566968cf-dhm68 2/2 Running 0 6h22m rook-ceph-osd-prepare-default-0-data-0f7czh-nzfbf 0/1 Completed 0 6h22m rook-ceph-osd-prepare-default-1-data-09wl5d-bdcr9 0/1 Completed 0 6h22m rook-ceph-osd-prepare-default-2-data-04wm7m-2z79r 0/1 Completed 0 6h22m rook-ceph-tools-7c8c77bd96-hm2lt 1/1 Running 0 6h26m --> VERIFIED Tested with: ocs-operator.v4.10.9 ocs-osd-deployer.v2.0.11 Closing the bug as it's fixed in v2.0.11 and verified by QA. Reopening as multiple clusters are facing the issue again with ocs-osd-deployer.v2.0.12. prometheus-managed-ocs-prometheus-0 2/3 CrashLoopBackOff 153 (2m9s ago) 3d18h ~$ oc -n openshift-storage get csv ocs-osd-deployer.v2.0.12 NAME DISPLAY VERSION REPLACES PHASE ocs-osd-deployer.v2.0.12 OCS OSD Deployer 2.0.12 ocs-osd-deployer.v2.0.11 Succeeded |