Description of problem:
Prometheus pods are crashing due to OOM events when limits are defined. When limits are not defined they are consuming >15-30GB and 10-15 cores
Version-Release number of selected component (if applicable):
atomic-openshift-3.11.117-1.git.0.14e54a3.el7.x86_64
RHEL 7.6, kernel-3.10.0-957.12.2.el7
How reproducible:
The customer is able to reproduce both events readily
Steps to Reproduce:
To crash pods:
1. Configure a 3.11 OCP cluster with ~2000 total pods in all name spaces, >140 nodes in the cluster, while running 2 Prometheus pods
2. Define reasonable limits as defined in the documentation[0].
3. Prometheus pods then crash due to OOM events
To have pods consume large amount of memory:
1.) Same as above
2.) Do not define limits
3.) After some time has passed review
Actual results:
Pods crash when limits are defined
OR
Pods consume >25-30GB of memory and >10-15 cores
Expected results:
- Pods do not crash when reasonable limits are defined
- When limits are defined pods should not consume so much memory.
Additional info:
The below details are specific to the customer's environment and the different things tried and their outcome:
Limit defined:
10GB memory
6 cores
Outcome: Pods crash due to OOM killer events
--
Limit Defined:
8GB memory
6 core cp
Outcome: Pods crash due to OOM killer events
--
Limit defined:
15 GB memory
10 core
Outcome: Pods crash due to OOM killer events
--
Change default retention period to 7days
Outcome: Pods still crash and/or consume too much memory within one hour
--
Limits removed
Outcome: Pods run fine, but consume more than 25-30 GB of memory and 10-15 cores.
Environment:
OCP 3.11
2224 total pods in all namespaces
142 total nodes in the cluster
Total number of Prometheus nodes: 2
[0] - https://docs.openshift.com/container-platform/3.11/scaling_performance/scaling_cluster_monitoring.html#cluster-monitoring-recommendations-for-OCP
Moving to the active development branch (4.4). For any needed fixes where backports are required/requested, BZ clones will be created targeting those specific z-stream releases.