Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1790265

Summary: Prometheus pods are consuming a large amount of memory and crashing when any limits are defined
Product: OpenShift Container Platform Reporter: Sara Ferguson <sferguso>
Component: MonitoringAssignee: Christian Heidenreich <cvogel>
Status: CLOSED NOTABUG QA Contact: Junqi Zhao <juzhao>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 3.11.0CC: alegrand, anpicker, erooth, fbranczy, kakkoyun, lcosic, mloibl, pkrupa, scuppett, surbania, syangsao
Target Milestone: ---Keywords: Reopened
Target Release: 3.11.z   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-03-04 07:37:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Sara Ferguson 2020-01-13 01:55:43 UTC
Description of problem:

Prometheus pods are crashing due to OOM events when limits are defined. When limits are not defined they are consuming >15-30GB and 10-15 cores

Version-Release number of selected component (if applicable):

atomic-openshift-3.11.117-1.git.0.14e54a3.el7.x86_64
RHEL 7.6, kernel-3.10.0-957.12.2.el7

How reproducible:

The customer is able to reproduce both events readily 

Steps to Reproduce:

To crash pods: 

1. Configure a 3.11 OCP cluster with ~2000 total pods in all name spaces, >140 nodes in the cluster, while running 2 Prometheus pods
2. Define reasonable limits as defined in the documentation[0]. 
3. Prometheus pods then crash due to OOM events

To have pods consume large amount of memory: 

1.) Same as above
2.) Do not define limits
3.) After some time has passed review 

Actual results:

Pods crash when limits are defined

OR 

Pods consume >25-30GB of memory and >10-15 cores

Expected results:

- Pods do not crash when reasonable limits are defined
- When limits are defined pods should not consume so much memory.

Additional info:

The below details are specific to the customer's environment and the different things tried and their outcome: 

Limit defined: 
10GB memory
6 cores

Outcome: Pods crash due to OOM killer events

--

Limit Defined: 
8GB memory
6 core cp
Outcome: Pods crash due to OOM killer events

--

Limit defined: 
15 GB memory
10 core

Outcome: Pods crash due to OOM killer events

--

Change default retention period to 7days

Outcome: Pods still crash and/or consume too much memory within one hour

--

Limits removed

Outcome: Pods run fine, but consume more than 25-30 GB of memory and 10-15 cores. 

Environment: 

OCP 3.11
2224 total pods in all namespaces
142 total nodes in the cluster
Total number of Prometheus nodes: 2

[0] - https://docs.openshift.com/container-platform/3.11/scaling_performance/scaling_cluster_monitoring.html#cluster-monitoring-recommendations-for-OCP

Comment 2 Stephen Cuppett 2020-01-13 12:22:27 UTC
Moving to the active development branch (4.4). For any needed fixes where backports are required/requested, BZ clones will be created targeting those specific z-stream releases.