Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1790265

Summary:	Prometheus pods are consuming a large amount of memory and crashing when any limits are defined
Product:	OpenShift Container Platform	Reporter:	Sara Ferguson <sferguso>
Component:	Monitoring	Assignee:	Christian Heidenreich <cvogel>
Status:	CLOSED NOTABUG	QA Contact:	Junqi Zhao <juzhao>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	3.11.0	CC:	alegrand, anpicker, erooth, fbranczy, kakkoyun, lcosic, mloibl, pkrupa, scuppett, surbania, syangsao
Target Milestone:	---	Keywords:	Reopened
Target Release:	3.11.z
Hardware:	Unspecified
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-03-04 07:37:57 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Sara Ferguson 2020-01-13 01:55:43 UTC

Description of problem:

Prometheus pods are crashing due to OOM events when limits are defined. When limits are not defined they are consuming >15-30GB and 10-15 cores

Version-Release number of selected component (if applicable):

atomic-openshift-3.11.117-1.git.0.14e54a3.el7.x86_64
RHEL 7.6, kernel-3.10.0-957.12.2.el7

How reproducible:

The customer is able to reproduce both events readily 

Steps to Reproduce:

To crash pods: 

1. Configure a 3.11 OCP cluster with ~2000 total pods in all name spaces, >140 nodes in the cluster, while running 2 Prometheus pods
2. Define reasonable limits as defined in the documentation[0]. 
3. Prometheus pods then crash due to OOM events

To have pods consume large amount of memory: 

1.) Same as above
2.) Do not define limits
3.) After some time has passed review 

Actual results:

Pods crash when limits are defined

OR 

Pods consume >25-30GB of memory and >10-15 cores

Expected results:

- Pods do not crash when reasonable limits are defined
- When limits are defined pods should not consume so much memory.

Additional info:

The below details are specific to the customer's environment and the different things tried and their outcome: 

Limit defined: 
10GB memory
6 cores

Outcome: Pods crash due to OOM killer events

--

Limit Defined: 
8GB memory
6 core cp
Outcome: Pods crash due to OOM killer events

--

Limit defined: 
15 GB memory
10 core

Outcome: Pods crash due to OOM killer events

--

Change default retention period to 7days

Outcome: Pods still crash and/or consume too much memory within one hour

--

Limits removed

Outcome: Pods run fine, but consume more than 25-30 GB of memory and 10-15 cores. 

Environment: 

OCP 3.11
2224 total pods in all namespaces
142 total nodes in the cluster
Total number of Prometheus nodes: 2

[0] - https://docs.openshift.com/container-platform/3.11/scaling_performance/scaling_cluster_monitoring.html#cluster-monitoring-recommendations-for-OCP

Comment 2 Stephen Cuppett 2020-01-13 12:22:27 UTC

Moving to the active development branch (4.4). For any needed fixes where backports are required/requested, BZ clones will be created targeting those specific z-stream releases.