Bug 1929875

Summary:	prometheus memory spikes
Product:	OpenShift Container Platform	Reporter:	dtarabor
Component:	Monitoring	Assignee:	Sergiusz Urbaniak <surbania>
Status:	CLOSED DUPLICATE	QA Contact:	Junqi Zhao <juzhao>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	4.6	CC:	alegrand, anpicker, erooth, kakkoyun, lcosic, pkrupa, surbania
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-02-18 09:37:50 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description dtarabor 2021-02-17 19:53:20 UTC

Description of problem:
prometheus-k8s-0 and prometheus-k8s-1 pods are having memory spikes after upgrading to 4.6.16- these appear to be unexplainable. these memory spikes (16GB+) cause the nodes to go into NotReady and pods can no longer be scheduled to those Nodes.

Workaround:
deleting the wal/ directory appears to have worked as a workaround.

Version-Release number of selected component (if applicable):
OCP 4.6.16

How reproducible:
I was not able to reproduce this on my cluster.

Steps to Reproduce:
1. Upgrade cluster to 4.6.16
2. prometheus pods will spike to a huge memory amount
3. nodes become overwhelmed

Expected results:
no memory spikes

Additional info:

appears to be related to https://github.com/prometheus/prometheus/issues/6934