Bug 1929875

Summary: prometheus memory spikes
Product: OpenShift Container Platform Reporter: dtarabor
Component: MonitoringAssignee: Sergiusz Urbaniak <surbania>
Status: CLOSED DUPLICATE QA Contact: Junqi Zhao <juzhao>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.6CC: alegrand, anpicker, erooth, kakkoyun, lcosic, pkrupa, surbania
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-02-18 09:37:50 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description dtarabor 2021-02-17 19:53:20 UTC
Description of problem:
prometheus-k8s-0 and prometheus-k8s-1 pods are having memory spikes after upgrading to 4.6.16- these appear to be unexplainable. these memory spikes (16GB+) cause the nodes to go into NotReady and pods can no longer be scheduled to those Nodes.

Workaround:
deleting the wal/ directory appears to have worked as a workaround.

Version-Release number of selected component (if applicable):
OCP 4.6.16

How reproducible:
I was not able to reproduce this on my cluster.

Steps to Reproduce:
1. Upgrade cluster to 4.6.16
2. prometheus pods will spike to a huge memory amount
3. nodes become overwhelmed

Expected results:
no memory spikes

Additional info:

appears to be related to https://github.com/prometheus/prometheus/issues/6934