Bug 1810111

Summary: Prometheus WAL replay memory consumption far exceeds steady state usage
Product: OpenShift Container Platform Reporter: Scott Dodson <sdodson>
Component: MonitoringAssignee: Lili Cosic <lcosic>
Status: CLOSED UPSTREAM QA Contact: Junqi Zhao <juzhao>
Severity: low Docs Contact:
Priority: unspecified    
Version: 4.3.0CC: alegrand, anpicker, erooth, kakkoyun, lbednar, lcosic, mloibl, nelluri, pkrupa, surbania, wking
Target Milestone: ---Keywords: Upgrades
Target Release: 4.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-05-06 08:03:22 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Scott Dodson 2020-03-04 14:59:34 UTC
Description of problem:
When Prometheus pods are restarted the replay of the WAL consumes significantly more memory than steady state. A cluster that is currently healthy may deadlock on restart due to the increased consumption during boot up after a rolling restart or upgrade.


Version-Release number of selected component (if applicable):
4.3.2 but likely all versions

How reproducible:
100% when steady state consumption nears node resource limits

Steps to Reproduce:
1. Populate Prometheus data until it nears node resource limits
2. Restart Prometheus

Actual results:
Burst memory consumption causes Prometheus to crashloop during bootup with no prior warning that the system has exceeded healthy limits prior to the restart.

Expected results:
Either reduce the startup burst utilization or devise an alerting pattern that will ensure admins are aware they've exceeded the limits of available resources prior to critical events like power outage, rolling restarts, or upgrades.

Additional info:
This burst of consumption has knock on effects due to a kubelet bug which results in system processes being OOMKilled and further degradation of the node https://bugzilla.redhat.com/show_bug.cgi?id=1808429

Comment 2 Scott Dodson 2020-03-04 15:30:25 UTC
Known workarounds: Remove the wal directory from the volume, this will result in dataloss for anything that hasn't been committed to TSDB

Comment 4 Sergiusz Urbaniak 2020-03-26 13:48:33 UTC
*** Bug 1808358 has been marked as a duplicate of this bug. ***