Bug 1810111 - Prometheus WAL replay memory consumption far exceeds steady state usage
Summary: Prometheus WAL replay memory consumption far exceeds steady state usage
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.3.0
Hardware: Unspecified
OS: Unspecified
unspecified
low
Target Milestone: ---
: 4.5.0
Assignee: Lili Cosic
QA Contact: Junqi Zhao
URL:
Whiteboard:
: 1808358 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-03-04 14:59 UTC by Scott Dodson
Modified: 2020-10-20 02:59 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-05-06 08:03:22 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Scott Dodson 2020-03-04 14:59:34 UTC
Description of problem:
When Prometheus pods are restarted the replay of the WAL consumes significantly more memory than steady state. A cluster that is currently healthy may deadlock on restart due to the increased consumption during boot up after a rolling restart or upgrade.


Version-Release number of selected component (if applicable):
4.3.2 but likely all versions

How reproducible:
100% when steady state consumption nears node resource limits

Steps to Reproduce:
1. Populate Prometheus data until it nears node resource limits
2. Restart Prometheus

Actual results:
Burst memory consumption causes Prometheus to crashloop during bootup with no prior warning that the system has exceeded healthy limits prior to the restart.

Expected results:
Either reduce the startup burst utilization or devise an alerting pattern that will ensure admins are aware they've exceeded the limits of available resources prior to critical events like power outage, rolling restarts, or upgrades.

Additional info:
This burst of consumption has knock on effects due to a kubelet bug which results in system processes being OOMKilled and further degradation of the node https://bugzilla.redhat.com/show_bug.cgi?id=1808429

Comment 2 Scott Dodson 2020-03-04 15:30:25 UTC
Known workarounds: Remove the wal directory from the volume, this will result in dataloss for anything that hasn't been committed to TSDB

Comment 4 Sergiusz Urbaniak 2020-03-26 13:48:33 UTC
*** Bug 1808358 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.