Bug 1925061 - Prometheus backed by a PVC may start consuming a lot of RAM after 4.6 -> 4.7 upgrade due to series churn
Summary: Prometheus backed by a PVC may start consuming a lot of RAM after 4.6 -> 4.7 ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.7
Hardware: Unspecified
OS: Unspecified
urgent
medium
Target Milestone: ---
: 4.8.0
Assignee: Simon Pasquier
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On: 1945849
Blocks: 1931896
TreeView+ depends on / blocked
 
Reported: 2021-02-04 10:41 UTC by Vadim Rutkovsky
Modified: 2023-09-22 04:08 UTC (History)
21 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1931896 (view as bug list)
Environment:
Last Closed: 2021-07-27 22:40:58 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
RSS memory and in-use heap when Prometheus starts (124.88 KB, image/png)
2021-02-15 09:32 UTC, Simon Pasquier
no flags Details
4.7-4.8 upgrades with PVCs (1.27 MB, image/png)
2021-06-15 09:38 UTC, Junqi Zhao
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-monitoring-operator pull 1052 0 None closed Bug 1925061: Remove the "instance" and "pod" labels for kube-state-metrics metrics 2021-04-15 08:35:07 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:41:22 UTC

Comment 1 Vadim Rutkovsky 2021-02-04 20:42:52 UTC
Jobs are timing out as must-gather step takes too long:
>Pod e2e-aws-upgrade-gather-must-gather succeeded after 50m40s

This probably means a performance regression

Comment 2 Tim Rozet 2021-02-04 23:29:00 UTC
We found that the worker node was running out of ram because prometheus was taking up around 4gigs of RSS:
asks: 391 total,  6 running, 382 sleeping,  0 stopped,  3 zombie 
%Cpu(s): 78.3 us, 19.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 2.7 hi, 0.0 si, 0.0 st 
MiB Mem :  7664.7 total,   116.9 free,  7135.3 used,   412.5 buff/cache 
MiB Swap:     0.0 total,     0.0 free,     0.0 used.   155.9 avail Mem 
   PID USER     PR NI   VIRT   RES   SHR S %CPU %MEM    TIME+ COMMAND                                                                                                                                               
 10410 nfsnobo+ 20  0 5018064  3.9g     0 R  0.2 51.5  5:19.79 prometheus            

This would eventually cause the node to go not ready. By killing the process the node goes back to ready and everything works until prom container restarts and balloons again to half of the worker's ram. Moving this to the monitoring team to investigate.

Comment 3 Sergiusz Urbaniak 2021-02-05 10:57:26 UTC
@simon: do you mind to have a look and assert severity?

Comment 4 Vadim Rutkovsky 2021-02-05 12:03:54 UTC
Possibly related - https://bugzilla.redhat.com/show_bug.cgi?id=1918683

Comment 5 Scott Dodson 2021-02-10 23:42:25 UTC
This is believed to block all upgrades when ovn is in use. Marking as a blocker.

Comment 6 Sergiusz Urbaniak 2021-02-11 09:29:50 UTC
*** Bug 1927448 has been marked as a duplicate of this bug. ***

Comment 7 Pawel Krupa 2021-02-11 10:00:18 UTC
This is expected and in-line with how prometheus works. During cluster upgrade, prometheus pods are rotated to upgrade to new version this causes new prometheus application to start WAL replay process from older data if that data exists. Since this process is memory and CPU intensive it can cause OOM kills if not enough RAM is available.

For CI tests solution would be to not attach PVC to prometheus or increase underlying node memory.

Comment 8 Simon Pasquier 2021-02-11 10:30:04 UTC
It is expected that Prometheus memory increases during upgrade. There are least 2 factors explaining this:
* Most of the pods (if not all) are rescheduled leading to new series being created (especially kubelet and kube-state-metrics)
* Prometheus replays its write-ahead-log data to rebuild its in-memory data.

I didn't find that the number of series is significant larger for OVN upgrade jobs compared to other upgrade jobs.

However the release-openshift-origin-installer-e2e-aws-ovn-upgrade-4.6-stable-to-4.7-ci job provisions m5.large instances for the worker nodes (e.g. 8GB of RAM). So the total memory available for workers is about 24GB.

Other upgrade jobs [1] provisions m4.xlarge instances (16GB) which means the total memory of workers is about 50GB. 

I would recommend upgrading the worker instances for the OVN jobs and see if it resolves the issue.

[1] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1359736918631256064

Comment 9 Simon Pasquier 2021-02-11 10:32:44 UTC
Lets keep it open for now

Comment 10 Vadim Rutkovsky 2021-02-11 11:38:18 UTC
> I would recommend upgrading the worker instances for the OVN jobs and see if it resolves the issue.

8GB worker is a supported configuration, m5.large is a default setting for compute. This might resolve failing prowjobs, but customers would hit it anyway

Comment 19 Simon Pasquier 2021-02-15 09:32:14 UTC
Created attachment 1757047 [details]
RSS memory and in-use heap when Prometheus starts

I've downloaded the prometheus data from and replayed it on my local machine with GODEBUG=madvdontneed=1 (which should be the default for 4.7) to see if the WAL replay led to a memory spike that would explain the process being OOM killed.

The graph shows that after the WAL replay, the process uses about 3.2GB of memory. After 2 minutes (e.g. forced garbage collection), the in-use heap memory goes below 2GB which is expected since Prometheus doesn't scrape any targets. We can see that the OS slowly reclaims the unused memory from the process and after a few minutes, the RSS memory and heap size values get close to each other.
On a real cluster being upgraded, the process memory would continue to rise after the WAL replay because Prometheus would start scraping targets. However the garbage collection should kick in and reclaim the memory consumed during the WAL replay, keeping memory usage to a minimum.

Comment 25 Simon Pasquier 2021-03-12 15:17:47 UTC
Moving back to POST. There are other modifications that are needed.

Comment 29 Junqi Zhao 2021-06-15 09:38:20 UTC
Created attachment 1791215 [details]
4.7-4.8 upgrades with PVCs

Comment 30 Junqi Zhao 2021-06-15 09:55:06 UTC
upgraded from 4.7.0-0.nightly-2021-06-12-151209 to 4.8.0-0.nightly-2021-06-14-145150, bugs in Comment 27 are fixed and memory usage for prometheus pods is about 2~3Gi at the most time

Comment 33 errata-xmlrpc 2021-07-27 22:40:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.