Bug 1931896

Summary: Prometheus backed by a PVC may start consuming a lot of RAM after 4.6 -> 4.7 upgrade due to series churn
Product: OpenShift Container Platform Reporter: Simon Pasquier <spasquie>
Component: MonitoringAssignee: Jan Fajerski <jfajersk>
Status: CLOSED DUPLICATE QA Contact: Junqi Zhao <juzhao>
Severity: medium Docs Contact:
Priority: urgent    
Version: 4.7CC: alegrand, alkazako, anbhat, anpicker, erooth, fdeutsch, hongyli, jeder, jfajersk, jluhrsen, juzhao, kakkoyun, lcosic, nmalik, pkrupa, rsandu, shzhou, spasquie, travi, trozet, vpickard, vrutkovs, wking
Target Milestone: ---Keywords: Reopened, ServiceDeliveryImpact
Target Release: 4.7.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: 1925061 Environment:
Last Closed: 2021-05-25 15:08:47 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1925061, 1945851, 1946597    
Bug Blocks:    

Description Simon Pasquier 2021-02-23 14:12:01 UTC
+++ This bug was initially created as a clone of Bug #1925061 +++

Recent AWS OVN 4.6 -> 4.7.ci upgrades have failed:
* https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-ovn-upgrade-4.6-stable-to-4.7-ci/1356171223960129536
*  https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-ovn-upgrade-4.6-stable-to-4.7-ci/1356917720284663808
* https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-ovn-upgrade-4.6-stable-to-4.7-ci/1356579523700723712
* https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-ovn-upgrade-4.6-stable-to-4.7-ci/1355087205629956096

--- Additional comment from Vadim Rutkovsky on 2021-02-04 20:42:52 UTC ---

Jobs are timing out as must-gather step takes too long:
>Pod e2e-aws-upgrade-gather-must-gather succeeded after 50m40s

This probably means a performance regression

--- Additional comment from Tim Rozet on 2021-02-04 23:29:00 UTC ---

We found that the worker node was running out of ram because prometheus was taking up around 4gigs of RSS:
asks: 391 total,  6 running, 382 sleeping,  0 stopped,  3 zombie 
%Cpu(s): 78.3 us, 19.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 2.7 hi, 0.0 si, 0.0 st 
MiB Mem :  7664.7 total,   116.9 free,  7135.3 used,   412.5 buff/cache 
MiB Swap:     0.0 total,     0.0 free,     0.0 used.   155.9 avail Mem 
   PID USER     PR NI   VIRT   RES   SHR S %CPU %MEM    TIME+ COMMAND                                                                                                                                               
 10410 nfsnobo+ 20  0 5018064  3.9g     0 R  0.2 51.5  5:19.79 prometheus            

This would eventually cause the node to go not ready. By killing the process the node goes back to ready and everything works until prom container restarts and balloons again to half of the worker's ram. Moving this to the monitoring team to investigate.

--- Additional comment from Sergiusz Urbaniak on 2021-02-05 10:57:26 UTC ---

@simon: do you mind to have a look and assert severity?

--- Additional comment from Vadim Rutkovsky on 2021-02-05 12:03:54 UTC ---

Possibly related - https://bugzilla.redhat.com/show_bug.cgi?id=1918683

--- Additional comment from Scott Dodson on 2021-02-10 23:42:25 UTC ---

This is believed to block all upgrades when ovn is in use. Marking as a blocker.

--- Additional comment from Sergiusz Urbaniak on 2021-02-11 09:29:50 UTC ---



--- Additional comment from Pawel Krupa on 2021-02-11 10:00:18 UTC ---

This is expected and in-line with how prometheus works. During cluster upgrade, prometheus pods are rotated to upgrade to new version this causes new prometheus application to start WAL replay process from older data if that data exists. Since this process is memory and CPU intensive it can cause OOM kills if not enough RAM is available.

For CI tests solution would be to not attach PVC to prometheus or increase underlying node memory.

--- Additional comment from Simon Pasquier on 2021-02-11 10:30:04 UTC ---

It is expected that Prometheus memory increases during upgrade. There are least 2 factors explaining this:
* Most of the pods (if not all) are rescheduled leading to new series being created (especially kubelet and kube-state-metrics)
* Prometheus replays its write-ahead-log data to rebuild its in-memory data.

I didn't find that the number of series is significant larger for OVN upgrade jobs compared to other upgrade jobs.

However the release-openshift-origin-installer-e2e-aws-ovn-upgrade-4.6-stable-to-4.7-ci job provisions m5.large instances for the worker nodes (e.g. 8GB of RAM). So the total memory available for workers is about 24GB.

Other upgrade jobs [1] provisions m4.xlarge instances (16GB) which means the total memory of workers is about 50GB. 

I would recommend upgrading the worker instances for the OVN jobs and see if it resolves the issue.

[1] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1359736918631256064

--- Additional comment from Simon Pasquier on 2021-02-11 10:32:44 UTC ---

Lets keep it open for now

--- Additional comment from Vadim Rutkovsky on 2021-02-11 11:38:18 UTC ---

> I would recommend upgrading the worker instances for the OVN jobs and see if it resolves the issue.

8GB worker is a supported configuration, m5.large is a default setting for compute. This might resolve failing prowjobs, but customers would hit it anyway

--- Additional comment from Sergiusz Urbaniak on 2021-02-11 13:21:57 UTC ---

To summarize:

- there is no concrete defect or memory leak in Prometheus, hence lowering severity to medium and setting blocker-
- the 8gb limit is new to us, is there any OEP that we missed or support documentation that specifies this upper bound of memory?
- we do have a performance optimization epic, but it is deprioritized in favor of suporint OSD/layered service use cases and single node
- feel free to escalate to PM and/or mgmt if you feel this needs very urgent attention

But to summarize short-term we advise to raise resources for those CI machines.

--- Additional comment from Aniket Bhat on 2021-02-11 22:13:18 UTC ---

FYI, this doesn't seem specific to OVN-K as the network plugin. I see the aws upgrade job failing with similar prometheus memory related errors on Openshift-sdn as well.

For instance:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.6-stable-to-4.7-ci/1348495838053142528

--- Additional comment from Simon Pasquier on 2021-02-12 09:18:48 UTC ---

@Aniket looking at the logs it doesn't seem to be the same issue. From the link you've pasted (Jan 11st 2021), CMO reports degraded because the Thanos query deployment doesn't reach the number of expected replicas (1/2). It should have been addressed by bug 1906496 which has been fixed on Jan 22 2021 (for the 4.7 release).

--- Additional comment from Sergiusz Urbaniak on 2021-02-12 13:23:26 UTC ---

To have more technical backing, a couple of facts:

- the amount of ingested metrics in CI runs did not raise significantly (we counted 1124 in 4.6 vs. 1130 in 4.7 for two CI runs).
- however the amount of series (which is the actual offender for resource usage) can significantly increase (indicated by the prometheus_tsdb_head_series metric) if PVCs are attached during upgrades as there is high pod churn.
- here especially the apiserver_request_duration_seconds_bucket metric is one of the biggest offenders, which is much amplified by PVC usage.

The next two screenshots are showing the 4.5-4.6 upgrade resource usage with emptydir for the tsdb and a screenshot with the 4.6-4.7 upgrade with PVCs attached. Note that upgrade also takes significantly longer, causing a longer period of API requests and thus even more churn on the apiserver_request_duration_seconds_bucket metric.

--- Additional comment from Sergiusz Urbaniak on 2021-02-12 13:26:38 UTC ---



--- Additional comment from Sergiusz Urbaniak on 2021-02-12 13:27:15 UTC ---



--- Additional comment from Sergiusz Urbaniak on 2021-02-12 14:48:24 UTC ---

@lindsey: as discussed let's find a good place where we can document the behavior of prometheus using a significant amount of resources when

a) Prometheus has PVC attached
b) An upgrade happens


I am not sure where this can be prominently stated best, but we should make sure this is visible to users which have the above mentioned PVC configuration.

--- Additional comment from jamo luhrsen on 2021-02-12 16:20:48 UTC ---

(In reply to Sergiusz Urbaniak from comment #11)
> To summarize:
> 
> - there is no concrete defect or memory leak in Prometheus, hence lowering
> severity to medium and setting blocker-
> - the 8gb limit is new to us, is there any OEP that we missed or support
> documentation that specifies this upper bound of memory?
> - we do have a performance optimization epic, but it is deprioritized in
> favor of suporint OSD/layered service use cases and single node
> - feel free to escalate to PM and/or mgmt if you feel this needs very urgent
> attention
> 
> But to summarize short-term we advise to raise resources for those CI
> machines.

I couldn't quickly find the 4.6 docs that indicate the min requirement
is 8GB for nodes, but it's here for v3:
https://docs.openshift.com/container-platform/3.10/install/prerequisites.html#:~:text=Therefore%2C%20the%20recommended%20size%20of,and%2019%20GB%20of%20RAM.

if that's what we advertise, then I prefer that our non-stress/perf/scale CI
environments run on that min setup (at least some of them) so that we can
catch these types of issues before we learn about them from customers.
Maybe the answer is that our min requirement is no longer 8GB?

--- Additional comment from Simon Pasquier on 2021-02-15 09:32:14 UTC ---

I've downloaded the prometheus data from and replayed it on my local machine with GODEBUG=madvdontneed=1 (which should be the default for 4.7) to see if the WAL replay led to a memory spike that would explain the process being OOM killed.

The graph shows that after the WAL replay, the process uses about 3.2GB of memory. After 2 minutes (e.g. forced garbage collection), the in-use heap memory goes below 2GB which is expected since Prometheus doesn't scrape any targets. We can see that the OS slowly reclaims the unused memory from the process and after a few minutes, the RSS memory and heap size values get close to each other.
On a real cluster being upgraded, the process memory would continue to rise after the WAL replay because Prometheus would start scraping targets. However the garbage collection should kick in and reclaim the memory consumed during the WAL replay, keeping memory usage to a minimum.

--- Additional comment from Sergiusz Urbaniak on 2021-02-15 11:14:32 UTC ---

The general consensus is that we work on engineering side to alleviate the problem, but documentation would certainly help in the meantime.

--- Additional comment from Jeremy Eder on 2021-02-16 13:55:27 UTC ---

I believe the "bug" here is the fact that we don't default to PVC enabled for the monitoring operator.

--- Additional comment from Simon Pasquier on 2021-02-23 14:09:06 UTC ---

Changing target release to 4.8.0 since 4.7.0 is (almost) out. I'll clone the bug against 4.7.z.