Recent AWS OVN 4.6 -> 4.7.ci upgrades have failed:
Jobs are timing out as must-gather step takes too long:
>Pod e2e-aws-upgrade-gather-must-gather succeeded after 50m40s
This probably means a performance regression
We found that the worker node was running out of ram because prometheus was taking up around 4gigs of RSS:
asks: 391 total, 6 running, 382 sleeping, 0 stopped, 3 zombie
%Cpu(s): 78.3 us, 19.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 2.7 hi, 0.0 si, 0.0 st
MiB Mem : 7664.7 total, 116.9 free, 7135.3 used, 412.5 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 155.9 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
10410 nfsnobo+ 20 0 5018064 3.9g 0 R 0.2 51.5 5:19.79 prometheus
This would eventually cause the node to go not ready. By killing the process the node goes back to ready and everything works until prom container restarts and balloons again to half of the worker's ram. Moving this to the monitoring team to investigate.
@simon: do you mind to have a look and assert severity?
Possibly related - https://bugzilla.redhat.com/show_bug.cgi?id=1918683
This is believed to block all upgrades when ovn is in use. Marking as a blocker.
*** Bug 1927448 has been marked as a duplicate of this bug. ***
This is expected and in-line with how prometheus works. During cluster upgrade, prometheus pods are rotated to upgrade to new version this causes new prometheus application to start WAL replay process from older data if that data exists. Since this process is memory and CPU intensive it can cause OOM kills if not enough RAM is available.
For CI tests solution would be to not attach PVC to prometheus or increase underlying node memory.
It is expected that Prometheus memory increases during upgrade. There are least 2 factors explaining this:
* Most of the pods (if not all) are rescheduled leading to new series being created (especially kubelet and kube-state-metrics)
* Prometheus replays its write-ahead-log data to rebuild its in-memory data.
I didn't find that the number of series is significant larger for OVN upgrade jobs compared to other upgrade jobs.
However the release-openshift-origin-installer-e2e-aws-ovn-upgrade-4.6-stable-to-4.7-ci job provisions m5.large instances for the worker nodes (e.g. 8GB of RAM). So the total memory available for workers is about 24GB.
Other upgrade jobs  provisions m4.xlarge instances (16GB) which means the total memory of workers is about 50GB.
I would recommend upgrading the worker instances for the OVN jobs and see if it resolves the issue.
Lets keep it open for now
> I would recommend upgrading the worker instances for the OVN jobs and see if it resolves the issue.
8GB worker is a supported configuration, m5.large is a default setting for compute. This might resolve failing prowjobs, but customers would hit it anyway
Created attachment 1757047 [details]
RSS memory and in-use heap when Prometheus starts
I've downloaded the prometheus data from and replayed it on my local machine with GODEBUG=madvdontneed=1 (which should be the default for 4.7) to see if the WAL replay led to a memory spike that would explain the process being OOM killed.
The graph shows that after the WAL replay, the process uses about 3.2GB of memory. After 2 minutes (e.g. forced garbage collection), the in-use heap memory goes below 2GB which is expected since Prometheus doesn't scrape any targets. We can see that the OS slowly reclaims the unused memory from the process and after a few minutes, the RSS memory and heap size values get close to each other.
On a real cluster being upgraded, the process memory would continue to rise after the WAL replay because Prometheus would start scraping targets. However the garbage collection should kick in and reclaim the memory consumed during the WAL replay, keeping memory usage to a minimum.
Moving back to POST. There are other modifications that are needed.
Created attachment 1791215 [details]
4.7-4.8 upgrades with PVCs
upgraded from 4.7.0-0.nightly-2021-06-12-151209 to 4.8.0-0.nightly-2021-06-14-145150, bugs in Comment 27 are fixed and memory usage for prometheus pods is about 2~3Gi at the most time
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.