Bug 2164593

Summary:	High memory request (Windows VM) hitting KubevirtVmHighMemoryUsage alert
Product:	Container Native Virtualization (CNV)	Reporter:	Jenifer Abrams <jhopper>
Component:	Virtualization	Assignee:	Itamar Holder <iholder>
Status:	CLOSED ERRATA	QA Contact:	Denys Shchedrivyi <dshchedr>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.11.3	CC:	acardace, fdeutsch, ibezukh, kbidarka, sradco
Target Milestone:	---
Target Release:	4.13.1
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	v4.13.1.rhel9-79	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-06-20 13:41:05 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Jenifer Abrams 2023-01-25 18:52:11 UTC

Description of problem:
On the topic of memory overhead calculations, I noticed starting a Windows VM (which initializes all guest memory shortly after boot time) with a very large memory request hits the KubevirtVmHighMemoryUsage alert a few min after boot up, while otherwise idle. 

I did not reproduce with an 8Gi mem request, not sure where the threshold is yet. 

Version-Release number of selected component (if applicable):
OCP 4.11.7
CNV 4.11.3-14

How reproducible:
When large mem requests are set

Steps to Reproduce:
1. Start very large memory request Win VM
2. Check for HighMem alert

Actual results:

VM definition based on 'windows10-desktop' templates but set to use:
          resources:
            requests:
              memory: 700Gi

(i.e. 716800Mi)

virt-launcher pod mem requests got:
memory: 718448Mi

'oc adm top pod' shows virt launcher using: 721875Mi

container_memory_working_set_bytes{pod="virt-launcher-win10-full-ct679", container="compute"} value is 756.9G

# oc exec -it virt-launcher-win10-full-ct679 -- ps aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
qemu           1  0.0  0.0 1247280 16964 ?       Ssl  18:12   0:00 /usr/bin/virt-launcher-monitor --qemu-timeout
qemu          13  0.1  0.0 2886664 63032 ?       Sl   18:12   0:02 /usr/bin/virt-launcher --qemu-timeout 328s --n
qemu          27  0.0  0.0 1484592 27916 ?       Sl   18:12   0:01 /usr/sbin/libvirtd -f /var/run/libvirt/libvirt
qemu          28  0.0  0.0 222572 15740 ?        Sl   18:12   0:00 /usr/sbin/virtlogd -f /etc/libvirt/virtlogd.co
qemu          69 35.7 92.7 734858660 734117624 ? Sl   18:13  11:25 /usr/libexec/qemu-kvm -name guest=default_win1
qemu         145  0.0  0.0  44692  3396 pts/0    Rs+  18:45   0:00 ps aux

Expected results:
I didn't expect a large mem request to cause >3G extra overhead

Additional info:
I suspect this means a large memory "highperformance" Win VM (dedicatedCPU / mem limit will be set) may hit an OOM

If this is worth pursuing I can get more values at lower memory requests to see the overhead behavior.

Comment 1 Itamar Holder 2023-02-19 09:41:34 UTC

From a first glance, my concern here is that memory that's allocated to the kernel is also accounted. As we've already seen with similar bugs, this memory is not negligible and is accounted for the container's memory. This memory, though, is reclaimable by the kernel when needed.

I see that this alert is being calculated by `<container-request> - <container-working-set>`.
@sradco - can you confirm this is really the case? Does it make sense in your opinion to also subtract `container_memory_cache` from this amount to better reflect reality?

If we're talking about reclaimable memory only - then I would close this on "not a bug" and fix the alert.

Comment 3 Jenifer Abrams 2023-02-23 21:46:35 UTC

For the "700Gi" memory request case:
virt-launcher pod gets:   memory: 718448Mi

Once the alert fires (takes ~12min of idle guest / Win initializing mem), I see:

cache 0
rss 0
rss_huge 0
shmem 0
mapped_file 0
dirty 0
writeback 0
swap 0
pgpgin 0
pgpgout 0
pgfault 0
pgmajfault 0
inactive_anon 0
active_anon 0
inactive_file 0
active_file 0
unevictable 0
hierarchical_memory_limit 786715865088
hierarchical_memsw_limit 9223372036854771712
total_cache 8105984
total_rss 751762849792
total_rss_huge 751654928384
total_shmem 32768
total_mapped_file 4079616
total_dirty 0
total_writeback 0
total_swap 0
total_pgpgin 425764
total_pgpgout 39523
total_pgfault 455951
total_pgmajfault 17
total_inactive_anon 731076292608
total_active_anon 20690808832
total_inactive_file 1478656
total_active_file 6594560
total_unevictable 0

request:
718448*1024*1024 = 753,347,330,048

memory.usage_in_bytes = 756,951,343,104	     
memory.kmem.usage_in_bytes = 5,176,557,568

memory.usage_in_bytes - req:
756,951,343,104 - 753,347,330,048 = 3,604,013,056

----- 

a bit later... (slight usage changes over time)

container_memory_working_set_bytes{pod="virt-launcher-win10-full-4pcrf"} = 756,951,277,568
container_memory_cache{pod="virt-launcher-win10-full-4pcrf"} = 8,105,984

req - working_set - cache:
753,347,330,048 - 756,951,277,568 - 8,105,984 = -3,612,053,504

The metric output is actually showing a value of -3.6G? (see screenshot)

I know there are currently not any kmem.usage metrics, but then again kmem will count towards the cgroup limit so not sure if the alert should consider that... 

I will send an email w/ access info as well.

Comment 5 Itamar Holder 2023-02-27 15:14:18 UTC

Thanks a lot @jhopper!

First of all, regarding the free memory metric calculation ("kubevirt_vm_container_free_memory_bytes_based_on_working_set_bytes"):
Firstly, the working set is calculated (by runc) as the following:
working set = memory.usage_in_bytes - total_inactive_file = 756,951,343,104 - 1,478,656 = 756,949,864,448 [1]

The metric is being calculated in the following way [2]:
free-memory-metric = req - working-set = 753,347,330,048 - 756,949,864,448 = −3,602,534,400

While this metric can be improved (e.g. by removing kernel usage / cache data / etc), I think that the calculation is actually correct and makes sense. It basically says that the free memory is -3.6G, or that it exceeds memory by 3.6G.

[1] https://github.com/google/cadvisor/blob/v0.47.1/container/libcontainer/handler.go#L836
[2] https://github.com/kubevirt/kubevirt/blob/v0.59.0-rc.1/pkg/virt-operator/resource/generate/components/prometheus.go#L451

Regarding the memory allocation itself:
Seems like the vast majority of the memory is used by anonymous pages. In the environment you tested on IIUC SWAP is disabled, so this memory cannot be reclaimed at all. But even if SWAP is enabled, 750+ GB is much more than the usual SWAP capacity.

It seems that QEMU allocates a lot of internal memory, that is probably being used for internal caches / buffers / data structures. We already know that we don't calculate the virt infra accurately, it makes sense that this problem does not scale well with huge memory amounts. In other words, seems like the lack of accuracy grows exponentially with the amount that's being allocated to the guest.

This problem is similar to another bug [3] about the same issue.
This PR [4] serves as a good first-aid to this problem. We do need to improve our monitoring solutions and overhead calculation, but as explained in the PR, it would never be accurate.

[3] https://bugzilla.redhat.com/show_bug.cgi?id=2165618
[4] https://github.com/kubevirt/kubevirt/pull/9322

Comment 6 Jenifer Abrams 2023-02-28 18:34:45 UTC

Thanks for the explanation!  I agree it is difficult to calculate the correct overhead for all cases, maybe over time we can document some examples and having the new headroom PR setting available will be a good way to add extra buffer. Glad the HighMem metric is working as expected, it provides a clue that extra buffer is a good idea, hopefully warning admins before any real memory pressure scenarios.

Comment 7 Antonio Cardace 2023-03-03 16:48:04 UTC

Deferring to 4.13.1 due to capacity.

Comment 8 Itamar Holder 2023-03-05 11:33:35 UTC

PR: https://github.com/kubevirt/kubevirt/pull/9322

Comment 15 Denys Shchedrivyi 2023-06-03 16:13:00 UTC

 Verified on CNV-v4.13.1.rhel9-79 - the alert still firing for VM with 700Gb of memory (with default `additionalGuestMemoryOverheadRatio` parameter)
However, the metric output is better than was before: -700M now instead of -3.6G (screenshot is attached)

With `additionalGuestMemoryOverheadRatio=2` I don't see the alert firing - metric shows more than +1G of free memory (screenshot attached)

Comment 18 Itamar Holder 2023-06-04 17:20:52 UTC

Hey all,

Thanks @dshchedr for verifying!

These results are excellent, and are completely expected.
As written in the PR [1]:

> Not only that this overhead currently suffers from known issues and non-accurate calculation which needs to be fixed - this calculation is in essence an educated guess / estimation, and not an accurate calculation. The reason is that even if a careful profiling will take place (which is a very difficult task to do, since the environments on which we would profile makes the results biased), there are still many components we cannot control, e.g. kernel drivers, kernel configuration, inner QEMU buffer allocations, etc.

> To solve this problem, we need to both keep improving the overhead estimations, but also provide a solution for the cluster admin to explicitly add some overhead.

In other words, the only thing Kubevirt can do is to make an educated guess. While this guess works fine for many cases, it doesn't for others, especially when a huge amount of memory is allocated to the VM.
Because we're aware of the fact that the overhead amount is not accurate (and never will be), we've introduced `additionalGuestMemoryOverheadRatio`, which is shown to be effective here and practically solves the problem.

[1] https://github.com/kubevirt/kubevirt/pull/9322

Comment 21 Denys Shchedrivyi 2023-06-05 13:51:51 UTC

Thanks Itamar for the confirmation! Moving this BZ to Verified state

Comment 27 errata-xmlrpc 2023-06-20 13:41:05 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Virtualization 4.13.1 Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2023:3686

Comment 28 Red Hat Bugzilla 2024-02-03 04:25:08 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days