Bug 2164593
Summary: | High memory request (Windows VM) hitting KubevirtVmHighMemoryUsage alert | ||
---|---|---|---|
Product: | Container Native Virtualization (CNV) | Reporter: | Jenifer Abrams <jhopper> |
Component: | Virtualization | Assignee: | Itamar Holder <iholder> |
Status: | CLOSED ERRATA | QA Contact: | Denys Shchedrivyi <dshchedr> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 4.11.3 | CC: | acardace, fdeutsch, ibezukh, kbidarka, sradco |
Target Milestone: | --- | ||
Target Release: | 4.13.1 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | v4.13.1.rhel9-79 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2023-06-20 13:41:05 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Jenifer Abrams
2023-01-25 18:52:11 UTC
From a first glance, my concern here is that memory that's allocated to the kernel is also accounted. As we've already seen with similar bugs, this memory is not negligible and is accounted for the container's memory. This memory, though, is reclaimable by the kernel when needed. I see that this alert is being calculated by `<container-request> - <container-working-set>`. @sradco - can you confirm this is really the case? Does it make sense in your opinion to also subtract `container_memory_cache` from this amount to better reflect reality? If we're talking about reclaimable memory only - then I would close this on "not a bug" and fix the alert. For the "700Gi" memory request case: virt-launcher pod gets: memory: 718448Mi Once the alert fires (takes ~12min of idle guest / Win initializing mem), I see: cache 0 rss 0 rss_huge 0 shmem 0 mapped_file 0 dirty 0 writeback 0 swap 0 pgpgin 0 pgpgout 0 pgfault 0 pgmajfault 0 inactive_anon 0 active_anon 0 inactive_file 0 active_file 0 unevictable 0 hierarchical_memory_limit 786715865088 hierarchical_memsw_limit 9223372036854771712 total_cache 8105984 total_rss 751762849792 total_rss_huge 751654928384 total_shmem 32768 total_mapped_file 4079616 total_dirty 0 total_writeback 0 total_swap 0 total_pgpgin 425764 total_pgpgout 39523 total_pgfault 455951 total_pgmajfault 17 total_inactive_anon 731076292608 total_active_anon 20690808832 total_inactive_file 1478656 total_active_file 6594560 total_unevictable 0 request: 718448*1024*1024 = 753,347,330,048 memory.usage_in_bytes = 756,951,343,104 memory.kmem.usage_in_bytes = 5,176,557,568 memory.usage_in_bytes - req: 756,951,343,104 - 753,347,330,048 = 3,604,013,056 ----- a bit later... (slight usage changes over time) container_memory_working_set_bytes{pod="virt-launcher-win10-full-4pcrf"} = 756,951,277,568 container_memory_cache{pod="virt-launcher-win10-full-4pcrf"} = 8,105,984 req - working_set - cache: 753,347,330,048 - 756,951,277,568 - 8,105,984 = -3,612,053,504 The metric output is actually showing a value of -3.6G? (see screenshot) I know there are currently not any kmem.usage metrics, but then again kmem will count towards the cgroup limit so not sure if the alert should consider that... I will send an email w/ access info as well. Thanks a lot @jhopper! First of all, regarding the free memory metric calculation ("kubevirt_vm_container_free_memory_bytes_based_on_working_set_bytes"): Firstly, the working set is calculated (by runc) as the following: working set = memory.usage_in_bytes - total_inactive_file = 756,951,343,104 - 1,478,656 = 756,949,864,448 [1] The metric is being calculated in the following way [2]: free-memory-metric = req - working-set = 753,347,330,048 - 756,949,864,448 = −3,602,534,400 While this metric can be improved (e.g. by removing kernel usage / cache data / etc), I think that the calculation is actually correct and makes sense. It basically says that the free memory is -3.6G, or that it exceeds memory by 3.6G. [1] https://github.com/google/cadvisor/blob/v0.47.1/container/libcontainer/handler.go#L836 [2] https://github.com/kubevirt/kubevirt/blob/v0.59.0-rc.1/pkg/virt-operator/resource/generate/components/prometheus.go#L451 Regarding the memory allocation itself: Seems like the vast majority of the memory is used by anonymous pages. In the environment you tested on IIUC SWAP is disabled, so this memory cannot be reclaimed at all. But even if SWAP is enabled, 750+ GB is much more than the usual SWAP capacity. It seems that QEMU allocates a lot of internal memory, that is probably being used for internal caches / buffers / data structures. We already know that we don't calculate the virt infra accurately, it makes sense that this problem does not scale well with huge memory amounts. In other words, seems like the lack of accuracy grows exponentially with the amount that's being allocated to the guest. This problem is similar to another bug [3] about the same issue. This PR [4] serves as a good first-aid to this problem. We do need to improve our monitoring solutions and overhead calculation, but as explained in the PR, it would never be accurate. [3] https://bugzilla.redhat.com/show_bug.cgi?id=2165618 [4] https://github.com/kubevirt/kubevirt/pull/9322 Thanks for the explanation! I agree it is difficult to calculate the correct overhead for all cases, maybe over time we can document some examples and having the new headroom PR setting available will be a good way to add extra buffer. Glad the HighMem metric is working as expected, it provides a clue that extra buffer is a good idea, hopefully warning admins before any real memory pressure scenarios. Deferring to 4.13.1 due to capacity. Verified on CNV-v4.13.1.rhel9-79 - the alert still firing for VM with 700Gb of memory (with default `additionalGuestMemoryOverheadRatio` parameter) However, the metric output is better than was before: -700M now instead of -3.6G (screenshot is attached) With `additionalGuestMemoryOverheadRatio=2` I don't see the alert firing - metric shows more than +1G of free memory (screenshot attached) Hey all, Thanks @dshchedr for verifying! These results are excellent, and are completely expected. As written in the PR [1]: > Not only that this overhead currently suffers from known issues and non-accurate calculation which needs to be fixed - this calculation is in essence an educated guess / estimation, and not an accurate calculation. The reason is that even if a careful profiling will take place (which is a very difficult task to do, since the environments on which we would profile makes the results biased), there are still many components we cannot control, e.g. kernel drivers, kernel configuration, inner QEMU buffer allocations, etc. > To solve this problem, we need to both keep improving the overhead estimations, but also provide a solution for the cluster admin to explicitly add some overhead. In other words, the only thing Kubevirt can do is to make an educated guess. While this guess works fine for many cases, it doesn't for others, especially when a huge amount of memory is allocated to the VM. Because we're aware of the fact that the overhead amount is not accurate (and never will be), we've introduced `additionalGuestMemoryOverheadRatio`, which is shown to be effective here and practically solves the problem. [1] https://github.com/kubevirt/kubevirt/pull/9322 Thanks Itamar for the confirmation! Moving this BZ to Verified state Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Virtualization 4.13.1 Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2023:3686 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days |