Bug 2232120 - Estimate typical and maximum memory usage based on domxml [NEEDINFO]
Summary: Estimate typical and maximum memory usage based on domxml
Keywords:
Status: NEW
Alias: None
Product: Red Hat Enterprise Linux 9
Classification: Red Hat
Component: libvirt
Version: unspecified
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: rc
: ---
Assignee: Jaroslav Suchanek
QA Contact: liang cong
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-08-15 13:00 UTC by Dan Kenigsberg
Modified: 2023-08-17 13:21 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Type: Bug
Target Upstream Version:
Embargoed:
kkiwi: needinfo? (danken)
kkiwi: needinfo? (pbonzini)


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHELPLAN-165836 0 None None None 2023-08-15 13:02:47 UTC

Description Dan Kenigsberg 2023-08-15 13:00:35 UTC
Description of problem:

KubeVirt runs libvirt+qemu in a Pod called virt-launcher. To be a good citizen of Kubernetes, KubeVirt has to declare ahead of time how much memory virt-launcher requests. KubeVirt currently does that in a very coarse fashion, guestimating how memory virt-laucher+libvirt+qemu+etc typically requires.

Things become even trickier in multi-tenant Kubernetes cluster, where a cluster-admin enforces memory limits on namespaces, to limit the maximum noise one tennant can cause to others. In this case, KubeVirt has to estimate ahead of time the memory limit of virt-launcher, above which the Pod would be killed. Overestimation causes waste of resources; underestimation causes premature OOM kills.

KubeVirt's estimates are likely to be configurable: some cluster-admins would care more about protecting their resources from rogue bursts and other cluster-admins would care more about keeping their bursting VMs alive.

To improve KubeVirt's estimation of requested memory and memory limits, I would like libvirt to provide functions similar to

int estimateTypicalMemoryUse(domxml)
int estimateMaximumMemoryUse(domxml)

Users may want to add an "int genrousity" argument, expressing in a range from 1 to 100 how generous they feel about this specific VM, and willing to pay with RAM to keep it alive.

Super-smart users may be able to ignore KubeVirt's libvirt-based recommendation, but most VM users are unlikely to know better than us how much RAM our software consumes.

Comment 1 Klaus Heinrich Kiwi 2023-08-15 13:55:41 UTC
FYI, this has been a long running topic of discussion.

My understanding is that there are several challenges to accomplish this the way Dan suggested (i.e., with a Libvirt API):

 * Even if Libvirt can theoretically do the math (well, CNV could too), Libvirt simply doesn't own the information of how much memory each type of device + machine type is "theoretically" using
 * It gets even worse if we consider QEMU/libvirt dependencies - i.e., a libc/gnutls/gcc/libiscsi change can create additional memory requirements.. even RUNTIME configurations like availability of cache etc can influence how much memory pressure something has in practice
 * Getting that information from QEMU would require a new QAPI but..
 * ... QEMU would require to maintain that "static table" somewhere, and there's no current process for reaching that determination across platforms / build configurations and maintain that.

So during our last 1x1, me and Amnon discussed an interesting alternative:

 What if, instead of trying to figure out the theoretical maximum envelope through a process / API, we could empirically make that determination during Downstream productization, for the relatively few scenarios / permutations that CNV is interested in?

I.e., QE can do a scale test and reach a determination of the max config for each machine type we care. Theoretically, this actually needed to be done somewhere to ensure we can support customers in this configuration

My proposal is a partnership between Virt QE and CNV QE..

1) CNV should tell us what machine types and devices they are interested in.
2)  Virt QE would need to do capacity tests on the valid permutations of those pieces pointed above. While doing it (and validating that they work and can be supported), they also profile the test and document the amount of memory used
3) CNV QE takes that information and validates additional things like upgrade paths etc (e.g., upgrading CNV from RHEL 9.0 to RHEL 9.2 may have different memory requirements / pressures, making the need to adapt eventual hard limiting of resources) and build a table.
4) We document all that in KBS and official documentation, but CNV can optionally use that data to provide hints to Customers about sensible limits to their chosen VMs - at creation, hot-plugging/unplugging, migration and upgrading paths.

I also suggest that this is fundamentally the same solution for other limit / maxi config scenarios as well, such as https://issues.redhat.com/browse/CNV-31831

Your thoughts Dan Ken, Michal, Daniel Berrange, Paolo Bonzini, etc?

Comment 2 Michal Privoznik 2023-08-15 15:54:37 UTC
First and foremost, it's impossible to tell how much memory a program is going to need for given input (Turing machines, Undecidable problem and stuff). Now, it's true that the statement is about "general" program and QEMU is not a general program. But on the other hand - we do not know the input. I mean, memory consumption can vary, and indeed will. What Klaus suggests might work, but only to some extent. Our QEs surely won't run VMs the same way our customers do (active use of services inside VM, say a DB server). Therefore, any estimate we come up will need some adjusting at which point, it's as good as an estimate pulled out of thin air.

Libvirt has fought/is fighting this problem on two fronts:

1) it used to calculate hard limit (the peak memory QEMU is allowed to use), and
2) it is trying to estimate memlock limit.

These two are fundamentally different though. While any wrong estimate in 1) means OOM killer kills QEMU (because the hard limit is set in CGroups), a (slightly) wrong estimate in 2) means guest can still run.
We used to do 1), but because it was impossible to get right we stopped doing that and documented the following:

hard_limit

    The optional hard_limit element is the maximum memory the guest can use. The units for this value are kibibytes (i.e. blocks of 1024 bytes). Users of QEMU and KVM are strongly advised not to set this limit as domain may get killed by the kernel if the guess is too low, and determining the memory needed for a process to run is an undecidable problem.

Now, for 2) we still keep up the fight, although we take many shortcuts (because we can overshoot the actual limit without kernel killing QEMU). The code lives here (function qemuDomainGetMemLockLimitBytes()):

https://gitlab.com/libvirt/libvirt/-/blob/master/src/qemu/qemu_domain.c?ref_type=heads#L9661

and KubeVirt can take an inspiration there.


Another reason why this should not be in libvirt - if the API returns wrong value then the roundtrip between fixing it in libvirt and KubeVirt/CNV picking up the fix is needlessly long, leaving customers unable to run their machines in the meantime. Compare this to the situation where the guessing algorithm lives in KubeVirt and distributing fix just there.

And a final thought - from upstream POV, QEMU can be compiled with/without support for a lot of stuff which affects what libraries are then linked to the QEMU binary. This would lead to spaghetti code. And don't forget that the API would need to account for libvirt memory consumption itself (!) and for all the helper processes.

Comment 3 John Ferlan 2023-08-15 21:32:13 UTC
In a former career I had a part in creating https://support.hpe.com/hpesc/public/docDisplay?docId=c02157777&docLocale=en_US - lots of hours put into profiling characteristics, but we controlled the whole stack which is not the case here.  It also changed from release to release - very painful to keep up to date. New device types brought need for updated calculations, but invariably we didn't always have the resources. In one instance adding a particular device added a set amount of memory for each device - way cool until some magic number was reached and the kernel needed to allocate 1G more of memory to map something it used to track device data.  We got luck to find that, but then had to figure out how to document that.

It'd be great to have specifics, but I question if we have "all" the "resources" necessary to create "rules" that are desired and then keep them up to date.

Comment 4 Daniel Berrangé 2023-08-17 13:21:49 UTC
(In reply to Klaus Heinrich Kiwi from comment #1)
> 1) CNV should tell us what machine types and devices they are interested in.
> 2)  Virt QE would need to do capacity tests on the valid permutations of
> those pieces pointed above. While doing it (and validating that they work
> and can be supported), they also profile the test and document the amount of
> memory used

"tests on the valid permutations" is really glossing over a massive amount of work here.

RHEL already ships a finite number of machine types and devices.  CNV uses a subset of these.

Even with this subset, the number of possible permutations is going to be incredibly large.

Add into that the number of possible *configurations* of these permutations, along with possible guest workloads, and you have effectively an infinite number of possibilities

IOW, we can't test our way to a solution for determining "maximum memory usage" unless we're expecting the tested outcome to be "unlimited memory usage" which is what libvirt decided it was.


Note You need to log in before you can comment on or make changes to this bug.