Bug 2232120
| Summary: | Estimate typical and maximum memory usage based on domxml | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 9 | Reporter: | Dan Kenigsberg <danken> |
| Component: | libvirt | Assignee: | Jaroslav Suchanek <jsuchane> |
| libvirt sub component: | CLI & API | QA Contact: | liang cong <lcong> |
| Status: | NEW --- | Docs Contact: | |
| Severity: | high | ||
| Priority: | unspecified | CC: | ailan, berrange, kkiwi, lmen, lpivarc, mprivozn, pbonzini, rjones, virt-maint |
| Version: | unspecified | Keywords: | Triaged |
| Target Milestone: | rc | Flags: | kkiwi:
needinfo?
(danken) kkiwi: needinfo? (pbonzini) |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | Type: | Bug | |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Dan Kenigsberg
2023-08-15 13:00:35 UTC
FYI, this has been a long running topic of discussion. My understanding is that there are several challenges to accomplish this the way Dan suggested (i.e., with a Libvirt API): * Even if Libvirt can theoretically do the math (well, CNV could too), Libvirt simply doesn't own the information of how much memory each type of device + machine type is "theoretically" using * It gets even worse if we consider QEMU/libvirt dependencies - i.e., a libc/gnutls/gcc/libiscsi change can create additional memory requirements.. even RUNTIME configurations like availability of cache etc can influence how much memory pressure something has in practice * Getting that information from QEMU would require a new QAPI but.. * ... QEMU would require to maintain that "static table" somewhere, and there's no current process for reaching that determination across platforms / build configurations and maintain that. So during our last 1x1, me and Amnon discussed an interesting alternative: What if, instead of trying to figure out the theoretical maximum envelope through a process / API, we could empirically make that determination during Downstream productization, for the relatively few scenarios / permutations that CNV is interested in? I.e., QE can do a scale test and reach a determination of the max config for each machine type we care. Theoretically, this actually needed to be done somewhere to ensure we can support customers in this configuration My proposal is a partnership between Virt QE and CNV QE.. 1) CNV should tell us what machine types and devices they are interested in. 2) Virt QE would need to do capacity tests on the valid permutations of those pieces pointed above. While doing it (and validating that they work and can be supported), they also profile the test and document the amount of memory used 3) CNV QE takes that information and validates additional things like upgrade paths etc (e.g., upgrading CNV from RHEL 9.0 to RHEL 9.2 may have different memory requirements / pressures, making the need to adapt eventual hard limiting of resources) and build a table. 4) We document all that in KBS and official documentation, but CNV can optionally use that data to provide hints to Customers about sensible limits to their chosen VMs - at creation, hot-plugging/unplugging, migration and upgrading paths. I also suggest that this is fundamentally the same solution for other limit / maxi config scenarios as well, such as https://issues.redhat.com/browse/CNV-31831 Your thoughts Dan Ken, Michal, Daniel Berrange, Paolo Bonzini, etc? First and foremost, it's impossible to tell how much memory a program is going to need for given input (Turing machines, Undecidable problem and stuff). Now, it's true that the statement is about "general" program and QEMU is not a general program. But on the other hand - we do not know the input. I mean, memory consumption can vary, and indeed will. What Klaus suggests might work, but only to some extent. Our QEs surely won't run VMs the same way our customers do (active use of services inside VM, say a DB server). Therefore, any estimate we come up will need some adjusting at which point, it's as good as an estimate pulled out of thin air.
Libvirt has fought/is fighting this problem on two fronts:
1) it used to calculate hard limit (the peak memory QEMU is allowed to use), and
2) it is trying to estimate memlock limit.
These two are fundamentally different though. While any wrong estimate in 1) means OOM killer kills QEMU (because the hard limit is set in CGroups), a (slightly) wrong estimate in 2) means guest can still run.
We used to do 1), but because it was impossible to get right we stopped doing that and documented the following:
hard_limit
The optional hard_limit element is the maximum memory the guest can use. The units for this value are kibibytes (i.e. blocks of 1024 bytes). Users of QEMU and KVM are strongly advised not to set this limit as domain may get killed by the kernel if the guess is too low, and determining the memory needed for a process to run is an undecidable problem.
Now, for 2) we still keep up the fight, although we take many shortcuts (because we can overshoot the actual limit without kernel killing QEMU). The code lives here (function qemuDomainGetMemLockLimitBytes()):
https://gitlab.com/libvirt/libvirt/-/blob/master/src/qemu/qemu_domain.c?ref_type=heads#L9661
and KubeVirt can take an inspiration there.
Another reason why this should not be in libvirt - if the API returns wrong value then the roundtrip between fixing it in libvirt and KubeVirt/CNV picking up the fix is needlessly long, leaving customers unable to run their machines in the meantime. Compare this to the situation where the guessing algorithm lives in KubeVirt and distributing fix just there.
And a final thought - from upstream POV, QEMU can be compiled with/without support for a lot of stuff which affects what libraries are then linked to the QEMU binary. This would lead to spaghetti code. And don't forget that the API would need to account for libvirt memory consumption itself (!) and for all the helper processes.
In a former career I had a part in creating https://support.hpe.com/hpesc/public/docDisplay?docId=c02157777&docLocale=en_US - lots of hours put into profiling characteristics, but we controlled the whole stack which is not the case here. It also changed from release to release - very painful to keep up to date. New device types brought need for updated calculations, but invariably we didn't always have the resources. In one instance adding a particular device added a set amount of memory for each device - way cool until some magic number was reached and the kernel needed to allocate 1G more of memory to map something it used to track device data. We got luck to find that, but then had to figure out how to document that. It'd be great to have specifics, but I question if we have "all" the "resources" necessary to create "rules" that are desired and then keep them up to date. (In reply to Klaus Heinrich Kiwi from comment #1) > 1) CNV should tell us what machine types and devices they are interested in. > 2) Virt QE would need to do capacity tests on the valid permutations of > those pieces pointed above. While doing it (and validating that they work > and can be supported), they also profile the test and document the amount of > memory used "tests on the valid permutations" is really glossing over a massive amount of work here. RHEL already ships a finite number of machine types and devices. CNV uses a subset of these. Even with this subset, the number of possible permutations is going to be incredibly large. Add into that the number of possible *configurations* of these permutations, along with possible guest workloads, and you have effectively an infinite number of possibilities IOW, we can't test our way to a solution for determining "maximum memory usage" unless we're expecting the tested outcome to be "unlimited memory usage" which is what libvirt decided it was. |