Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2232120

Summary:	Estimate typical and maximum memory usage based on domxml
Product:	Red Hat Enterprise Linux 9	Reporter:	Dan Kenigsberg <danken>
Component:	libvirt	Assignee:	Jaroslav Suchanek <jsuchane>
libvirt sub component:	CLI & API	QA Contact:	liang cong <lcong>
Status:	CLOSED MIGRATED	Docs Contact:
Severity:	high
Priority:	unspecified	CC:	ailan, berrange, kkiwi, lmen, lpivarc, mprivozn, pbonzini, rjones, virt-maint, yafu
Version:	unspecified	Keywords:	MigratedToJIRA, Triaged
Target Milestone:	rc	Flags:	pm-rhel: mirror+
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-09-22 13:23:50 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Dan Kenigsberg 2023-08-15 13:00:35 UTC

Description of problem:

KubeVirt runs libvirt+qemu in a Pod called virt-launcher. To be a good citizen of Kubernetes, KubeVirt has to declare ahead of time how much memory virt-launcher requests. KubeVirt currently does that in a very coarse fashion, guestimating how memory virt-laucher+libvirt+qemu+etc typically requires.

Things become even trickier in multi-tenant Kubernetes cluster, where a cluster-admin enforces memory limits on namespaces, to limit the maximum noise one tennant can cause to others. In this case, KubeVirt has to estimate ahead of time the memory limit of virt-launcher, above which the Pod would be killed. Overestimation causes waste of resources; underestimation causes premature OOM kills.

KubeVirt's estimates are likely to be configurable: some cluster-admins would care more about protecting their resources from rogue bursts and other cluster-admins would care more about keeping their bursting VMs alive.

To improve KubeVirt's estimation of requested memory and memory limits, I would like libvirt to provide functions similar to

int estimateTypicalMemoryUse(domxml)
int estimateMaximumMemoryUse(domxml)

Users may want to add an "int genrousity" argument, expressing in a range from 1 to 100 how generous they feel about this specific VM, and willing to pay with RAM to keep it alive.

Super-smart users may be able to ignore KubeVirt's libvirt-based recommendation, but most VM users are unlikely to know better than us how much RAM our software consumes.

Comment 1 Klaus Heinrich Kiwi 2023-08-15 13:55:41 UTC

FYI, this has been a long running topic of discussion.

My understanding is that there are several challenges to accomplish this the way Dan suggested (i.e., with a Libvirt API):

 * Even if Libvirt can theoretically do the math (well, CNV could too), Libvirt simply doesn't own the information of how much memory each type of device + machine type is "theoretically" using
 * It gets even worse if we consider QEMU/libvirt dependencies - i.e., a libc/gnutls/gcc/libiscsi change can create additional memory requirements.. even RUNTIME configurations like availability of cache etc can influence how much memory pressure something has in practice
 * Getting that information from QEMU would require a new QAPI but..
 * ... QEMU would require to maintain that "static table" somewhere, and there's no current process for reaching that determination across platforms / build configurations and maintain that.

So during our last 1x1, me and Amnon discussed an interesting alternative:

 What if, instead of trying to figure out the theoretical maximum envelope through a process / API, we could empirically make that determination during Downstream productization, for the relatively few scenarios / permutations that CNV is interested in?

I.e., QE can do a scale test and reach a determination of the max config for each machine type we care. Theoretically, this actually needed to be done somewhere to ensure we can support customers in this configuration

My proposal is a partnership between Virt QE and CNV QE..

1) CNV should tell us what machine types and devices they are interested in.
2)  Virt QE would need to do capacity tests on the valid permutations of those pieces pointed above. While doing it (and validating that they work and can be supported), they also profile the test and document the amount of memory used
3) CNV QE takes that information and validates additional things like upgrade paths etc (e.g., upgrading CNV from RHEL 9.0 to RHEL 9.2 may have different memory requirements / pressures, making the need to adapt eventual hard limiting of resources) and build a table.
4) We document all that in KBS and official documentation, but CNV can optionally use that data to provide hints to Customers about sensible limits to their chosen VMs - at creation, hot-plugging/unplugging, migration and upgrading paths.

I also suggest that this is fundamentally the same solution for other limit / maxi config scenarios as well, such as https://issues.redhat.com/browse/CNV-31831

Your thoughts Dan Ken, Michal, Daniel Berrange, Paolo Bonzini, etc?

Comment 2 Michal Privoznik 2023-08-15 15:54:37 UTC

First and foremost, it's impossible to tell how much memory a program is going to need for given input (Turing machines, Undecidable problem and stuff). Now, it's true that the statement is about "general" program and QEMU is not a general program. But on the other hand - we do not know the input. I mean, memory consumption can vary, and indeed will. What Klaus suggests might work, but only to some extent. Our QEs surely won't run VMs the same way our customers do (active use of services inside VM, say a DB server). Therefore, any estimate we come up will need some adjusting at which point, it's as good as an estimate pulled out of thin air.

Libvirt has fought/is fighting this problem on two fronts:

1) it used to calculate hard limit (the peak memory QEMU is allowed to use), and
2) it is trying to estimate memlock limit.

These two are fundamentally different though. While any wrong estimate in 1) means OOM killer kills QEMU (because the hard limit is set in CGroups), a (slightly) wrong estimate in 2) means guest can still run.
We used to do 1), but because it was impossible to get right we stopped doing that and documented the following:

hard_limit

The optional hard_limit element is the maximum memory the guest can use. The units for this value are kibibytes (i.e. blocks of 1024 bytes). Users of QEMU and KVM are strongly advised not to set this limit as domain may get killed by the kernel if the guess is too low, and determining the memory needed for a process to run is an undecidable problem.

Now, for 2) we still keep up the fight, although we take many shortcuts (because we can overshoot the actual limit without kernel killing QEMU). The code lives here (function qemuDomainGetMemLockLimitBytes()):

https://gitlab.com/libvirt/libvirt/-/blob/master/src/qemu/qemu_domain.c?ref_type=heads#L9661

and KubeVirt can take an inspiration there.

Another reason why this should not be in libvirt - if the API returns wrong value then the roundtrip between fixing it in libvirt and KubeVirt/CNV picking up the fix is needlessly long, leaving customers unable to run their machines in the meantime. Compare this to the situation where the guessing algorithm lives in KubeVirt and distributing fix just there.

And a final thought - from upstream POV, QEMU can be compiled with/without support for a lot of stuff which affects what libraries are then linked to the QEMU binary. This would lead to spaghetti code. And don't forget that the API would need to account for libvirt memory consumption itself (!) and for all the helper processes.

Comment 3 John Ferlan 2023-08-15 21:32:13 UTC

In a former career I had a part in creating https://support.hpe.com/hpesc/public/docDisplay?docId=c02157777&docLocale=en_US - lots of hours put into profiling characteristics, but we controlled the whole stack which is not the case here.  It also changed from release to release - very painful to keep up to date. New device types brought need for updated calculations, but invariably we didn't always have the resources. In one instance adding a particular device added a set amount of memory for each device - way cool until some magic number was reached and the kernel needed to allocate 1G more of memory to map something it used to track device data.  We got luck to find that, but then had to figure out how to document that.

It'd be great to have specifics, but I question if we have "all" the "resources" necessary to create "rules" that are desired and then keep them up to date.

Comment 4 Daniel Berrangé 2023-08-17 13:21:49 UTC

(In reply to Klaus Heinrich Kiwi from comment #1)
> 1) CNV should tell us what machine types and devices they are interested in.
> 2)  Virt QE would need to do capacity tests on the valid permutations of
> those pieces pointed above. While doing it (and validating that they work
> and can be supported), they also profile the test and document the amount of
> memory used

"tests on the valid permutations" is really glossing over a massive amount of work here.

RHEL already ships a finite number of machine types and devices.  CNV uses a subset of these.

Even with this subset, the number of possible permutations is going to be incredibly large.

Add into that the number of possible *configurations* of these permutations, along with possible guest workloads, and you have effectively an infinite number of possibilities

IOW, we can't test our way to a solution for determining "maximum memory usage" unless we're expecting the tested outcome to be "unlimited memory usage" which is what libvirt decided it was.

Comment 5 Dan Kenigsberg 2023-08-21 19:57:23 UTC

(In reply to Michal Privoznik from comment #2)
> First and foremost, it's impossible to tell how much memory a program is
> going to need for given input (Turing machines, Undecidable problem and
> stuff). 
>
...
> 
> And a final thought - from upstream POV, QEMU can be compiled with/without
> support for a lot of stuff which affects what libraries are then linked to
> the QEMU binary. This would lead to spaghetti code. And don't forget that
> the API would need to account for libvirt memory consumption itself (!) and
> for all the helper processes.

It is impossible to provide an exact estimation. It is, however, possible to provide an approximation that works for most users most of the time.

Just like RHV and OSP before us, CNV is estimating the requested memory. Unlike the former two, and due to its design decisions, KubeVirt is going to estimate the memory limit, too. We cannot just say "let the VM owner set it" - we just said it was hard, and our job is making VM owners' life easy. Someone on the virt stack should be able to examine a VM definition and give a sane estimate. It may be a poor wasteful heuristic by a high-level component that has no idea about how libvirt and qemu are implemented (e.g double the memory requested by the guest), or it may be a better implementation, that takes into account things that I am not aware of.

I'm not asking libvirt to prove Alan Turing wrong, just to participate in the Nth implementation of this heuristic.

> Another reason why this should not be in libvirt - if the API returns wrong
> value then the roundtrip between fixing it in libvirt and KubeVirt/CNV
> picking up the fix is needlessly long, leaving customers unable to run their
> machines in the meantime.

KubeVirt allows knowledgeable VM owner to set their memory limit explicitly, which can be used as a quick and dirty remedy in such cases.

> Compare this to the situation where the guessing
> algorithm lives in KubeVirt and distributing fix just there.

KubeVirt would have its own algorithm to estimate the combined size of its components, libvirt and qemu. But without assistance from libvirt and qemu, its estimate would be worse off. Ideally, KubeVirt would not need to know that qemu even exists, let alone be aware of qemu components and their sizes. Just like libvirt cannot do this estimate well without help from qemu.




(In reply to Klaus Heinrich Kiwi from comment #1)
> 
> My proposal is a partnership between Virt QE and CNV QE..
> 
> 1) CNV should tell us what machine types and devices they are interested in.
> 2)  Virt QE would need to do capacity tests on the valid permutations of
> those pieces pointed above. While doing it (and validating that they work
> and can be supported), they also profile the test and document the amount of
> memory used
> 3) CNV QE takes that information and validates additional things like
> upgrade paths etc (e.g., upgrading CNV from RHEL 9.0 to RHEL 9.2 may have
> different memory requirements / pressures, making the need to adapt eventual
> hard limiting of resources) and build a table.
> 4) We document all that in KBS and official documentation

That would be nice, but I prefer C-language documentation over English-language one.

> but CNV can
> optionally use that data to provide hints to Customers about sensible limits
> to their chosen VMs - at creation, hot-plugging/unplugging, migration and
> upgrading paths.
> 
> I also suggest that this is fundamentally the same solution for other limit
> / maxi config scenarios as well, such as
> https://issues.redhat.com/browse/CNV-31831
> 

I completely agree with this statement. IMHO Libvirt is here to give management systems an abstraction of the gory details about what is SeaBios and how it behaves. Only select people on this globe can/should understand them. This complexity should not leak into KubeVirt.

Comment 6 Klaus Heinrich Kiwi 2023-08-22 18:12:08 UTC

(In reply to Dan Kenigsberg from comment #5)
> (In reply to Michal Privoznik from comment #2)
> > First and foremost, it's impossible to tell how much memory a program is
> > going to need for given input (Turing machines, Undecidable problem and
> > stuff). 
> >
> ...
> > 
> > And a final thought - from upstream POV, QEMU can be compiled with/without
> > support for a lot of stuff which affects what libraries are then linked to
> > the QEMU binary. This would lead to spaghetti code. And don't forget that
> > the API would need to account for libvirt memory consumption itself (!) and
> > for all the helper processes.
> 
> It is impossible to provide an exact estimation. It is, however, possible to
> provide an approximation that works for most users most of the time.
> 

Michal and Daniel can jump back here to correct me, but what I think is a mistake in the assumption that libvirt can provide an API to provide an estimation is that there is an effective upstream process to maintain those values - yes, even ballpark ones. I don't see a practical way to even brute-force it into existence, given the infinite combinations in which CPU Architectures - Host configurations - Qemu configurations - Guest configurations - Libvirt configurations and even runtime configurations can be tracked to give us something. And the problem with an Upstream API for it is that it would need to be pervasive across all those, to at least provide values that are sane, much less useful.



> Just like RHV and OSP before us, CNV is estimating the requested memory.
> Unlike the former two, and due to its design decisions, KubeVirt is going to
> estimate the memory limit, too. We cannot just say "let the VM owner set it"
> - we just said it was hard, and our job is making VM owners' life easy.
> Someone on the virt stack should be able to examine a VM definition and give
> a sane estimate. It may be a poor wasteful heuristic by a high-level
> component that has no idea about how libvirt and qemu are implemented (e.g
> double the memory requested by the guest), or it may be a better
> implementation, that takes into account things that I am not aware of.
> 
> I'm not asking libvirt to prove Alan Turing wrong, just to participate in
> the Nth implementation of this heuristic.
> 
> > Another reason why this should not be in libvirt - if the API returns wrong
> > value then the roundtrip between fixing it in libvirt and KubeVirt/CNV
> > picking up the fix is needlessly long, leaving customers unable to run their
> > machines in the meantime.
> 
> KubeVirt allows knowledgeable VM owner to set their memory limit explicitly,
> which can be used as a quick and dirty remedy in such cases.
> 
> > Compare this to the situation where the guessing
> > algorithm lives in KubeVirt and distributing fix just there.
> 
> KubeVirt would have its own algorithm to estimate the combined size of its
> components, libvirt and qemu. But without assistance from libvirt and qemu,
> its estimate would be worse off. Ideally, KubeVirt would not need to know
> that qemu even exists, let alone be aware of qemu components and their
> sizes. Just like libvirt cannot do this estimate well without help from qemu.
> 
> 

For the reasons I stated above, I am skeptical that this value should, or even could be provided by an algorithm or a super-laborious testing of the many possible scenarios that exists upstream.

And albeit not ideal, I *do* think that a practical answer to the question is to perform scalability testing *downstream*. As claimed by Daniel, both things are potential infinite, but the downstream infinite arguably grows slower than the upstream infinity. And on top of it, we have the confidence-building aspect of validating actual configurations that will be used by customers, and not just counting on a theoretical limit. I think that's one of the expectations when we are talking about mature enterprise software to be honest. 

> 
> 
> (In reply to Klaus Heinrich Kiwi from comment #1)
> > 
> > My proposal is a partnership between Virt QE and CNV QE..
> > 
> > 1) CNV should tell us what machine types and devices they are interested in.
> > 2)  Virt QE would need to do capacity tests on the valid permutations of
> > those pieces pointed above. While doing it (and validating that they work
> > and can be supported), they also profile the test and document the amount of
> > memory used
> > 3) CNV QE takes that information and validates additional things like
> > upgrade paths etc (e.g., upgrading CNV from RHEL 9.0 to RHEL 9.2 may have
> > different memory requirements / pressures, making the need to adapt eventual
> > hard limiting of resources) and build a table.
> > 4) We document all that in KBS and official documentation
> 
> That would be nice, but I prefer C-language documentation over
> English-language one.

I think there needs to be a way in that, whenever we are readying a release, the QE/Scalability test can happen and feed/update a xml file that can be "dropped"  somewhere in a CNV installation at a very late stage during the release cycle (same as a Release Notes), and that file could be consumed by CNV to give sane recommendations to customers not only in scenarios where a machine is being created, but also expanded (e.g. hotplug), migrated or upgraded (e.g., RHEL Host or otherwise). Most of the tests, including the scalability ones, could be done with mockup values that are essentially beyond the limits. We validade acceptable limits / scenarios, and document those as 'safe' in the xml file, leaving everything else "at your own risk".

> 
> > but CNV can
> > optionally use that data to provide hints to Customers about sensible limits
> > to their chosen VMs - at creation, hot-plugging/unplugging, migration and
> > upgrading paths.
> > 
> > I also suggest that this is fundamentally the same solution for other limit
> > / maxi config scenarios as well, such as
> > https://issues.redhat.com/browse/CNV-31831
> > 
> 
> I completely agree with this statement. IMHO Libvirt is here to give
> management systems an abstraction of the gory details about what is SeaBios
> and how it behaves. Only select people on this globe can/should understand
> them. This complexity should not leak into KubeVirt.

Not sure if I agree. It would be true if none of those decisions depended on the workload (something that, if anyone, CNV should know about). If we were able to  really simplify "give me a machine with x cpus and y devices and z memory" and be OK with whatever libvirt picks, I think we would miss out on important "legacy supporting" scenarios, such as VMs that are not ready for UEFI.

Comment 7 Daniel Berrangé 2023-08-24 15:33:49 UTC

(In reply to Dan Kenigsberg from comment #0)
> KubeVirt runs libvirt+qemu in a Pod called virt-launcher. To be a good
> citizen of Kubernetes, KubeVirt has to declare ahead of time how much memory
> virt-launcher requests. KubeVirt currently does that in a very coarse
> fashion, guestimating how memory virt-laucher+libvirt+qemu+etc typically
> requires.

What is this limit used for ? Is it merely an accounting limit for the
purpose of estimating workload density by k8s ? Or is it actively
enforced ? It seems like it isn't a hard cap, given your next paragraph
says that namespace enforcement opt-in does the hard cap with OOM killer

Can you point to the code that kubevirt currently uses for estimating max memory.


> Things become even trickier in multi-tenant Kubernetes cluster, where a
> cluster-admin enforces memory limits on namespaces, to limit the maximum
> noise one tennant can cause to others. In this case, KubeVirt has to
> estimate ahead of time the memory limit of virt-launcher, above which the
> Pod would be killed. Overestimation causes waste of resources;
> underestimation causes premature OOM kills.

Given the current code in KubeVirt, have there been any/many reported OOM bugs
against kubevirt where the current estimation was shown to be inadequate (ie too small) ?

> To improve KubeVirt's estimation of requested memory and memory limits, I
> would like libvirt to provide functions similar to
> 
> int estimateTypicalMemoryUse(domxml)
> int estimateMaximumMemoryUse(domxml)
> 
> Users may want to add an "int genrousity" argument, expressing in a range
> from 1 to 100 how generous they feel about this specific VM, and willing to
> pay with RAM to keep it alive.

I really know what we would sensibly do with a "generosity" integer.

Each aspects of VM memory usage has different levels of uncertainty.
ie the main guest RAM block and graphics VRAM are fixed quantities
that have no uncertainty directly associated with them.

The VNC server on the other hand has a shadow bitmap whose usage
is a function of how many concurrent users are connected.

The in-QEMU RBD block driver has thread stack usage varying according
to how many OSDs the Ceph cluster has.

The qcow2 block driver  has memory usage related to the size of the disk image,
as that influences the size of the various lookup tables for tracking disk
cluster usage.

(In reply to Dan Kenigsberg from comment #5)
> Just like RHV and OSP before us, CNV is estimating the requested memory.
> Unlike the former two, and due to its design decisions, KubeVirt is going to
> estimate the memory limit, too. We cannot just say "let the VM owner set it"
> - we just said it was hard, and our job is making VM owners' life easy.
> Someone on the virt stack should be able to examine a VM definition and give
> a sane estimate. It may be a poor wasteful heuristic by a high-level
> component that has no idea about how libvirt and qemu are implemented (e.g
> double the memory requested by the guest), or it may be a better
> implementation, that takes into account things that I am not aware of.

I don't recall the RHV situation, but with OSP the memory estimation it
did had no need to be especially accurate. It was not used to cap the
guest RAM, only as an input for calculating VM RAM usage for the purpose
of host density/placement analysis. It was mandatory to run hosts with
swap, and a number of GB were reserved for "host OS" general usage. Thus
the consequences of getting the RAM estimate wrong were largely benign.



> > Compare this to the situation where the guessing
> > algorithm lives in KubeVirt and distributing fix just there.
> 
> KubeVirt would have its own algorithm to estimate the combined size of its
> components, libvirt and qemu. But without assistance from libvirt and qemu,
> its estimate would be worse off. Ideally, KubeVirt would not need to know
> that qemu even exists, let alone be aware of qemu components and their
> sizes. Just like libvirt cannot do this estimate well without help from qemu.

I agree with this generally - KubeVirt should not need to be aware of QEMU
impl details, it is libvirt's job to hide this technicality from KubeVirt.

I think it is definitely in scope for libvirt to provide APIs to estimate
the peak guest RAM usage.

My concern is just around whether we can actually do a good enough job
of implementing such an API, to make it desirable for us to support it.

It will be unusual in that we know that whatever we implement for such
an API will be buggy/broken for a many scenarios from day 1, and will
lead to a indefinite series of bug reports / fixes over years.

> > but CNV can
> > optionally use that data to provide hints to Customers about sensible limits
> > to their chosen VMs - at creation, hot-plugging/unplugging, migration and
> > upgrading paths.
> > 
> > I also suggest that this is fundamentally the same solution for other limit
> > / maxi config scenarios as well, such as
> > https://issues.redhat.com/browse/CNV-31831

I actually think that is a fundamentally different problem.

With the max disks scenario we're protecting the guest against problems
can a user can create by modifying the guest config, so that they get
a nicer error message sooner. Either way the user is going to get an
error, just whether its a friendly error from libvirt, or an unfriendly
error from the BIOS.  If libvirt reports the wrong limit the consequences
are not massively serious, as the user can just rollback the cnofig change
they made.

With the maximum memory limit, if it is used for cgroups RAM cap, the
situation is quite different. The VM can be running normally for 6 months
and then suddenly one day without any warning or interaction from the
admin, it will get destroyed by the OOM killer leading to application 
data loss for whatever was running at the time. If the user reboots it
will probably work fine until another 4 months and randomly get killed
again. This kind of non-deterministic failure scenario with application
data loss is what makes this a fundamentally different situation from
the other limits which are just about proactively identifying suboptimal
XML configurations.

(In reply to Klaus Heinrich Kiwi from comment #6)
> (In reply to Dan Kenigsberg from comment #5)
> > (In reply to Michal Privoznik from comment #2)
> > > First and foremost, it's impossible to tell how much memory a program is
> > > going to need for given input (Turing machines, Undecidable problem and
> > > stuff). 
> > >
> > ...
> > > 
> > > And a final thought - from upstream POV, QEMU can be compiled with/without
> > > support for a lot of stuff which affects what libraries are then linked to
> > > the QEMU binary. This would lead to spaghetti code. And don't forget that
> > > the API would need to account for libvirt memory consumption itself (!) and
> > > for all the helper processes.
> > 
> > It is impossible to provide an exact estimation. It is, however, possible to
> > provide an approximation that works for most users most of the time.
> > 
> 
> Michal and Daniel can jump back here to correct me, but what I think is a
> mistake in the assumption that libvirt can provide an API to provide an
> estimation is that there is an effective upstream process to maintain those
> values - yes, even ballpark ones. I don't see a practical way to even
> brute-force it into existence, given the infinite combinations in which CPU
> Architectures - Host configurations - Qemu configurations - Guest
> configurations - Libvirt configurations and even runtime configurations can
> be tracked to give us something. And the problem with an Upstream API for it
> is that it would need to be pervasive across all those, to at least provide
> values that are sane, much less useful.
> 
> 
> 
> > Just like RHV and OSP before us, CNV is estimating the requested memory.
> > Unlike the former two, and due to its design decisions, KubeVirt is going to
> > estimate the memory limit, too. We cannot just say "let the VM owner set it"
> > - we just said it was hard, and our job is making VM owners' life easy.
> > Someone on the virt stack should be able to examine a VM definition and give
> > a sane estimate. It may be a poor wasteful heuristic by a high-level
> > component that has no idea about how libvirt and qemu are implemented (e.g
> > double the memory requested by the guest), or it may be a better
> > implementation, that takes into account things that I am not aware of.
> > 
> > I'm not asking libvirt to prove Alan Turing wrong, just to participate in
> > the Nth implementation of this heuristic.
> > 
> > > Another reason why this should not be in libvirt - if the API returns wrong
> > > value then the roundtrip between fixing it in libvirt and KubeVirt/CNV
> > > picking up the fix is needlessly long, leaving customers unable to run their
> > > machines in the meantime.
> > 
> > KubeVirt allows knowledgeable VM owner to set their memory limit explicitly,
> > which can be used as a quick and dirty remedy in such cases.
> > 
> > > Compare this to the situation where the guessing
> > > algorithm lives in KubeVirt and distributing fix just there.
> > 
> > KubeVirt would have its own algorithm to estimate the combined size of its
> > components, libvirt and qemu. But without assistance from libvirt and qemu,
> > its estimate would be worse off. Ideally, KubeVirt would not need to know
> > that qemu even exists, let alone be aware of qemu components and their
> > sizes. Just like libvirt cannot do this estimate well without help from qemu.
> > 
> > 
> 
> For the reasons I stated above, I am skeptical that this value should, or
> even could be provided by an algorithm or a super-laborious testing of the
> many possible scenarios that exists upstream.

I think any reported limit would have to be based on code inspection,
backed up by testing to identify whether the rationalized limit was 
at all reasonably accurate.

I believe that reporting a single number is likely not a viable
approach, because of the above mentioned problem that different
inputs have different levels of uncertainty/variability.

Perhaps instead of reporting a single number, libvirt could 
report an XML document that details each item that contributes
to memory usage

  <domainMemory>
    <usage type="ram" size="2" unit="TiB"/>
    <usage type="firmware" size="64" unit="MiB"/> <!-- OVMF code -->
    <usage type="firmware" size="64" unit="MiB"/> <!-- OVMF varstore -->
    <usage type="live-migration" size="256" unit="MiB"/>  <!-- except this varies depending on what migration features are used eg post-copy or pre-copy -->
    ...other global memory usage..

    <device id="video0">
      <usage type="vram" size="64" unit="MiB"/>
    </device>

    <device id="graphics0">
      <usage type="vnc-dirtybitmap" size="32" unit="MiB" scale="connections"/>
    </device>

    <device id="disk0">
      <usage type="rbd" size="16" unit="MiB" scale="osd"/>
    </device>   
  </domainMemory>

this kind of breakdown would let apps see what factors the current
libvirt version knows about, and lets apps scale the usage of
factors which are known to be runtime variable such as RBD stack
or VNC dirty bitmap

There's still many hard aspects to this I'm not sure about though.
Ongoing RAM usage related to I/O load. Memory usage from mapping
shared libraries - I'm not actually sure how that is even accounted
for in cgroups, given .so mappings are shared across processes.
In kubevirt's case the .so mappings are not going to be shared
since each QEMU is inside a separate contanier, so the /usr/bin/qemu-system-xxx
binary will be a differnt inode for each VM, even if the same actual
container in each case.

Then there is memory usage for separate backends, such as swtpm
or virtiofsd or other vhostuser backends I've not thought about
much

Comment 8 Dan Kenigsberg 2023-09-04 09:56:49 UTC

(In reply to Daniel Berrangé from comment #7)

Sorry for my 2 weeks delay, I missed Klaus's and Daniel's comments.

> (In reply to Dan Kenigsberg from comment #0)
> > KubeVirt runs libvirt+qemu in a Pod called virt-launcher. To be a good
> > citizen of Kubernetes, KubeVirt has to declare ahead of time how much memory
> > virt-launcher requests. KubeVirt currently does that in a very coarse
> > fashion, guestimating how memory virt-laucher+libvirt+qemu+etc typically
> > requires.
> 
> What is this limit used for ? Is it merely an accounting limit for the
> purpose of estimating workload density by k8s ? Or is it actively
> enforced ? It seems like it isn't a hard cap, given your next paragraph
> says that namespace enforcement opt-in does the hard cap with OOM killer

I'm referring to https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#requests-and-limits ; requests are for accounting, limits are for enforcement. As I see it, KubeVirt would set a limit only when the namespace quota forces it.

> 
> Can you point to the code that kubevirt currently uses for estimating max
> memory.

No such code exists today. But we have to do this difficult chore https://issues.redhat.com/browse/CNV-31898 so here I ask for libvirt's assistance. We currently just roll this hot potato to the customers' hands, who guess a number e.g 2 * guest RAM, or are forced to stop using Kubernetes's ResourceQuota on memory limits https://kubernetes.io/docs/concepts/policy/resource-quotas/#compute-resource-quota . 

> 
> > Things become even trickier in multi-tenant Kubernetes cluster, where a
> > cluster-admin enforces memory limits on namespaces, to limit the maximum
> > noise one tennant can cause to others. In this case, KubeVirt has to
> > estimate ahead of time the memory limit of virt-launcher, above which the
> > Pod would be killed. Overestimation causes waste of resources;
> > underestimation causes premature OOM kills.
> 
> Given the current code in KubeVirt, have there been any/many reported OOM
> bugs
> against kubevirt where the current estimation was shown to be inadequate (ie
> too small) ?

No current code. So customer requests are in the lines of "please give us guidance on what to put here" or "KubeVirt should fill this field itself". We had one report from a worried customer about mem usage nearing limit (we had a bug about enabling host-side disk cache).

> 
> > To improve KubeVirt's estimation of requested memory and memory limits, I
> > would like libvirt to provide functions similar to
> > 
> > int estimateTypicalMemoryUse(domxml)
> > int estimateMaximumMemoryUse(domxml)
> > 
> > Users may want to add an "int genrousity" argument, expressing in a range
> > from 1 to 100 how generous they feel about this specific VM, and willing to
> > pay with RAM to keep it alive.
> 
> I really know what we would sensibly do with a "generosity" integer.

Yeah... It's intentionally fuzzy. It could be used to a way to estimate how many VNC connections we would like to support. Feel free to keep it out of the discussion.

> 
> Each aspects of VM memory usage has different levels of uncertainty.
> ie the main guest RAM block and graphics VRAM are fixed quantities
> that have no uncertainty directly associated with them.
> 
> The VNC server on the other hand has a shadow bitmap whose usage
> is a function of how many concurrent users are connected.
> 
> The in-QEMU RBD block driver has thread stack usage varying according
> to how many OSDs the Ceph cluster has.
> 
> The qcow2 block driver  has memory usage related to the size of the disk
> image,
> as that influences the size of the various lookup tables for tracking disk
> cluster usage.
> 
> (In reply to Dan Kenigsberg from comment #5)
> > Just like RHV and OSP before us, CNV is estimating the requested memory.
> > Unlike the former two, and due to its design decisions, KubeVirt is going to
> > estimate the memory limit, too. We cannot just say "let the VM owner set it"
> > - we just said it was hard, and our job is making VM owners' life easy.
> > Someone on the virt stack should be able to examine a VM definition and give
> > a sane estimate. It may be a poor wasteful heuristic by a high-level
> > component that has no idea about how libvirt and qemu are implemented (e.g
> > double the memory requested by the guest), or it may be a better
> > implementation, that takes into account things that I am not aware of.
> 
> I don't recall the RHV situation, but with OSP the memory estimation it
> did had no need to be especially accurate. It was not used to cap the
> guest RAM, only as an input for calculating VM RAM usage for the purpose
> of host density/placement analysis. It was mandatory to run hosts with
> swap, and a number of GB were reserved for "host OS" general usage. Thus
> the consequences of getting the RAM estimate wrong were largely benign.

Right. RHV was like OSP and current KubeVirt in this regard: only requests are automatically set, used only for accounting. Still, it would be nicer if this rough estimate is done once in libvirt instead of 3 times in 3 management systems.

> > > Compare this to the situation where the guessing
> > > algorithm lives in KubeVirt and distributing fix just there.
> > 
> > KubeVirt would have its own algorithm to estimate the combined size of its
> > components, libvirt and qemu. But without assistance from libvirt and qemu,
> > its estimate would be worse off. Ideally, KubeVirt would not need to know
> > that qemu even exists, let alone be aware of qemu components and their
> > sizes. Just like libvirt cannot do this estimate well without help from qemu.
> 
> I agree with this generally - KubeVirt should not need to be aware of QEMU
> impl details, it is libvirt's job to hide this technicality from KubeVirt.
> 
> I think it is definitely in scope for libvirt to provide APIs to estimate
> the peak guest RAM usage.
> 
> My concern is just around whether we can actually do a good enough job
> of implementing such an API, to make it desirable for us to support it.
> 
> It will be unusual in that we know that whatever we implement for such
> an API will be buggy/broken for a many scenarios from day 1, and will
> lead to a indefinite series of bug reports / fixes over years.

I share this concern. This is why I seek help from the lower-level experts. We would have to start with very generous limits (e.g twice the requests) and tighten it with time.

> 
> > > but CNV can
> > > optionally use that data to provide hints to Customers about sensible limits
> > > to their chosen VMs - at creation, hot-plugging/unplugging, migration and
> > > upgrading paths.
> > > 
> > > I also suggest that this is fundamentally the same solution for other limit
> > > / maxi config scenarios as well, such as
> > > https://issues.redhat.com/browse/CNV-31831
> 
> I actually think that is a fundamentally different problem.
> 
> With the max disks scenario we're protecting the guest against problems
> can a user can create by modifying the guest config, so that they get
> a nicer error message sooner. Either way the user is going to get an
> error, just whether its a friendly error from libvirt, or an unfriendly
> error from the BIOS.  If libvirt reports the wrong limit the consequences
> are not massively serious, as the user can just rollback the cnofig change
> they made.
> 
> With the maximum memory limit, if it is used for cgroups RAM cap, the
> situation is quite different. The VM can be running normally for 6 months
> and then suddenly one day without any warning or interaction from the
> admin, it will get destroyed by the OOM killer leading to application 
> data loss for whatever was running at the time. If the user reboots it
> will probably work fine until another 4 months and randomly get killed
> again. This kind of non-deterministic failure scenario with application
> data loss is what makes this a fundamentally different situation from
> the other limits which are just about proactively identifying suboptimal
> XML configurations.

Right. The effect to the customer is very different. The similar bit is the ownership of knowledge. KubeVirt should not know the runtime size of libvirt and its dependencies just as it should not know about SeaBios limitations. Libvirt should abstract these away.

> > 
> > For the reasons I stated above, I am skeptical that this value should, or
> > even could be provided by an algorithm or a super-laborious testing of the
> > many possible scenarios that exists upstream.
> 
> I think any reported limit would have to be based on code inspection,
> backed up by testing to identify whether the rationalized limit was 
> at all reasonably accurate.
> 
> I believe that reporting a single number is likely not a viable
> approach, because of the above mentioned problem that different
> inputs have different levels of uncertainty/variability.
> 
> Perhaps instead of reporting a single number, libvirt could 
> report an XML document that details each item that contributes
> to memory usage
> 
>   <domainMemory>
>     <usage type="ram" size="2" unit="TiB"/>
>     <usage type="firmware" size="64" unit="MiB"/> <!-- OVMF code -->
>     <usage type="firmware" size="64" unit="MiB"/> <!-- OVMF varstore -->
>     <usage type="live-migration" size="256" unit="MiB"/>  <!-- except this
> varies depending on what migration features are used eg post-copy or
> pre-copy -->
>     ...other global memory usage..
> 
>     <device id="video0">
>       <usage type="vram" size="64" unit="MiB"/>
>     </device>
> 
>     <device id="graphics0">
>       <usage type="vnc-dirtybitmap" size="32" unit="MiB"
> scale="connections"/>
>     </device>
> 
>     <device id="disk0">
>       <usage type="rbd" size="16" unit="MiB" scale="osd"/>
>     </device>   
>   </domainMemory>
> 
> this kind of breakdown would let apps see what factors the current
> libvirt version knows about, and lets apps scale the usage of
> factors which are known to be runtime variable such as RBD stack
> or VNC dirty bitmap

+1. I think KubeVirt can consume something like this, guess the mode of usage (e.g how many vnc connections are expected) and produce a sane guess for memory limits.

> 
> There's still many hard aspects to this I'm not sure about though.
> Ongoing RAM usage related to I/O load. Memory usage from mapping
> shared libraries - I'm not actually sure how that is even accounted
> for in cgroups, given .so mappings are shared across processes.
> In kubevirt's case the .so mappings are not going to be shared
> since each QEMU is inside a separate contanier, so the
> /usr/bin/qemu-system-xxx
> binary will be a differnt inode for each VM, even if the same actual
> container in each case.
> 
> Then there is memory usage for separate backends, such as swtpm
> or virtiofsd or other vhostuser backends I've not thought about
> much

Yes, it is hard; impossible to do exactly. Luckily we don't have to provide a tight limit. It can be a generous one (until customers start complaining about higher VM density).

Comment 9 Michal Privoznik 2023-09-04 11:00:12 UTC

(In reply to Dan Kenigsberg from comment #8)
> (In reply to Daniel Berrangé from comment #7)
> 
> Sorry for my 2 weeks delay, I missed Klaus's and Daniel's comments.
> 
> > (In reply to Dan Kenigsberg from comment #0)
> > > KubeVirt runs libvirt+qemu in a Pod called virt-launcher. To be a good
> > > citizen of Kubernetes, KubeVirt has to declare ahead of time how much memory
> > > virt-launcher requests. KubeVirt currently does that in a very coarse
> > > fashion, guestimating how memory virt-laucher+libvirt+qemu+etc typically
> > > requires.
> > 
> > What is this limit used for ? Is it merely an accounting limit for the
> > purpose of estimating workload density by k8s ? Or is it actively
> > enforced ? It seems like it isn't a hard cap, given your next paragraph
> > says that namespace enforcement opt-in does the hard cap with OOM killer
> 
> I'm referring to
> https://kubernetes.io/docs/concepts/configuration/manage-resources-
> containers/#requests-and-limits ; requests are for accounting, limits are
> for enforcement. As I see it, KubeVirt would set a limit only when the
> namespace quota forces it.

So how does this work when virtualization is taken out of the picture? I mean, say I want to run mysql in a pod. What limits should I set to make sure it won't get killed? Oh, forgot to mention, all of a sudden you have N+1 client connections, M+1 queries to process.

> 
> Yes, it is hard; impossible to do exactly. Luckily we don't have to provide
> a tight limit. It can be a generous one (until customers start complaining
> about higher VM density).

Which they will do the moment this is put into production. I mean, that's what we've discussed not a month ago (remember the discussion about swap?). And then, libvirt would need to fix its guessing algorithm, the fix would need to go through all RHEL process and OpenShift Virtualization would then need to pick the fix up and redistribute to customers.

I don't think this belongs to libvirt. If anything, it can be a project on top of libvirt (or beside libvirt) and it for starters it can just pick up the code that RHEV converged to.

Comment 10 RHEL Program Management 2023-09-22 13:23:19 UTC

Issue migration from Bugzilla to Jira is in process at this time. This will be the last message in Jira copied from the Bugzilla bug.

Comment 11 RHEL Program Management 2023-09-22 13:23:50 UTC

This BZ has been automatically migrated to the issues.redhat.com Red Hat Issue Tracker. All future work related to this report will be managed there.

Due to differences in account names between systems, some fields were not replicated.  Be sure to add yourself to Jira issue's "Watchers" field to continue receiving updates and add others to the "Need Info From" field to continue requesting information.

To find the migrated issue, look in the "Links" section for a direct link to the new issue location. The issue key will have an icon of 2 footprints next to it, and begin with "RHEL-" followed by an integer.  You can also find this issue by visiting https://issues.redhat.com/issues/?jql= and searching the "Bugzilla Bug" field for this BZ's number, e.g. a search like:

"Bugzilla Bug" = 1234567

In the event you have trouble locating or viewing this issue, you can file an issue by sending mail to rh-issues. You can also visit https://access.redhat.com/articles/7032570 for general account information.

Comment 12 Red Hat Bugzilla 2024-01-21 04:26:05 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days