Bug 1377735

Summary:	[Feature Request] Add ability to set qcow2 l2-cache-size
Product:	[Community] Virtualization Tools	Reporter:	Frank Myhr <fmyhr>
Component:	libvirt	Assignee:	Libvirt Maintainers <libvirt-maint>
Status:	CLOSED CURRENTRELEASE	QA Contact:
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	unspecified	CC:	akarlsso, berrange, berto, coli, greg, iloginovskikh, jsuchane, kchamart, libvirt-maint, m40636067, piyush.shivam, sct, yasushi.shoji
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	libvirt-7.0.0	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-05-25 13:10:43 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Frank Myhr 2016-09-20 13:45:08 UTC

QEMU's default qcow2 L2 cache size is too small for large images (and small cluster sizes), resulting in very bad performance.

https://blogs.igalia.com/berto/2015/12/17/improving-disk-io-performance-in-qemu-2-5-with-the-qcow2-l2-cache/
shows huge performance hit for a 20GB qcow2 with default 64kB cluster size:

L2 Cache, MiB   Average IOPS
1 (default)             5100
1.5                     7300
2                      12700
2.5                    63600

The above link also gives the formula:
optimal L2 cache size = L2 table size = (8 Byte) * (disk size) / (cluster size)

and the QEMU command line for setting L2 cache size, which must be specified at each invocation, for example:
qemu-system-x86_64 -drive file=hd.qcow2,l2-cache-size=2621440

It would be great if libvirt would allow specifying qcow2 l2-cache-size. It is apparently easy to add this option. Opened this bug in case I or someone else wants to do it.

Ref:
https://www.redhat.com/archives/libvirt-users/2016-September/msg00032.html

Comment 1 Daniel Berrangé 2016-09-20 14:25:25 UTC

IMHO if the default cache size is so bad, then we should fix the defaults in QEMU so every app gets improved performance out of the box without having to be modified to contain the same formula to override the default.

Comment 2 Alberto Garcia 2016-09-21 10:21:58 UTC

Hey,

I don't think the default size is necessarily so bad, it just depends
on the size of the disk image and the usage pattern.

The numbers on my blog that Frank pasted on this bug report represent
a worst case scenario (pure random I/O).

There are several possible default configurations for the L2 cache and
all of them have drawbacks:

a) Have a small(ish) size, like now (1 MB per disk image).
   + Pros: low memory footprint, good enough for many common scenarios.
   + Cons: bad (or very bad) performance with images larger than 8GB
           and lots of I/O.

b) Have the maximum possible cache size (1MB per 8GB of disk image if
   using the default cluster size).
   + Pros: good performance in all cases, no need to worry about it.
   + Cons: it can be very wasteful of RAM. A 1TB disk image takes
           128MB of RAM for the L2 cache alone. If there are more disk
           drives (or backing images) the problem gets worse. In most
           cases you're not going to peform random I/O on the whole
           disk, so you don't need such a big cache.

c) Have a large cache size (like in b), and remove unused entries
   periodically (cache-clean-interval setting).
   + Pros: it provides the best of both worlds, you'll get good
           performance and the unused memory will be returned to the
           system.
   + Cons: you can still have peaks of RAM usage. We'd still need to
           decide what's the best length for the cache cleaning
           interval. The memory footprint of the VM becomes more
           volatile and difficult to control.

I'm not sure if there's a good default that suits all use cases.
Allowing the user to configure the L2 cache seems like a good idea to
me.

Comment 3 Frank Myhr 2016-09-22 15:35:38 UTC

I'm using qcow2 images of modest size (5 - 100 GB) without backing files. I much prefer spending 13MB RAM per 100GB image to eliminate L2 cache misses as a source of performance problems.

Someone using multiple 1TB images with backing files in a RAM-constrained system is a very different scenario indeed.

In order to accommodate this wide range of use cases, it appears that the l2-cache-size parameter is in fact needed. I'll propose what will probably be the least-popular option: modify both QEMU and libvirt as follows:

QEMU:
1) Add ability to parse l2-cache-size “%” suffix:
l2-cache-size=100% means l2-cache-size= (8 Byte) * (disk size) / (cluster size)

2) Set default l2-cache-size=100%

3) QEMU accepts l2-cache-size=[0-500(?)%] and rounds up to the nearest size in bytes that is a multiple of the cluster size.

4) QEMU continues to accept l2-cache-size with no suffix (-> bytes) or “M” suffix, and in this case sets a constant L2 cache size independent of image size and cluster size, same as it does now.


libvirt:
* Modify to accept -drive l2-cache-size parameter and pass it to QEMU


Pro:
+ Everyone gets the optimum (smallest maximum-performance) L2 cache size by default.
+ QEMU itself does calculation of optimum L2 cache size rather than relying on user to get it right.
+ Easy for user to override default L2 cache size default, using % or absolute bytes.

Con:
- Requires modifications to both QEMU & libvirt
- “%” character is problematic in XML; maybe use “p” instead?
- Proposed default l2-cache-size is changed from current behavior.
- l2-cache-size parameter is specific to qcow2 with QEMU, not expected to apply to other libvirt virtualization backends.

Comment 4 Frank Myhr 2016-10-04 18:28:30 UTC

An entirely different way of allowing users to tune L2 cache size would be to modify the qcow2 format by adding a suitable header extension:

http://git.qemu.org/?p=qemu.git;a=blob;f=docs/specs/qcow2.txt

qemu-img would set the L2 cache size at qcow2 creation / modification time. Then QEMU (and libvirt) would simply read the L2 cache size from the qcow2 and do the right thing, without requiring any additional parameters.

This would be roughly analogous to different physical hard drives having different cache sizes.

Comment 5 Alberto Garcia 2016-10-05 14:41:01 UTC

(In reply to Frank Myhr from comment #4)
> An entirely different way of allowing users to tune L2 cache size would be
> to modify the qcow2 format by adding a suitable header extension:
>
> http://git.qemu.org/?p=qemu.git;a=blob;f=docs/specs/qcow2.txt

I'm not sure if that's worth the effort... I would be more in favor of
a '%' suffix like you propose, or some other way to achieve the same
result.

In case you're interested there's right now a patch (and a debate) on
the QEMU mailing list about this very feature:

https://lists.gnu.org/archive/html/qemu-block/2016-10/msg00036.html

Comment 6 Sirius Rayner-Karlsson 2017-11-21 16:29:21 UTC

Seems this one is back, https://www.redhat.com/archives/libvir-list/2017-September/msg00996.html, and there is a later attempt as well, https://www.redhat.com/archives/libvir-list/2017-November/msg00536.html which points back at the thread from September.

There is renewed interest in this from NFV/Telco side as well via a partner I co-manage.

Once this lands upstream, and as qemu-kvm-rhev supports the setting, what are the odds of a backport in to the RHEL libvirt (which is version 3.2 based)?

Kind regards,

/Anders

Comment 7 Alberto Garcia 2018-08-06 08:07:07 UTC

A quick update: we have retaken this discussion on the QEMU mailing list.

So it seems that there's two important things that should be taken into account:

a) Users would like to be able to configure the L2 cache size, and in
   particular they would like to maximize the I/O performance without
   having to calculate the cache size manually for each image.

b) Users would also like to be able to prevent the cache from being
   too large. So while it's good to have a way to say "I want to
   maximize the I/O performance" we don't necessarily want to do that
   in a very large image if that means that we need, say, half a
   gigabyte just for the L2 cache.

So the idea is to make the existing l2-cache-size option work as a
hard maximum ("whatever happens, never allocate more than this for an
image"), and at the same time guarantee that QEMU will never allocate
more than what an image can use ("if a 2MB cache is enough for the
whole image, never allocate more than that even if l2-cache-size is
larger").

Comment 8 Brian-W 2019-09-08 19:32:05 UTC

I fully understand you want to make the best solution when implementing this, but it seems, as always, aiming for the "perfect" solution can be a difficult path.
Couldn't you consider implementing a "simple" option to specify the l2-cache, and else keeps things as they are?
This setting has suchs a huge impact on performance! It should have been possible to set the property 3-4 years ago in the XML file.
It's better to have the options, even though it not "perfect", than not having it at all as the case is now (I assume).

Comment 9 Alberto Garcia 2019-09-09 12:49:34 UTC

I don't know what's the status of this in libvirt, but QEMU at the moment (since v3.1.0) defaults to 32MB for the L2 cache. That's enough for a 256GB image with 64KB clusters.

Comment 10 Gregory P. Smith 2021-02-28 21:25:05 UTC

It also sounds like as of the (very recent) QEMU 5.2, the new subcluster allocation feature can be enabled when creating qcow2 images making them require 16x less L2 cache space.  https://blogs.igalia.com/berto/2020/12/03/subcluster-allocation-for-qcow2-images/

Comment 11 Jaroslav Suchanek 2021-05-25 13:10:43 UTC

The feature was fixed upstream and released in libvirt-7.0.0.

commit dc837a412f67b709373247003a07e4b387cec1b8
Author: Peter Krempa <pkrempa>
Date:   Wed Jan 6 18:20:29 2021 +0100

    qemu: Implement '<metadata_cache><max_size>' control for qcow2
    
    qemu's qcow2 driver allows control of the metadata cache of qcow2 driver
    by the 'cache-size' property. Wire it up to the recently introduced
    elements.
    
    Signed-off-by: Peter Krempa <pkrempa>
    Reviewed-by: Ján Tomko <jtomko>

commit 06380cb587ca61d321459c46664f9aec6e14c8be
Author: Peter Krempa <pkrempa>
Date:   Thu Jan 7 15:30:21 2021 +0100

    conf: snapshot: Add support for <metadata_cache>
    
    Similarly to the domain config code it may be beneficial to control the
    cache size of images introduced as snapshots into the backing chain.
    Wire up handling of the 'metadata_cache' element.
    
    Signed-off-by: Peter Krempa <pkrempa>
    Reviewed-by: Ján Tomko <jtomko>

commit 154df5840d800661a6988ccba59facd28ac06599
Author: Peter Krempa <pkrempa>
Date:   Wed Jan 6 18:20:22 2021 +0100

    conf: Introduce <metadata_cache> subelement of <disk><driver>
    
    In certain specific cases it might be beneficial to be able to control
    the metadata caching of storage image format drivers of a hypervisor.
    
    Introduce XML machinery to set the maximum size of the metadata cache
    which will be used by qemu's qcow2 driver.
    
    Signed-off-by: Peter Krempa <pkrempa>
    Reviewed-by: Ján Tomko <jtomko>

commit a01726e9cf426e8cbe553139c3cee888de63c1f2
Author: Peter Krempa <pkrempa>
Date:   Thu Jan 7 15:03:57 2021 +0100

    virDomainSnapshotDiskDefFormat: Use virXMLFormatElement
    
    Refactor the code to use modern XML formatting approach.
    
    Signed-off-by: Peter Krempa <pkrempa>
    Reviewed-by: Ján Tomko <jtomko>

commit de69f963652bb10d5e1a56d5bc702f25868e045e
Author: Peter Krempa <pkrempa>
Date:   Wed Jan 6 22:34:57 2021 +0100

    virDomainDiskDefFormatDriver: Rename 'driverBuf' to 'attrBuf'
    
    Unify the code with other places using virXMLFormatElement.
    
    Signed-off-by: Peter Krempa <pkrempa>
    Reviewed-by: Ján Tomko <jtomko>

v6.10.0-395-gdc837a412f