QEMU's default qcow2 L2 cache size is too small for large images (and small cluster sizes), resulting in very bad performance. https://blogs.igalia.com/berto/2015/12/17/improving-disk-io-performance-in-qemu-2-5-with-the-qcow2-l2-cache/ shows huge performance hit for a 20GB qcow2 with default 64kB cluster size: L2 Cache, MiB Average IOPS 1 (default) 5100 1.5 7300 2 12700 2.5 63600 The above link also gives the formula: optimal L2 cache size = L2 table size = (8 Byte) * (disk size) / (cluster size) and the QEMU command line for setting L2 cache size, which must be specified at each invocation, for example: qemu-system-x86_64 -drive file=hd.qcow2,l2-cache-size=2621440 It would be great if libvirt would allow specifying qcow2 l2-cache-size. It is apparently easy to add this option. Opened this bug in case I or someone else wants to do it. Ref: https://www.redhat.com/archives/libvirt-users/2016-September/msg00032.html
IMHO if the default cache size is so bad, then we should fix the defaults in QEMU so every app gets improved performance out of the box without having to be modified to contain the same formula to override the default.
Hey, I don't think the default size is necessarily so bad, it just depends on the size of the disk image and the usage pattern. The numbers on my blog that Frank pasted on this bug report represent a worst case scenario (pure random I/O). There are several possible default configurations for the L2 cache and all of them have drawbacks: a) Have a small(ish) size, like now (1 MB per disk image). + Pros: low memory footprint, good enough for many common scenarios. + Cons: bad (or very bad) performance with images larger than 8GB and lots of I/O. b) Have the maximum possible cache size (1MB per 8GB of disk image if using the default cluster size). + Pros: good performance in all cases, no need to worry about it. + Cons: it can be very wasteful of RAM. A 1TB disk image takes 128MB of RAM for the L2 cache alone. If there are more disk drives (or backing images) the problem gets worse. In most cases you're not going to peform random I/O on the whole disk, so you don't need such a big cache. c) Have a large cache size (like in b), and remove unused entries periodically (cache-clean-interval setting). + Pros: it provides the best of both worlds, you'll get good performance and the unused memory will be returned to the system. + Cons: you can still have peaks of RAM usage. We'd still need to decide what's the best length for the cache cleaning interval. The memory footprint of the VM becomes more volatile and difficult to control. I'm not sure if there's a good default that suits all use cases. Allowing the user to configure the L2 cache seems like a good idea to me.
I'm using qcow2 images of modest size (5 - 100 GB) without backing files. I much prefer spending 13MB RAM per 100GB image to eliminate L2 cache misses as a source of performance problems. Someone using multiple 1TB images with backing files in a RAM-constrained system is a very different scenario indeed. In order to accommodate this wide range of use cases, it appears that the l2-cache-size parameter is in fact needed. I'll propose what will probably be the least-popular option: modify both QEMU and libvirt as follows: QEMU: 1) Add ability to parse l2-cache-size “%” suffix: l2-cache-size=100% means l2-cache-size= (8 Byte) * (disk size) / (cluster size) 2) Set default l2-cache-size=100% 3) QEMU accepts l2-cache-size=[0-500(?)%] and rounds up to the nearest size in bytes that is a multiple of the cluster size. 4) QEMU continues to accept l2-cache-size with no suffix (-> bytes) or “M” suffix, and in this case sets a constant L2 cache size independent of image size and cluster size, same as it does now. libvirt: * Modify to accept -drive l2-cache-size parameter and pass it to QEMU Pro: + Everyone gets the optimum (smallest maximum-performance) L2 cache size by default. + QEMU itself does calculation of optimum L2 cache size rather than relying on user to get it right. + Easy for user to override default L2 cache size default, using % or absolute bytes. Con: - Requires modifications to both QEMU & libvirt - “%” character is problematic in XML; maybe use “p” instead? - Proposed default l2-cache-size is changed from current behavior. - l2-cache-size parameter is specific to qcow2 with QEMU, not expected to apply to other libvirt virtualization backends.
An entirely different way of allowing users to tune L2 cache size would be to modify the qcow2 format by adding a suitable header extension: http://git.qemu.org/?p=qemu.git;a=blob;f=docs/specs/qcow2.txt qemu-img would set the L2 cache size at qcow2 creation / modification time. Then QEMU (and libvirt) would simply read the L2 cache size from the qcow2 and do the right thing, without requiring any additional parameters. This would be roughly analogous to different physical hard drives having different cache sizes.
(In reply to Frank Myhr from comment #4) > An entirely different way of allowing users to tune L2 cache size would be > to modify the qcow2 format by adding a suitable header extension: > > http://git.qemu.org/?p=qemu.git;a=blob;f=docs/specs/qcow2.txt I'm not sure if that's worth the effort... I would be more in favor of a '%' suffix like you propose, or some other way to achieve the same result. In case you're interested there's right now a patch (and a debate) on the QEMU mailing list about this very feature: https://lists.gnu.org/archive/html/qemu-block/2016-10/msg00036.html
Seems this one is back, https://www.redhat.com/archives/libvir-list/2017-September/msg00996.html, and there is a later attempt as well, https://www.redhat.com/archives/libvir-list/2017-November/msg00536.html which points back at the thread from September. There is renewed interest in this from NFV/Telco side as well via a partner I co-manage. Once this lands upstream, and as qemu-kvm-rhev supports the setting, what are the odds of a backport in to the RHEL libvirt (which is version 3.2 based)? Kind regards, /Anders
A quick update: we have retaken this discussion on the QEMU mailing list. So it seems that there's two important things that should be taken into account: a) Users would like to be able to configure the L2 cache size, and in particular they would like to maximize the I/O performance without having to calculate the cache size manually for each image. b) Users would also like to be able to prevent the cache from being too large. So while it's good to have a way to say "I want to maximize the I/O performance" we don't necessarily want to do that in a very large image if that means that we need, say, half a gigabyte just for the L2 cache. So the idea is to make the existing l2-cache-size option work as a hard maximum ("whatever happens, never allocate more than this for an image"), and at the same time guarantee that QEMU will never allocate more than what an image can use ("if a 2MB cache is enough for the whole image, never allocate more than that even if l2-cache-size is larger").
I fully understand you want to make the best solution when implementing this, but it seems, as always, aiming for the "perfect" solution can be a difficult path. Couldn't you consider implementing a "simple" option to specify the l2-cache, and else keeps things as they are? This setting has suchs a huge impact on performance! It should have been possible to set the property 3-4 years ago in the XML file. It's better to have the options, even though it not "perfect", than not having it at all as the case is now (I assume).
I don't know what's the status of this in libvirt, but QEMU at the moment (since v3.1.0) defaults to 32MB for the L2 cache. That's enough for a 256GB image with 64KB clusters.
It also sounds like as of the (very recent) QEMU 5.2, the new subcluster allocation feature can be enabled when creating qcow2 images making them require 16x less L2 cache space. https://blogs.igalia.com/berto/2020/12/03/subcluster-allocation-for-qcow2-images/
The feature was fixed upstream and released in libvirt-7.0.0. commit dc837a412f67b709373247003a07e4b387cec1b8 Author: Peter Krempa <pkrempa> Date: Wed Jan 6 18:20:29 2021 +0100 qemu: Implement '<metadata_cache><max_size>' control for qcow2 qemu's qcow2 driver allows control of the metadata cache of qcow2 driver by the 'cache-size' property. Wire it up to the recently introduced elements. Signed-off-by: Peter Krempa <pkrempa> Reviewed-by: Ján Tomko <jtomko> commit 06380cb587ca61d321459c46664f9aec6e14c8be Author: Peter Krempa <pkrempa> Date: Thu Jan 7 15:30:21 2021 +0100 conf: snapshot: Add support for <metadata_cache> Similarly to the domain config code it may be beneficial to control the cache size of images introduced as snapshots into the backing chain. Wire up handling of the 'metadata_cache' element. Signed-off-by: Peter Krempa <pkrempa> Reviewed-by: Ján Tomko <jtomko> commit 154df5840d800661a6988ccba59facd28ac06599 Author: Peter Krempa <pkrempa> Date: Wed Jan 6 18:20:22 2021 +0100 conf: Introduce <metadata_cache> subelement of <disk><driver> In certain specific cases it might be beneficial to be able to control the metadata caching of storage image format drivers of a hypervisor. Introduce XML machinery to set the maximum size of the metadata cache which will be used by qemu's qcow2 driver. Signed-off-by: Peter Krempa <pkrempa> Reviewed-by: Ján Tomko <jtomko> commit a01726e9cf426e8cbe553139c3cee888de63c1f2 Author: Peter Krempa <pkrempa> Date: Thu Jan 7 15:03:57 2021 +0100 virDomainSnapshotDiskDefFormat: Use virXMLFormatElement Refactor the code to use modern XML formatting approach. Signed-off-by: Peter Krempa <pkrempa> Reviewed-by: Ján Tomko <jtomko> commit de69f963652bb10d5e1a56d5bc702f25868e045e Author: Peter Krempa <pkrempa> Date: Wed Jan 6 22:34:57 2021 +0100 virDomainDiskDefFormatDriver: Rename 'driverBuf' to 'attrBuf' Unify the code with other places using virXMLFormatElement. Signed-off-by: Peter Krempa <pkrempa> Reviewed-by: Ján Tomko <jtomko> v6.10.0-395-gdc837a412f