Bug 2074000
Summary: | Make memory preallocation threads NUMA aware | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 9 | Reporter: | Michal Privoznik <mprivozn> | ||||
Component: | qemu-kvm | Assignee: | David Hildenbrand <dhildenb> | ||||
qemu-kvm sub component: | Devices | QA Contact: | Mario Casquero <mcasquer> | ||||
Status: | CLOSED ERRATA | Docs Contact: | |||||
Severity: | low | ||||||
Priority: | low | CC: | aquini, chayang, chuhu, coli, cswanson, ddutile, dhildenb, djdumas, jdenemar, jinzhao, jmario, jsuchane, juzhang, lcong, lhuang, llong, lmen, lwoodman, mcasquer, mm-maint, mprivozn, mrezanin, nilal, nkoenig, npache, rsibley, virt-maint, xuzhang, yalzhang, yfu, yuhuang | ||||
Version: | 9.1 | Keywords: | RFE, Triaged | ||||
Target Milestone: | rc | ||||||
Target Release: | 9.2 | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | qemu-kvm-7.2.0-1.el9 | Doc Type: | If docs needed, set a value | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | 2064194 | Environment: | |||||
Last Closed: | 2023-05-09 07:19:33 UTC | Type: | Feature Request | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | 2135806 | ||||||
Bug Blocks: | 1788991 | ||||||
Attachments: |
|
Description
Michal Privoznik
2022-04-11 10:12:05 UTC
Created attachment 1871968 [details] 0001-NUMA-aware-mem-prealloc.patch Alright, I just realized (as I was testing my patches on top of QEMU that implement this), that libvirt by default starts qemu with: -sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny and this 'resourcecontrol=deny' means, that syscalls that control resources will always fail. libseccomp makes sure of that. And indeed, that's what I am seeing: sched_setaffinity() fails (with EPERM, because that's what QEMU told libseccomp to return). And while libvirt could remove resourcecontrol=deny, that would open a pandora's box because there are other syscalls in the group: https://gitlab.com/qemu-project/qemu/-/blob/master/softmmu/qemu-seccomp.c#L257 setpriority, sched_setparam, sched_setscheduler, sched_setscheduler_arg, sched_setaffinity. Honestly, I have no idea whether setaffinity could be dropped from the list. I can see how an infected guest could use setaffinity to move QEMU around and slow down other tasks running on the host. Meanwhile, let me attach my patch for anybody else to pick up. Of course, if we have a limited number of allocation threads (e.g., 16), we could simply set them up during QEMU startup and reuse them on actual preallocation. libvirt could move them around as it pleases. This would work as long as we only have one preallocation that intends to use multiple threads happening at the same time. IIRC, all threads (including the single thread that does preallocation for virtio-mem) are all currently serialized using the BQL. While this might change in the future for virtio-mem, it's what should hold right now. It's sub-optimal though. It would be better if QEMU would just accept a list of physical CPUs to use for preallocation, and libvirt could allow QEMU to set the affinity itself to these physical CPUs (but not more). (In reply to David Hildenbrand from comment #3) > Of course, if we have a limited number of allocation threads (e.g., 16), we > could simply set them up during QEMU startup and reuse them on actual > preallocation. > > libvirt could move them around as it pleases. > > This would work as long as we only have one preallocation that intends to > use multiple threads happening at the same time. IIRC, all threads > (including the single thread that does preallocation for virtio-mem) are all > currently serialized using the BQL. While this might change in the future > for virtio-mem, it's what should hold right now. > > It's sub-optimal though. It would be better if QEMU would just accept a list > of physical CPUs to use for preallocation, and libvirt could allow QEMU to > set the affinity itself to these physical CPUs (but not more). I don't think that's how it works. We have preallocation for every memory backend on its own. For instance, when there are two memory-backend-file objects, each with preallocation-threads say 16, then the first 16 threads are created to serve the fist object, and only after they finished next set of 16 threads is created to serve the other object. And don't forget about memory hotplug, where preallocation happens too. Speaking of which, I wonder what's going to happen when a large amount of memory is hotplugged on a running guest - I suspect that the monitor is going to be unresponsive for a long time. (In reply to Michal Privoznik from comment #4) > (In reply to David Hildenbrand from comment #3) > > Of course, if we have a limited number of allocation threads (e.g., 16), we > > could simply set them up during QEMU startup and reuse them on actual > > preallocation. > > > > libvirt could move them around as it pleases. > > > > This would work as long as we only have one preallocation that intends to > > use multiple threads happening at the same time. IIRC, all threads > > (including the single thread that does preallocation for virtio-mem) are all > > currently serialized using the BQL. While this might change in the future > > for virtio-mem, it's what should hold right now. > > > > It's sub-optimal though. It would be better if QEMU would just accept a list > > of physical CPUs to use for preallocation, and libvirt could allow QEMU to > > set the affinity itself to these physical CPUs (but not more). > > I don't think that's how it works. As I said "we could simply set them up during QEMU startup and reuse them on actual preallocation.", meaning, QEMU code would have to be changed. I'm not particularly happy about either approach. It feels like we're working around a limitation imposed by upper management. Having that said, I wonder which performance gain we can actually obtain via NUMA awareness and if it is even worth the trouble here. Meaning: which performance boost in preallocation can we obtain? Might be somewhat easy to measure with your prototype patch, leaving libvirt out of the picture and just doing it via QEMU directly. > Speaking of which, I wonder what's going to happen when a large amount of > memory is hotplugged on a running guest - I suspect that the monitor is > going to be unresponsive for a long time. I recall that that's how it works. At least we only return from object-add after preallocation is over and the new memory backend is ready to be used. David - since you seem to already be thinking about this, I'll assign to you directly mainly because w/ the ITR set, there's a need for it to be assigned to someone... Michal and David: I was curious how much benefit there would be if the page-zeroing was numa-aware. To test it, I wrote a test that mmaps the specified number of reserved 1Gb hugepages, and then touches each of them. The pages can be touched by threads that are: 1) Pinned to one numa node 2) Allowed to float across all numa nodes 3) Equally distributed and pinned across all numa nodes. I ran this on various systems. Nils ran it on his SAP HANA system, and our hardware partner ran it on their 32-socket 47 Tb system. The results show: - On the 4-socket systems, the speedup from numa-aware-threads was in the 25% range. - On the 32-socket system, the speedup from numa-aware-threads was 2.3 times faster. During the tests that I did, I ran Intel's PCM tool to watch the memory I/O bandwidth to all the numa nodes. When the threads were allowed to float, they would often bunch up on a given numa node, where the node's memory controller bandwidth hit it's max speed. And then there was lots of cross-numa activity (where a thread on one node was zeroing a page located on a remote numa node.) And there was run-to-run variation of that activity. When the threads were equally distributed and pinned across all numa nodes, the bandwidth to memory was equal on all the numa nodes. The test program, example run script, and my test results are attached to this BZ. In summary, there is upside to adding numa-awareness. Yeah, I've run QEMU that I've patched so that it sets affinity of allocator threads onto those NUMA nodes where memory is being allocated at and saw similar speedups too. What stopped me from sending the patch is discussion on corresponding libvirt bug 2064194#c61 (for instance). Anyway, that discussion should be happening on qemu-devel or libvir-list. So let me post the patch to start the discussion. (In reply to Joe Mario from comment #12) > Michal and David: > I was curious how much benefit there would be if the page-zeroing was > numa-aware. > > To test it, I wrote a test that mmaps the specified number of reserved 1Gb > hugepages, and then touches each of them. > The pages can be touched by threads that are: > 1) Pinned to one numa node > 2) Allowed to float across all numa nodes > 3) Equally distributed and pinned across all numa nodes. > > > I ran this on various systems. Nils ran it on his SAP HANA system, and our > hardware partner ran it on their 32-socket 47 Tb system. > > The results show: > - On the 4-socket systems, the speedup from numa-aware-threads was in the > 25% range. > - On the 32-socket system, the speedup from numa-aware-threads was 2.3 > times faster. > > During the tests that I did, I ran Intel's PCM tool to watch the memory I/O > bandwidth to all the numa nodes. > > When the threads were allowed to float, they would often bunch up on a given > numa node, where the node's memory controller bandwidth hit it's max speed. > And then there was lots of cross-numa activity (where a thread on one node > was zeroing a page located on a remote numa node.) And there was run-to-run > variation of that activity. > > When the threads were equally distributed and pinned across all numa nodes, > the bandwidth to memory was equal on all the numa nodes. > > The test program, example run script, and my test results are attached to > this BZ. > > In summary, there is upside to adding numa-awareness. Thanks for the results, very helpful! I'll start thinking about possible alternatives to let QEMU+libvirt get along when it comes to preallcoation thread placement. QEMU proposal sent ~1-2 weeks ago: "[PATCH RFC 0/7] hostmem: NUMA-aware memory preallocation using ThreadContext" https://lkml.kernel.org/r/20220721120732.118133-1-david@redhat.com Patches were merged into QEMU upstream and will be part of QEMU 7.2.0. bd77c30df9 vl: Allow ThreadContext objects to be created before the sandbox option e681645862 hostmem: Allow for specifying a ThreadContext for preallocation e04a34e55c util: Make qemu_prealloc_mem() optionally consume a ThreadContext 10218ae6d0 util: Add write-only "node-affinity" property for ThreadContext e2de2c497e util: Introduce ThreadContext user-creatable object 7730f32c28 util: Introduce qemu_thread_set_affinity() and qemu_thread_get_affinity() 6556aadc18 util: Cleanup and rename os_mem_prealloc() IIRC, this should automatically go into RHEL9.2/CS9 via the rebase to QEMU 7.2.0. @mrezanin is that correct? (In reply to David Hildenbrand from comment #27) > Patches were merged into QEMU upstream and will be part of QEMU 7.2.0. > > bd77c30df9 vl: Allow ThreadContext objects to be created before the sandbox > option > e681645862 hostmem: Allow for specifying a ThreadContext for preallocation > e04a34e55c util: Make qemu_prealloc_mem() optionally consume a ThreadContext > 10218ae6d0 util: Add write-only "node-affinity" property for ThreadContext > e2de2c497e util: Introduce ThreadContext user-creatable object > 7730f32c28 util: Introduce qemu_thread_set_affinity() and > qemu_thread_get_affinity() > 6556aadc18 util: Cleanup and rename os_mem_prealloc() > > IIRC, this should automatically go into RHEL9.2/CS9 via the rebase to QEMU > 7.2.0. > > @mrezanin is that correct? Yes, we will rebase to 7.2.0 for 9.2 so any upstream patches will come via rebase. Since the rebase BZ already moved to MODIFIED, this BZ was directly moved to MODIFIED from ASSIGNED. John, Mirek, I, and David discussed this BZ and made these changes. Please feel free to change anything in case we missed it. Thanks QE bot(pre verify): Set 'Verified:Tested,SanityOnly' as gating/tier1 test pass. please provide qa_ack+ to get release+ Removing needinfo since qa_ack+ has been provided Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: qemu-kvm security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:2162 |