2074000 – Make memory preallocation threads NUMA aware

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2074000 - Make memory preallocation threads NUMA aware

Summary: Make memory preallocation threads NUMA aware

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 9
Classification:	Red Hat
Component:	qemu-kvm
Sub Component:
Version:	9.1
Hardware:	All
OS:	Linux
Priority:	low
Severity:	low
Target Milestone:	rc
Target Release:	9.2
Assignee:	David Hildenbrand
QA Contact:	Mario Casquero
Docs Contact:
URL:
Whiteboard:
Depends On:	2135806
Blocks:	1788991
TreeView+	depends on / blocked

Reported:	2022-04-11 10:12 UTC by Michal Privoznik
Modified:	2024-01-31 06:42 UTC (History)
CC List:	31 users (show)
Fixed In Version:	qemu-kvm-7.2.0-1.el9
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	2064194
Environment:
Last Closed:	2023-05-09 07:19:33 UTC
Type:	Feature Request
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
0001-NUMA-aware-mem-prealloc.patch (7.77 KB, patch) 2022-04-12 13:58 UTC, Michal Privoznik	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHELPLAN-118525	0	None	None	None	2022-04-11 10:30:51 UTC
Red Hat Product Errata	RHSA-2023:2162	0	None	None	None	2023-05-09 07:20:28 UTC

Description Michal Privoznik 2022-04-11 10:12:05 UTC

+++ This bug was initially created as a clone of Bug #2064194 +++

Libvirt now allows specifying number of threads that memory is allocated from:

b8d6ecc70c qemu_command: Generate prealloc-threads property
75a4e0165e qemu_validate: Validate prealloc threads against qemuCpas
a30dac15dc qemu_capabilities: Detect memory-backend-*.prealloc-threads property
ba7f98126f conf: Introduce memory allocation threads

v8.1.0-212-gb8d6ecc70c

These result in using prealloc-threads attribute of memory-backend-* objects. However, these threads are not introspectable and thus no mgmt application (like libvirt) can move them onto appropriate NUMA nodes. However, what QEMU can do is move them onto nodes specified in host-nodes attribute.

Comment 2 Michal Privoznik 2022-04-12 13:58:27 UTC

Created attachment 1871968 [details]
0001-NUMA-aware-mem-prealloc.patch

Alright, I just realized (as I was testing my patches on top of QEMU that implement this), that libvirt by default starts qemu with:

  -sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny

and this 'resourcecontrol=deny' means, that syscalls that control resources will always fail. libseccomp makes sure of that. And indeed, that's what I am seeing: sched_setaffinity() fails (with EPERM, because that's what QEMU told libseccomp to return). And while libvirt could remove resourcecontrol=deny, that would open a pandora's box because there are other syscalls in the group:

https://gitlab.com/qemu-project/qemu/-/blob/master/softmmu/qemu-seccomp.c#L257

setpriority, sched_setparam, sched_setscheduler, sched_setscheduler_arg, sched_setaffinity.

Honestly, I have no idea whether setaffinity could be dropped from the list. I can see how an infected guest could use setaffinity to move QEMU around and slow down other tasks running on the host.

Meanwhile, let me attach my patch for anybody else to pick up.

Comment 3 David Hildenbrand 2022-04-13 11:05:06 UTC

Of course, if we have a limited number of allocation threads (e.g., 16), we could simply set them up during QEMU startup and reuse them on actual preallocation.

libvirt could move them around as it pleases.

This would work as long as we only have one preallocation that intends to use multiple threads happening at the same time. IIRC, all threads (including the single thread that does preallocation for virtio-mem) are all currently serialized using the BQL. While this might change in the future for virtio-mem, it's what should hold right now.

It's sub-optimal though. It would be better if QEMU would just accept a list of physical CPUs to use for preallocation, and libvirt could allow QEMU to set the affinity itself to these physical CPUs (but not more).

Comment 4 Michal Privoznik 2022-04-13 13:03:06 UTC

(In reply to David Hildenbrand from comment #3)
> Of course, if we have a limited number of allocation threads (e.g., 16), we
> could simply set them up during QEMU startup and reuse them on actual
> preallocation.
> 
> libvirt could move them around as it pleases.
> 
> This would work as long as we only have one preallocation that intends to
> use multiple threads happening at the same time. IIRC, all threads
> (including the single thread that does preallocation for virtio-mem) are all
> currently serialized using the BQL. While this might change in the future
> for virtio-mem, it's what should hold right now.
> 
> It's sub-optimal though. It would be better if QEMU would just accept a list
> of physical CPUs to use for preallocation, and libvirt could allow QEMU to
> set the affinity itself to these physical CPUs (but not more).

I don't think that's how it works. We have preallocation for every memory backend on its own. For instance, when there are two memory-backend-file objects, each with preallocation-threads say 16, then the first 16 threads are created to serve the fist object, and only after they finished next set of 16 threads is created to serve the other object. And don't forget about memory hotplug, where preallocation happens too. Speaking of which, I wonder what's going to happen when a large amount of memory is hotplugged on a running guest - I suspect that the monitor is going to be unresponsive for a long time.

Comment 5 David Hildenbrand 2022-04-13 16:12:42 UTC

(In reply to Michal Privoznik from comment #4)
> (In reply to David Hildenbrand from comment #3)
> > Of course, if we have a limited number of allocation threads (e.g., 16), we
> > could simply set them up during QEMU startup and reuse them on actual
> > preallocation.
> > 
> > libvirt could move them around as it pleases.
> > 
> > This would work as long as we only have one preallocation that intends to
> > use multiple threads happening at the same time. IIRC, all threads
> > (including the single thread that does preallocation for virtio-mem) are all
> > currently serialized using the BQL. While this might change in the future
> > for virtio-mem, it's what should hold right now.
> > 
> > It's sub-optimal though. It would be better if QEMU would just accept a list
> > of physical CPUs to use for preallocation, and libvirt could allow QEMU to
> > set the affinity itself to these physical CPUs (but not more).
> 
> I don't think that's how it works. 

As I said "we could simply set them up during QEMU startup and reuse them on actual preallocation.", meaning, QEMU code would have to be changed.

I'm not particularly happy about either approach. It feels like we're working around a limitation imposed by upper management.

Having that said, I wonder which performance gain we can actually obtain via NUMA awareness and if it is even worth the trouble here. Meaning: which performance boost in preallocation can we obtain?


Might be somewhat easy to measure with your prototype patch, leaving libvirt out of the picture and just doing it via QEMU directly.

> Speaking of which, I wonder what's going to happen when a large amount of
> memory is hotplugged on a running guest - I suspect that the monitor is 
> going to be unresponsive for a long time.

I recall that that's how it works. At least we only return from object-add after preallocation is over and the new memory backend is ready to be used.

Comment 7 John Ferlan 2022-04-18 13:52:23 UTC

David - since you seem to already be thinking about this, I'll assign to you directly mainly because w/ the ITR set, there's a need for it to be assigned to someone...

Comment 12 Joe Mario 2022-05-09 22:42:09 UTC

Michal and David:
I was curious how much benefit there would be if the page-zeroing was numa-aware.

To test it, I wrote a test that mmaps the specified number of reserved 1Gb hugepages, and then touches each of them.  
The pages can be touched by threads that are:
 1) Pinned to one numa node
 2) Allowed to float across all numa nodes
 3) Equally distributed and pinned across all numa nodes.


I ran this on various systems.  Nils ran it on his SAP HANA system, and our hardware partner ran it on their 32-socket 47 Tb system.

The results show:
 - On the 4-socket systems, the speedup from numa-aware-threads was in the 25% range.
 - On the 32-socket system, the speedup from numa-aware-threads was 2.3 times faster.

During the tests that I did, I ran Intel's PCM tool to watch the memory I/O bandwidth to all the numa nodes.  

When the threads were allowed to float, they would often bunch up on a given numa node, where the node's memory controller bandwidth hit it's max speed.  And then there was lots of cross-numa activity (where a thread on one node was zeroing a page located on a remote numa node.)  And there was run-to-run variation of that activity.

When the threads were equally distributed and pinned across all numa nodes, the bandwidth to memory was equal on all the numa nodes.

The test program, example run script, and my test results are attached to this BZ.

In summary, there is upside to adding numa-awareness.

Comment 13 Michal Privoznik 2022-05-10 06:33:16 UTC

Yeah, I've run QEMU that I've patched so that it sets affinity of allocator threads onto those NUMA nodes where memory is being allocated at and saw similar speedups too. What stopped me from sending the patch is discussion on corresponding libvirt bug 2064194#c61 (for instance). Anyway, that discussion should be happening on qemu-devel or libvir-list. So let me post the patch to start the discussion.

Comment 14 Michal Privoznik 2022-05-10 07:18:18 UTC

Posted here:

https://lists.gnu.org/archive/html/qemu-devel/2022-05/msg01835.html

Comment 21 David Hildenbrand 2022-05-30 09:13:17 UTC

(In reply to Joe Mario from comment #12)
> Michal and David:
> I was curious how much benefit there would be if the page-zeroing was
> numa-aware.
> 
> To test it, I wrote a test that mmaps the specified number of reserved 1Gb
> hugepages, and then touches each of them.  
> The pages can be touched by threads that are:
>  1) Pinned to one numa node
>  2) Allowed to float across all numa nodes
>  3) Equally distributed and pinned across all numa nodes.
> 
> 
> I ran this on various systems.  Nils ran it on his SAP HANA system, and our
> hardware partner ran it on their 32-socket 47 Tb system.
> 
> The results show:
>  - On the 4-socket systems, the speedup from numa-aware-threads was in the
> 25% range.
>  - On the 32-socket system, the speedup from numa-aware-threads was 2.3
> times faster.
> 
> During the tests that I did, I ran Intel's PCM tool to watch the memory I/O
> bandwidth to all the numa nodes.  
> 
> When the threads were allowed to float, they would often bunch up on a given
> numa node, where the node's memory controller bandwidth hit it's max speed. 
> And then there was lots of cross-numa activity (where a thread on one node
> was zeroing a page located on a remote numa node.)  And there was run-to-run
> variation of that activity.
> 
> When the threads were equally distributed and pinned across all numa nodes,
> the bandwidth to memory was equal on all the numa nodes.
> 
> The test program, example run script, and my test results are attached to
> this BZ.
> 
> In summary, there is upside to adding numa-awareness.

Thanks for the results, very helpful! I'll start thinking about possible alternatives to let QEMU+libvirt get along when it comes to preallcoation thread placement.

Comment 25 David Hildenbrand 2022-08-01 15:03:25 UTC

QEMU proposal sent ~1-2 weeks ago: "[PATCH RFC 0/7] hostmem: NUMA-aware memory preallocation using ThreadContext"

https://lkml.kernel.org/r/20220721120732.118133-1-david@redhat.com

Comment 27 David Hildenbrand 2022-11-22 12:09:50 UTC

Patches were merged into QEMU upstream and will be part of QEMU 7.2.0.

bd77c30df9 vl: Allow ThreadContext objects to be created before the sandbox option
e681645862 hostmem: Allow for specifying a ThreadContext for preallocation
e04a34e55c util: Make qemu_prealloc_mem() optionally consume a ThreadContext
10218ae6d0 util: Add write-only "node-affinity" property for ThreadContext
e2de2c497e util: Introduce ThreadContext user-creatable object
7730f32c28 util: Introduce qemu_thread_set_affinity() and qemu_thread_get_affinity()
6556aadc18 util: Cleanup and rename os_mem_prealloc()

IIRC, this should automatically go into RHEL9.2/CS9 via the rebase to QEMU 7.2.0.

@mrezanin is that correct?

Comment 28 Miroslav Rezanina 2022-11-28 05:10:46 UTC

(In reply to David Hildenbrand from comment #27)
> Patches were merged into QEMU upstream and will be part of QEMU 7.2.0.
> 
> bd77c30df9 vl: Allow ThreadContext objects to be created before the sandbox
> option
> e681645862 hostmem: Allow for specifying a ThreadContext for preallocation
> e04a34e55c util: Make qemu_prealloc_mem() optionally consume a ThreadContext
> 10218ae6d0 util: Add write-only "node-affinity" property for ThreadContext
> e2de2c497e util: Introduce ThreadContext user-creatable object
> 7730f32c28 util: Introduce qemu_thread_set_affinity() and
> qemu_thread_get_affinity()
> 6556aadc18 util: Cleanup and rename os_mem_prealloc()
> 
> IIRC, this should automatically go into RHEL9.2/CS9 via the rebase to QEMU
> 7.2.0.
> 
> @mrezanin is that correct?

Yes, we will rebase to 7.2.0 for 9.2 so any upstream patches will come via rebase.

Comment 29 Nitesh Narayan Lal 2022-12-20 15:34:00 UTC

Since the rebase BZ already moved to MODIFIED, this BZ was directly moved to MODIFIED from ASSIGNED.
John, Mirek, I, and David discussed this BZ and made these changes.
Please feel free to change anything in case we missed it.

Thanks

Comment 30 Yanan Fu 2022-12-23 02:27:09 UTC

QE bot(pre verify): Set 'Verified:Tested,SanityOnly' as gating/tier1 test pass.

Comment 31 John Ferlan 2022-12-23 14:18:32 UTC

please provide qa_ack+ to get release+

Comment 32 Mario Casquero 2023-01-03 06:43:19 UTC

Removing needinfo since qa_ack+ has been provided

Comment 39 errata-xmlrpc 2023-05-09 07:19:33 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: qemu-kvm security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:2162

Note You need to log in before you can comment on or make changes to this bug.

aquini
chayang
chuhu
coli
cswanson
ddutile
dhildenb
djdumas
jdenemar
jinzhao
jmario
jsuchane
juzhang
lcong
lhuang
llong
lmen
lwoodman
mcasquer
mm-maint
mprivozn
mrezanin
nilal
nkoenig
npache
rsibley
virt-maint
xuzhang
yalzhang
yfu
yuhuang