Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2168910

Summary: Incorrect CPU affinity with numatune mode strict
Product: Red Hat Enterprise Linux 9 Reporter: Adrian Tomasov <atomasov>
Component: libvirtAssignee: Virtualization Maintenance <virt-maint>
libvirt sub component: General QA Contact: liang cong <lcong>
Status: CLOSED MIGRATED Docs Contact:
Severity: unspecified    
Priority: unspecified CC: aokuliar, atomasov, eskultet, jhladky, lmen, mprivozn, osabart, virt-maint
Version: 9.2Keywords: MigratedToJIRA, Triaged
Target Milestone: rcFlags: pm-rhel: mirror+
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-07-07 21:32:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Adrian Tomasov 2023-02-10 14:12:45 UTC
Description of problem:
We have been trying to set the NUMA affinity of specific VM using libvirt by adding this into the VM config:
<numatune>
        <memory mode='strict' nodeset='1'/>
</numatune>
This should set the correct memory and vcpu affinity to a specific NUMA node. The memory affinity was right:
PID              Node 0 Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Total
---------------  ------ ------ ------ ------ ------ ------ ------ ------ -----
7103 (qemu-kvm)       0   2126      2      2      2      2      1      2  2198

However, the emulator and vcpus have set affinity to a different NUMA node:
# taskset -cp 7103
pid 7103's current affinity list: 12-15,44-47


Version-Release number of selected component (if applicable):
libvirt-8.5.0-7.el9_1.x86_64

How reproducible:
always

Steps to Reproduce:
1. Install hypervisor with RHEL9.1.0 and virtualization packages
2. Install VM
3. Add this into the VM config:
<numatune>
        <memory mode='strict' nodeset='1'/>
</numatune>
<vcpu placement='auto'>8</vcpu>

4. Start VM
5. Check memory affinity using numastat -c qemu-kvm
6. Check CPU affinity using taskset -cp <PID>

Actual results:
Correct memory, but incorrect CPU affinity.


Expected results:
Correct memory and CPU affinity list.

Additional info:
We can provide you with a hypervisor with many NUMA nodes for further investigation of this issue.

Comment 4 Michal Privoznik 2023-02-10 15:49:45 UTC
Yeah, I remember talking with Erik Skultety about this yesterday. And I've found what might look like evince in the code that this used to work, because I couldn't recall whether we used to set affinity on vCPU just based on the <numatune>. Meanwhile, explicit vCPU pinning should work, if you're in a need of a workaround.

Comment 5 liang cong 2023-02-13 04:26:38 UTC
(In reply to Michal Privoznik from comment #4)
> Yeah, I remember talking with Erik Skultety about this yesterday. And I've
> found what might look like evince in the code that this used to work,
> because I couldn't recall whether we used to set affinity on vCPU just based
> on the <numatune>. Meanwhile, explicit vCPU pinning should work, if you're
> in a need of a workaround.

Hi michal,
From the numa node tuning part of libvirt doc:https://libvirt.org/formatdomain.html#numa-node-tuning, we could only see the memory allocation is decided, but no any info about cpu affinity.
AFAIK, if set auto placement mode then the numad service would decide the nodeset advisory like what we could see from cmd:
# cat /run/libvirt/qemu/vm1.xml | grep nodeset
  <numad nodeset='X' cpuset='1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47'/>

That would be same with cpu affinity.

Even we set numa node strict to node 1, the nodeset advisory may be different with that.

Comment 6 liang cong 2023-02-13 04:29:24 UTC
(In reply to Adrian Tomasov from comment #0)
> Description of problem:
> We have been trying to set the NUMA affinity of specific VM using libvirt by
> adding this into the VM config:
> <numatune>
>         <memory mode='strict' nodeset='1'/>
> </numatune>
> This should set the correct memory and vcpu affinity to a specific NUMA
> node. The memory affinity was right:
> PID              Node 0 Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7
> Total
> ---------------  ------ ------ ------ ------ ------ ------ ------ ------
> -----
> 7103 (qemu-kvm)       0   2126      2      2      2      2      1      2 
> 2198
> 
> However, the emulator and vcpus have set affinity to a different NUMA node:
> # taskset -cp 7103
> pid 7103's current affinity list: 12-15,44-47
> 
> 
> Version-Release number of selected component (if applicable):
> libvirt-8.5.0-7.el9_1.x86_64
> 
> How reproducible:
> always
> 
> Steps to Reproduce:
> 1. Install hypervisor with RHEL9.1.0 and virtualization packages
> 2. Install VM
> 3. Add this into the VM config:
> <numatune>
>         <memory mode='strict' nodeset='1'/>
> </numatune>
> <vcpu placement='auto'>8</vcpu>
> 
> 4. Start VM
> 5. Check memory affinity using numastat -c qemu-kvm
> 6. Check CPU affinity using taskset -cp <PID>
> 
> Actual results:
> Correct memory, but incorrect CPU affinity.
> 
> 
> Expected results:
> Correct memory and CPU affinity list.
> 
> Additional info:
> We can provide you with a hypervisor with many NUMA nodes for further
> investigation of this issue.

Hi Adrian,

Could you help to check if the following cmd get the cpu affinity of your test "12-15,44-47"? thx
# cat /run/libvirt/qemu/${domain-name}.xml | grep nodeset

Comment 7 Michal Privoznik 2023-02-13 09:10:17 UTC
(In reply to liang cong from comment #5)
> (In reply to Michal Privoznik from comment #4)
> > Yeah, I remember talking with Erik Skultety about this yesterday. And I've
> > found what might look like evince in the code that this used to work,
> > because I couldn't recall whether we used to set affinity on vCPU just based
> > on the <numatune>. Meanwhile, explicit vCPU pinning should work, if you're
> > in a need of a workaround.
> 
> Hi michal,
> From the numa node tuning part of libvirt
> doc:https://libvirt.org/formatdomain.html#numa-node-tuning, we could only
> see the memory allocation is decided, but no any info about cpu affinity.
> AFAIK, if set auto placement mode then the numad service would decide the
> nodeset advisory like what we could see from cmd:
> # cat /run/libvirt/qemu/vm1.xml | grep nodeset
>   <numad nodeset='X'
> cpuset='1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47'/>
> 
> That would be same with cpu affinity.
> 
> Even we set numa node strict to node 1, the nodeset advisory may be
> different with that.

Yeah, we do not document this behavior in public docs. But I think @eskultet found this somewhere in RHEL docs. Nevertheless, the code behaves crazy. Firstly, it sets affinity, but then undoes it. This is the root cause of the problem. And I do agree that the ordering of affinity should be as follows:

1) domain XML,
2) numad recommendation,
3) <numatune/>

I mean, it only makes sense to set affinity of vCPU threads so that they are local to the memory they work with. Ideally, kernel would move threads around to achieve locality, but apparently that isn't happening and scheduler needs a bit of help. Mind you, affinity is a "recommendation", not hard restriction. The kernel can still schedule thread to run on a different host CPU (should it need to), but the ones from affinity set are preferred.

Comment 8 Erik Skultety 2023-02-13 10:50:47 UTC
The thing is, as I noticed recently, the docs was for RHEL-7 (https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/virtualization_tuning_and_optimization_guide/sect-virtualization_tuning_optimization_guide-numa-numa_and_libvirt#sect-Virtualization_Tuning_Optimization_Guide-NUMA-NUMA_and_libvirt-Domain_Processes) and I could not find anything similar in RHEL-8/9 docs (was that just a mistake in the docs back then?). Still, even though affinity is a recommendation, it was clear from the testing in terms of code that handles this in libvirt, we could do better and then it's up to the kernel.

Comment 9 liang cong 2023-02-16 01:37:12 UTC
IMO, the setting of the description could not describe the problem:
<numatune>
        <memory mode='strict' nodeset='1'/>
</numatune>
<vcpu placement='auto'>8</vcpu>

For this setting, <vcpu placement='auto'>8</vcpu> means the domain process will be pinned to the advisory nodeset from querying numad. The advisory nodeset could be different with the memory tuning specified nodeset 1.
So I think the setting should be:
<numatune>
        <memory mode='strict' nodeset='1'/>
</numatune>
<vcpu placement='static'>8</vcpu>

But if we set like this as the libvirt doc say:if placement is "static", but no cpuset is specified, the domain process will be pinned to all the available physical CPUs, so as the result I tried on build libvirt-9.0.0-3.el9.x86_64.

So the current behavior is same as doc logic.