RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 2185184 - Specifying restrictive numa tuning mode per each guest numa node doesn't work
Summary: Specifying restrictive numa tuning mode per each guest numa node doesn't work
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 9
Classification: Red Hat
Component: libvirt
Version: 9.2
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: rc
: ---
Assignee: Martin Kletzander
QA Contact: liang cong
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-04-07 11:13 UTC by liang cong
Modified: 2023-11-07 09:40 UTC (History)
5 users (show)

Fixed In Version: libvirt-9.3.0-1.el9
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-11-07 08:31:17 UTC
Type: Bug
Target Upstream Version: 9.3.0
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHELPLAN-154233 0 None None None 2023-04-07 11:15:11 UTC
Red Hat Product Errata RHSA-2023:6409 0 None None None 2023-11-07 08:31:49 UTC

Description liang cong 2023-04-07 11:13:49 UTC
Description of problem:
Specifying restrictive numa tuning mode per each guest numa node doesn't work

Version-Release number of selected component (if applicable):
libvirt-9.0.0-10.el9_2.x86_64

How reproducible:
100%

Steps to Reproduce:
1 Scenario1:
1.1 Define and start a guest with memory and numa tuning config as below:
<maxMemory slots='16' unit='KiB'>52428800</maxMemory>
<memory unit='KiB'>2072576</memory>
<currentMemory unit='KiB'>2072576</currentMemory>
...
<numatune>
     <memnode cellid="0" mode="restrictive" nodeset="1"/>
</numatune>
...
<cpu>
...
<numa>
      <cell id='0' cpus='0-1' memory='1048576' unit='KiB'/>
      <cell id='1' cpus='2-3' memory='1024000' unit='KiB'/>
 </numa>
</cpu>

1.2 Check guest numa cell 0 memory allocation on host
# grep -B1 1048576 /proc/`pidof qemu-kvm`/smaps
7f8013e00000-7f8053e00000 rw-p 00000000 00:00 0
Size:            1048576 kB

# grep 7f8013e00000 /proc/`pidof qemu-kvm`/numa_maps
7f8013e00000 default anon=62464 dirty=62464 active=61952 N0=7680 N1=54784 kernelpagesize_kB=4

1.3 Check the cpuset.mems cgroup setting:
# cat /sys/fs/cgroup/machine.slice/machine-qemu\\x2d18\\x2dvm1.scope/libvirt/emulator/cpuset.mems


Actual results:
We could see that the guest numa cell 0 memory allocates on both host numa node 0 and node 1, and doesn't restricted as numa tuning setting.

2 Scenario2:
2.1 Define and start a guest with memory and numa tuning config as below:
<maxMemory slots='16' unit='KiB'>52428800</maxMemory>
<memory unit='KiB'>2072576</memory>
<currentMemory unit='KiB'>2072576</currentMemory>
...
<numatune>
     <memory mode='restrictive' nodeset='1' />
     <memnode cellid="0" mode="restrictive" nodeset="0"/>
</numatune>
...
<cpu>
...
<numa>
      <cell id='0' cpus='0-1' memory='1048576' unit='KiB'/>
      <cell id='1' cpus='2-3' memory='1024000' unit='KiB'/>
 </numa>
</cpu>

2.2 Check guest numa cell 0 memory allocation on host
# grep -B1 1048576 /proc/`pidof qemu-kvm`/smaps
7f60c3e00000-7f6103e00000 rw-p 00000000 00:00 0
Size:            1048576 kB

# grep 7f60c3e00000 /proc/`pidof qemu-kvm`/numa_maps
7f60c3e00000 default anon=261140 dirty=261140 active=74772 N0=249876 N1=11264 kernelpagesize_kB=4

2.3 Check guest numa cell 1 memory allocation on host
# grep -B1 1024000 /proc/`pidof qemu-kvm`/smaps
7f6085400000-7f60c3c00000 rw-p 00000000 00:00 0
Size:            1024000 kB

# grep 7f6085400000 /proc/`pidof qemu-kvm`/numa_maps
7f6085400000 default anon=228352 dirty=228352 active=218624 N0=153600 N1=74752 kernelpagesize_kB=4

2.4 Check the cpuset.mems cgroup setting:
# cat /sys/fs/cgroup/machine.slice/machine-qemu\\x2d17\\x2dvm1.scope/libvirt/emulator/cpuset.mems
1

Actual results:
We could see that the guest numa cell 0 and cell 1 memory allocates on both host numa node 0 and node 1, and doesn't restricted as numa tuning setting.

3 Scenario3:
3.1 Define and start a guest with memory and numa tuning config as below:
<maxMemory slots='16' unit='KiB'>52428800</maxMemory>
<memory unit='KiB'>2072576</memory>
<currentMemory unit='KiB'>2072576</currentMemory>
...
<numatune>
     <memory mode='restrictive' nodeset='0-1' />
     <memnode cellid="0" mode="restrictive" nodeset="0"/>
</numatune>
...
<cpu>
...
<numa>
      <cell id='0' cpus='0-1' memory='1048576' unit='KiB'/>
      <cell id='1' cpus='2-3' memory='1024000' unit='KiB'/>
 </numa>
</cpu>

3.2 Check guest numa cell 0 memory allocation on host
# grep -B1 1048576 /proc/`pidof qemu-kvm`/smaps
7fef07e00000-7fef47e00000 rw-p 00000000 00:00 0
Size:            1048576 kB

# grep 7fef07e00000 /proc/`pidof qemu-kvm`/numa_maps
7fef07e00000 default anon=70670 dirty=70670 N0=3584 N1=67086 kernelpagesize_kB=4

2.3 Check the cpuset.mems cgroup setting:
# cat /sys/fs/cgroup/machine.slice/machine-qemu\\x2d21\\x2dvm1.scope/libvirt/emulator/cpuset.mems
0-1

Actual results:
We could see that the guest numa cell 0  memory allocates on both host numa node 0 and node 1, and doesn't restricted as numa tuning setting.


Expected results:
Specifying restrictive numa tuning mode per each guest numa node should restrict the memory allocation to the desired node

Additional info:
From the above scenarios, all kinds of restrictive numa tuning mode per each guest numa node doesn't work(if same with host numa tuning nodeset, then we don't need guest specified tuning setting). If only host restrictive numa mode(<memory mode='restrictive' nodeset='0-1' />) works alone, there should be some check to forbid these combination usage during virsh define process.

Comment 1 Michal Privoznik 2023-04-13 11:15:06 UTC
Yeah, I don't think we can use mode="restrictive" for individual guest NUMA nodes (/numatune/memnode). It's not like a NUMA node is a different thread/process (i.e. units that CGroups understand). Martin, what do you think?

Comment 2 Martin Kletzander 2023-04-13 15:13:06 UTC
With `restrictive` the only setting we can do (and this mode was introduced precisely for this reason) is to limit the vCPU threads (not the emulator thread) with cpuset.mems and hope for the best (either that the allocation will be done after the setting or that it might migrate).  It depends on the system and is done so that we can change the numa node(s) during runtime (which is also not guaranteed to migrate the memory).

Instead of looking into `emulator/cpuset.mems` peek into `vcpu*/cpuset.mems`.

Also in the first example the node has only 244MB of allocated memory, if you allocate more when it is running it should, potentially, if there's enough room, allocate it from the right node *if* you are also making sure you are allocating that memory from the right guest numa node.

One more note, with cgroups v1 we explicitly set `cpuset.memory_migrate` to `1`, but cgroups v2 behave differently.  When task is migrated to a cgroup the resources (including memory allocations) are not migrated with it, but once anyone writes to `cpuset.mems` the memory is migrated.  I will check if we write to that file after the vcpu is moved there or before.  Anyway it is all based on the fact that what is using the node memory is the vcpu of that node and we can't do much more.

Comment 7 Martin Kletzander 2023-04-20 08:52:08 UTC
I posted some fix for this:

https://www.mail-archive.com/libvir-list@redhat.com/msg237420.html

Comment 8 Martin Kletzander 2023-04-20 10:48:03 UTC
Fixed upstream with v9.2.0-271-g383caddea103 and v9.2.0-272-g2f4f381871d2:

commit 383caddea103eaab7bb495ec446b43748677f749
Author: Martin Kletzander <mkletzan>
Date:   Fri Apr 14 12:08:59 2023 +0200

    qemu, ch: Move threads to cgroup dir before changing parameters

commit 2f4f381871d253e3ec34f32b452c32570459bdde
Author: Martin Kletzander <mkletzan>
Date:   Thu Apr 20 08:51:14 2023 +0200

    docs: Clarify restrictive numatune mode

Comment 9 liang cong 2023-04-25 06:44:40 UTC
Preverified on upstream libvirt v9.2.0-277-gd063389f10

Test steps:
Senario1: with restrictive mode
1.1 Define and start a guest with numatune and numa config xml:
<numatune>
     <memory mode='restrictive' nodeset='1' />
     <memnode cellid="0" mode="restrictive" nodeset="0"/>
</numatune>
<cpu>
<numa>
      <cell id='0' cpus='0' memory='1024000' unit='KiB'/>
      <cell id='1' cpus='1' memory='1048576' unit='KiB'/>
</numa>
...
</cpu>

1.2 check the cgroup cpuset.mems config
# cat /sys/fs/cgroup/machine.slice/machine-qemu\\x2d2\\x2dvm1.scope/libvirt/emulator/cpuset.mems
1
# cat /sys/fs/cgroup/machine.slice/machine-qemu\\x2d2\\x2dvm1.scope/libvirt/vcpu0/cpuset.mems
0
# cat /sys/fs/cgroup/machine.slice/machine-qemu\\x2d2\\x2dvm1.scope/libvirt/vcpu1/cpuset.mems
1

1.3 in guest consume the memory
# swapoff -a
# memhog 1200000KB

Senario2: with interleave mode
2.1 Define and start a guest with numatune and numa config xml:
<numatune>
     <memory mode='interleave' nodeset='1' />
     <memnode cellid="0" mode="interleave" nodeset="0"/>
</numatune>
<cpu>
<numa>
      <cell id='0' cpus='0' memory='1024000' unit='KiB'/>
      <cell id='1' cpus='1' memory='1048576' unit='KiB'/>
</numa>
...
</cpu>

2.2 check the cgroup cpuset.mems config
# cat /sys/fs/cgroup/machine.slice/machine-qemu\\x2d4\\x2dvm1.scope/libvirt/emulator/cpuset.mems

# cat /sys/fs/cgroup/machine.slice/machine-qemu\\x2d4\\x2dvm1.scope/libvirt/vcpu0/cpuset.mems

# cat /sys/fs/cgroup/machine.slice/machine-qemu\\x2d4\\x2dvm1.scope/libvirt/vcpu1/cpuset.mems

2.3 in guest consume the memory
# swapoff -a
# memhog 1200000KB

Senario3: with strict mode
3.1 Define and start a guest with numatune and numa config xml:
<numatune>
     <memory mode='strict' nodeset='0-1' />
     <memnode cellid="0" mode="strict" nodeset="0"/>
</numatune>
<cpu>
<numa>
      <cell id='0' cpus='0' memory='1024000' unit='KiB'/>
      <cell id='1' cpus='1' memory='1048576' unit='KiB'/>
</numa>
...
</cpu>

3.2 check the cgroup cpuset.mems config
# cat /sys/fs/cgroup/machine.slice/machine-qemu\\x2d6\\x2dvm1.scope/libvirt/emulator/cpuset.mems
0-1
# cat /sys/fs/cgroup/machine.slice/machine-qemu\\x2d6\\x2dvm1.scope/libvirt/vcpu0/cpuset.mems
0-1
# cat /sys/fs/cgroup/machine.slice/machine-qemu\\x2d6\\x2dvm1.scope/libvirt/vcpu1/cpuset.mems
0-1

3.3 in guest consume the memory
# swapoff -a
# memhog 1200000KB


Also check other scenarios, such as: with vcpupin, with emulatorpin, change numatuning on restrictive mode.

Hi Martin, I tested the code change as above test steps, do you think other more test scenarios need to be covered?
And for doc update:
Note that for ``memnode`` this will only guide the memory access for the vCPU
threads or similar mechanism and is very hypervisor-specific.  This does not
guarantee the placement of the node's memory allocation.  For proper
restriction other means should be used (e.g. different mode, preallocated
hugepages).

IMO these explanation is only for memnode with restrictive mode, right? if so, I think we'd better add that in doc, thx

Comment 10 Martin Kletzander 2023-04-25 10:36:24 UTC
I don't think restrictive mode for memnodes needs much testing, of course the matrix can explode very easily.

This explanation is meant for memnode, but there's added docs for numatune/memory as well, although the whole domain is restricted before launch in the latter case and that should work a bit better.

Comment 11 liang cong 2023-04-26 02:22:21 UTC
mark it tested as comment 9

Comment 14 liang cong 2023-05-18 02:27:01 UTC
Verified on:
# rpm -q libvirt qemu-kvm
libvirt-9.3.0-2.el9.x86_64
qemu-kvm-8.0.0-3.el9.x86_64

Test steps:
Senario1: restrictive mode
1.1 Define and start a guest with numatune and numa config xml:
<numatune>
     <memory mode='restrictive' nodeset='1' />
     <memnode cellid="0" mode="restrictive" nodeset="0"/>
</numatune>
<cpu>
<numa>
      <cell id='0' cpus='0' memory='1024000' unit='KiB'/>
      <cell id='1' cpus='1' memory='1048576' unit='KiB'/>
</numa>
...
</cpu>

1.2 check the cgroup cpuset.mems config
# cat /sys/fs/cgroup/machine.slice/machine-qemu\\x2d7\\x2dvm1.scope/libvirt/emulator/cpuset.mems
1
# cat /sys/fs/cgroup/machine.slice/machine-qemu\\x2d7\\x2dvm1.scope/libvirt/vcpu0/cpuset.mems
0
# cat /sys/fs/cgroup/machine.slice/machine-qemu\\x2d7\\x2dvm1.scope/libvirt/vcpu1/cpuset.mems
1

1.3 in guest consume the memory
# swapoff -a
# memhog 1200000KB

Senario2: restrictive with interleave mode
2.1 Define and start a guest with numatune and numa config xml:
<numatune>
     <memory mode='interleave' nodeset='1' />
     <memnode cellid="0" mode="restrictive" nodeset="0"/>
</numatune>
<cpu>
<numa>
      <cell id='0' cpus='0' memory='1024000' unit='KiB'/>
      <cell id='1' cpus='1' memory='1048576' unit='KiB'/>
</numa>
...
</cpu>

2.2 check the cgroup cpuset.mems config
# cat /sys/fs/cgroup/machine.slice/machine-qemu\\x2d8\\x2dvm1.scope/libvirt/emulator/cpuset.mems

# cat /sys/fs/cgroup/machine.slice/machine-qemu\\x2d8\\x2dvm1.scope/libvirt/vcpu0/cpuset.mems
0
# cat /sys/fs/cgroup/machine.slice/machine-qemu\\x2d8\\x2dvm1.scope/libvirt/vcpu1/cpuset.mems

2.3 in guest consume the memory
# swapoff -a
# memhog 1200000KB

Senario3: strict mode
3.1 Define and start a guest with numatune and numa config xml:
<numatune>
     <memory mode='strict' nodeset='0-1' />
     <memnode cellid="0" mode="strict" nodeset="0"/>
</numatune>
<cpu>
<numa>
      <cell id='0' cpus='0' memory='1024000' unit='KiB'/>
      <cell id='1' cpus='1' memory='1048576' unit='KiB'/>
</numa>
...
</cpu>

3.2 check the cgroup cpuset.mems config
# cat /sys/fs/cgroup/machine.slice/machine-qemu\\x2d9\\x2dvm1.scope/libvirt/emulator/cpuset.mems
0-1
# cat /sys/fs/cgroup/machine.slice/machine-qemu\\x2d9\\x2dvm1.scope/libvirt/vcpu0/cpuset.mems
0-1
# cat /sys/fs/cgroup/machine.slice/machine-qemu\\x2d9\\x2dvm1.scope/libvirt/vcpu1/cpuset.mems
0-1

3.3 in guest consume the memory
# swapoff -a
# memhog 1200000KB

Also check other scenarios, such as: with vcpupin, with emulatorpin, change numatuning on restrictive mode.

Comment 16 errata-xmlrpc 2023-11-07 08:31:17 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: libvirt security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:6409


Note You need to log in before you can comment on or make changes to this bug.