Bug 2185184
| Summary: | Specifying restrictive numa tuning mode per each guest numa node doesn't work | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 9 | Reporter: | liang cong <lcong> |
| Component: | libvirt | Assignee: | Martin Kletzander <mkletzan> |
| libvirt sub component: | General | QA Contact: | liang cong <lcong> |
| Status: | VERIFIED --- | Docs Contact: | |
| Severity: | medium | ||
| Priority: | medium | CC: | jdenemar, lmen, mkletzan, mprivozn, virt-maint |
| Version: | 9.2 | Keywords: | AutomationTriaged, Triaged |
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | libvirt-9.3.0-1.el9 | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | Type: | Bug | |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | 9.3.0 |
| Embargoed: | |||
Yeah, I don't think we can use mode="restrictive" for individual guest NUMA nodes (/numatune/memnode). It's not like a NUMA node is a different thread/process (i.e. units that CGroups understand). Martin, what do you think? With `restrictive` the only setting we can do (and this mode was introduced precisely for this reason) is to limit the vCPU threads (not the emulator thread) with cpuset.mems and hope for the best (either that the allocation will be done after the setting or that it might migrate). It depends on the system and is done so that we can change the numa node(s) during runtime (which is also not guaranteed to migrate the memory). Instead of looking into `emulator/cpuset.mems` peek into `vcpu*/cpuset.mems`. Also in the first example the node has only 244MB of allocated memory, if you allocate more when it is running it should, potentially, if there's enough room, allocate it from the right node *if* you are also making sure you are allocating that memory from the right guest numa node. One more note, with cgroups v1 we explicitly set `cpuset.memory_migrate` to `1`, but cgroups v2 behave differently. When task is migrated to a cgroup the resources (including memory allocations) are not migrated with it, but once anyone writes to `cpuset.mems` the memory is migrated. I will check if we write to that file after the vcpu is moved there or before. Anyway it is all based on the fact that what is using the node memory is the vcpu of that node and we can't do much more. I posted some fix for this: https://www.mail-archive.com/libvir-list@redhat.com/msg237420.html Fixed upstream with v9.2.0-271-g383caddea103 and v9.2.0-272-g2f4f381871d2:
commit 383caddea103eaab7bb495ec446b43748677f749
Author: Martin Kletzander <mkletzan>
Date: Fri Apr 14 12:08:59 2023 +0200
qemu, ch: Move threads to cgroup dir before changing parameters
commit 2f4f381871d253e3ec34f32b452c32570459bdde
Author: Martin Kletzander <mkletzan>
Date: Thu Apr 20 08:51:14 2023 +0200
docs: Clarify restrictive numatune mode
Preverified on upstream libvirt v9.2.0-277-gd063389f10
Test steps:
Senario1: with restrictive mode
1.1 Define and start a guest with numatune and numa config xml:
<numatune>
<memory mode='restrictive' nodeset='1' />
<memnode cellid="0" mode="restrictive" nodeset="0"/>
</numatune>
<cpu>
<numa>
<cell id='0' cpus='0' memory='1024000' unit='KiB'/>
<cell id='1' cpus='1' memory='1048576' unit='KiB'/>
</numa>
...
</cpu>
1.2 check the cgroup cpuset.mems config
# cat /sys/fs/cgroup/machine.slice/machine-qemu\\x2d2\\x2dvm1.scope/libvirt/emulator/cpuset.mems
1
# cat /sys/fs/cgroup/machine.slice/machine-qemu\\x2d2\\x2dvm1.scope/libvirt/vcpu0/cpuset.mems
0
# cat /sys/fs/cgroup/machine.slice/machine-qemu\\x2d2\\x2dvm1.scope/libvirt/vcpu1/cpuset.mems
1
1.3 in guest consume the memory
# swapoff -a
# memhog 1200000KB
Senario2: with interleave mode
2.1 Define and start a guest with numatune and numa config xml:
<numatune>
<memory mode='interleave' nodeset='1' />
<memnode cellid="0" mode="interleave" nodeset="0"/>
</numatune>
<cpu>
<numa>
<cell id='0' cpus='0' memory='1024000' unit='KiB'/>
<cell id='1' cpus='1' memory='1048576' unit='KiB'/>
</numa>
...
</cpu>
2.2 check the cgroup cpuset.mems config
# cat /sys/fs/cgroup/machine.slice/machine-qemu\\x2d4\\x2dvm1.scope/libvirt/emulator/cpuset.mems
# cat /sys/fs/cgroup/machine.slice/machine-qemu\\x2d4\\x2dvm1.scope/libvirt/vcpu0/cpuset.mems
# cat /sys/fs/cgroup/machine.slice/machine-qemu\\x2d4\\x2dvm1.scope/libvirt/vcpu1/cpuset.mems
2.3 in guest consume the memory
# swapoff -a
# memhog 1200000KB
Senario3: with strict mode
3.1 Define and start a guest with numatune and numa config xml:
<numatune>
<memory mode='strict' nodeset='0-1' />
<memnode cellid="0" mode="strict" nodeset="0"/>
</numatune>
<cpu>
<numa>
<cell id='0' cpus='0' memory='1024000' unit='KiB'/>
<cell id='1' cpus='1' memory='1048576' unit='KiB'/>
</numa>
...
</cpu>
3.2 check the cgroup cpuset.mems config
# cat /sys/fs/cgroup/machine.slice/machine-qemu\\x2d6\\x2dvm1.scope/libvirt/emulator/cpuset.mems
0-1
# cat /sys/fs/cgroup/machine.slice/machine-qemu\\x2d6\\x2dvm1.scope/libvirt/vcpu0/cpuset.mems
0-1
# cat /sys/fs/cgroup/machine.slice/machine-qemu\\x2d6\\x2dvm1.scope/libvirt/vcpu1/cpuset.mems
0-1
3.3 in guest consume the memory
# swapoff -a
# memhog 1200000KB
Also check other scenarios, such as: with vcpupin, with emulatorpin, change numatuning on restrictive mode.
Hi Martin, I tested the code change as above test steps, do you think other more test scenarios need to be covered?
And for doc update:
Note that for ``memnode`` this will only guide the memory access for the vCPU
threads or similar mechanism and is very hypervisor-specific. This does not
guarantee the placement of the node's memory allocation. For proper
restriction other means should be used (e.g. different mode, preallocated
hugepages).
IMO these explanation is only for memnode with restrictive mode, right? if so, I think we'd better add that in doc, thx
I don't think restrictive mode for memnodes needs much testing, of course the matrix can explode very easily. This explanation is meant for memnode, but there's added docs for numatune/memory as well, although the whole domain is restricted before launch in the latter case and that should work a bit better. mark it tested as comment 9 Verified on:
# rpm -q libvirt qemu-kvm
libvirt-9.3.0-2.el9.x86_64
qemu-kvm-8.0.0-3.el9.x86_64
Test steps:
Senario1: restrictive mode
1.1 Define and start a guest with numatune and numa config xml:
<numatune>
<memory mode='restrictive' nodeset='1' />
<memnode cellid="0" mode="restrictive" nodeset="0"/>
</numatune>
<cpu>
<numa>
<cell id='0' cpus='0' memory='1024000' unit='KiB'/>
<cell id='1' cpus='1' memory='1048576' unit='KiB'/>
</numa>
...
</cpu>
1.2 check the cgroup cpuset.mems config
# cat /sys/fs/cgroup/machine.slice/machine-qemu\\x2d7\\x2dvm1.scope/libvirt/emulator/cpuset.mems
1
# cat /sys/fs/cgroup/machine.slice/machine-qemu\\x2d7\\x2dvm1.scope/libvirt/vcpu0/cpuset.mems
0
# cat /sys/fs/cgroup/machine.slice/machine-qemu\\x2d7\\x2dvm1.scope/libvirt/vcpu1/cpuset.mems
1
1.3 in guest consume the memory
# swapoff -a
# memhog 1200000KB
Senario2: restrictive with interleave mode
2.1 Define and start a guest with numatune and numa config xml:
<numatune>
<memory mode='interleave' nodeset='1' />
<memnode cellid="0" mode="restrictive" nodeset="0"/>
</numatune>
<cpu>
<numa>
<cell id='0' cpus='0' memory='1024000' unit='KiB'/>
<cell id='1' cpus='1' memory='1048576' unit='KiB'/>
</numa>
...
</cpu>
2.2 check the cgroup cpuset.mems config
# cat /sys/fs/cgroup/machine.slice/machine-qemu\\x2d8\\x2dvm1.scope/libvirt/emulator/cpuset.mems
# cat /sys/fs/cgroup/machine.slice/machine-qemu\\x2d8\\x2dvm1.scope/libvirt/vcpu0/cpuset.mems
0
# cat /sys/fs/cgroup/machine.slice/machine-qemu\\x2d8\\x2dvm1.scope/libvirt/vcpu1/cpuset.mems
2.3 in guest consume the memory
# swapoff -a
# memhog 1200000KB
Senario3: strict mode
3.1 Define and start a guest with numatune and numa config xml:
<numatune>
<memory mode='strict' nodeset='0-1' />
<memnode cellid="0" mode="strict" nodeset="0"/>
</numatune>
<cpu>
<numa>
<cell id='0' cpus='0' memory='1024000' unit='KiB'/>
<cell id='1' cpus='1' memory='1048576' unit='KiB'/>
</numa>
...
</cpu>
3.2 check the cgroup cpuset.mems config
# cat /sys/fs/cgroup/machine.slice/machine-qemu\\x2d9\\x2dvm1.scope/libvirt/emulator/cpuset.mems
0-1
# cat /sys/fs/cgroup/machine.slice/machine-qemu\\x2d9\\x2dvm1.scope/libvirt/vcpu0/cpuset.mems
0-1
# cat /sys/fs/cgroup/machine.slice/machine-qemu\\x2d9\\x2dvm1.scope/libvirt/vcpu1/cpuset.mems
0-1
3.3 in guest consume the memory
# swapoff -a
# memhog 1200000KB
Also check other scenarios, such as: with vcpupin, with emulatorpin, change numatuning on restrictive mode.
|
Description of problem: Specifying restrictive numa tuning mode per each guest numa node doesn't work Version-Release number of selected component (if applicable): libvirt-9.0.0-10.el9_2.x86_64 How reproducible: 100% Steps to Reproduce: 1 Scenario1: 1.1 Define and start a guest with memory and numa tuning config as below: <maxMemory slots='16' unit='KiB'>52428800</maxMemory> <memory unit='KiB'>2072576</memory> <currentMemory unit='KiB'>2072576</currentMemory> ... <numatune> <memnode cellid="0" mode="restrictive" nodeset="1"/> </numatune> ... <cpu> ... <numa> <cell id='0' cpus='0-1' memory='1048576' unit='KiB'/> <cell id='1' cpus='2-3' memory='1024000' unit='KiB'/> </numa> </cpu> 1.2 Check guest numa cell 0 memory allocation on host # grep -B1 1048576 /proc/`pidof qemu-kvm`/smaps 7f8013e00000-7f8053e00000 rw-p 00000000 00:00 0 Size: 1048576 kB # grep 7f8013e00000 /proc/`pidof qemu-kvm`/numa_maps 7f8013e00000 default anon=62464 dirty=62464 active=61952 N0=7680 N1=54784 kernelpagesize_kB=4 1.3 Check the cpuset.mems cgroup setting: # cat /sys/fs/cgroup/machine.slice/machine-qemu\\x2d18\\x2dvm1.scope/libvirt/emulator/cpuset.mems Actual results: We could see that the guest numa cell 0 memory allocates on both host numa node 0 and node 1, and doesn't restricted as numa tuning setting. 2 Scenario2: 2.1 Define and start a guest with memory and numa tuning config as below: <maxMemory slots='16' unit='KiB'>52428800</maxMemory> <memory unit='KiB'>2072576</memory> <currentMemory unit='KiB'>2072576</currentMemory> ... <numatune> <memory mode='restrictive' nodeset='1' /> <memnode cellid="0" mode="restrictive" nodeset="0"/> </numatune> ... <cpu> ... <numa> <cell id='0' cpus='0-1' memory='1048576' unit='KiB'/> <cell id='1' cpus='2-3' memory='1024000' unit='KiB'/> </numa> </cpu> 2.2 Check guest numa cell 0 memory allocation on host # grep -B1 1048576 /proc/`pidof qemu-kvm`/smaps 7f60c3e00000-7f6103e00000 rw-p 00000000 00:00 0 Size: 1048576 kB # grep 7f60c3e00000 /proc/`pidof qemu-kvm`/numa_maps 7f60c3e00000 default anon=261140 dirty=261140 active=74772 N0=249876 N1=11264 kernelpagesize_kB=4 2.3 Check guest numa cell 1 memory allocation on host # grep -B1 1024000 /proc/`pidof qemu-kvm`/smaps 7f6085400000-7f60c3c00000 rw-p 00000000 00:00 0 Size: 1024000 kB # grep 7f6085400000 /proc/`pidof qemu-kvm`/numa_maps 7f6085400000 default anon=228352 dirty=228352 active=218624 N0=153600 N1=74752 kernelpagesize_kB=4 2.4 Check the cpuset.mems cgroup setting: # cat /sys/fs/cgroup/machine.slice/machine-qemu\\x2d17\\x2dvm1.scope/libvirt/emulator/cpuset.mems 1 Actual results: We could see that the guest numa cell 0 and cell 1 memory allocates on both host numa node 0 and node 1, and doesn't restricted as numa tuning setting. 3 Scenario3: 3.1 Define and start a guest with memory and numa tuning config as below: <maxMemory slots='16' unit='KiB'>52428800</maxMemory> <memory unit='KiB'>2072576</memory> <currentMemory unit='KiB'>2072576</currentMemory> ... <numatune> <memory mode='restrictive' nodeset='0-1' /> <memnode cellid="0" mode="restrictive" nodeset="0"/> </numatune> ... <cpu> ... <numa> <cell id='0' cpus='0-1' memory='1048576' unit='KiB'/> <cell id='1' cpus='2-3' memory='1024000' unit='KiB'/> </numa> </cpu> 3.2 Check guest numa cell 0 memory allocation on host # grep -B1 1048576 /proc/`pidof qemu-kvm`/smaps 7fef07e00000-7fef47e00000 rw-p 00000000 00:00 0 Size: 1048576 kB # grep 7fef07e00000 /proc/`pidof qemu-kvm`/numa_maps 7fef07e00000 default anon=70670 dirty=70670 N0=3584 N1=67086 kernelpagesize_kB=4 2.3 Check the cpuset.mems cgroup setting: # cat /sys/fs/cgroup/machine.slice/machine-qemu\\x2d21\\x2dvm1.scope/libvirt/emulator/cpuset.mems 0-1 Actual results: We could see that the guest numa cell 0 memory allocates on both host numa node 0 and node 1, and doesn't restricted as numa tuning setting. Expected results: Specifying restrictive numa tuning mode per each guest numa node should restrict the memory allocation to the desired node Additional info: From the above scenarios, all kinds of restrictive numa tuning mode per each guest numa node doesn't work(if same with host numa tuning nodeset, then we don't need guest specified tuning setting). If only host restrictive numa mode(<memory mode='restrictive' nodeset='0-1' />) works alone, there should be some check to forbid these combination usage during virsh define process.