Bug 2138150
Summary: | With different nodeset, strict host numa memory binding and guest specified numa memory binding make guest vm fail to start | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 9 | Reporter: | liang cong <lcong> |
Component: | libvirt | Assignee: | Michal Privoznik <mprivozn> |
libvirt sub component: | General | QA Contact: | liang cong <lcong> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | medium | ||
Priority: | unspecified | CC: | jdenemar, lmen, mprivozn, virt-maint, xuzhang |
Version: | 9.1 | Keywords: | AutomationTriaged, Triaged, Upstream |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | libvirt-9.5.0-0rc1.1.el9 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2023-11-07 08:30:47 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | 9.5.0 |
Embargoed: |
Description
liang cong
2022-10-27 12:07:58 UTC
So this was changed in the following commit: https://gitlab.com/libvirt/libvirt/-/commit/f136b83139c and reading through the reasoning in the commit message, I start to wonder whether we should just forbid this configuration. I mean, ideally, such configuration means "allocate all memory on host node #0, except memory for guest NUMA node 0, which should come from host node #1". But there are some caveats. The first one is, when QEMU is restricted (via CGroups) to allocate on node #0 and then it tries to allocate memory from node #1 (because it is given -object memory-backend-ram,id=ram-node0,size=2147483648,host-nodes=1,policy=bind) it will inevitably see failing mbind(). One way around it is to compute union of all the sets, let QEMU allocate its memory and then restrict it back to the original <memory nodeset/>. But the referenced commit lists reasons why that did not work (e.g. QEMU might have locked memory which is then not movable). Just for the record, the reason example 2.1 works is because mode="restrictive" means no host-nodes= is generated onto the cmd line and thus QEMU does not call mbind(). Libvirt relies solely on CGroups to restrict QEMU. And the example 2.2 is actually what makes perfect sense and what I suggest in the paragraph above should have been used instead. Patches posted on the list: https://listman.redhat.com/archives/libvir-list/2023-May/239964.html Merged as: e53291514c qemu_hotplug: Temporarily allow emulator thread to access other NUMA nodes during mem hotplug 3ec6d586bc qemu: Start emulator thread with more generous cpuset.mems c4a7f8007c qemuProcessSetupPid: Use @numatune variable more v9.3.0-116-ge53291514c Testd on upstream build libvirt v9.4.0-12-gf26923fb2e Test steps: 1.Define a guest vm with following numatune and numa topology xml: <numatune> <memory mode="strict" nodeset="0"/> <memnode cellid="0" mode="strict" nodeset="1"/> </numatune> ... <numa> <cell id="0" cpus="0" memory="1024000" unit="KiB"/> <cell id="1" cpus="1" memory="1048576" unit="KiB"/> </numa> 2.Start guest vm # virsh start vm1 Domain 'vm1' started 3.Get the qemu cmd line # ps -ef | grep qemu ... -object {"qom-type":"memory-backend-ram","id":"ram-node0","size":1048576000,"host-nodes":[1],"policy":"bind"} -numa node,nodeid=0,cpus=0-1,memdev=ram-node0 -object {"qom-type":"memory-backend-ram","id":"ram-node1","size":1073741824,"host-nodes":[0],"policy":"bind"} -numa node,nodeid=1,cpus=2-3,memdev=ram-node1... 4.Check the guest numa memory allocation # pidof qemu-system-x86_64 3454 3430 # grep -B1 **1024000** /proc/3454/smaps 7fd8c9600000-7fd907e00000 rw-p 00000000 00:00 0 Size: 1024000 kB # grep 7fd8c9600000 /proc/3454/numa_maps 7fd8c9600000 bind:0 anon=82944 dirty=82944 active=0 N0=82944 kernelpagesize_kB=4 We could see guest numa 0 memory is bind to host node 0 which is different with qemu cmd line indicate and setting "<memnode cellid="0" mode="strict" nodeset="1"/>" Hi michal, could you help to have a check? The current result make me confused, thx. (In reply to liang cong from comment #4) > Testd on upstream build libvirt v9.4.0-12-gf26923fb2e > > Test steps: > 1.Define a guest vm with following numatune and numa topology xml: > <numatune> > <memory mode="strict" nodeset="0"/> > <memnode cellid="0" mode="strict" nodeset="1"/> > </numatune> > ... > <numa> > <cell id="0" cpus="0" memory="1024000" unit="KiB"/> > <cell id="1" cpus="1" memory="1048576" unit="KiB"/> So here you have vCPU#0 assigned to guest NUMA node #0 and vCPU#1 to node #1 ... > </numa> > > 2.Start guest vm > # virsh start vm1 > Domain 'vm1' started > > 3.Get the qemu cmd line > # ps -ef | grep qemu > ... -object > {"qom-type":"memory-backend-ram","id":"ram-node0","size":1048576000,"host- > nodes":[1],"policy":"bind"} -numa node,nodeid=0,cpus=0-1,memdev=ram-node0 > -object > {"qom-type":"memory-backend-ram","id":"ram-node1","size":1073741824,"host- > nodes":[0],"policy":"bind"} -numa node,nodeid=1,cpus=2-3,memdev=ram-node1... ... but here you have it differently. > > 4.Check the guest numa memory allocation > # pidof qemu-system-x86_64 > 3454 3430 And this also suggest you might be looking at a different QEMU process. (In reply to Michal Privoznik from comment #5) > (In reply to liang cong from comment #4) > > Testd on upstream build libvirt v9.4.0-12-gf26923fb2e > > > > Test steps: > > 1.Define a guest vm with following numatune and numa topology xml: > > <numatune> > > <memory mode="strict" nodeset="0"/> > > <memnode cellid="0" mode="strict" nodeset="1"/> > > </numatune> > > ... > > <numa> > > <cell id="0" cpus="0" memory="1024000" unit="KiB"/> > > <cell id="1" cpus="1" memory="1048576" unit="KiB"/> > > So here you have vCPU#0 assigned to guest NUMA node #0 and vCPU#1 to node #1 yeah, the numa topology and numa tuning setting should be: <numatune> <memory mode="strict" nodeset="0"/> <memnode cellid="0" mode="strict" nodeset="1"/> </numatune> <numa> <cell id="0" cpus="0-1" memory="1024000" unit="KiB" /> <cell id="1" cpus="2-3" memory="1048576" unit="KiB"/> </numa> guest cell 0 memory should be allocated to host numa node 1 guest cell 1 memory should be allocated to host numa node 0 > ... > > > </numa> > > > > 2.Start guest vm > > # virsh start vm1 > > Domain 'vm1' started > > > > 3.Get the qemu cmd line > > # ps -ef | grep qemu > > ... -object > > {"qom-type":"memory-backend-ram","id":"ram-node0","size":1048576000,"host- > > nodes":[1],"policy":"bind"} -numa node,nodeid=0,cpus=0-1,memdev=ram-node0 > > -object > > {"qom-type":"memory-backend-ram","id":"ram-node1","size":1073741824,"host- > > nodes":[0],"policy":"bind"} -numa node,nodeid=1,cpus=2-3,memdev=ram-node1... > > ... but here you have it differently. > > > the qemu cmd line is showing the same result with domain xml setting > > 4.Check the guest numa memory allocation > > # pidof qemu-system-x86_64 > > 3454 3430 > > And this also suggest you might be looking at a different QEMU process. for qemu process I could get 2, just as below : # ps -ef | grep qemu qemu 3430 1 0 06:21 ? 00:02:52 /usr/bin/qemu-system-x86_64 -name guest=vm1,debug-threads=on -S -object {"qom-type":"secret","id":"masterKey0","format":"raw","file":"/var/lib/libvirt/qemu/domain-6-vm1/master-key.aes"} ....-object {"qom-type":"memory-backend-ram","id":"ram-node0","size":1048576000,"host-nodes":[1],"policy":"bind"} -numa node,nodeid=0,cpus=0-1,memdev=ram-node0 -object {"qom-type":"memory-backend-ram","id":"ram-node1","size":1073741824,"host-nodes":[0],"policy":"bind"} -numa node,nodeid=1,cpus=2-3,memdev=ram-node1... qemu 3454 3430 0 06:21 ? 00:00:01 /usr/bin/qemu-system-x86_64 -name guest=vm1,debug-threads=on -S -object {"qom-type":"secret","id":"masterKey0","format":"raw","file":"/var/lib/libvirt/qemu/domain-6-vm1/master-key.aes"}...-object {"qom-type":"memory-backend-ram","id":"ram-node0","size":1048576000,"host-nodes":[1],"policy":"bind"} -numa node,nodeid=0,cpus=0-1,memdev=ram-node0 -object {"qom-type":"memory-backend-ram","id":"ram-node1","size":1073741824,"host-nodes":[0],"policy":"bind"} -numa node,nodeid=1,cpus=2-3,memdev=ram-node1... the 2 processes cmd line numa memory part are the same. And I tried 2 processes 3430 and 3454(seems they share the same memory), they have the same result that the guest cell 0 memroy is allocated on host numa node 0, which is different with what I define and qemu cmd line indicate. # grep -B1 **1024000** /proc/3430/smaps 7fd8c9600000-7fd907e00000 rw-p 00000000 00:00 0 Size: 1024000 kB # grep 7fd8c9600000 /proc/3430/numa_maps 7fd8c9600000 bind:0 anon=110080 dirty=110080 active=0 N0=110080 kernelpagesize_kB=4 # grep -B1 **1024000** /proc/3454/smaps 7fd8c9600000-7fd907e00000 rw-p 00000000 00:00 0 Size: 1024000 kB # grep 7fd8c9600000 /proc/3454/numa_maps 7fd8c9600000 bind:0 anon=110080 dirty=110080 active=0 N0=110080 kernelpagesize_kB=4 Yeah, the problem here is that while libvirt starts QEMU with cpuset.mems set to 0-1 it then overwrites this to just 0 which causes all the memory to move to NUMA node #0. Let me see if I can fix it. Patches posted on the list: https://listman.redhat.com/archives/libvir-list/2023-June/240222.html Merged upstream as: d09b73b560 (HEAD -> master, origin/master, origin/HEAD) qemu: Drop @unionMems argument from qemuProcessSetupPid() 83adba541a qemu: Allow more generous cpuset.mems for vCPUs and IOThreads fddbb2f12f qemu: Don't try to 'fix up' cpuset.mems after QEMU's memory allocation v9.4.0-52-gd09b73b560 Pre-verified on upstream build: v9.4.0-66-ga5bf2c4bf9 Test steps: 1.Define a guest vm with following numatune and numa topology xml: <iothreads>1</iothreads> <numatune> <memory mode="strict" nodeset="0"/> <memnode cellid="0" mode="strict" nodeset="1"/> </numatune> ... <numa> <cell id="0" cpus="0" memory="1024000" unit="KiB"/> <cell id="1" cpus="1" memory="1048576" unit="KiB"/> </numa> 2.Start guest vm # virsh start vm1 Domain 'vm1' started 3.Get the qemu cmd line # ps -ef | grep qemu -object {"qom-type":"memory-backend-ram","id":"ram-node0","size":1048576000,"host-nodes":[1],"policy":"bind"} -numa node,nodeid=0,cpus=0,memdev=ram-node0 -object {"qom-type":"memory-backend-ram","id":"ram-node1","size":1073741824,"host-nodes":[0],"policy":"bind"} -numa node,nodeid=1,cpus=1,memdev=ram-node1... 4.Check the guest numa memory allocation # pidof qemu-system-x86_64 40630 # grep -B1 **1024000** /proc/40630/smaps 7f8f25600000-7f8f63e00000 rw-p 00000000 00:00 0 Size: 1024000 kB # grep 7f8f25600000 /proc/40630/numa_maps 7f8f25600000 bind:1 anon=108046 dirty=108046 active=0 N1=108046 kernelpagesize_kB=4 # grep -B1 **1048576** /proc/40630/smaps 7f8ee5400000-7f8f25400000 rw-p 00000000 00:00 0 Size: 1048576 kB # grep 7f8ee5400000 /proc/40630/numa_maps 7f8ee5400000 bind:0 anon=116224 dirty=116224 active=0 N0=116224 kernelpagesize_kB=4 5. Check cgroup settings # cat /sys/fs/cgroup/machine.slice/machine-qemu\\x2d2\\x2dvm1.scope/libvirt/iothread1/cpuset.mems 0-1 # cat /sys/fs/cgroup/machine.slice/machine-qemu\\x2d2\\x2dvm1.scope/libvirt/emulator/cpuset.mems 0-1 # cat /sys/fs/cgroup/machine.slice/machine-qemu\\x2d2\\x2dvm1.scope/libvirt/vcpu0/cpuset.mems 0-1 # cat /sys/fs/cgroup/machine.slice/machine-qemu\\x2d2\\x2dvm1.scope/libvirt/vcpu1/cpuset.mems 0-1 Also test with other modes such as interleave, preferred, restrictive Verified on build: # rpm -q libvirt qemu-kvm libvirt-9.5.0-3.el9.x86_64 qemu-kvm-8.0.0-9.el9.x86_64 Test steps: 1.Define a guest vm with following numatune and numa topology xml: <iothreads>1</iothreads> <numatune> <memory mode="strict" nodeset="1"/> <memnode cellid="0" mode="strict" nodeset="0"/> </numatune> ... <numa> <cell id='0' cpus='0' memory='1048576' unit='KiB'/> <cell id='1' cpus='1' memory='1024000' unit='KiB'/> </numa> 2.Start guest vm # virsh start vm1 Domain 'vm1' started 3.Get the qemu cmd line # ps -ef | grep qemu ...-object {"qom-type":"memory-backend-ram","id":"ram-node0","size":1073741824,"host-nodes":[0],"policy":"bind"} -numa node,nodeid=0,cpus=0,memdev=ram-node0 -object {"qom-type":"memory-backend-ram","id":"ram-node1","size":1048576000,"host-nodes":[1],"policy":"bind"} -numa node,nodeid=1,cpus=1,memdev=ram-node1... 4.Check the guest numa memory allocation # pidof qemu-kvm 29472 # grep -B1 **1048576** /proc/29472/smaps 7f50e3e00000-7f5123e00000 rw-p 00000000 00:00 0 Size: 1048576 kB # grep 7f50e3e00000 /proc/29472/numa_maps 7f50e3e00000 bind:0 anon=71182 dirty=71182 active=68622 N0=71182 kernelpagesize_kB=4 # grep -B1 **1024000** /proc/29472/smaps 7f50a5400000-7f50e3c00000 rw-p 00000000 00:00 0 Size: 1024000 kB # grep 7f50a5400000 /proc/29472/numa_maps 7f50a5400000 bind:1 anon=146944 dirty=146944 active=123904 N1=146944 kernelpagesize_kB=4 5. Check cgroup settings # cat /sys/fs/cgroup/machine.slice/machine-qemu\\x2d1\\x2dvm1.scope/libvirt/iothread1/cpuset.mems 0-1 # cat /sys/fs/cgroup/machine.slice/machine-qemu\\x2d1\\x2dvm1.scope/libvirt/emulator/cpuset.mems 0-1 # cat /sys/fs/cgroup/machine.slice/machine-qemu\\x2d1\\x2dvm1.scope/libvirt/vcpu0/cpuset.mems 0-1 # cat /sys/fs/cgroup/machine.slice/machine-qemu\\x2d1\\x2dvm1.scope/libvirt/vcpu1/cpuset.mems 0-1 Also test with other modes such as interleave, preferred, restrictive mark it verified for comment 12 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: libvirt security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:6409 |