RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 2138150 - With different nodeset, strict host numa memory binding and guest specified numa memory binding make guest vm fail to start
Summary: With different nodeset, strict host numa memory binding and guest specified n...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 9
Classification: Red Hat
Component: libvirt
Version: 9.1
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: rc
: ---
Assignee: Michal Privoznik
QA Contact: liang cong
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-10-27 12:07 UTC by liang cong
Modified: 2023-11-07 09:36 UTC (History)
5 users (show)

Fixed In Version: libvirt-9.5.0-0rc1.1.el9
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-11-07 08:30:47 UTC
Type: Bug
Target Upstream Version: 9.5.0
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHELPLAN-137726 0 None None None 2022-10-27 12:13:36 UTC
Red Hat Product Errata RHSA-2023:6409 0 None None None 2023-11-07 08:31:29 UTC

Description liang cong 2022-10-27 12:07:58 UTC
Description of problem:
With different nodeset, strict host numa memory binding and guest specified numa memory binding make guest vm fail to start

Version-Release number of selected component (if applicable):
# rpm -q libvirt qemu-kvm
libvirt-8.5.0-7.el9_1.x86_64
qemu-kvm-7.0.0-13.el9.x86_64


How reproducible:
100%

Steps to Reproduce:
1.There are 2 nodes on the host:
# numactl --hard
available: 2 nodes (0-1)
node 0 cpus: 0 1
node 0 size: 3671 MB
node 0 free: 1545 MB
node 1 cpus: 2 3
node 1 size: 4020 MB
node 1 free: 2210 MB
node distances:
node   0   1 
  0:  10  20 
  1:  20  10 

2.Define a guest vm with following numatune and numa topology xml:
<numatune>
    <memory mode="strict" nodeset="0"/>
    <memnode cellid="0" mode="strict" nodeset="1"/>
</numatune>
...
<numa>
    <cell id="0" cpus="0" memory="1048576" unit="KiB"/>
    <cell id="1" cpus="1" memory="1048576" unit="KiB"/>
</numa>

3.Start guest vm
# virsh start vm1
error: Failed to start domain 'vm1'
error: internal error: qemu unexpectedly closed the monitor: 2022-10-27T11:57:09.527171Z qemu-kvm: cannot bind memory to host NUMA nodes: Invalid argument


Actual results:
Guest vm fails to start

Expected results:
Guest could start successfully

Additional info:
1.Settings like below numatune have same issue:
1.1 "interleave" mode guest specified numa memory binding
<numatune>
    <memory mode="strict" nodeset="0"/>
    <memnode cellid="0" mode="interleave" nodeset="1"/>
</numatune>
1.2 "preferred" mode guest specified numa memory binding
<numatune>
    <memory mode="strict" nodeset="0"/>
    <memnode cellid="0" mode="preferred" nodeset="1"/>
</numatune>



2.But if set like below, then guest vm can start up successfully
2.1 "restrictive" mode guest specified numa memory binding
<numatune>
    <memory mode="strict" nodeset="0"/>
    <memnode cellid="0" mode="restrictive" nodeset="1"/>
</numatune>
2.2 "strict" host numa memory binding nodeset covers the guest specified numa memory binding nodeset
<numatune>
    <memory mode="strict" nodeset="0-1"/>
    <memnode cellid="0" mode="strict" nodeset="1"/>
</numatune>

Comment 1 Michal Privoznik 2022-10-27 12:59:06 UTC
So this was changed in the following commit:

https://gitlab.com/libvirt/libvirt/-/commit/f136b83139c

and reading through the reasoning in the commit message, I start to wonder whether we should just forbid this configuration. I mean, ideally, such configuration means "allocate all memory on host node #0, except memory for guest NUMA node 0, which should come from host node #1". But there are some caveats. The first one is, when QEMU is restricted (via CGroups) to allocate on node #0 and then it tries to allocate memory from node #1 (because it is given -object memory-backend-ram,id=ram-node0,size=2147483648,host-nodes=1,policy=bind) it will inevitably see failing mbind(). One way around it is to compute union of all the sets, let QEMU allocate its memory and then restrict it back to the original <memory nodeset/>. But the referenced commit lists reasons why that did not work (e.g. QEMU might have locked memory which is then not movable).

Just for the record, the reason example 2.1 works is because mode="restrictive" means no host-nodes= is generated onto the cmd line and thus QEMU does not call mbind(). Libvirt relies solely on CGroups to restrict QEMU. And the example 2.2 is actually what makes perfect sense and what I suggest in the paragraph above should have been used instead.

Comment 2 Michal Privoznik 2023-05-23 10:08:09 UTC
Patches posted on the list:

https://listman.redhat.com/archives/libvir-list/2023-May/239964.html

Comment 3 Michal Privoznik 2023-05-23 15:23:03 UTC
Merged as:

e53291514c qemu_hotplug: Temporarily allow emulator thread to access other NUMA nodes during mem hotplug
3ec6d586bc qemu: Start emulator thread with more generous cpuset.mems
c4a7f8007c qemuProcessSetupPid: Use @numatune variable more

v9.3.0-116-ge53291514c

Comment 4 liang cong 2023-06-05 10:36:04 UTC
Testd on upstream build libvirt v9.4.0-12-gf26923fb2e

Test steps:
1.Define a guest vm with following numatune and numa topology xml:
<numatune>
    <memory mode="strict" nodeset="0"/>
    <memnode cellid="0" mode="strict" nodeset="1"/>
</numatune>
...
<numa>
    <cell id="0" cpus="0" memory="1024000" unit="KiB"/>
    <cell id="1" cpus="1" memory="1048576" unit="KiB"/>
</numa>

2.Start guest vm
# virsh start vm1
Domain 'vm1' started

3.Get the qemu cmd line
# ps -ef | grep qemu
... -object {"qom-type":"memory-backend-ram","id":"ram-node0","size":1048576000,"host-nodes":[1],"policy":"bind"} -numa node,nodeid=0,cpus=0-1,memdev=ram-node0 -object {"qom-type":"memory-backend-ram","id":"ram-node1","size":1073741824,"host-nodes":[0],"policy":"bind"} -numa node,nodeid=1,cpus=2-3,memdev=ram-node1...

4.Check the guest numa memory allocation
# pidof qemu-system-x86_64
3454 3430

# grep -B1 **1024000**  /proc/3454/smaps
7fd8c9600000-7fd907e00000 rw-p 00000000 00:00 0 
Size:            1024000 kB

# grep 7fd8c9600000  /proc/3454/numa_maps
7fd8c9600000 bind:0 anon=82944 dirty=82944 active=0 N0=82944 kernelpagesize_kB=4

We could see guest numa 0 memory is bind to host node 0 which is different with qemu cmd line indicate and setting "<memnode cellid="0" mode="strict" nodeset="1"/>"

Hi michal, could you help to have a check? The current result make me confused, thx.

Comment 5 Michal Privoznik 2023-06-05 10:53:14 UTC
(In reply to liang cong from comment #4)
> Testd on upstream build libvirt v9.4.0-12-gf26923fb2e
> 
> Test steps:
> 1.Define a guest vm with following numatune and numa topology xml:
> <numatune>
>     <memory mode="strict" nodeset="0"/>
>     <memnode cellid="0" mode="strict" nodeset="1"/>
> </numatune>
> ...
> <numa>
>     <cell id="0" cpus="0" memory="1024000" unit="KiB"/>
>     <cell id="1" cpus="1" memory="1048576" unit="KiB"/>

So here you have vCPU#0 assigned to guest NUMA node #0 and vCPU#1 to node #1 ... 

> </numa>
> 
> 2.Start guest vm
> # virsh start vm1
> Domain 'vm1' started
> 
> 3.Get the qemu cmd line
> # ps -ef | grep qemu
> ... -object
> {"qom-type":"memory-backend-ram","id":"ram-node0","size":1048576000,"host-
> nodes":[1],"policy":"bind"} -numa node,nodeid=0,cpus=0-1,memdev=ram-node0
> -object
> {"qom-type":"memory-backend-ram","id":"ram-node1","size":1073741824,"host-
> nodes":[0],"policy":"bind"} -numa node,nodeid=1,cpus=2-3,memdev=ram-node1...

... but here you have it differently.

> 
> 4.Check the guest numa memory allocation
> # pidof qemu-system-x86_64
> 3454 3430

And this also suggest you might be looking at a different QEMU process.

Comment 6 liang cong 2023-06-06 01:42:51 UTC
(In reply to Michal Privoznik from comment #5)
> (In reply to liang cong from comment #4)
> > Testd on upstream build libvirt v9.4.0-12-gf26923fb2e
> > 
> > Test steps:
> > 1.Define a guest vm with following numatune and numa topology xml:
> > <numatune>
> >     <memory mode="strict" nodeset="0"/>
> >     <memnode cellid="0" mode="strict" nodeset="1"/>
> > </numatune>
> > ...
> > <numa>
> >     <cell id="0" cpus="0" memory="1024000" unit="KiB"/>
> >     <cell id="1" cpus="1" memory="1048576" unit="KiB"/>
> 
> So here you have vCPU#0 assigned to guest NUMA node #0 and vCPU#1 to node #1

yeah, the numa topology and numa tuning setting should be:
<numatune>
         <memory mode="strict" nodeset="0"/>
         <memnode cellid="0" mode="strict" nodeset="1"/>
  </numatune>

<numa>
  <cell id="0" cpus="0-1" memory="1024000" unit="KiB" />
  <cell id="1" cpus="2-3" memory="1048576" unit="KiB"/>
</numa>

guest cell 0 memory should be allocated to host numa node 1 
guest cell 1 memory should be allocated to host numa node 0 

> ... 
> 
> > </numa>
> > 
> > 2.Start guest vm
> > # virsh start vm1
> > Domain 'vm1' started
> > 
> > 3.Get the qemu cmd line
> > # ps -ef | grep qemu
> > ... -object
> > {"qom-type":"memory-backend-ram","id":"ram-node0","size":1048576000,"host-
> > nodes":[1],"policy":"bind"} -numa node,nodeid=0,cpus=0-1,memdev=ram-node0
> > -object
> > {"qom-type":"memory-backend-ram","id":"ram-node1","size":1073741824,"host-
> > nodes":[0],"policy":"bind"} -numa node,nodeid=1,cpus=2-3,memdev=ram-node1...
> 
> ... but here you have it differently.
> 
> > 

the qemu cmd line is showing the same result with domain xml setting

> > 4.Check the guest numa memory allocation
> > # pidof qemu-system-x86_64
> > 3454 3430
> 
> And this also suggest you might be looking at a different QEMU process.

for qemu process I could get 2, just as below :
# ps -ef | grep qemu
qemu        3430       1  0 06:21 ?        00:02:52 /usr/bin/qemu-system-x86_64 -name guest=vm1,debug-threads=on -S -object {"qom-type":"secret","id":"masterKey0","format":"raw","file":"/var/lib/libvirt/qemu/domain-6-vm1/master-key.aes"} ....-object {"qom-type":"memory-backend-ram","id":"ram-node0","size":1048576000,"host-nodes":[1],"policy":"bind"} -numa node,nodeid=0,cpus=0-1,memdev=ram-node0 -object {"qom-type":"memory-backend-ram","id":"ram-node1","size":1073741824,"host-nodes":[0],"policy":"bind"} -numa node,nodeid=1,cpus=2-3,memdev=ram-node1...

qemu        3454    3430  0 06:21 ?        00:00:01 /usr/bin/qemu-system-x86_64 -name guest=vm1,debug-threads=on -S -object {"qom-type":"secret","id":"masterKey0","format":"raw","file":"/var/lib/libvirt/qemu/domain-6-vm1/master-key.aes"}...-object {"qom-type":"memory-backend-ram","id":"ram-node0","size":1048576000,"host-nodes":[1],"policy":"bind"} -numa node,nodeid=0,cpus=0-1,memdev=ram-node0 -object {"qom-type":"memory-backend-ram","id":"ram-node1","size":1073741824,"host-nodes":[0],"policy":"bind"} -numa node,nodeid=1,cpus=2-3,memdev=ram-node1...

the 2 processes cmd line numa memory part are the same.

And I tried 2 processes 3430 and 3454(seems they share the same memory), they have the same result that the guest cell 0 memroy is allocated on host numa node 0, which is different with what I define and qemu cmd line indicate.

# grep -B1 **1024000**  /proc/3430/smaps
7fd8c9600000-7fd907e00000 rw-p 00000000 00:00 0 
Size:            1024000 kB

# grep 7fd8c9600000  /proc/3430/numa_maps
7fd8c9600000 bind:0 anon=110080 dirty=110080 active=0 N0=110080 kernelpagesize_kB=4


# grep -B1 **1024000**  /proc/3454/smaps
7fd8c9600000-7fd907e00000 rw-p 00000000 00:00 0 
Size:            1024000 kB

# grep 7fd8c9600000  /proc/3454/numa_maps
7fd8c9600000 bind:0 anon=110080 dirty=110080 active=0 N0=110080 kernelpagesize_kB=4

Comment 7 Michal Privoznik 2023-06-06 11:35:36 UTC
Yeah, the problem here is that while libvirt starts QEMU with cpuset.mems set to 0-1 it then overwrites this to just 0 which causes all the memory to move to NUMA node #0. Let me see if I can fix it.

Comment 9 Michal Privoznik 2023-06-07 14:47:28 UTC
Patches posted on the list:

https://listman.redhat.com/archives/libvir-list/2023-June/240222.html

Comment 10 Michal Privoznik 2023-06-08 07:52:25 UTC
Merged upstream as:

d09b73b560 (HEAD -> master, origin/master, origin/HEAD) qemu: Drop @unionMems argument from qemuProcessSetupPid()
83adba541a qemu: Allow more generous cpuset.mems for vCPUs and IOThreads
fddbb2f12f qemu: Don't try to 'fix up' cpuset.mems after QEMU's memory allocation

v9.4.0-52-gd09b73b560

Comment 11 liang cong 2023-06-21 06:15:48 UTC
Pre-verified on upstream build: v9.4.0-66-ga5bf2c4bf9

Test steps:
1.Define a guest vm with following numatune and numa topology xml:
<iothreads>1</iothreads>
<numatune>
    <memory mode="strict" nodeset="0"/>
    <memnode cellid="0" mode="strict" nodeset="1"/>
</numatune>
...
<numa>
    <cell id="0" cpus="0" memory="1024000" unit="KiB"/>
    <cell id="1" cpus="1" memory="1048576" unit="KiB"/>
</numa>

2.Start guest vm
# virsh start vm1
Domain 'vm1' started

3.Get the qemu cmd line
# ps -ef | grep qemu
-object {"qom-type":"memory-backend-ram","id":"ram-node0","size":1048576000,"host-nodes":[1],"policy":"bind"} -numa node,nodeid=0,cpus=0,memdev=ram-node0 -object {"qom-type":"memory-backend-ram","id":"ram-node1","size":1073741824,"host-nodes":[0],"policy":"bind"} -numa node,nodeid=1,cpus=1,memdev=ram-node1...

4.Check the guest numa memory allocation
# pidof qemu-system-x86_64
40630

# grep -B1 **1024000**  /proc/40630/smaps
7f8f25600000-7f8f63e00000 rw-p 00000000 00:00 0 
Size:            1024000 kB

# grep 7f8f25600000  /proc/40630/numa_maps
7f8f25600000 bind:1 anon=108046 dirty=108046 active=0 N1=108046 kernelpagesize_kB=4

# grep -B1 **1048576**  /proc/40630/smaps
7f8ee5400000-7f8f25400000 rw-p 00000000 00:00 0 
Size:            1048576 kB

# grep 7f8ee5400000  /proc/40630/numa_maps
7f8ee5400000 bind:0 anon=116224 dirty=116224 active=0 N0=116224 kernelpagesize_kB=4

5. Check cgroup settings
# cat /sys/fs/cgroup/machine.slice/machine-qemu\\x2d2\\x2dvm1.scope/libvirt/iothread1/cpuset.mems
0-1

# cat /sys/fs/cgroup/machine.slice/machine-qemu\\x2d2\\x2dvm1.scope/libvirt/emulator/cpuset.mems
0-1

# cat /sys/fs/cgroup/machine.slice/machine-qemu\\x2d2\\x2dvm1.scope/libvirt/vcpu0/cpuset.mems
0-1

# cat /sys/fs/cgroup/machine.slice/machine-qemu\\x2d2\\x2dvm1.scope/libvirt/vcpu1/cpuset.mems
0-1

Also test with other modes such as interleave, preferred, restrictive

Comment 12 liang cong 2023-07-25 09:04:55 UTC
Verified on build:
# rpm -q libvirt qemu-kvm
libvirt-9.5.0-3.el9.x86_64
qemu-kvm-8.0.0-9.el9.x86_64

Test steps:
1.Define a guest vm with following numatune and numa topology xml:
<iothreads>1</iothreads>
<numatune>
          <memory mode="strict" nodeset="1"/>
          <memnode cellid="0" mode="strict" nodeset="0"/>
  </numatune>
...
<numa>
  <cell id='0' cpus='0' memory='1048576' unit='KiB'/>
  <cell id='1' cpus='1' memory='1024000' unit='KiB'/>
</numa>


2.Start guest vm
# virsh start vm1
Domain 'vm1' started

3.Get the qemu cmd line
# ps -ef | grep qemu
...-object {"qom-type":"memory-backend-ram","id":"ram-node0","size":1073741824,"host-nodes":[0],"policy":"bind"} -numa node,nodeid=0,cpus=0,memdev=ram-node0 -object {"qom-type":"memory-backend-ram","id":"ram-node1","size":1048576000,"host-nodes":[1],"policy":"bind"} -numa node,nodeid=1,cpus=1,memdev=ram-node1...

4.Check the guest numa memory allocation
# pidof qemu-kvm
29472

# grep -B1 **1048576**  /proc/29472/smaps
7f50e3e00000-7f5123e00000 rw-p 00000000 00:00 0 
Size:            1048576 kB

# grep 7f50e3e00000  /proc/29472/numa_maps
7f50e3e00000 bind:0 anon=71182 dirty=71182 active=68622 N0=71182 kernelpagesize_kB=4

# grep -B1 **1024000**  /proc/29472/smaps
7f50a5400000-7f50e3c00000 rw-p 00000000 00:00 0 
Size:            1024000 kB

# grep 7f50a5400000  /proc/29472/numa_maps
7f50a5400000 bind:1 anon=146944 dirty=146944 active=123904 N1=146944 kernelpagesize_kB=4

5. Check cgroup settings
# cat /sys/fs/cgroup/machine.slice/machine-qemu\\x2d1\\x2dvm1.scope/libvirt/iothread1/cpuset.mems
0-1

# cat /sys/fs/cgroup/machine.slice/machine-qemu\\x2d1\\x2dvm1.scope/libvirt/emulator/cpuset.mems
0-1

# cat /sys/fs/cgroup/machine.slice/machine-qemu\\x2d1\\x2dvm1.scope/libvirt/vcpu0/cpuset.mems
0-1

# cat /sys/fs/cgroup/machine.slice/machine-qemu\\x2d1\\x2dvm1.scope/libvirt/vcpu1/cpuset.mems
0-1

Also test with other modes such as interleave, preferred, restrictive

Comment 15 liang cong 2023-08-07 03:21:51 UTC
mark it verified for comment 12

Comment 17 errata-xmlrpc 2023-11-07 08:30:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: libvirt security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:6409


Note You need to log in before you can comment on or make changes to this bug.