RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 2154750 - [numatune][cputune] qemu-kvm: Setting CPU affinity failed: Invalid argument
Summary: [numatune][cputune] qemu-kvm: Setting CPU affinity failed: Invalid argument
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 9
Classification: Red Hat
Component: libvirt
Version: 9.2
Hardware: x86_64
OS: Linux
high
high
Target Milestone: rc
: ---
Assignee: Michal Privoznik
QA Contact: Luyao Huang
URL:
Whiteboard:
: 2157060 2167527 (view as bug list)
Depends On:
Blocks: 2185039
TreeView+ depends on / blocked
 
Reported: 2022-12-19 05:50 UTC by Yanghang Liu
Modified: 2023-11-07 09:38 UTC (History)
23 users (show)

Fixed In Version: libvirt-9.2.0-1.el9
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2185039 (view as bug list)
Environment:
Last Closed: 2023-11-07 08:30:47 UTC
Type: ---
Target Upstream Version: 9.2.0
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker LIBVIRTAT-14221 0 None None None 2023-05-22 12:55:57 UTC
Red Hat Issue Tracker RHELPLAN-142851 0 None None None 2022-12-19 06:14:39 UTC
Red Hat Product Errata RHSA-2023:6409 0 None None None 2023-11-07 08:31:29 UTC

Description Yanghang Liu 2022-12-19 05:50:04 UTC
Description of problem:
Fail to start a domain which has specific cputune and numatune xml

Version-Release number of selected component (if applicable):
host:
qemu-kvm-7.2.0-1.el9.x86_64
5.14.0-212.el9.x86_64
libvirt-8.10.0-2.el9.x86_64


How reproducible:
100%

Steps to Reproduce:
1. start a domain which has cputune and numatune xml

  <memory unit='KiB'>4194304</memory>
  <currentMemory unit='KiB'>4194304</currentMemory>
  <vcpu placement='static'>4</vcpu>
  <cputune>
    <emulatorpin cpuset='1,3,5,7,9,11'/>
  </cputune>
  <numatune>
    <memory mode='strict' nodeset='0'/>
    <memnode cellid='0' mode='strict' nodeset='0'/>
  </numatune>
  <cpu mode='host-model' check='partial'>
    <numa>
      <cell id='0' cpus='0-3' memory='4194304' unit='KiB'/>
    </numa>
  </cpu>



Actual results:
The domain can not be started with the following error info:
qemu-kvm: Setting CPU affinity failed: Invalid argument

Expected results:
The domain can be started successfully

Additional info:
(1) The same domain can be started successfully in qemu-kvm-7.1.0-6.el9.x86_64


(2) The domain can be started after removing either numatune element or cputune element

(3) The full xml

<domain type='kvm'>
  <name>rhel9.2</name>
  <uuid>2ec35b42-bb52-11ed-b9a7-20040fec000c</uuid>
  <memory unit='KiB'>8388608</memory>
  <currentMemory unit='KiB'>8388608</currentMemory>
  <memoryBacking>
    <hugepages>
      <page size='1048576' unit='KiB'/>
    </hugepages>
    <locked/>
  </memoryBacking>
  <vcpu placement='static'>6</vcpu>
  <cputune>
    <vcpupin vcpu='0' cpuset='31'/>
    <vcpupin vcpu='1' cpuset='29'/>
    <vcpupin vcpu='2' cpuset='30'/>
    <vcpupin vcpu='3' cpuset='28'/>
    <vcpupin vcpu='4' cpuset='26'/>
    <vcpupin vcpu='5' cpuset='24'/>
    <emulatorpin cpuset='1,3,5,7,9,11'/>
  </cputune>
  <numatune>
    <memory mode='strict' nodeset='0'/>
    <memnode cellid='0' mode='strict' nodeset='0'/>
  </numatune>
  <os>
    <type arch='x86_64' machine='pc-q35-rhel9.2.0'>hvm</type>
    <loader readonly='yes' secure='yes' type='pflash'>/usr/share/edk2/ovmf/OVMF_CODE.secboot.fd</loader>
    <nvram template='/usr/share/edk2/ovmf/OVMF_VARS.fd'>/var/lib/libvirt/qemu/nvram/rhel9.2_VARS.fd</nvram>
    <boot dev='hd'/>
  </os>
  <features>
    <acpi/>
    <pmu state='off'/>
    <vmport state='off'/>
    <smm state='on'/>
    <ioapic driver='qemu'/>
  </features>
  <cpu mode='host-model' check='partial'>
    <topology sockets='3' dies='1' cores='1' threads='2'/>
    <feature policy='require' name='tsc-deadline'/>
    <numa>
      <cell id='0' cpus='0-5' memory='8388608' unit='KiB' memAccess='shared'/>
    </numa>
  </cpu>
  <clock offset='utc'>
    <timer name='rtc' tickpolicy='catchup'/>
    <timer name='pit' tickpolicy='delay'/>
    <timer name='hpet' present='no'/>
  </clock>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>restart</on_crash>
  <devices>
    <emulator>/usr/libexec/qemu-kvm</emulator>
    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2' cache='none' io='threads' iommu='on' ats='on'/>
      <source file='/mnt/nfv//rhel9.2.qcow2'/>
      <target dev='vda' bus='virtio'/>
      <address type='pci' domain='0x0000' bus='0x02' slot='0x00' function='0x0'/>
    </disk>
    <controller type='usb' index='0' model='none'/>
    <controller type='pci' index='0' model='pcie-root'/>
    <controller type='pci' index='1' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='1' port='0x10'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0' multifunction='on'/>
    </controller>
    <controller type='pci' index='2' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='2' port='0x11'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x1'/>
    </controller>
    <controller type='pci' index='3' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='3' port='0x12'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x2'/>
    </controller>
    <controller type='pci' index='4' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='4' port='0x13'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x3'/>
    </controller>
    <controller type='pci' index='5' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='5' port='0x14'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x4'/>
    </controller>
    <controller type='pci' index='6' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='6' port='0x15'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x5'/>
    </controller>
    <controller type='pci' index='7' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='7' port='0x16'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x6'/>
    </controller>
    <controller type='pci' index='8' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='8' port='0x17'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x7'/>
    </controller>
    <controller type='sata' index='0'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x1f' function='0x2'/>
    </controller>
    <interface type='bridge'>
      <mac address='28:66:da:5f:dd:01'/>
      <source bridge='switch'/>
      <model type='virtio'/>
      <driver name='vhost' iommu='on' ats='on'/>
      <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
    </interface>
    <interface type='vhostuser'>
      <mac address='18:66:da:5f:dd:02'/>
      <source type='unix' path='/tmp/vhostuser0.sock' mode='server'/>
      <model type='virtio'/>
      <driver name='vhost' queues='2' rx_queue_size='1024' iommu='on' ats='on' packed='on'/>
      <address type='pci' domain='0x0000' bus='0x06' slot='0x00' function='0x0'/>
    </interface>
    <interface type='vhostuser'>
      <mac address='18:66:da:5f:dd:03'/>
      <source type='unix' path='/tmp/vhostuser1.sock' mode='server'/>
      <model type='virtio'/>
      <driver name='vhost' queues='2' rx_queue_size='1024' iommu='on' ats='on' packed='on'/>
      <address type='pci' domain='0x0000' bus='0x07' slot='0x00' function='0x0'/>
    </interface>
    <interface type='vhostuser'>
      <mac address='18:66:da:5f:dd:04'/>
      <source type='unix' path='/tmp/vhostuser2.sock' mode='server'/>
      <model type='virtio'/>
      <driver name='vhost' rx_queue_size='1024' iommu='on' ats='on' packed='on'/>
      <address type='pci' domain='0x0000' bus='0x08' slot='0x00' function='0x0'/>
    </interface>
    <serial type='pty'>
      <target type='isa-serial' port='0'>
        <model name='isa-serial'/>
      </target>
    </serial>
    <console type='pty'>
      <target type='serial' port='0'/>
    </console>
    <input type='mouse' bus='ps2'/>
    <input type='keyboard' bus='ps2'/>
    <tpm model='tpm-crb'>
      <backend type='emulator' version='2.0'/>
    </tpm>
    <audio id='1' type='none'/>
    <memballoon model='virtio'>
      <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
      <driver iommu='on' ats='on'/>
    </memballoon>
    <iommu model='intel'>
      <driver intremap='on' caching_mode='on' iotlb='on'/>
    </iommu>
  </devices>
</domain>

Comment 1 John Ferlan 2023-01-03 12:47:34 UTC
David - I see from git history you made some QEMU related changes in this area, so I'll start with you.

Comment 2 David Hildenbrand 2023-01-03 14:40:24 UTC
libvirt seems to configure a thread-contex object for the VM with a problematic CPU list. The list isn't empty, because otherwise, we'd get a different error.

pthread_setaffinity_np() ends up returning -EINVAL, which would happen according to the man page either if

(1) The affinity bit mask mask contains no processors that are currently physically on the system and permitted to the thread  according  to  any  re‐
    strictions that may be imposed by the "cpuset" mechanism described in cpuset(7)

(2) cpuset specified a CPU that was outside the set supported by
    the kernel. (The kernel configuration option CONFIG_NR_CPUS defines the range  of  the
    set supported by the kernel data type used to represent CPU sets.)

I cannot spot from the XML snippet why the thread-context is created at all -- are we preallocating memory (hugetlb?)?

@Yanghang, please provide the full XML and ideally, the generated QEMU cmdline.

@Michal, any idea which CPU list is generated here and why?

Comment 3 David Hildenbrand 2023-01-03 15:38:01 UTC
I am able to reproduce. We add a prealloc context even though we don't do any preallocation:

-object '{"qom-type":"thread-context","id":"tc-ram-node0","node-affinity":[0]}' \
-object '{"qom-type":"memory-backend-ram","id":"ram-node0","size":4294967296,"host-nodes":[0],"policy":"bind","prealloc-context":"tc-ram-node0"}' \

This seems like it could be avoided.


But the underlying issue seems to be that libvirt restricts the cpuset of QEMU to the list of CPUs in emulatorpin even before QEMU started. Preallcoation will be limited to the emulator threads in that case.

The crash reveals that the intersection of the CPUs in "emulatorpin" and the CPUs specified by "node-affinity" (of the thread-context) is empty. Because then, there are no CPUs to run on and pthread_setaffinity_np() will bail out, just the way it should.

@Michal, we'd have to move the emulator threads to the restricted cpuset after starting QEMU I guess (I thought this used to be the case). Alternatively, we'd have to assign the thread-context affinity manually after starting QEMU (having it paused) and then triggering preallocation.

Comment 4 David Hildenbrand 2023-01-03 15:40:28 UTC
There is not much QEMU can do, unfortunately. Setting component to libvirt.

Comment 5 David Hildenbrand 2023-01-03 15:42:29 UTC
No further info needed at this point

Comment 6 Michal Privoznik 2023-01-05 11:54:52 UTC
Patch posted on the list:

https://listman.redhat.com/archives/libvir-list/2023-January/236658.html

Comment 7 David Hildenbrand 2023-01-05 12:25:33 UTC
(In reply to Michal Privoznik from comment #6)
> Patch posted on the list:
> 
> https://listman.redhat.com/archives/libvir-list/2023-January/236658.html

IIUC, this fixes this BZ, but enabling preallcoation would in that setup would fail, no? Restricting preallocation to emulator threads is the main problem in that regard.

Comment 8 yalzhang@redhat.com 2023-01-06 06:10:45 UTC
*** Bug 2157060 has been marked as a duplicate of this bug. ***

Comment 9 Michal Privoznik 2023-01-13 07:55:31 UTC
Merged upstream as:

commit 8ff8fe3f8a7bb67a150c7f4801c2df5ef743aa8f
Author:     Michal Prívozník <mprivozn>
AuthorDate: Thu Jan 5 09:51:07 2023 +0100
Commit:     Michal Prívozník <mprivozn>
CommitDate: Fri Jan 13 08:43:30 2023 +0100

    qemuBuildThreadContextProps: Generate ThreadContext less frequently
    
    Currently, the ThreadContext object is generated whenever we see
    .host-nodes attribute for a memory-backend-* object. The idea was
    that when the backend is pinned to a specific set of host NUMA
    nodes, then the allocation could be happening on CPUs from those
    nodes too. But this may not be always possible.
    
    Users might configure their guests in such way that vCPUs and
    corresponding guest NUMA nodes are on different host NUMA nodes
    than emulator thread. In this case, ThreadContext won't work,
    because ThreadContext objects live in context of the emulator
    thread (vCPU threads are moved around by us later, when emulator
    thread finished its setup and spawned vCPU threads - see
    qemuProcessSetupVcpus()). Therefore, memory allocation is done by
    emulator thread which is pinned to a subset of host NUMA nodes,
    but tries to create a ThreadContext object with a disjoint subset
    of host NUMA nodes, which fails.
    
    Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=2154750
    Signed-off-by: Michal Privoznik <mprivozn>
    Reviewed-by: Ján Tomko <jtomko>

v9.0.0-rc1-17-g8ff8fe3f8a

Comment 10 Michal Privoznik 2023-01-13 08:04:24 UTC
(In reply to David Hildenbrand from comment #7)
> (In reply to Michal Privoznik from comment #6)
> > Patch posted on the list:
> > 
> > https://listman.redhat.com/archives/libvir-list/2023-January/236658.html
> 
> IIUC, this fixes this BZ, but enabling preallcoation would in that setup
> would fail, no? Restricting preallocation to emulator threads is the main
> problem in that regard.

Indeed. That's result of the following commit:

https://gitlab.com/libvirt/libvirt/-/commit/0eaa4716e1b8f6eb59d77049aed3735c3b5fbdd6

The reasoning in that commit message is simple: QEMU is started with the cpuset.mems set so that if there's not enough memory on desired NUMA nodes, startup fails. Previously, libvirt would start QEMU without cpuset.mems set and only after initial monitor communications (e.g. vCPU/IOthread TIDs detection) the emulator thread would be moved to desired NUMA nodes using cpuset.mems. Looks like we have two competing approaches here. Let me see if I can come up with something clever. Meanwhile, this particular problem is fixed. But I agree that we need a better solution, so I'll keep this bug open.

Comment 11 Luyao Huang 2023-02-16 06:15:45 UTC
Reproduce this bug on libvirt-8.10.0-2.el9:

0. prepare a host which numa node > 1

# numactl --hard
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
node 0 size: 31503 MB
node 0 free: 29115 MB
node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
node 1 size: 32244 MB
node 1 free: 24570 MB
node distances:
node   0   1 
  0:  10  32 
  1:  32  10 

1. prepare a guest which have emulatorpin + numatune

# virsh dumpxml vm1

  <cputune>
    <emulatorpin cpuset='16-20'/>
  </cputune>
  <numatune>
    <memory mode='strict' nodeset='0'/>
    <memnode cellid='0' mode='strict' nodeset='0'/>
  </numatune>

2. 

# virsh start vm1
error: Failed to start domain 'vm1'
error: internal error: process exited while connecting to monitor: 2023-02-16T04:24:21.584878Z qemu-kvm: Setting CPU affinity failed: Invalid argument

Comment 12 Michal Privoznik 2023-02-21 09:52:44 UTC
*** Bug 2167527 has been marked as a duplicate of this bug. ***

Comment 19 Yanghang Liu 2023-03-08 08:35:07 UTC
Hi Michal,


My test results show this issue still existed in the following environment:
  qemu-kvm-7.2.0-10.el9.x86_64
  libvirt-9.0.0-7.el9.x86_64
  edk2-ovmf-20221207gitfff6d81270b5-7.el9.noarch


Test result:
# virsh start 2154750
error: Failed to start domain '2154750'
error: internal error: process exited while connecting to monitor: 2023-03-08T08:29:19.229506Z qemu-kvm: Setting CPU affinity failed: Invalid argument

Comment 20 Michal Privoznik 2023-03-08 11:18:25 UTC
Patches posted on the list:

https://listman.redhat.com/archives/libvir-list/2023-March/238536.html

Comment 25 Michal Privoznik 2023-03-15 12:17:22 UTC
Merged upstream as:

902ab2a29b NEWS: Document recent thread-context bug fix
c4b176567b docs: Document memory allocation and emulator pinning limitation
df2ef2e706 qemuBuildThreadContextProps: Prune .node-affinity wrt <emulatorpin/>
45222a83b7 qemu: Add @nodemask argument to qemuBuildThreadContextProps()
9f26f6cc4b qemu: Add @nodemaskRet argument to qemuBuildMemoryBackendProps()
450d932cd9 qemuBuildMemoryBackendProps: Join two conditions
7feed1613d qemu: Fix qemuDomainGetEmulatorPinInfo()
b4ccb0dc41 qemu: Move cpuset preference evaluation into a separate function
95ae91fdd4 qemuxml2argvmock: Drop virNuma* mocks
c4c90063a5 qemuxml2argvdata: Extend vCPUs placement in memory-hotplug-dimm-addr.xml
d91ca262fb qemuxml2argvdata: Adjust maximum NUMA node used
28ec9d86b3 qemuxml2argvtest: Use virnuma mock
213b6822a8 virnumamock: Introduce virNumaGetNodeOfCPU() mock
b6cfd348e9 virnuma: Introduce virNumaCPUSetToNodeset()
01e5111c3c virnuma: Move virNumaNodesetToCPUset() out of WITH_NUMACTL

v9.1.0-221-g902ab2a29b

Comment 35 Luyao Huang 2023-05-18 07:44:41 UTC
Verify this bug use the same steps in bug 2185039 comment 6 on libvirt-9.3.0-2.el9.x86_64.

Comment 37 errata-xmlrpc 2023-11-07 08:30:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: libvirt security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:6409


Note You need to log in before you can comment on or make changes to this bug.