Bug 1912967

Summary:

Unexpected Threads per core on guest for VM when setting NUMA pinning

Product:

[oVirt] ovirt-engine

Reporter:

Polina <pagranat>

Component:

BLL.Virt

Assignee:

Liran Rotenberg <lrotenbe>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Polina <pagranat>

Severity:

medium

Docs Contact:

Priority:

unspecified

Version:

4.4.4.5

CC:

ahadas, bugs, lrotenbe

Target Milestone:

ovirt-4.5.0

Flags:

pm-rhel: ovirt-4.5?
ahadas: planning_ack?
ahadas: devel_ack+
pm-rhel: testing_ack+

Target Release:

4.5.0

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

ovirt-engine-4.5.0

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2022-04-20 06:33:59 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

Virt

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
logs	none

Description Polina 2021-01-05 17:16:29 UTC

Created attachment 1744652 [details]
logs

Created attachment 1744652 [details]
logs

Description of problem: the number of Threads per Core on the guest differs from the host in lscpu command (reproducible on hosts with 4 Numa nodes)


Version-Release number of selected component (if applicable):
ovirt-engine-4.4.4.5-0.10.el8ev.noarch

How reproducible:100% on certain environment 

Steps to Reproduce:
1. POST https://{{host}}/ovirt-engine/api/vms?auto_pinning_policy=adjust
<vm>
  <name>auto_cpu_vm</name>
  <template>
    <name>latest-rhel-guest-image-8.3-infra</name>
  </template>
  <cluster>
    <name>golden_env_mixed_1</name>
  </cluster>
  <placement_policy>
      <hosts>
        <host>
          <name>host_mixed_1</name>
        </host>
    </hosts>
  </placement_policy>
</vm>

Actual results:Thread(s) per Core in lscpu are not identical

Expected results: 
According to the documentation the expected is
1. CPU(s) in the VM must be lower or equal to the host.  45 (on VM) < 46 (host) - ok
2. Thread(s) per Core must be identical. though in this case 2(on host) != 1 (on VM)
3. Core(s) per socket in the VM must be lower or equal to the host. In this case 24 =24 - ok
4. Socket(s) must be identical. In this case 1=1
5. NUMA node(s) must be identical. In this case 4=4.

Additional info:
VM lscpu: http://pastebin.test.redhat.com/927472

Host lscpu : http://pastebin.test.redhat.com/927471

cpuid -r for vm and host are in the attached cpuid_vm.txt and cpuid_host.txt.

on host and vm we have  util-linux-2.32.1-24.el8.x86_64 .
kernel version on host and vm  4.18.0-240.4.1.el8_3.x86_64

Comment 1 Liran Rotenberg 2021-01-05 17:25:45 UTC

The problem is not with auto pinning policy. It's with the way we automatically set the cpusets for NUMA nodes.

I will pull out the HW from internal pastebin:
Host lscpu:
[root@ocelot05 ~]# lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              48
On-line CPU(s) list: 0-47
Thread(s) per core:  2
Core(s) per socket:  24
Socket(s):           1
NUMA node(s):        4
Vendor ID:           AuthenticAMD
CPU family:          23
Model:               1
Model name:          AMD EPYC 7451 24-Core Processor
Stepping:            2
CPU MHz:             2859.092
CPU max MHz:         2300.0000
CPU min MHz:         1200.0000
BogoMIPS:            4599.39
Virtualization:      AMD-V
L1d cache:           32K
L1i cache:           64K
L2 cache:            512K
L3 cache:            8192K
NUMA node0 CPU(s):   0-5,24-29
NUMA node1 CPU(s):   6-11,30-35
NUMA node2 CPU(s):   12-17,36-41
NUMA node3 CPU(s):   18-23,42-47

VM lscpu:
[root@vm-30-110 ~]# lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              46
On-line CPU(s) list: 0-45
Thread(s) per core:  1
Core(s) per socket:  24
Socket(s):           1
NUMA node(s):        4
Vendor ID:           AuthenticAMD
CPU family:          23
Model:               1
Model name:          AMD EPYC Processor
Stepping:            2
CPU MHz:             2299.994
BogoMIPS:            4599.98
Virtualization:      AMD-V
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           64K
L2 cache:            512K
L3 cache:            8192K
NUMA node0 CPU(s):   0-11
NUMA node1 CPU(s):   12-23
NUMA node2 CPU(s):   24-34
NUMA node3 CPU(s):   35-45

VM domxml was created with:
Eduardo Habkost investigation lead to: 
   <numa>
      <cell id='0' cpus='0-11,48,52,56,60,64,68,72,76,80,84,88,92,96,100,104,108,112,116,120,124,128,132,136,140,144,148,152,156,160,164,168,172,176,180' memory='262144' unit='KiB'/>
      <cell id='1' cpus='12-23,49,53,57,61,65,69,73,77,81,85,89,93,97,101,105,109,113,117,121,125,129,133,137,141,145,149,153,157,161,165,169,173,177,181' memory='262144' unit='KiB'/>
      <cell id='2' cpus='24-34,46,50,54,58,62,66,70,74,78,82,86,90,94,98,102,106,110,114,118,122,126,130,134,138,142,146,150,154,158,162,166,170,174,178,182' memory='262144' unit='KiB'/>
      <cell id='3' cpus='35-45,47,51,55,59,63,67,71,75,79,83,87,91,95,99,103,107,111,115,119,123,127,131,135,139,143,147,151,155,159,163,167,171,175,179,183' memory='262144' unit='KiB'/>
    </numa>

VCPU 34 is in node 2, but CPU 35 is in node 3. Which confuses lscpu.

As far as I can see, this configuration
is generated by RHV.  It looks like the only input in the RHV UI
is "number of NUMA nodes", and in that VM the number of cores was
not a multiple of the number of NUMA nodes.

It's up to the RHV UI designers to decide what to do in this
case: it could prevent such configuration, emit a warning, or
split the VCPUs between the NUMA nodes correctly.

I suggest also making sure all cores in a socket stay in the same
NUMA node, to avoid similar surprises in the guest interpretation
of the socket/core topology.

Dr. David Gillbert sugguested:
it's specifying that one physical CPU
(34/35 ) is split across two NUMA nodes.

If you change it to:

  <cell id='0' cpus='0-11'
  <cell id='1' cpus='12-23'
  <cell id='2' cpus='24-35'
  <cell id='3' cpus='36-45'

that depends on the topology within a socket on AMD;
it might want to be at the 'die' level, but if we're not doing dies
then I agree we may as well keep everything in a socket within a NUMA
node.

We need to decide to have a warning / split it correctly when the cores not a multiple of the number of NUMA nodes.

Comment 2 Arik 2021-01-06 09:22:36 UTC

(In reply to Liran Rotenberg from comment #1)
> We need to decide to have a warning / split it correctly when the cores not
> a multiple of the number of NUMA nodes.

Note that a warning is not an option for the OCP on RHV use case so we should probably try to fix the allocation logic.

I think this bug currently talks about two related but different issues:
1. That the CPU topology from within the guest doesn't match the VM settings (that's what the title says)
2. That the allocation of vCPUs to NUMA nodes is not correct/optimal

The reason for #1 is that when creating a VM with auto_pinning_policy=adjust, we set the CPU topology of the VM according to the CPU topology of the host but we don't allocate all CPUs (in this case 46 vCPUs are allocated on a host with 48 threads so we can't reach a 1:24:2 topology from the guest point of view - maybe in this particular case it would have been better to set the CPU topology of the guest to 1:23:2 if that's valid).

#2 is more broad - it can happen regardless of auto_pinning_policy=adjust.

Polina, could you please file a separate bug for #2?

Comment 3 Polina 2021-01-06 11:56:26 UTC

reported https://bugzilla.redhat.com/show_bug.cgi?id=1913269

Comment 4 Polina 2022-04-17 13:15:02 UTC

verified on ovirt-engine-tools-4.5.0.2-0.7.el8ev.noarch accordibng to the description

host

CPU(s):              48
On-line CPU(s) list: 0-47
Thread(s) per core:  2
Core(s) per socket:  24
Socket(s):           1
NUMA node(s):        4

Comment 5 Sandro Bonazzola 2022-04-20 06:33:59 UTC

This bugzilla is included in oVirt 4.5.0 release, published on April 20th 2022.

Since the problem described in this bug report should be resolved in oVirt 4.5.0 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.