Bug 1694711

Summary: Incorrect NUMA pinning due to improper correlation between CPU sockets and NUMA nodes
Product: Red Hat Enterprise Virtualization Manager Reporter: Nils Koenig <nkoenig>
Component: ovirt-engineAssignee: Liran Rotenberg <lrotenbe>
Status: CLOSED ERRATA QA Contact: Polina <pagranat>
Severity: medium Docs Contact:
Priority: high    
Version: 4.2.7CC: ahadas, avijayku, djdumas, emarcus, gveitmic, koconnor, lrotenbe, lsurette, michal.skrivanek, mtessun, nkoenig, sgratch, srevivo, ycui
Target Milestone: ovirt-4.4.4Keywords: Reopened
Target Release: 4.4.4   
Hardware: x86_64   
OS: Unspecified   
Whiteboard:
Fixed In Version: ovirt-engine-4.4.4 Doc Type: Bug Fix
Doc Text:
Previously, the UI NUMA panel showed an incorrect NUMA node for a corresponding socket. In this release, the NUMA nodes are ordered by the database, and the socket matches the NUMA node.
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-02-02 13:58:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
NUMA Pining as shown in RHV-M
none
cat /proc/cpuinfo
none
Numbering still wrong on RHV 4.4 none

Description Nils Koenig 2019-04-01 12:45:07 UTC
Created attachment 1550555 [details]
NUMA Pining as shown in RHV-M

On a 4 Socket system, I see the NUMA architecture as shown in the screen shot.
Whats confusing is, that the NUMA node and socket numbers diverge.

Looking at lscpu / numactl --hardware does not show the twist:

[root@ls3020 ~]# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                192
On-line CPU(s) list:   0-191
Thread(s) per core:    2
Core(s) per socket:    24
Socket(s):             4
NUMA node(s):          4
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E7-8890 v4 @ 2.20GHz
Stepping:              1
CPU MHz:               3215.002
CPU max MHz:           3400.0000
CPU min MHz:           1200.0000
BogoMIPS:              4389.36
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              61440K
NUMA node0 CPU(s):     0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76,80,84,88,92,96,100,104,108,112,116,120,124,128,132,136,140,144,148,152,156,160,164,168,172,176,180,184,188
NUMA node1 CPU(s):     1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,61,65,69,73,77,81,85,89,93,97,101,105,109,113,117,121,125,129,133,137,141,145,149,153,157,161,165,169,173,177,181,185,189
NUMA node2 CPU(s):     2,6,10,14,18,22,26,30,34,38,42,46,50,54,58,62,66,70,74,78,82,86,90,94,98,102,106,110,114,118,122,126,130,134,138,142,146,150,154,158,162,166,170,174,178,182,186,190
NUMA node3 CPU(s):     3,7,11,15,19,23,27,31,35,39,43,47,51,55,59,63,67,71,75,79,83,87,91,95,99,103,107,111,115,119,123,127,131,135,139,143,147,151,155,159,163,167,171,175,179,183,187,191
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 intel_pt ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts spec_ctrl intel_stibp flush_l1d

[root@ls3020 ~]# numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80 84 88 92 96 100 104 108 112 116 120 124 128 132 136 140 144 148 152 156 160 164 168 172 176 180 184 188
node 0 size: 262038 MB
node 0 free: 54468 MB
node 1 cpus: 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 101 105 109 113 117 121 125 129 133 137 141 145 149 153 157 161 165 169 173 177 181 185 189
node 1 size: 262144 MB
node 1 free: 59311 MB
node 2 cpus: 2 6 10 14 18 22 26 30 34 38 42 46 50 54 58 62 66 70 74 78 82 86 90 94 98 102 106 110 114 118 122 126 130 134 138 142 146 150 154 158 162 166 170 174 178 182 186 190
node 2 size: 262144 MB
node 2 free: 59387 MB
node 3 cpus: 3 7 11 15 19 23 27 31 35 39 43 47 51 55 59 63 67 71 75 79 83 87 91 95 99 103 107 111 115 119 123 127 131 135 139 143 147 151 155 159 163 167 171 175 179 183 187 191
node 3 size: 262144 MB
node 3 free: 59266 MB
node distances:
node   0   1   2   3 
  0:  10  21  21  21 
  1:  21  10  21  21 
  2:  21  21  10  21 
  3:  21  21  21  10 

It is not clear to me, why NUMA 0/3 and Socket 0/3 are swapped. Moreover, this is quite confusing to the user when configuring NUMA pinning and can lead to erroneous  mapping. Is there a good reason for this or is it just wrong?

This also might be a reason why live migrating of NUMA pinned VMs fails.

Comment 1 Nils Koenig 2019-04-01 12:45:48 UTC
Created attachment 1550556 [details]
cat /proc/cpuinfo

Comment 2 Ryan Barry 2019-04-01 14:07:29 UTC
EL8 or EL7?

Comment 3 Nils Koenig 2019-04-01 14:38:24 UTC
EL7

Comment 4 Ryan Barry 2019-04-01 22:42:25 UTC
Sharon, any ideas?

Nils - AFAIK, the virtual NUMA nodes are not required to map exactly to the physical NUMA nodes (and as long as the CPU grouping is correct, should not lead to performance deficits). Do you have a bug number for a failure live migrating NUMA pinned VMs? It's not on any of my queries, and I don't remember seeing such a  bug

Comment 5 Nils Koenig 2019-04-02 08:37:47 UTC
Ryan, what do you mean by CPU grouping?

In our case using the "High Performance" VM profile, NUMA, CPU and iothread pinning for aligning virtual and physical topology this is crucial for getting the maximum performance.
What I am asking is, where does the numbering for the NUMA nodes come from and why is it different to the sockets?

My statement regarding the live migration is more of a gut feeling - I don't have a BZ for that.
I am just trying to imagine how live migrating a pinned VM could work even between identical hosts, if the NUMA node numbering is indeterministic.

Comment 6 Ryan Barry 2019-04-02 11:23:28 UTC
No, this makes sense, Nils. What I meant is that, in an imaginary scenario where there are the following NUMA nodes:

CPU0: 0,2,4
CPU1: 1,3,5

Even a virtual NUMA map which reverses the CPUs should not affect performance, since operations on a vNUMA arrangement where CPU1 is mapped to vCPU0 will still not cross boundaries as part of the memory operations even if the topology isn't a 1:1 map for socket:socket. Sharon can confirm behavior, though.

There's no BZ for failed live migrations on HP VMs? If this is reproducible, please file a bug next time it comes up.

Comment 7 Ryan Barry 2019-05-15 03:14:27 UTC
Per Sharon, this does not affect performance. Unless there's a functional impact, closing

Comment 8 Nils Koenig 2020-07-09 10:38:28 UTC
Well, it is an issue in my opinion when you also do CPU pinning.
Then memory and CPUs become dislocated. 
Additionally, it is confusing to the user - in the UI, what is the correct mapping then?

Comment 9 Nils Koenig 2020-07-09 10:39:57 UTC
Created attachment 1700418 [details]
Numbering still wrong on RHV 4.4

Comment 11 Martin Tessun 2020-07-10 08:03:44 UTC
I think Nils is correct here. The Socket and NUMA numbering must be matching, as otherwise mistakes are made easily unless we are also doing an implicit vCPU pinning, which we currently don't.
This could specifically be an issue for High performance Workloads.

@Arik, please Triage accordingly.

Comment 12 Arik 2020-10-20 16:56:32 UTC
*** Bug 1822841 has been marked as a duplicate of this bug. ***

Comment 15 Polina 2020-11-17 17:33:17 UTC
Verified on ovirt-engine-4.4.4-0.1.el8ev.noarch

host with 4 numa nodes - no swap between NUMA and Socket indexes

[root@ocelot05 ~]# numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 24 25 26 27 28 29
node 0 size: 15377 MB
node 0 free: 11301 MB
node 1 cpus: 6 7 8 9 10 11 30 31 32 33 34 35
node 1 size: 31113 MB
node 1 free: 26358 MB
node 2 cpus: 12 13 14 15 16 17 36 37 38 39 40 41
node 2 size: 15841 MB
node 2 free: 13867 MB
node 3 cpus: 18 19 20 21 22 23 42 43 44 45 46 47
node 3 size: 30956 MB
node 3 free: 30244 MB
node distances:
node   0   1   2   3 
  0:  10  16  16  16 
  1:  16  10  16  16 
  2:  16  16  10  16 
  3:  16  16  16  10 

Run VM configured with 4 CPUs , 4 vNUMA nodes:

NUMA0 mapped to Socket0 in Numa Pinning Topology 
Numa1 ->Socket1
Numa2 ->Socket2
Numa3 ->Socket3

Checked the same with two NUMA nodes host
 - no swap between NUMA and Socket indexes

please confirm that no more tests required to verify.

Comment 16 Liran Rotenberg 2020-11-18 07:13:34 UTC
Nothing else.

Comment 22 errata-xmlrpc 2021-02-02 13:58:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (RHV Engine and Host Common Packages 4.4.z [ovirt-4.4.4]), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0312