Bug 1694711
Summary: | Incorrect NUMA pinning due to improper correlation between CPU sockets and NUMA nodes | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Nils Koenig <nkoenig> | ||||||||
Component: | ovirt-engine | Assignee: | Liran Rotenberg <lrotenbe> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | Polina <pagranat> | ||||||||
Severity: | medium | Docs Contact: | |||||||||
Priority: | high | ||||||||||
Version: | 4.2.7 | CC: | ahadas, avijayku, djdumas, emarcus, gveitmic, koconnor, lrotenbe, lsurette, michal.skrivanek, mtessun, nkoenig, sgratch, srevivo, ycui | ||||||||
Target Milestone: | ovirt-4.4.4 | Keywords: | Reopened | ||||||||
Target Release: | 4.4.4 | ||||||||||
Hardware: | x86_64 | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | ovirt-engine-4.4.4 | Doc Type: | Bug Fix | ||||||||
Doc Text: |
Previously, the UI NUMA panel showed an incorrect NUMA node for a corresponding socket.
In this release, the NUMA nodes are ordered by the database, and the socket matches the NUMA node.
|
Story Points: | --- | ||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2021-02-02 13:58:29 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | Virt | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Created attachment 1550556 [details]
cat /proc/cpuinfo
EL8 or EL7? EL7 Sharon, any ideas? Nils - AFAIK, the virtual NUMA nodes are not required to map exactly to the physical NUMA nodes (and as long as the CPU grouping is correct, should not lead to performance deficits). Do you have a bug number for a failure live migrating NUMA pinned VMs? It's not on any of my queries, and I don't remember seeing such a bug Ryan, what do you mean by CPU grouping? In our case using the "High Performance" VM profile, NUMA, CPU and iothread pinning for aligning virtual and physical topology this is crucial for getting the maximum performance. What I am asking is, where does the numbering for the NUMA nodes come from and why is it different to the sockets? My statement regarding the live migration is more of a gut feeling - I don't have a BZ for that. I am just trying to imagine how live migrating a pinned VM could work even between identical hosts, if the NUMA node numbering is indeterministic. No, this makes sense, Nils. What I meant is that, in an imaginary scenario where there are the following NUMA nodes: CPU0: 0,2,4 CPU1: 1,3,5 Even a virtual NUMA map which reverses the CPUs should not affect performance, since operations on a vNUMA arrangement where CPU1 is mapped to vCPU0 will still not cross boundaries as part of the memory operations even if the topology isn't a 1:1 map for socket:socket. Sharon can confirm behavior, though. There's no BZ for failed live migrations on HP VMs? If this is reproducible, please file a bug next time it comes up. Per Sharon, this does not affect performance. Unless there's a functional impact, closing Well, it is an issue in my opinion when you also do CPU pinning. Then memory and CPUs become dislocated. Additionally, it is confusing to the user - in the UI, what is the correct mapping then? Created attachment 1700418 [details]
Numbering still wrong on RHV 4.4
I think Nils is correct here. The Socket and NUMA numbering must be matching, as otherwise mistakes are made easily unless we are also doing an implicit vCPU pinning, which we currently don't. This could specifically be an issue for High performance Workloads. @Arik, please Triage accordingly. *** Bug 1822841 has been marked as a duplicate of this bug. *** Verified on ovirt-engine-4.4.4-0.1.el8ev.noarch host with 4 numa nodes - no swap between NUMA and Socket indexes [root@ocelot05 ~]# numactl --hardware available: 4 nodes (0-3) node 0 cpus: 0 1 2 3 4 5 24 25 26 27 28 29 node 0 size: 15377 MB node 0 free: 11301 MB node 1 cpus: 6 7 8 9 10 11 30 31 32 33 34 35 node 1 size: 31113 MB node 1 free: 26358 MB node 2 cpus: 12 13 14 15 16 17 36 37 38 39 40 41 node 2 size: 15841 MB node 2 free: 13867 MB node 3 cpus: 18 19 20 21 22 23 42 43 44 45 46 47 node 3 size: 30956 MB node 3 free: 30244 MB node distances: node 0 1 2 3 0: 10 16 16 16 1: 16 10 16 16 2: 16 16 10 16 3: 16 16 16 10 Run VM configured with 4 CPUs , 4 vNUMA nodes: NUMA0 mapped to Socket0 in Numa Pinning Topology Numa1 ->Socket1 Numa2 ->Socket2 Numa3 ->Socket3 Checked the same with two NUMA nodes host - no swap between NUMA and Socket indexes please confirm that no more tests required to verify. Nothing else. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (RHV Engine and Host Common Packages 4.4.z [ovirt-4.4.4]), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:0312 |
Created attachment 1550555 [details] NUMA Pining as shown in RHV-M On a 4 Socket system, I see the NUMA architecture as shown in the screen shot. Whats confusing is, that the NUMA node and socket numbers diverge. Looking at lscpu / numactl --hardware does not show the twist: [root@ls3020 ~]# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 192 On-line CPU(s) list: 0-191 Thread(s) per core: 2 Core(s) per socket: 24 Socket(s): 4 NUMA node(s): 4 Vendor ID: GenuineIntel CPU family: 6 Model: 79 Model name: Intel(R) Xeon(R) CPU E7-8890 v4 @ 2.20GHz Stepping: 1 CPU MHz: 3215.002 CPU max MHz: 3400.0000 CPU min MHz: 1200.0000 BogoMIPS: 4389.36 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 61440K NUMA node0 CPU(s): 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76,80,84,88,92,96,100,104,108,112,116,120,124,128,132,136,140,144,148,152,156,160,164,168,172,176,180,184,188 NUMA node1 CPU(s): 1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,61,65,69,73,77,81,85,89,93,97,101,105,109,113,117,121,125,129,133,137,141,145,149,153,157,161,165,169,173,177,181,185,189 NUMA node2 CPU(s): 2,6,10,14,18,22,26,30,34,38,42,46,50,54,58,62,66,70,74,78,82,86,90,94,98,102,106,110,114,118,122,126,130,134,138,142,146,150,154,158,162,166,170,174,178,182,186,190 NUMA node3 CPU(s): 3,7,11,15,19,23,27,31,35,39,43,47,51,55,59,63,67,71,75,79,83,87,91,95,99,103,107,111,115,119,123,127,131,135,139,143,147,151,155,159,163,167,171,175,179,183,187,191 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 intel_pt ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts spec_ctrl intel_stibp flush_l1d [root@ls3020 ~]# numactl --hardware available: 4 nodes (0-3) node 0 cpus: 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80 84 88 92 96 100 104 108 112 116 120 124 128 132 136 140 144 148 152 156 160 164 168 172 176 180 184 188 node 0 size: 262038 MB node 0 free: 54468 MB node 1 cpus: 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 101 105 109 113 117 121 125 129 133 137 141 145 149 153 157 161 165 169 173 177 181 185 189 node 1 size: 262144 MB node 1 free: 59311 MB node 2 cpus: 2 6 10 14 18 22 26 30 34 38 42 46 50 54 58 62 66 70 74 78 82 86 90 94 98 102 106 110 114 118 122 126 130 134 138 142 146 150 154 158 162 166 170 174 178 182 186 190 node 2 size: 262144 MB node 2 free: 59387 MB node 3 cpus: 3 7 11 15 19 23 27 31 35 39 43 47 51 55 59 63 67 71 75 79 83 87 91 95 99 103 107 111 115 119 123 127 131 135 139 143 147 151 155 159 163 167 171 175 179 183 187 191 node 3 size: 262144 MB node 3 free: 59266 MB node distances: node 0 1 2 3 0: 10 21 21 21 1: 21 10 21 21 2: 21 21 10 21 3: 21 21 21 10 It is not clear to me, why NUMA 0/3 and Socket 0/3 are swapped. Moreover, this is quite confusing to the user when configuring NUMA pinning and can lead to erroneous mapping. Is there a good reason for this or is it just wrong? This also might be a reason why live migrating of NUMA pinned VMs fails.