Bug 1503284
Summary: | libvirt memory pinning/migration vs vfio | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux Advanced Virtualization | Reporter: | Alex Williamson <alex.williamson> | ||||||
Component: | libvirt | Assignee: | Michal Privoznik <mprivozn> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Jing Qi <jinqi> | ||||||
Severity: | unspecified | Docs Contact: | |||||||
Priority: | unspecified | ||||||||
Version: | 8.0 | CC: | berrange, chhu, dyuan, fjin, jdenemar, mprivozn, pbonzini, xuzhang, yafu, yalzhang | ||||||
Target Milestone: | rc | Keywords: | Upstream | ||||||
Target Release: | 8.0 | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | libvirt-5.3.0-1.el8 | Doc Type: | If docs needed, set a value | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2019-11-06 07:10:39 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Alex Williamson
2017-10-17 18:18:42 UTC
Specifically I thought that if we saw: <memory mode='strict' nodeset='2'/> but <vcpu> did not have any cpuset=, then we should invent an initial cpuset based on the memory nodeset. This ensures the guest gets correct RAM placement and avoids memory page migration when we initialize cpuset.mems. Once we've set cgroups, we can then put the process CPU affinity back to all-1s (to honour the lack of any <vcpu cpuset=...>. Another important note for the scope of this issue, when using hugepages for the VM, the hugepages are allocated according to the numatune directive and the VM memory placement is therefore correct. This is likely the real reason I hadn't seen this on the 2-node system noted in comment 0 and possibly the reason that we've only just now discovered this issue, users making use of node-local memory will typically also make use of hugepages. Is there anything that QEMU needs to do or are "-numa node,memdev=..." and "-object memory-backend-ram,policy=...,host-nodes=..." enough? I don't think we need anything from QEMU here. The solution I describe ought to be achievable with just a little more intelligence in the way libvirt does setup before exec'ing QEMU. Patch proposed upstream: https://www.redhat.com/archives/libvir-list/2019-January/msg01228.html To POST: commit f136b83139c63f20de0df3285d9e82df2fb97bfc Author: Michal Privoznik <mprivozn> AuthorDate: Wed Jan 30 09:46:23 2019 +0100 Commit: Michal Privoznik <mprivozn> CommitDate: Fri Feb 1 12:53:46 2019 +0100 qemu: Rework setting process affinity https://bugzilla.redhat.com/show_bug.cgi?id=1503284 The way we currently start qemu from CPU affinity POV is as follows: 1) the child process is set affinity to all online CPUs (unless some vcpu pinning was given in the domain XML) 2) Once qemu is running, cpuset cgroup is configured taking memory pinning into account Problem is that we let qemu allocate its memory just anywhere in 1) and then rely in 2) to be able to move the memory to configured NUMA nodes. This might not be always possible (e.g. qemu might lock some parts of its memory) and is very suboptimal (copying large memory between NUMA nodes takes significant amount of time). The solution is to set affinity to one of (in priority order): - The CPUs associated with NUMA memory affinity mask - The CPUs associated with emulator pinning - All online host CPUs Later (once QEMU has allocated its memory) we then change this again to (again in priority order): - The CPUs associated with emulator pinning - The CPUs returned by numad - The CPUs associated with vCPU pinning - All online host CPUs Signed-off-by: Michal Privoznik <mprivozn> Reviewed-by: Daniel P. Berrangé <berrange> v5.0.0-199-gf136b83139 Built a rpm with upstream code with the patch- libvirt-5.1.0-1.el7.x86_64 (The test machine with 8 cells doesn't support rhel8) Here is the cpus info in numa cells available: 8 nodes (0-7) node 0 cpus: 0 4 8 12 16 20 24 28 node 0 size: 16349 MB node 0 free: 13920 MB node 1 cpus: 32 36 40 44 48 52 56 60 node 1 size: 16384 MB node 1 free: 15345 MB node 2 cpus: 1 5 9 13 17 21 25 29 node 2 size: 16384 MB node 2 free: 13986 MB node 3 cpus: 33 37 41 45 49 53 57 61 node 3 size: 16384 MB node 3 free: 15432 MB node 4 cpus: 2 6 10 14 18 22 26 30 node 4 size: 16384 MB node 4 free: 14840 MB node 5 cpus: 34 38 42 46 50 54 58 62 node 5 size: 16384 MB node 5 free: 15387 MB node 6 cpus: 35 39 43 47 51 55 59 63 node 6 size: 16384 MB node 6 free: 15370 MB node 7 cpus: 3 7 11 15 19 23 27 31 node 7 size: 16367 MB node 7 free: 12508 MB Used below domain xml with vfio device - <maxMemory slots='8' unit='KiB'>8938496</maxMemory> <memory unit='KiB'>2097152</memory> <currentMemory unit='KiB'>2097152</currentMemory> <vcpu placement='static'>8</vcpu> <numatune> <memory mode='strict' nodeset='7'/> </numatune> .... <interface type='hostdev' managed='yes'> <mac address='52:54:00:c6:b7:84'/> <source> <address type='pci' domain='0x0000' bus='0x44' slot='0x10' function='0x0'/> </source> <address type='pci' domain='0x0000' bus='0x00' slot='0x10' function='0x0'/> </interface> # virsh start avocado-vt-vm-numa0 Domain avocado-vt-vm-numa0 started Domain avocado-vt-vm-numa0 started quickly and the memory is in node 7. Per-node process memory usage (in MBs) for PID 4197 (qemu-kvm) Node 0 Node 1 Node 2 --------------- --------------- --------------- Huge 0.00 0.00 0.00 Heap 0.00 0.00 0.00 Stack 0.00 0.00 0.00 Private 0.00 0.00 0.02 ---------------- --------------- --------------- --------------- Total 0.00 0.00 0.02 Node 3 Node 4 Node 5 --------------- --------------- --------------- Huge 0.00 0.00 0.00 Heap 0.00 0.00 0.00 Stack 0.00 0.00 0.00 Private 0.01 0.21 0.00 ---------------- --------------- --------------- --------------- Total 0.01 0.21 0.00 Node 6 Node 7 Total --------------- --------------- --------------- Huge 0.00 0.00 0.00 Heap 0.00 55.36 55.36 Stack 0.00 0.61 0.61 Private 0.00 2525.26 2525.50 ---------------- --------------- --------------- --------------- Total 0.00 2581.23 2581.48 But the cpus are not all from node 7. VCPU: 0 CPU: 60 State: running CPU time: 14.4s CPU Affinity: yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy VCPU: 1 CPU: 27 State: running CPU time: 1.1s CPU Affinity: yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy VCPU: 2 CPU: 15 State: running CPU time: 1.3s CPU Affinity: yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy VCPU: 3 CPU: 35 State: running CPU time: 1.2s CPU Affinity: yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy VCPU: 4 CPU: 3 State: running CPU time: 1.2s CPU Affinity: yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy VCPU: 5 CPU: 31 State: running CPU time: 0.9s CPU Affinity: yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy VCPU: 6 CPU: 34 State: running CPU time: 0.8s CPU Affinity: yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy VCPU: 7 CPU: 11 State: running CPU time: 1.1s CPU Affinity: yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy Michal, Can you please help to check if the result about the cpus' affinity in above comment is as expected? Thanks, Jing Qi (In reply to Jing Qi from comment #13) > Michal, > > Can you please help to check if the result about the cpus' affinity in above > comment is as expected? > > Thanks, > Jing Qi I think the results are expected. After all, you haven't set up any vcpu pinning. To cite the docs: "If both cpuset and placement are not specified or if placement is "static", but no cpuset is specified, the domain process will be pinned to all the available physical CPUs." And the memory is pinned to the right node and there was no memory movement apparently, since the guest started instantly as you said. (In reply to Michal Privoznik from comment #14) > (In reply to Jing Qi from comment #13) > > Michal, > > > > Can you please help to check if the result about the cpus' affinity in above > > comment is as expected? > > > > Thanks, > > Jing Qi > > I think the results are expected. After all, you haven't set up any vcpu > pinning. To cite the docs: "If both cpuset and placement are not specified > or if placement is "static", but no cpuset is specified, the domain process > will be pinned to all the available physical CPUs." > > And the memory is pinned to the right node and there was no memory movement > apparently, since the guest started instantly as you said. OK, Michal. I need your info since the solution you said in comment11 . Verfied with libvirt-daemon-5.3.0-1.module+el8.1.0+3225+a8268fde.x86_64 & qemu-kvm-2.12.0-73.module+el8.1.0+3196+302d7a44.x86_64 With Intel Corporation Ethernet Connection X722 card in the 0 cell of a total 8 cells machine - The following domain xml is configured: <vcpu placement='static'>4</vcpu> <numatune> <memory mode='strict' nodeset='7'/> </numatune> <cputune> <vcpupin vcpu='0' cpuset='32'/> <vcpupin vcpu='1' cpuset='33'/> <vcpupin vcpu='2' cpuset='34'/> <vcpupin vcpu='3' cpuset='35'/> </vcputune> .... <interface type='hostdev' managed='yes'> <mac address='be:04:03:30:10:02'/> <source> <address type='pci' domain='0x0000' bus='0x1a' slot='0x02' function='0x1'/> </source> <address type='pci' domain='0x0000' bus='0x00' slot='0x10' function='0x0'/> </interface> The domain is started and check the vcpu info: #virsh vcpuinfo avocado-vt-vm2 VCPU: 0 CPU: 32 State: running CPU time: 11.2s CPU Affinity: --------------------------------y------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- VCPU: 1 CPU: 33 State: running CPU time: 3.6s CPU Affinity: ---------------------------------y------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ VCPU: 2 CPU: 34 State: running CPU time: 4.2s CPU Affinity: ----------------------------------y----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- VCPU: 3 CPU: 35 State: running CPU time: 2.9s CPU Affinity: -----------------------------------y---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Per-node process memory usage (in MBs) for PID 22053 (qemu-kvm) Node 0 Node 1 Node 2 --------------- --------------- --------------- Huge 0.00 0.00 0.00 Heap 0.00 0.00 0.00 Stack 0.00 0.00 0.00 Private 0.83 0.22 0.41 ---------------- --------------- --------------- --------------- Total 0.83 0.22 0.41 Node 3 Node 4 Node 5 --------------- --------------- --------------- Huge 0.00 0.00 0.00 Heap 0.00 0.00 0.00 Stack 0.00 0.00 0.00 Private 9.16 2.90 2.77 ---------------- --------------- --------------- --------------- Total 9.16 2.90 2.77 Node 6 Node 7 Total --------------- --------------- --------------- Huge 0.00 0.00 0.00 Heap 0.00 4.06 4.06 Stack 0.00 0.03 0.03 Private 0.34 1012.17 1028.80 ---------------- --------------- --------------- --------------- Total 0.34 1016.26 1032.89 All the vcpus are allocated according the cputune ( in node 2) and the memory is assigned from node 7 according numatune as expected. #numactl --ha available: 8 nodes (0-7) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 node 0 size: 773302 MB node 0 free: 772819 MB node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 node 1 size: 774137 MB node 1 free: 773725 MB node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 node 2 size: 774137 MB node 2 free: 773773 MB node 3 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 node 3 size: 774137 MB node 3 free: 773287 MB node 4 cpus: 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 node 4 size: 774113 MB node 4 free: 773738 MB node 5 cpus: 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 node 5 size: 774137 MB node 5 free: 773743 MB node 6 cpus: 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 node 6 size: 774137 MB node 6 free: 773036 MB node 7 cpus: 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 node 7 size: 774134 MB node 7 free: 771810 MB node distances: node 0 1 2 3 4 5 6 7 0: 10 21 31 21 21 31 31 31 1: 21 10 21 31 31 21 31 31 2: 31 21 10 21 31 31 21 31 3: 21 31 21 10 31 31 31 21 4: 21 31 31 31 10 21 21 31 5: 31 21 31 31 21 10 31 21 6: 31 31 21 31 21 31 10 21 7: 31 31 31 21 31 21 21 10 Created attachment 1600839 [details]
code coverage report
Created attachment 1600918 [details]
code coverage report - new
Hi Michal, I uploaded the test code coverage report for the bug. Can you please help to see if there is other test scenario need to be added? The current test is to start the domain with vfio interface (in node 0) and memory being used is specified to node 5 and cpus are from node 2 - The domain xml is as below - <maxMemory slots='8' unit='KiB'>838860800</maxMemory> <memory unit='KiB'>21495808</memory> <currentMemory unit='KiB'>20971520</currentMemory> <memtune> <hard_limit unit='KiB'>4503599627370495</hard_limit> </memtune> <numatune> <memory mode='strict' nodeset='5'/> </numatune> .... <vcpu placement='static' cpuset='0-9'>10</vcpu> <iothreads>1</iothreads> <iothreadids> <iothread id='1'/> </iothreadids> <cputune> <vcpupin vcpu='0' cpuset='32'/> <vcpupin vcpu='1' cpuset='33'/> <vcpupin vcpu='2' cpuset='34'/> <vcpupin vcpu='3' cpuset='35'/> <iothreadpin iothread='1' cpuset='40-41'/> </cputune> ... <cpu> <numa> <cell id='0' cpus='0-9' memory='20971520' unit='KiB'/> </numa> </cpu> .... <interface type='hostdev' managed='yes'> <mac address='be:04:03:30:10:32'/> <driver name='vfio'/> <source> <address type='pci' domain='0x0000' bus='0x1a' slot='0x06' function='0x3'/> </source> <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/> </interface> #numactl --ha available: 8 nodes (0-7) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 node 0 size: 773302 MB node 0 free: 772819 MB node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 node 1 size: 774137 MB node 1 free: 773725 MB node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 node 2 size: 774137 MB node 2 free: 773773 MB node 3 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 node 3 size: 774137 MB node 3 free: 773287 MB node 4 cpus: 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 node 4 size: 774113 MB node 4 free: 773738 MB node 5 cpus: 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 node 5 size: 774137 MB node 5 free: 773743 MB node 6 cpus: 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 node 6 size: 774137 MB node 6 free: 773036 MB node 7 cpus: 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 node 7 size: 774134 MB node 7 free: 771810 MB node distances: node 0 1 2 3 4 5 6 7 0: 10 21 31 21 21 31 31 31 1: 21 10 21 31 31 21 31 31 2: 31 21 10 21 31 31 21 31 3: 21 31 21 10 31 31 31 21 4: 21 31 31 31 10 21 21 31 5: 31 21 31 31 21 10 31 21 6: 31 31 21 31 21 31 10 21 7: 31 31 31 21 31 21 21 10 (In reply to Jing Qi from comment #22) > Hi Michal, > > I uploaded the test code coverage report for the bug. Can you please help > to see if there is other test scenario need to be added? I think this one is enough. Thanks Michal for your response. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:3723 |