Bug 1503284

Summary:

libvirt memory pinning/migration vs vfio

Product:

Red Hat Enterprise Linux Advanced Virtualization

Reporter:

Alex Williamson <alex.williamson>

Component:

libvirt

Assignee:

Michal Privoznik <mprivozn>

Status:

CLOSED ERRATA

QA Contact:

Jing Qi <jinqi>

Severity:

unspecified

Docs Contact:

Priority:

unspecified

Version:

8.0

CC:

berrange, chhu, dyuan, fjin, jdenemar, mprivozn, pbonzini, xuzhang, yafu, yalzhang

Target Milestone:

Keywords:

Upstream

Target Release:

8.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

libvirt-5.3.0-1.el8

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2019-11-06 07:10:39 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
code coverage report	none
code coverage report - new	none

Description Alex Williamson 2017-10-17 18:18:42 UTC

Description of problem:

When using <numatune> elements, such as:

  <numatune>
    <memory mode='strict' nodeset='2'/>
  </numatune>

libvirt expects to be able to move VM memory after the qemu process is instantiated.  However, when the VM includes a vfio device, the VM memory is already pinned during that instantiation.  Not only is the user then not using the node-local memory they requested, but this can also incur a very long (3min) startup delay during which libvirtd is otherwise unresponsive to virsh commands and consumes 100% of a cpu.

A workaround for this is to specify an initial locality on the <vcpu> element, for example:

  <vcpu placement='static' cpuset='16-23'>8</vcpu>

In this example the hardware topology is:

node 2 cpus: 16 17 18 19 20 21 22 23 80 81 82 83 84 85 86 87
node 2 size: 16383 MB
node 2 free: 15958 MB

Where the chosen cpuset causes the initial allocation on node 2.  We can then also include:

  <cputune>
    <vcpupin vcpu='0' cpuset='80'/>
    <vcpupin vcpu='1' cpuset='81'/>
    <vcpupin vcpu='2' cpuset='82'/>
    <vcpupin vcpu='3' cpuset='83'/>
    <vcpupin vcpu='4' cpuset='84'/>
    <vcpupin vcpu='5' cpuset='85'/>
    <vcpupin vcpu='6' cpuset='86'/>
    <vcpupin vcpu='7' cpuset='87'/>
    <emulatorpin cpuset='16-23'/>
  </cputune>

if desired to pin vCPUs to separate threads from the emulator (or iothread, not shown).

Overall, libvirt is expecting to be able migrate memory after VM instantiation, which does not work.  Daniel Berrange suggested on IRC that a potential solution might be for libvirt to look for memory binding elements prior to instantiating qemu and make use of those if no cpuset is provided in the <vcpu> element.

Version-Release number of selected component (if applicable):
libvirt-3.8.0-1.el7.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Create VM with vfio device and numatune memory binding
2.
3.

Actual results:

VM start-up takes an excessively long time (3min) - potentially dependent on host system, pathological case on 8-node system, unnoticeable on 2-node system.

VM is not necessarily using prescribed node-local memory.


Expected results:

VMs with vfio devices should use memory nodes defined by <numatune> elements.


Additional info:

Workaround exists as described above, but is limited to processor+memory nodes, for example a memory-only node could not be used for a vfio VM.

Comment 2 Daniel Berrangé 2017-10-17 18:24:15 UTC

Specifically I thought that if we saw:

    <memory mode='strict' nodeset='2'/>

but <vcpu> did not have any cpuset=, then we should invent an initial cpuset based on the memory nodeset.  This ensures the guest gets correct RAM placement and avoids memory page migration when we initialize cpuset.mems.  Once we've set cgroups, we can then put the process CPU affinity back to all-1s (to honour the lack of any <vcpu cpuset=...>.

Comment 3 Alex Williamson 2017-10-17 18:52:57 UTC

Another important note for the scope of this issue, when using hugepages for the VM, the hugepages are allocated according to the numatune directive and the VM memory placement is therefore correct.  This is likely the real reason I hadn't seen this on the 2-node system noted in comment 0 and possibly the reason that we've only just now discovered this issue, users making use of node-local memory will typically also make use of hugepages.

Comment 4 Paolo Bonzini 2017-11-15 22:01:43 UTC

Is there anything that QEMU needs to do or are "-numa node,memdev=..." and "-object memory-backend-ram,policy=...,host-nodes=..." enough?

Comment 5 Daniel Berrangé 2017-11-16 10:27:16 UTC

I don't think we need anything from QEMU here. The solution I describe ought to be achievable with just a little more intelligence in the way libvirt does setup before exec'ing QEMU.

Comment 7 Michal Privoznik 2019-01-30 14:19:53 UTC

Patch proposed upstream:

https://www.redhat.com/archives/libvir-list/2019-January/msg01228.html

Comment 9 Michal Privoznik 2019-01-31 09:34:43 UTC

V2:

https://www.redhat.com/archives/libvir-list/2019-January/msg01281.html

Comment 11 Michal Privoznik 2019-02-01 11:55:15 UTC

To POST:

commit f136b83139c63f20de0df3285d9e82df2fb97bfc
Author:     Michal Privoznik <mprivozn>
AuthorDate: Wed Jan 30 09:46:23 2019 +0100
Commit:     Michal Privoznik <mprivozn>
CommitDate: Fri Feb 1 12:53:46 2019 +0100

    qemu: Rework setting process affinity
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1503284
    
    The way we currently start qemu from CPU affinity POV is as
    follows:
    
      1) the child process is set affinity to all online CPUs (unless
      some vcpu pinning was given in the domain XML)
    
      2) Once qemu is running, cpuset cgroup is configured taking
      memory pinning into account
    
    Problem is that we let qemu allocate its memory just anywhere in
    1) and then rely in 2) to be able to move the memory to
    configured NUMA nodes. This might not be always possible (e.g.
    qemu might lock some parts of its memory) and is very suboptimal
    (copying large memory between NUMA nodes takes significant amount
    of time).
    
    The solution is to set affinity to one of (in priority order):
      - The CPUs associated with NUMA memory affinity mask
      - The CPUs associated with emulator pinning
      - All online host CPUs
    
    Later (once QEMU has allocated its memory) we then change this
    again to (again in priority order):
      - The CPUs associated with emulator pinning
      - The CPUs returned by numad
      - The CPUs associated with vCPU pinning
      - All online host CPUs
    
    Signed-off-by: Michal Privoznik <mprivozn>
    Reviewed-by: Daniel P. Berrangé <berrange>


v5.0.0-199-gf136b83139

Comment 12 Jing Qi 2019-02-13 06:27:26 UTC

Built a rpm with upstream code with the patch-
libvirt-5.1.0-1.el7.x86_64
(The test machine with 8 cells doesn't support rhel8)

Here is the cpus info in numa cells

available: 8 nodes (0-7)
node 0 cpus: 0 4 8 12 16 20 24 28
node 0 size: 16349 MB
node 0 free: 13920 MB
node 1 cpus: 32 36 40 44 48 52 56 60
node 1 size: 16384 MB
node 1 free: 15345 MB
node 2 cpus: 1 5 9 13 17 21 25 29
node 2 size: 16384 MB
node 2 free: 13986 MB
node 3 cpus: 33 37 41 45 49 53 57 61
node 3 size: 16384 MB
node 3 free: 15432 MB
node 4 cpus: 2 6 10 14 18 22 26 30
node 4 size: 16384 MB
node 4 free: 14840 MB
node 5 cpus: 34 38 42 46 50 54 58 62
node 5 size: 16384 MB
node 5 free: 15387 MB
node 6 cpus: 35 39 43 47 51 55 59 63
node 6 size: 16384 MB
node 6 free: 15370 MB
node 7 cpus: 3 7 11 15 19 23 27 31
node 7 size: 16367 MB
node 7 free: 12508 MB


Used below domain xml with vfio device -

<maxMemory slots='8' unit='KiB'>8938496</maxMemory>
  <memory unit='KiB'>2097152</memory>
  <currentMemory unit='KiB'>2097152</currentMemory>
  <vcpu placement='static'>8</vcpu>
  <numatune>
    <memory mode='strict' nodeset='7'/>
  </numatune>
....
    <interface type='hostdev' managed='yes'>
      <mac address='52:54:00:c6:b7:84'/>
      <source>
        <address type='pci' domain='0x0000' bus='0x44' slot='0x10' function='0x0'/>
      </source>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x10' function='0x0'/>
    </interface>

 # virsh start avocado-vt-vm-numa0
Domain avocado-vt-vm-numa0 started

Domain avocado-vt-vm-numa0 started quickly and the memory is in node 7.
Per-node process memory usage (in MBs) for PID 4197 (qemu-kvm)
                           Node 0          Node 1          Node 2
                  --------------- --------------- ---------------
Huge                         0.00            0.00            0.00
Heap                         0.00            0.00            0.00
Stack                        0.00            0.00            0.00
Private                      0.00            0.00            0.02
----------------  --------------- --------------- ---------------
Total                        0.00            0.00            0.02

                           Node 3          Node 4          Node 5
                  --------------- --------------- ---------------
Huge                         0.00            0.00            0.00
Heap                         0.00            0.00            0.00
Stack                        0.00            0.00            0.00
Private                      0.01            0.21            0.00
----------------  --------------- --------------- ---------------
Total                        0.01            0.21            0.00

                           Node 6          Node 7           Total
                  --------------- --------------- ---------------
Huge                         0.00            0.00            0.00
Heap                         0.00           55.36           55.36
Stack                        0.00            0.61            0.61
Private                      0.00         2525.26         2525.50
----------------  --------------- --------------- ---------------
Total                        0.00         2581.23         2581.48


But the cpus are not all from node 7. 


VCPU:           0
CPU:            60
State:          running
CPU time:       14.4s
CPU Affinity:   yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy

VCPU:           1
CPU:            27
State:          running
CPU time:       1.1s
CPU Affinity:   yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy

VCPU:           2
CPU:            15
State:          running
CPU time:       1.3s
CPU Affinity:   yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy

VCPU:           3
CPU:            35
State:          running
CPU time:       1.2s
CPU Affinity:   yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy

VCPU:           4
CPU:            3
State:          running
CPU time:       1.2s
CPU Affinity:   yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy

VCPU:           5
CPU:            31
State:          running
CPU time:       0.9s
CPU Affinity:   yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy

VCPU:           6
CPU:            34
State:          running
CPU time:       0.8s
CPU Affinity:   yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy

VCPU:           7
CPU:            11
State:          running
CPU time:       1.1s
CPU Affinity:   yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy

Comment 13 Jing Qi 2019-02-13 06:39:04 UTC

Michal,

Can you please help to check if the result about the cpus' affinity in above comment is as expected?

Thanks,
Jing Qi

Comment 14 Michal Privoznik 2019-02-13 12:49:02 UTC

(In reply to Jing Qi from comment #13)
> Michal,
> 
> Can you please help to check if the result about the cpus' affinity in above
> comment is as expected?
> 
> Thanks,
> Jing Qi

I think the results are expected. After all, you haven't set up any vcpu pinning. To cite the docs: "If both cpuset and placement are not specified or if placement is "static", but no cpuset is specified, the domain process will be pinned to all the available physical CPUs."

And the memory is pinned to the right node and there was no memory movement apparently, since the guest started instantly as you said.

Comment 15 Jing Qi 2019-02-14 01:50:46 UTC

(In reply to Michal Privoznik from comment #14)
> (In reply to Jing Qi from comment #13)
> > Michal,
> > 
> > Can you please help to check if the result about the cpus' affinity in above
> > comment is as expected?
> > 
> > Thanks,
> > Jing Qi
> 
> I think the results are expected. After all, you haven't set up any vcpu
> pinning. To cite the docs: "If both cpuset and placement are not specified
> or if placement is "static", but no cpuset is specified, the domain process
> will be pinned to all the available physical CPUs."
> 
> And the memory is pinned to the right node and there was no memory movement
> apparently, since the guest started instantly as you said.

OK, Michal. I need your info since the solution you said in comment11 .

Comment 17 Jing Qi 2019-05-27 04:09:18 UTC

Verfied with libvirt-daemon-5.3.0-1.module+el8.1.0+3225+a8268fde.x86_64 & qemu-kvm-2.12.0-73.module+el8.1.0+3196+302d7a44.x86_64 

With Intel Corporation Ethernet Connection X722 card in the 0 cell of a total 8 cells machine -

The following domain xml is configured:

<vcpu placement='static'>4</vcpu>
  <numatune>
    <memory mode='strict' nodeset='7'/>
  </numatune>
  <cputune>
  <vcpupin vcpu='0' cpuset='32'/>
    <vcpupin vcpu='1' cpuset='33'/>
    <vcpupin vcpu='2' cpuset='34'/>
    <vcpupin vcpu='3' cpuset='35'/>
  </vcputune>
....
   <interface type='hostdev' managed='yes'>
      <mac address='be:04:03:30:10:02'/>
      <source>
        <address type='pci' domain='0x0000' bus='0x1a' slot='0x02' function='0x1'/>
      </source>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x10' function='0x0'/>
    </interface>

The domain is started and check the vcpu info:
#virsh vcpuinfo avocado-vt-vm2
VCPU:           0
CPU:            32
State:          running
CPU time:       11.2s
CPU Affinity:   --------------------------------y-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

VCPU:           1
CPU:            33
State:          running
CPU time:       3.6s
CPU Affinity:   ---------------------------------y------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

VCPU:           2
CPU:            34
State:          running
CPU time:       4.2s
CPU Affinity:   ----------------------------------y-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

VCPU:           3
CPU:            35
State:          running
CPU time:       2.9s
CPU Affinity:   -----------------------------------y----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Per-node process memory usage (in MBs) for PID 22053 (qemu-kvm)
                           Node 0          Node 1          Node 2
                  --------------- --------------- ---------------
Huge                         0.00            0.00            0.00
Heap                         0.00            0.00            0.00
Stack                        0.00            0.00            0.00
Private                      0.83            0.22            0.41
----------------  --------------- --------------- ---------------
Total                        0.83            0.22            0.41

                           Node 3          Node 4          Node 5
                  --------------- --------------- ---------------
Huge                         0.00            0.00            0.00
Heap                         0.00            0.00            0.00
Stack                        0.00            0.00            0.00
Private                      9.16            2.90            2.77
----------------  --------------- --------------- ---------------
Total                        9.16            2.90            2.77

                           Node 6          Node 7           Total
                  --------------- --------------- ---------------
Huge                         0.00            0.00            0.00
Heap                         0.00            4.06            4.06
Stack                        0.00            0.03            0.03
Private                      0.34         1012.17         1028.80
----------------  --------------- --------------- ---------------
Total                        0.34         1016.26         1032.89


All the vcpus are allocated according the cputune ( in node 2) and the memory is assigned from node 7  according numatune as expected.

 


#numactl --ha
available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
node 0 size: 773302 MB
node 0 free: 772819 MB
node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159
node 1 size: 774137 MB
node 1 free: 773725 MB
node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175
node 2 size: 774137 MB
node 2 free: 773773 MB
node 3 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191
node 3 size: 774137 MB
node 3 free: 773287 MB
node 4 cpus: 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207
node 4 size: 774113 MB
node 4 free: 773738 MB
node 5 cpus: 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223
node 5 size: 774137 MB
node 5 free: 773743 MB
node 6 cpus: 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239
node 6 size: 774137 MB
node 6 free: 773036 MB
node 7 cpus: 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255
node 7 size: 774134 MB
node 7 free: 771810 MB
node distances:
node   0   1   2   3   4   5   6   7 
  0:  10  21  31  21  21  31  31  31 
  1:  21  10  21  31  31  21  31  31 
  2:  31  21  10  21  31  31  21  31 
  3:  21  31  21  10  31  31  31  21 
  4:  21  31  31  31  10  21  21  31 
  5:  31  21  31  31  21  10  31  21 
  6:  31  31  21  31  21  31  10  21 
  7:  31  31  31  21  31  21  21  10

Comment 20 Jing Qi 2019-08-06 06:43:28 UTC

Created attachment 1600839 [details]
code coverage report

Comment 21 Jing Qi 2019-08-06 09:41:44 UTC

Created attachment 1600918 [details]
code coverage report - new

Comment 22 Jing Qi 2019-08-06 09:57:55 UTC

Hi Michal, 

I uploaded the test code coverage report for the bug.  Can you please help to see if there is other test scenario need to be added?

The current test is to start the domain with vfio interface (in node 0) and memory being used is specified to node 5 and cpus are from node 2 -
  
The domain xml is as below -

<maxMemory slots='8' unit='KiB'>838860800</maxMemory>
  <memory unit='KiB'>21495808</memory>
  <currentMemory unit='KiB'>20971520</currentMemory>
  <memtune>
    <hard_limit unit='KiB'>4503599627370495</hard_limit>
  </memtune>
  <numatune>
    <memory mode='strict' nodeset='5'/>
  </numatune>
....
  <vcpu placement='static' cpuset='0-9'>10</vcpu>
  <iothreads>1</iothreads>
  <iothreadids>
    <iothread id='1'/>
  </iothreadids>
  <cputune>
    <vcpupin vcpu='0' cpuset='32'/>
    <vcpupin vcpu='1' cpuset='33'/>
    <vcpupin vcpu='2' cpuset='34'/>
    <vcpupin vcpu='3' cpuset='35'/>
    <iothreadpin iothread='1' cpuset='40-41'/>
  </cputune>

...

 <cpu>
    <numa>
      <cell id='0' cpus='0-9' memory='20971520' unit='KiB'/>
    </numa>
  </cpu>
....

  <interface type='hostdev' managed='yes'>
      <mac address='be:04:03:30:10:32'/>
      <driver name='vfio'/>
      <source>
        <address type='pci' domain='0x0000' bus='0x1a' slot='0x06' function='0x3'/>
      </source>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </interface>



#numactl --ha
available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
node 0 size: 773302 MB
node 0 free: 772819 MB
node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159
node 1 size: 774137 MB
node 1 free: 773725 MB
node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175
node 2 size: 774137 MB
node 2 free: 773773 MB
node 3 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191
node 3 size: 774137 MB
node 3 free: 773287 MB
node 4 cpus: 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207
node 4 size: 774113 MB
node 4 free: 773738 MB
node 5 cpus: 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223
node 5 size: 774137 MB
node 5 free: 773743 MB
node 6 cpus: 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239
node 6 size: 774137 MB
node 6 free: 773036 MB
node 7 cpus: 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255
node 7 size: 774134 MB
node 7 free: 771810 MB
node distances:
node   0   1   2   3   4   5   6   7 
  0:  10  21  31  21  21  31  31  31 
  1:  21  10  21  31  31  21  31  31 
  2:  31  21  10  21  31  31  21  31 
  3:  21  31  21  10  31  31  31  21 
  4:  21  31  31  31  10  21  21  31 
  5:  31  21  31  31  21  10  31  21 
  6:  31  31  21  31  21  31  10  21 
  7:  31  31  31  21  31  21  21  10

Comment 23 Michal Privoznik 2019-08-06 11:50:23 UTC

(In reply to Jing Qi from comment #22)
> Hi Michal, 
> 
> I uploaded the test code coverage report for the bug.  Can you please help
> to see if there is other test scenario need to be added?

I think this one is enough.

Comment 24 Jing Qi 2019-08-07 01:53:06 UTC

Thanks Michal for your response.

Comment 26 errata-xmlrpc 2019-11-06 07:10:39 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:3723