Bug 1640869 - [POWER]virsh numatune does not move the already allocated memory
Summary: [POWER]virsh numatune does not move the already allocated memory
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux Advanced Virtualization
Classification: Red Hat
Component: libvirt
Version: 8.0
Hardware: ppc64le
OS: Linux
high
high
Target Milestone: rc
: 8.3
Assignee: Daniel Henrique Barboza (IBM)
QA Contact: Dan Zheng
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-10-19 02:34 UTC by Junxiang Li
Modified: 2020-11-17 17:45 UTC (History)
19 users (show)

Fixed In Version: libvirt-6.5.0-1.el8
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1198497
Environment:
Last Closed: 2020-11-17 17:44:45 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
guest xml (4.75 KB, text/plain)
2020-06-08 08:54 UTC, Dan Zheng
no flags Details

Description Junxiang Li 2018-10-19 02:34:19 UTC
+++ This bug was initially created as a clone of Bug #1198497 +++

Description of problem:
virsh numatune doesn't migrate memory in a running VM

Version-Release number of selected component (if applicable):


How reproducible:
always

Steps to Reproduce:
1. Start a VM with strict pinning to NUMA Node 0
2. Change the NumaNode via "virsh numatune <VM> --nodeset 1

Actual results:
New memory gets allocated in the new numanode, but already allocated one stays in NumaNode 0


Expected results:
As the allocation policy is strict, all memory shoudl be moved to NumaNode 1


Additional info:
This happens as the cgroup has cpuset.memory_migrate set to 0
In case this is set to 1 manually the memory gets migrated.

As typically the Numanode and the CPUset are "pinned" to the same NumaNode, having memory in another node slows down the VM in case this memory is accessed.

--- Additional comment from Martin Kletzander on 2015-03-09 04:03:56 EDT ---

I remember raising the memory_migrate question upstream, let me find the discussion (if I remember correctly that there was any).  In the meantime, setting it manually to 1 is a valid workaround that will not break libvirt.

--- Additional comment from Martin Kletzander on 2015-03-11 08:50:32 EDT ---

The discussion mentioned in comment #2 was private and mostly irrelevant, so I just composed a patch and sent it upstream:

https://www.redhat.com/archives/libvir-list/2015-March/msg00586.html

--- Additional comment from Martin Kletzander on 2015-03-20 08:45:24 EDT ---

Fixed upstream with v1.2.13-250-gba1dfc5..v1.2.13-251-g3a0e5b0: 

commit ba1dfc5b6a65914ec8ceadbcfbe16c17e83cc760
Author: Martin Kletzander <mkletzan>
Date:   Wed Mar 11 11:15:29 2015 +0100

    cgroup: Add accessors for cpuset.memory_migrate

commit 3a0e5b0c20815f986ac434e3df67f56d5d1aa44c
Author: Martin Kletzander <mkletzan>
Date:   Wed Mar 11 11:17:15 2015 +0100

    qemu: Migrate memory on numatune change

--- Additional comment from Luyao Huang on 2015-04-13 04:47:37 EDT ---

I can reproduce this issue with libvirt-0.10.2-51.el6.x86_64:

1. start a vm guest mem bind to host with strict mode:
# virsh dumpxml r6
...
  <memory unit='KiB'>40240000</memory>
  <currentMemory unit='KiB'>30240000</currentMemory>
  <vcpu placement='static'>4</vcpu>
  <numatune>
    <memory mode='strict' nodeset='0'/>
  </numatune>
...

2. start this vm

# virsh start r6
Domain r6 started

3. check cgroup and numa mem

# numastat -p `pidof qemu-kvm`

Per-node process memory usage (in MBs) for PID 31656 (qemu-kvm)
                           Node 0          Node 1           Total
                  --------------- --------------- ---------------
Huge                         0.00            0.00            0.00
Heap                       196.49            0.00          196.49
Stack                        0.04            0.00            0.04
Private                  30486.73            5.62        30492.35
----------------  --------------- --------------- ---------------
Total                    30683.27            5.62        30688.88

# cgget -g cpuset /libvirt/qemu/r6
/libvirt/qemu/r6:
cpuset.memory_spread_slab: 0
cpuset.memory_spread_page: 0
cpuset.memory_pressure: 0
cpuset.memory_migrate: 0
cpuset.sched_relax_domain_level: -1
cpuset.sched_load_balance: 1
cpuset.mem_hardwall: 0
cpuset.mem_exclusive: 0
cpuset.cpu_exclusive: 0
cpuset.mems: 0
cpuset.cpus: 0-31

4. change the memory bind node

# virsh numatune r6 0 1

5.recheck cgroup and numa:
# numastat -p `pidof qemu-kvm`

Per-node process memory usage (in MBs) for PID 31656 (qemu-kvm)
                           Node 0          Node 1           Total
                  --------------- --------------- ---------------
Huge                         0.00            0.00            0.00
Heap                       196.49            0.00          196.49
Stack                        0.04            0.00            0.04
Private                  30494.73            5.62        30500.35
----------------  --------------- --------------- ---------------
Total                    30691.27            5.62        30696.88

# cgget -g cpuset /libvirt/qemu/r6
/libvirt/qemu/r6:
cpuset.memory_spread_slab: 0
cpuset.memory_spread_page: 0
cpuset.memory_pressure: 0
cpuset.memory_migrate: 0
cpuset.sched_relax_domain_level: -1
cpuset.sched_load_balance: 1
cpuset.mem_hardwall: 0
cpuset.mem_exclusive: 0
cpuset.cpu_exclusive: 0
cpuset.mems: 1
cpuset.cpus: 0-31



And verify this bug with libvirt-0.10.2-53.el6.x86_64:

1. prepare a vm mem bind to host with strict:
# virsh dumpxml r6
...
  <memory unit='KiB'>40240000</memory>
  <currentMemory unit='KiB'>30240000</currentMemory>
  <vcpu placement='static'>4</vcpu>
  <numatune>
    <memory mode='strict' nodeset='0'/>
  </numatune>
...

2. start this vm

# virsh start r6
Domain r6 started

3.check cgroup and numa mem:
# cgget -g cpuset /libvirt/qemu/r6
/libvirt/qemu/r6:
cpuset.memory_spread_slab: 0
cpuset.memory_spread_page: 0
cpuset.memory_pressure: 0
cpuset.memory_migrate: 1
cpuset.sched_relax_domain_level: -1
cpuset.sched_load_balance: 1
cpuset.mem_hardwall: 0
cpuset.mem_exclusive: 0
cpuset.cpu_exclusive: 0
cpuset.mems: 0
cpuset.cpus: 0-31

# numastat -p `pidof qemu-kvm`

Per-node process memory usage (in MBs) for PID 15103 (qemu-kvm)
                           Node 0          Node 1           Total
                  --------------- --------------- ---------------
Huge                         0.00            0.00            0.00
Heap                       196.49            0.00          196.49
Stack                        0.04            0.00            0.04
Private                     48.64            5.31           53.95
----------------  --------------- --------------- ---------------
Total                      245.17            5.31          250.48

4. run a memeater in guest and recheck memory

# numastat -p `pidof qemu-kvm`

Per-node process memory usage (in MBs) for PID 15103 (qemu-kvm)
                           Node 0          Node 1           Total
                  --------------- --------------- ---------------
Huge                         0.00            0.00            0.00
Heap                       196.49            0.00          196.49
Stack                        0.04            0.00            0.04
Private                  29629.27            5.32        29634.59
----------------  --------------- --------------- ---------------
Total                    29825.80            5.32        29831.12

5. migrate memory via virsh numatune:

# time virsh numatune r6 0 1


real	0m38.960s
user	0m0.030s
sys	0m0.034s

6. recheck cgroup and numa memory

# cgget -g cpuset /libvirt/qemu/r6
/libvirt/qemu/r6:
cpuset.memory_spread_slab: 0
cpuset.memory_spread_page: 0
cpuset.memory_pressure: 0
cpuset.memory_migrate: 1
cpuset.sched_relax_domain_level: -1
cpuset.sched_load_balance: 1
cpuset.mem_hardwall: 0
cpuset.mem_exclusive: 0
cpuset.cpu_exclusive: 0
cpuset.mems: 1
cpuset.cpus: 0-31

# numastat -p `pidof qemu-kvm`

Per-node process memory usage (in MBs) for PID 15103 (qemu-kvm)
                           Node 0          Node 1           Total
                  --------------- --------------- ---------------
Huge                         0.00            0.00            0.00
Heap                         0.00          196.49          196.49
Stack                        0.00            0.04            0.04
Private                      0.04        29638.54        29638.58
----------------  --------------- --------------- ---------------
Total                        0.04        29835.07        29835.11


7. also test with numad with placement='auto', it work's well

Comment 1 Junxiang Li 2018-10-19 02:37:29 UTC
Version-Release number of selected component (if applicable):
# rpm -q libvirt qemu-kvm kernel numad
libvirt-4.5.0-9.el8+1790+2744186b.ppc64le
qemu-kvm-2.12.0-32.el8+1900+70997154.ppc64le
kernel-4.18.0-27.el8.ppc64le
numad-0.5-26.20150602git.el8.ppc64le

Comment 2 Daniel Henrique Barboza (IBM) 2020-05-28 23:18:42 UTC
I am willing to take a look into this but I'm afraid I need more information.
Test case, guest XML, expected result and so on.

Comment 3 Dan Zheng 2020-06-08 08:52:00 UTC
Reproduce with 
libvirt-6.4.0-1.module+el8.3.0+6881+88468c00.ppc64le
qemu-kvm-5.0.0-0.module+el8.3.0+6620+5d5e1420.ppc64le
kernel-4.18.0-209.el8.ppc64le

Also with RHEL 8.3 slow train:
libvirt-6.0.0-17.module+el8.3.0+6423+e4cb6418.ppc64le
qemu-kvm-4.2.0-19.module+el8.3.0+6473+93e27135.ppc64le

Case: Bind memory to node 0 -> numastat to migrate memory to node 8 -> Check if memory usage is moved to node 8 
Guest xml snippet:
  <memory unit='KiB'>8388608</memory>
  <currentMemory unit='KiB'>8388608</currentMemory>
  <vcpu placement='static'>20</vcpu>
  <numatune>
    <memory mode='strict' nodeset='0'/>
  </numatune>

1. Start guest
# virsh start vm1
Domain vm1 started

2. Check cgroup cpuset.memory_migrate, enabled
# cat /sys/fs/cgroup/cpuset/machine.slice/machine-qemu\\x2d1\\x2dvm1.scope/cpuset.memory_migrate 
1

3. Check memory usage
# numactl --hard
available: 2 nodes (0,8)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79
node 0 size: 61281 MB
node 0 free: 55486 MB
node 8 cpus: 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159
node 8 size: 65303 MB
node 8 free: 47137 MB
node distances:
node   0   8 
  0:  10  40 
  8:  40  10 

# numastat -p `pidof qemu-kvm`

Per-node process memory usage (in MBs) for PID 102664 (qemu-kvm)
                           Node 0          Node 8           Total
                  --------------- --------------- ---------------
Huge                         0.00            0.00            0.00
Heap                        23.88            0.00           23.88
Stack                        0.06            0.00            0.06
Private                   1296.19           15.12         1311.31
----------------  --------------- --------------- ---------------
Total                     1320.12           15.12         1335.25


4. Migrate memory from node 0 to node 8
# virsh numatune vm1 0 8

5. Recheck memory usage
# numastat -p `pidof qemu-kvm`

Per-node process memory usage (in MBs) for PID 102664 (qemu-kvm)
                           Node 0          Node 8           Total
                  --------------- --------------- ---------------
Huge                         0.00            0.00            0.00
Heap                         0.00           23.88           23.88
Stack                        0.06            0.00            0.06
Private                   1243.69           67.62         1311.31
----------------  --------------- --------------- ---------------
Total                     1243.75           91.50         1335.25

Actual result: 
As above. Small part of the memory used is moved from node 0 to node 8 (Heap), but most are still on node 0.

Expect result:
Almost all of the memory should be moved from node 0 to node 8.

Comment 4 Dan Zheng 2020-06-08 08:54:25 UTC
Created attachment 1696020 [details]
guest xml

Comment 5 Daniel Henrique Barboza (IBM) 2020-06-09 09:56:03 UTC
Thanks for the info. I'll look into it.

Comment 6 Daniel Henrique Barboza (IBM) 2020-06-11 17:17:43 UTC
The behavior you're seeing is not properly documented in the virsh manpage, but it is
expected.

I can reproduce the exact behavior you're seeing, but only with an idle guest. Any
guest activity that causes memory pages to be used (e.g. stress tests, or rebooting
the guest) will make the memory migrated to the new nodeset, in your case node8.

There are 2 reasons for this. First, memory migration, as documented in the cgroups
documentation of the Linux kernel, is not made at once. It depends on tasks using
the affected pages. An idle guest is using a small amount of memory, and this is
why you're seeing the results you're seeing.

Second reason is QEMU itself. QEMU locks parts of the memory for its internal use
(VFIO for example), and this memory won't migrate to node8.


As an experiment I attempted to execute numatune right after the guest started, and
this is the result:

$ sudo ./run tools/virsh start numatunetest

$ sudo ./run tools/virsh numatune numatunetest 0 8 --live

$ sudo numastat -p `pidof qemu-system-ppc64`

Per-node process memory usage (in MBs) for PID 65287 (qemu-system-ppc)
                           Node 0          Node 8           Total
                  --------------- --------------- ---------------
Huge                         0.00            0.00            0.00
Heap                         0.00           12.19           12.19
Stack                        0.00            0.06            0.06
Private                      6.25         1164.50         1170.75
----------------  --------------- --------------- ---------------
Total                        6.25         1176.75         1183.00  


As you can see, almost all memory were migrated to node8 because QEMU didn't
have the time to lock all memory on node8 and the memory pages used for guest
boot was allocated in node8 instead of node0. Attempting to migrate the
memory back to node0, with the guest being idle, will result in similar
results you're already seeing.


My conclusion is that there is no bug, but a lack of documentation to let users
know that the memory migration isn't immediate after the execution of
'virsh numatune'. I proposed a doc change in virsh mentioning the two factors
I mentioned above [1], but unfortunately that's all that needs to be done for this
bug at this moment, in my opinion.



[1] https://www.redhat.com/archives/libvir-list/2020-June/msg00511.html

Comment 7 David Gibson 2020-06-15 04:21:01 UTC
Dan,

Does that seem reasonable?  Can we close this as NOTABUG?

Comment 8 Andrea Bolognani 2020-06-16 11:55:46 UTC
There's nothing inherently sensitive in this bug, so I've edited
Comment 0 to strip out unnecessary / internal information and opened
it up to everyone.

Comment 9 Daniel Henrique Barboza (IBM) 2020-06-16 17:38:31 UTC
The documentation fix was accepted upstream:

commit 3a58613b0cf6a29960b909e6fd7420639ff794bd
Author: Daniel Henrique Barboza <danielhb413>
Date:   Thu Jun 11 14:00:29 2020 -0300

    manpages/virsh.rst: clarify numatune memory migration on Linux

v6.4.0-111-g3a58613b0c

Comment 15 Dan Zheng 2020-06-28 13:31:17 UTC
Comparing with the result on x86_64, I am a little confused about the difference.
Is this difference as expected?


Edit guest with
  <memory unit='KiB'>1048576</memory>
  <numatune>
    <memory mode='strict' nodeset='0'/>
  </numatune>

Start guest


On x86_64,
# numastat -p `pidof qemu-kvm`

Per-node process memory usage (in MBs) for PID 85320 (qemu-kvm)
                           Node 0          Node 1           Total
                  --------------- --------------- ---------------
Huge                         0.00            0.00            0.00
Heap                         9.09            0.00            9.09
Stack                        0.03            0.00            0.04
Private                    661.05            5.50          666.54
----------------  --------------- --------------- ---------------
Total                      670.17            5.50          675.67

# virsh numatune avocado-vt-vm1 0 1 --live
# numastat -p `pidof qemu-kvm`

Per-node process memory usage (in MBs) for PID 85320 (qemu-kvm)
                           Node 0          Node 1           Total
                  --------------- --------------- ---------------
Huge                         0.00            0.00            0.00
Heap                         0.00            9.66            9.66
Stack                        0.00            0.04            0.04
Private                      0.02          666.55          666.57
----------------  --------------- --------------- ---------------
Total                        0.02          676.24          676.26

There is no workload within the guest, but the memory can still be migrated to Node 1.
**********************************************************************
On Power 9,
# numastat -p `pidof qemu-kvm`

Per-node process memory usage (in MBs) for PID 275357 (qemu-kvm)
                           Node 0          Node 8           Total
                  --------------- --------------- ---------------
Huge                         0.00            0.00            0.00
Heap                        24.56            0.00           24.56
Stack                        0.12            0.00            0.12
Private                   1027.81            0.00         1027.81
----------------  --------------- --------------- ---------------
Total                     1052.50            0.00         1052.50

# virsh numatune vm1 0 8 --live

# numastat -p `pidof qemu-kvm`

Per-node process memory usage (in MBs) for PID 275357 (qemu-kvm)
                           Node 0          Node 8           Total
                  --------------- --------------- ---------------
Huge                         0.00            0.00            0.00
Heap                         0.00           24.56           24.56
Stack                        0.12            0.00            0.12
Private                    959.31           68.50         1027.81
----------------  --------------- --------------- ---------------
Total                      959.44           93.06         1052.50

Comment 16 Dan Zheng 2020-06-28 14:28:47 UTC
Also I tried below steps:
Edit guest using 
<memory unit='KiB'>8388608</memory>
  <numatune>
    <memory mode='strict' nodeset='0'/>
  </numatune>

Start guest
# numastat -p `pidof qemu-kvm`

Per-node process memory usage (in MBs) for PID 277479 (qemu-kvm)
                           Node 0          Node 8           Total
                  --------------- --------------- ---------------
Huge                         0.00            0.00            0.00
Heap                        23.88            0.00           23.88
Stack                        0.12            0.00            0.12
Private                   1254.12            0.00         1254.12
----------------  --------------- --------------- ---------------
Total                     1278.12            0.00         1278.12

Start stress tool within guest to waste the memory
[in guest]# swapoff -a
[in guest]# free -m
              total        used        free      shared  buff/cache   available
Mem:           7606         610        6660          21         335        6553
# stress --vm 6 --vm-bytes 1G --vm-keep 
stress: info: [1511] dispatching hogs: 0 cpu, 0 io, 6 vm, 0 hdd

Check memory within the guest
# free -m
              total        used        free      shared  buff/cache   available
Mem:           7606        6794         460          21         352         361
Swap:             0           0           0

Check numa stats
# numastat -p `pidof qemu-kvm`

Per-node process memory usage (in MBs) for PID 277479 (qemu-kvm)
                           Node 0          Node 8           Total
                  --------------- --------------- ---------------
Huge                         0.00            0.00            0.00
Heap                        23.88            0.00           23.88
Stack                        0.12            0.00            0.12
Private                   7380.19            0.00         7380.19
----------------  --------------- --------------- ---------------
Total                     7404.19            0.00         7404.19

Here there are 6G+ memory increased on Node 0

Run '# virsh numatune vm1 0 8 --live'

Check numa stats repeatedly for about 10 minutes

# numastat -p `pidof qemu-kvm`
                           Node 0          Node 8           Total

Private                   7055.31          324.88         7380.19
                  --------------- --------------- ---------------
Private                   5037.19         2459.00         7496.19

Finally I can see the numa node memory is migrating.

Comment 17 Daniel Henrique Barboza (IBM) 2020-07-02 23:24:46 UTC
Dan,

(In reply to Dan Zheng from comment #15)
> Comparing with the result on x86_64, I am a little confused about the
> difference.
> Is this difference as expected?


Sorry for the delay. It took some time to get a x86 environment to reproduce your x86 test. I can confirm that, on my x86 setup, 'virsh numatune' migrates almost all the guest memory way quicker than my Power9 setup.

In an attempt to rule out the factor "the pseries guest is locking too much memory", I ran a P9 TCG guest in the same x86 host, and the same behavior reproduces - the memory is migrated to another NUMA node almost instantly. Although there are quite obvious differences between a pseries guest running TCG and KVM, everything else is the same pseries code. This might indicate that the memory lock happens in KVM specific code, but it's just a hunch. There's also the difference between NUMA topology - my x86 setup uses contiguous NUMA nodes (0 and 1), my ppc64 setup uses disjointed nodes (0, 8, 252, 253, 254, 255). It is already known that libnuma is broken on ppc64 because of that, and I'm wondering if this non-contiguous NUMA setup factor is at play here as well.

It's important to remember that what is happening in Power is not bogus, as I stated in comment 6. The issue is that x86 migrates memory way quicker. I believe a deeper investigation in the QEMU/Kernel PPC internals is warranted to help us understand why Power is taking longer to execute the same cgroup operation that its x86 counterpart.


Answering your question, yeah, for now we can expect that x86_64 outperforms ppc64 in 'virsh numatune' operations.

Comment 18 Dan Zheng 2020-07-13 10:05:55 UTC
Change it to ON_QA as the 'fixed in version' is filled in.

Comment 19 Dan Zheng 2020-07-13 10:07:07 UTC
Package:
libvirt-6.5.0-1.module+el8.3.0+7323+d54bb644.ppc64le

Check man virsh and found the document is already updated for numatune.

numatune
      ...

       For running guests in Linux hosts, the changes made in the domain's numa parameters does not imply that the guest memory will be moved to a different nodeset  immediately.
       The  memory migration depends on the guest activity, and the memory of an idle guest will remain in its previous nodeset for longer. The presence of VFIO devices will also
       lock parts of the guest memory in the same nodeset used to start the guest, regardless of nodeset changes.



So mark it verified.

Comment 25 errata-xmlrpc 2020-11-17 17:44:45 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (virt:8.3 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:5137


Note You need to log in before you can comment on or make changes to this bug.