+++ This bug was initially created as a clone of Bug #1198497 +++ Description of problem: virsh numatune doesn't migrate memory in a running VM Version-Release number of selected component (if applicable): How reproducible: always Steps to Reproduce: 1. Start a VM with strict pinning to NUMA Node 0 2. Change the NumaNode via "virsh numatune <VM> --nodeset 1 Actual results: New memory gets allocated in the new numanode, but already allocated one stays in NumaNode 0 Expected results: As the allocation policy is strict, all memory shoudl be moved to NumaNode 1 Additional info: This happens as the cgroup has cpuset.memory_migrate set to 0 In case this is set to 1 manually the memory gets migrated. As typically the Numanode and the CPUset are "pinned" to the same NumaNode, having memory in another node slows down the VM in case this memory is accessed. --- Additional comment from Martin Kletzander on 2015-03-09 04:03:56 EDT --- I remember raising the memory_migrate question upstream, let me find the discussion (if I remember correctly that there was any). In the meantime, setting it manually to 1 is a valid workaround that will not break libvirt. --- Additional comment from Martin Kletzander on 2015-03-11 08:50:32 EDT --- The discussion mentioned in comment #2 was private and mostly irrelevant, so I just composed a patch and sent it upstream: https://www.redhat.com/archives/libvir-list/2015-March/msg00586.html --- Additional comment from Martin Kletzander on 2015-03-20 08:45:24 EDT --- Fixed upstream with v1.2.13-250-gba1dfc5..v1.2.13-251-g3a0e5b0: commit ba1dfc5b6a65914ec8ceadbcfbe16c17e83cc760 Author: Martin Kletzander <mkletzan> Date: Wed Mar 11 11:15:29 2015 +0100 cgroup: Add accessors for cpuset.memory_migrate commit 3a0e5b0c20815f986ac434e3df67f56d5d1aa44c Author: Martin Kletzander <mkletzan> Date: Wed Mar 11 11:17:15 2015 +0100 qemu: Migrate memory on numatune change --- Additional comment from Luyao Huang on 2015-04-13 04:47:37 EDT --- I can reproduce this issue with libvirt-0.10.2-51.el6.x86_64: 1. start a vm guest mem bind to host with strict mode: # virsh dumpxml r6 ... <memory unit='KiB'>40240000</memory> <currentMemory unit='KiB'>30240000</currentMemory> <vcpu placement='static'>4</vcpu> <numatune> <memory mode='strict' nodeset='0'/> </numatune> ... 2. start this vm # virsh start r6 Domain r6 started 3. check cgroup and numa mem # numastat -p `pidof qemu-kvm` Per-node process memory usage (in MBs) for PID 31656 (qemu-kvm) Node 0 Node 1 Total --------------- --------------- --------------- Huge 0.00 0.00 0.00 Heap 196.49 0.00 196.49 Stack 0.04 0.00 0.04 Private 30486.73 5.62 30492.35 ---------------- --------------- --------------- --------------- Total 30683.27 5.62 30688.88 # cgget -g cpuset /libvirt/qemu/r6 /libvirt/qemu/r6: cpuset.memory_spread_slab: 0 cpuset.memory_spread_page: 0 cpuset.memory_pressure: 0 cpuset.memory_migrate: 0 cpuset.sched_relax_domain_level: -1 cpuset.sched_load_balance: 1 cpuset.mem_hardwall: 0 cpuset.mem_exclusive: 0 cpuset.cpu_exclusive: 0 cpuset.mems: 0 cpuset.cpus: 0-31 4. change the memory bind node # virsh numatune r6 0 1 5.recheck cgroup and numa: # numastat -p `pidof qemu-kvm` Per-node process memory usage (in MBs) for PID 31656 (qemu-kvm) Node 0 Node 1 Total --------------- --------------- --------------- Huge 0.00 0.00 0.00 Heap 196.49 0.00 196.49 Stack 0.04 0.00 0.04 Private 30494.73 5.62 30500.35 ---------------- --------------- --------------- --------------- Total 30691.27 5.62 30696.88 # cgget -g cpuset /libvirt/qemu/r6 /libvirt/qemu/r6: cpuset.memory_spread_slab: 0 cpuset.memory_spread_page: 0 cpuset.memory_pressure: 0 cpuset.memory_migrate: 0 cpuset.sched_relax_domain_level: -1 cpuset.sched_load_balance: 1 cpuset.mem_hardwall: 0 cpuset.mem_exclusive: 0 cpuset.cpu_exclusive: 0 cpuset.mems: 1 cpuset.cpus: 0-31 And verify this bug with libvirt-0.10.2-53.el6.x86_64: 1. prepare a vm mem bind to host with strict: # virsh dumpxml r6 ... <memory unit='KiB'>40240000</memory> <currentMemory unit='KiB'>30240000</currentMemory> <vcpu placement='static'>4</vcpu> <numatune> <memory mode='strict' nodeset='0'/> </numatune> ... 2. start this vm # virsh start r6 Domain r6 started 3.check cgroup and numa mem: # cgget -g cpuset /libvirt/qemu/r6 /libvirt/qemu/r6: cpuset.memory_spread_slab: 0 cpuset.memory_spread_page: 0 cpuset.memory_pressure: 0 cpuset.memory_migrate: 1 cpuset.sched_relax_domain_level: -1 cpuset.sched_load_balance: 1 cpuset.mem_hardwall: 0 cpuset.mem_exclusive: 0 cpuset.cpu_exclusive: 0 cpuset.mems: 0 cpuset.cpus: 0-31 # numastat -p `pidof qemu-kvm` Per-node process memory usage (in MBs) for PID 15103 (qemu-kvm) Node 0 Node 1 Total --------------- --------------- --------------- Huge 0.00 0.00 0.00 Heap 196.49 0.00 196.49 Stack 0.04 0.00 0.04 Private 48.64 5.31 53.95 ---------------- --------------- --------------- --------------- Total 245.17 5.31 250.48 4. run a memeater in guest and recheck memory # numastat -p `pidof qemu-kvm` Per-node process memory usage (in MBs) for PID 15103 (qemu-kvm) Node 0 Node 1 Total --------------- --------------- --------------- Huge 0.00 0.00 0.00 Heap 196.49 0.00 196.49 Stack 0.04 0.00 0.04 Private 29629.27 5.32 29634.59 ---------------- --------------- --------------- --------------- Total 29825.80 5.32 29831.12 5. migrate memory via virsh numatune: # time virsh numatune r6 0 1 real 0m38.960s user 0m0.030s sys 0m0.034s 6. recheck cgroup and numa memory # cgget -g cpuset /libvirt/qemu/r6 /libvirt/qemu/r6: cpuset.memory_spread_slab: 0 cpuset.memory_spread_page: 0 cpuset.memory_pressure: 0 cpuset.memory_migrate: 1 cpuset.sched_relax_domain_level: -1 cpuset.sched_load_balance: 1 cpuset.mem_hardwall: 0 cpuset.mem_exclusive: 0 cpuset.cpu_exclusive: 0 cpuset.mems: 1 cpuset.cpus: 0-31 # numastat -p `pidof qemu-kvm` Per-node process memory usage (in MBs) for PID 15103 (qemu-kvm) Node 0 Node 1 Total --------------- --------------- --------------- Huge 0.00 0.00 0.00 Heap 0.00 196.49 196.49 Stack 0.00 0.04 0.04 Private 0.04 29638.54 29638.58 ---------------- --------------- --------------- --------------- Total 0.04 29835.07 29835.11 7. also test with numad with placement='auto', it work's well
Version-Release number of selected component (if applicable): # rpm -q libvirt qemu-kvm kernel numad libvirt-4.5.0-9.el8+1790+2744186b.ppc64le qemu-kvm-2.12.0-32.el8+1900+70997154.ppc64le kernel-4.18.0-27.el8.ppc64le numad-0.5-26.20150602git.el8.ppc64le
I am willing to take a look into this but I'm afraid I need more information. Test case, guest XML, expected result and so on.
Reproduce with libvirt-6.4.0-1.module+el8.3.0+6881+88468c00.ppc64le qemu-kvm-5.0.0-0.module+el8.3.0+6620+5d5e1420.ppc64le kernel-4.18.0-209.el8.ppc64le Also with RHEL 8.3 slow train: libvirt-6.0.0-17.module+el8.3.0+6423+e4cb6418.ppc64le qemu-kvm-4.2.0-19.module+el8.3.0+6473+93e27135.ppc64le Case: Bind memory to node 0 -> numastat to migrate memory to node 8 -> Check if memory usage is moved to node 8 Guest xml snippet: <memory unit='KiB'>8388608</memory> <currentMemory unit='KiB'>8388608</currentMemory> <vcpu placement='static'>20</vcpu> <numatune> <memory mode='strict' nodeset='0'/> </numatune> 1. Start guest # virsh start vm1 Domain vm1 started 2. Check cgroup cpuset.memory_migrate, enabled # cat /sys/fs/cgroup/cpuset/machine.slice/machine-qemu\\x2d1\\x2dvm1.scope/cpuset.memory_migrate 1 3. Check memory usage # numactl --hard available: 2 nodes (0,8) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 node 0 size: 61281 MB node 0 free: 55486 MB node 8 cpus: 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 node 8 size: 65303 MB node 8 free: 47137 MB node distances: node 0 8 0: 10 40 8: 40 10 # numastat -p `pidof qemu-kvm` Per-node process memory usage (in MBs) for PID 102664 (qemu-kvm) Node 0 Node 8 Total --------------- --------------- --------------- Huge 0.00 0.00 0.00 Heap 23.88 0.00 23.88 Stack 0.06 0.00 0.06 Private 1296.19 15.12 1311.31 ---------------- --------------- --------------- --------------- Total 1320.12 15.12 1335.25 4. Migrate memory from node 0 to node 8 # virsh numatune vm1 0 8 5. Recheck memory usage # numastat -p `pidof qemu-kvm` Per-node process memory usage (in MBs) for PID 102664 (qemu-kvm) Node 0 Node 8 Total --------------- --------------- --------------- Huge 0.00 0.00 0.00 Heap 0.00 23.88 23.88 Stack 0.06 0.00 0.06 Private 1243.69 67.62 1311.31 ---------------- --------------- --------------- --------------- Total 1243.75 91.50 1335.25 Actual result: As above. Small part of the memory used is moved from node 0 to node 8 (Heap), but most are still on node 0. Expect result: Almost all of the memory should be moved from node 0 to node 8.
Created attachment 1696020 [details] guest xml
Thanks for the info. I'll look into it.
The behavior you're seeing is not properly documented in the virsh manpage, but it is expected. I can reproduce the exact behavior you're seeing, but only with an idle guest. Any guest activity that causes memory pages to be used (e.g. stress tests, or rebooting the guest) will make the memory migrated to the new nodeset, in your case node8. There are 2 reasons for this. First, memory migration, as documented in the cgroups documentation of the Linux kernel, is not made at once. It depends on tasks using the affected pages. An idle guest is using a small amount of memory, and this is why you're seeing the results you're seeing. Second reason is QEMU itself. QEMU locks parts of the memory for its internal use (VFIO for example), and this memory won't migrate to node8. As an experiment I attempted to execute numatune right after the guest started, and this is the result: $ sudo ./run tools/virsh start numatunetest $ sudo ./run tools/virsh numatune numatunetest 0 8 --live $ sudo numastat -p `pidof qemu-system-ppc64` Per-node process memory usage (in MBs) for PID 65287 (qemu-system-ppc) Node 0 Node 8 Total --------------- --------------- --------------- Huge 0.00 0.00 0.00 Heap 0.00 12.19 12.19 Stack 0.00 0.06 0.06 Private 6.25 1164.50 1170.75 ---------------- --------------- --------------- --------------- Total 6.25 1176.75 1183.00 As you can see, almost all memory were migrated to node8 because QEMU didn't have the time to lock all memory on node8 and the memory pages used for guest boot was allocated in node8 instead of node0. Attempting to migrate the memory back to node0, with the guest being idle, will result in similar results you're already seeing. My conclusion is that there is no bug, but a lack of documentation to let users know that the memory migration isn't immediate after the execution of 'virsh numatune'. I proposed a doc change in virsh mentioning the two factors I mentioned above [1], but unfortunately that's all that needs to be done for this bug at this moment, in my opinion. [1] https://www.redhat.com/archives/libvir-list/2020-June/msg00511.html
Dan, Does that seem reasonable? Can we close this as NOTABUG?
There's nothing inherently sensitive in this bug, so I've edited Comment 0 to strip out unnecessary / internal information and opened it up to everyone.
The documentation fix was accepted upstream: commit 3a58613b0cf6a29960b909e6fd7420639ff794bd Author: Daniel Henrique Barboza <danielhb413> Date: Thu Jun 11 14:00:29 2020 -0300 manpages/virsh.rst: clarify numatune memory migration on Linux v6.4.0-111-g3a58613b0c
Comparing with the result on x86_64, I am a little confused about the difference. Is this difference as expected? Edit guest with <memory unit='KiB'>1048576</memory> <numatune> <memory mode='strict' nodeset='0'/> </numatune> Start guest On x86_64, # numastat -p `pidof qemu-kvm` Per-node process memory usage (in MBs) for PID 85320 (qemu-kvm) Node 0 Node 1 Total --------------- --------------- --------------- Huge 0.00 0.00 0.00 Heap 9.09 0.00 9.09 Stack 0.03 0.00 0.04 Private 661.05 5.50 666.54 ---------------- --------------- --------------- --------------- Total 670.17 5.50 675.67 # virsh numatune avocado-vt-vm1 0 1 --live # numastat -p `pidof qemu-kvm` Per-node process memory usage (in MBs) for PID 85320 (qemu-kvm) Node 0 Node 1 Total --------------- --------------- --------------- Huge 0.00 0.00 0.00 Heap 0.00 9.66 9.66 Stack 0.00 0.04 0.04 Private 0.02 666.55 666.57 ---------------- --------------- --------------- --------------- Total 0.02 676.24 676.26 There is no workload within the guest, but the memory can still be migrated to Node 1. ********************************************************************** On Power 9, # numastat -p `pidof qemu-kvm` Per-node process memory usage (in MBs) for PID 275357 (qemu-kvm) Node 0 Node 8 Total --------------- --------------- --------------- Huge 0.00 0.00 0.00 Heap 24.56 0.00 24.56 Stack 0.12 0.00 0.12 Private 1027.81 0.00 1027.81 ---------------- --------------- --------------- --------------- Total 1052.50 0.00 1052.50 # virsh numatune vm1 0 8 --live # numastat -p `pidof qemu-kvm` Per-node process memory usage (in MBs) for PID 275357 (qemu-kvm) Node 0 Node 8 Total --------------- --------------- --------------- Huge 0.00 0.00 0.00 Heap 0.00 24.56 24.56 Stack 0.12 0.00 0.12 Private 959.31 68.50 1027.81 ---------------- --------------- --------------- --------------- Total 959.44 93.06 1052.50
Also I tried below steps: Edit guest using <memory unit='KiB'>8388608</memory> <numatune> <memory mode='strict' nodeset='0'/> </numatune> Start guest # numastat -p `pidof qemu-kvm` Per-node process memory usage (in MBs) for PID 277479 (qemu-kvm) Node 0 Node 8 Total --------------- --------------- --------------- Huge 0.00 0.00 0.00 Heap 23.88 0.00 23.88 Stack 0.12 0.00 0.12 Private 1254.12 0.00 1254.12 ---------------- --------------- --------------- --------------- Total 1278.12 0.00 1278.12 Start stress tool within guest to waste the memory [in guest]# swapoff -a [in guest]# free -m total used free shared buff/cache available Mem: 7606 610 6660 21 335 6553 # stress --vm 6 --vm-bytes 1G --vm-keep stress: info: [1511] dispatching hogs: 0 cpu, 0 io, 6 vm, 0 hdd Check memory within the guest # free -m total used free shared buff/cache available Mem: 7606 6794 460 21 352 361 Swap: 0 0 0 Check numa stats # numastat -p `pidof qemu-kvm` Per-node process memory usage (in MBs) for PID 277479 (qemu-kvm) Node 0 Node 8 Total --------------- --------------- --------------- Huge 0.00 0.00 0.00 Heap 23.88 0.00 23.88 Stack 0.12 0.00 0.12 Private 7380.19 0.00 7380.19 ---------------- --------------- --------------- --------------- Total 7404.19 0.00 7404.19 Here there are 6G+ memory increased on Node 0 Run '# virsh numatune vm1 0 8 --live' Check numa stats repeatedly for about 10 minutes # numastat -p `pidof qemu-kvm` Node 0 Node 8 Total Private 7055.31 324.88 7380.19 --------------- --------------- --------------- Private 5037.19 2459.00 7496.19 Finally I can see the numa node memory is migrating.
Dan, (In reply to Dan Zheng from comment #15) > Comparing with the result on x86_64, I am a little confused about the > difference. > Is this difference as expected? Sorry for the delay. It took some time to get a x86 environment to reproduce your x86 test. I can confirm that, on my x86 setup, 'virsh numatune' migrates almost all the guest memory way quicker than my Power9 setup. In an attempt to rule out the factor "the pseries guest is locking too much memory", I ran a P9 TCG guest in the same x86 host, and the same behavior reproduces - the memory is migrated to another NUMA node almost instantly. Although there are quite obvious differences between a pseries guest running TCG and KVM, everything else is the same pseries code. This might indicate that the memory lock happens in KVM specific code, but it's just a hunch. There's also the difference between NUMA topology - my x86 setup uses contiguous NUMA nodes (0 and 1), my ppc64 setup uses disjointed nodes (0, 8, 252, 253, 254, 255). It is already known that libnuma is broken on ppc64 because of that, and I'm wondering if this non-contiguous NUMA setup factor is at play here as well. It's important to remember that what is happening in Power is not bogus, as I stated in comment 6. The issue is that x86 migrates memory way quicker. I believe a deeper investigation in the QEMU/Kernel PPC internals is warranted to help us understand why Power is taking longer to execute the same cgroup operation that its x86 counterpart. Answering your question, yeah, for now we can expect that x86_64 outperforms ppc64 in 'virsh numatune' operations.
Change it to ON_QA as the 'fixed in version' is filled in.
Package: libvirt-6.5.0-1.module+el8.3.0+7323+d54bb644.ppc64le Check man virsh and found the document is already updated for numatune. numatune ... For running guests in Linux hosts, the changes made in the domain's numa parameters does not imply that the guest memory will be moved to a different nodeset immediately. The memory migration depends on the guest activity, and the memory of an idle guest will remain in its previous nodeset for longer. The presence of VFIO devices will also lock parts of the guest memory in the same nodeset used to start the guest, regardless of nodeset changes. So mark it verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (virt:8.3 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:5137