This service will be undergoing maintenance at 00:00 UTC, 2017-10-23 It is expected to last about 30 minutes
Bug 1267027 - numad consumes too much CPU, especially when incorrectly selecting assigned device VM [NEEDINFO]
numad consumes too much CPU, especially when incorrectly selecting assigned d...
Status: NEW
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: numad (Show other bugs)
7.2
Unspecified Unspecified
unspecified Severity unspecified
: rc
: 7.3
Assigned To: Jan Synacek
qe-baseos-daemons
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-09-28 16:56 EDT by Alex Williamson
Modified: 2017-09-27 23:20 EDT (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
jsynacek: needinfo? (bgray)


Attachments (Terms of Use)

  None (edit)
Description Alex Williamson 2015-09-28 16:56:53 EDT
Description of problem:

A customer reported numad consuming 100% CPU and interfering with the smooth operation of their GPU assigned VM when numad incorrectly chose an assigned device VM for memory migration.  The case I present is fabricated to make numad misbehave, but the customer was able to achieve this during normal operation.  Furthermore, even without device assignment, VM migration between nodes is CPU and memory intensive and an inopportune balance can significantly affect not only the application performance, but the system performance overall.

In my testing, a non-device-assignment VM, using locked memory will migrate between nodes of a 2-socket system at ~1GB/s, therefore NUMA balancing of a 10GB VM disrupts the system for ~10s.  It's expected that this disruption scales linearly with VM size.  In the case of an assigned device VM, attempting to migrate a 10GB VM disrupts the system for over 3 minutes (shown below).  Again, it's expected that this scales linearly with VM size, so a 64GB VM would disrupt the system for over 20 minutes.

We can see from the log below that the problem occurs after writing to the cgroup cpuset.mems file for the VM.  In fact, it's easy to reproduce the problem by manually writing these files with `echo`.

My impression is that a) memory migration should be rate-limited to a point where numad does not interfere with the overall system performance and b) numad should do a better job of detecting processes that cannot be migrated either by inspection of the processor or perhaps an incremental migration so that numad can determine a migration is ineffective and abort without the disruption to the system.

Version-Release number of selected component (if applicable):
numad-0.5-14.20140620git.el7.x86_64
kernel-3.10.0-316.el7.x86_64

How reproducible:
100%

Steps to Reproduce:

The easy way to reproduce is to manually use the sysfs cgroup interface that numad uses and attempt to bounce an assigned device between VMs.

In order to induce an automatic migration with numad, I used a 2-socket system, where node0 has more memory than node1.  I used sysfs to offline all the CPUs on node 0 except CPU0 and used the hugepage allocator to create as many hugepages (2MB) as possible on node0.  I then launch a VM with an assigned device with vCPUs and memory approximately equal to node1.  The VM is not configured for hugepages and has default locality, so the allocation is for node0-1, but we know that nearly all of the memory is allocated on node1 since node0 is filled with hugepages.

At this point we reverse the load, taking all of node1 CPUs offline, bringing all of node0 CPUs online, and freeing all the hugepages on node0.  At this point restart the numad.service.  Since node0 now has more memory available and all of the available processors, numad will try to perform a node0-1 to node0 migration.  During this "migration", numad will consume 100% CPU.

Another potential issue with numad here is that a restart is required to recognize the topology change.  It doesn't seem to look for changes on each iteration.

Obviously this is significant manipulation to achieve the desired misbehavior, but really any case of numad trying to migrate an assigned device VM from a locality of 0-1 to either 0 or 1, which is exactly the purpose of numad, will generate the same results.

Actual results:
numad consumes 100% CPU and disrupts system performance initiating an ineffective migration.

Expected results:
Kernel interfaces should allow rate limiting and numad should be more aware of non-migratable memory.

Additional info:

Mon Sep 28 13:52:12 2015: Nodes: 2
Min CPUs free: 0, Max CPUs: 616, Avg CPUs: 308, StdDev: 308
Min MBs free: 1585, Max MBs: 15540, Avg MBs: 8562, StdDev: 6977.5
Node 0: MBs_total 16307, MBs_free  15540, CPUs_total 720, CPUs_free  616,  Distance: 10 20  CPUs: 0-5,12-17
Node 1: MBs_total 12288, MBs_free   1585, CPUs_total 0, CPUs_free    0,  Distance: 20 10  CPUs: 
Mon Sep 28 13:52:12 2015: Processes: 431
Mon Sep 28 13:52:12 2015: Candidates: 1
359767: PID 5096: (qemu-kvm), Threads 14, MBs_size  10885, MBs_used  10048, CPUs_used  100, Magnitude 1004800, Nodes: 0-1
Mon Sep 28 13:52:12 2015: PICK NODES FOR:  PID: 5096,  CPUs 117,  MBs 12805
Mon Sep 28 13:52:12 2015: PROCESS_MBs[0]: 19
Mon Sep 28 13:52:12 2015: PROCESS_MBs[1]: 10028
Mon Sep 28 13:52:12 2015:     Node[0]: mem: 136312  cpu: 1435
Mon Sep 28 13:52:12 2015: PROCESS_CPUs[1]: 77
Mon Sep 28 13:52:12 2015:     Node[1]: mem: 116130  cpu: 8
Mon Sep 28 13:52:12 2015: MBs: 12805,  CPUs: 117
Mon Sep 28 13:52:12 2015: Sorted magnitude[0]: 207833202
Mon Sep 28 13:52:12 2015: Sorted magnitude[1]: 987105
Mon Sep 28 13:52:12 2015:     Node[0]: mem: 8262  cpu: 499
Mon Sep 28 13:52:12 2015: Advising pid 5096 (qemu-kvm) move from nodes (0-1) to nodes (0)
Mon Sep 28 13:52:12 2015: Writing 1 to: /sys/fs/cgroup/cpuset/machine.slice/machine-qemu\x2dcentos6.scope/emulator/cpuset.memory_migrate
Mon Sep 28 13:52:12 2015: Writing 0 to: /sys/fs/cgroup/cpuset/machine.slice/machine-qemu\x2dcentos6.scope/emulator/cpuset.mems
Mon Sep 28 13:55:27 2015: Including PID: 5096 in cpuset: /sys/fs/cgroup/cpuset/machine.slice/machine-qemu\x2dcentos6.scope/emulator
Mon Sep 28 13:55:27 2015: Writing 0-5,12-17 to: /sys/fs/cgroup/cpuset/machine.slice/machine-qemu\x2dcentos6.scope/emulator/cpuset.cpus
Mon Sep 28 13:55:27 2015: Could not write 0-5,12-17 to /sys/fs/cgroup/cpuset/machine.slice/machine-qemu\x2dcentos6.scope/emulator/cpuset.cpus -- errno: 13
Mon Sep 28 13:55:27 2015: Could not configure cpuset: /sys/fs/cgroup/cpuset/machine.slice/machine-qemu\x2dcentos6.scope/emulator

(I've already informed the customer that numad is ineffective for assigned device VMs and advised manual VM placement, but numad should be better behaved for such cases)
Comment 2 Alex Williamson 2015-09-30 15:20:43 EDT
Now I've seen this behavior with absolutely no special configuration, my guest effectively hung, so I run top and see numad consuming 100% cpu.  The guest was only 4G in this case, so the hang was substantially shorter than above.

Wed Sep 30 13:16:08 2015: Nodes: 2
Min CPUs free: 651, Max CPUs: 714, Avg CPUs: 682, StdDev: 31.504
Min MBs free: 7661, Max MBs: 15236, Avg MBs: 11448, StdDev: 3787.5
Node 0: MBs_total 16307, MBs_free  15236, CPUs_total 720, CPUs_free  714,  Distance: 10 20  CPUs: 0-5,12-17
Node 1: MBs_total 12288, MBs_free   7661, CPUs_total 720, CPUs_free  651,  Distance: 20 10  CPUs: 6-11,18-23
Wed Sep 30 13:16:08 2015: Processes: 432
Wed Sep 30 13:16:08 2015: Candidates: 1
132068: PID 13333: (qemu-kvm), Threads  8, MBs_size   5143, MBs_used   4276, CPUs_used   63, Magnitude 269388, Nodes: 0-1
Wed Sep 30 13:16:08 2015: PICK NODES FOR:  PID: 13333,  CPUs 74,  MBs 6050
Wed Sep 30 13:16:08 2015: PROCESS_MBs[0]: 154
Wed Sep 30 13:16:08 2015: PROCESS_MBs[1]: 4122
Wed Sep 30 13:16:08 2015: PROCESS_CPUs[0]: 1
Wed Sep 30 13:16:08 2015:     Node[0]: mem: 88520  cpu: 1233
Wed Sep 30 13:16:08 2015: PROCESS_CPUs[1]: 42
Wed Sep 30 13:16:08 2015:     Node[1]: mem: 77699  cpu: 1211
Wed Sep 30 13:16:08 2015: MBs: 6050,  CPUs: 74
Wed Sep 30 13:16:08 2015: Sorted magnitude[0]: 115966732
Wed Sep 30 13:16:08 2015: Sorted magnitude[1]: 99974332
Wed Sep 30 13:16:08 2015:     Node[0]: mem: 28020  cpu: 641
Wed Sep 30 13:16:08 2015: Advising pid 13333 (qemu-kvm) move from nodes (0-1) to nodes (0)
Wed Sep 30 13:16:08 2015: Writing 1 to: /sys/fs/cgroup/cpuset/machine.slice/machine-qemu\x2dwin8.1\x2d2.scope/emulator/cpuset.memory_migrate
Wed Sep 30 13:16:08 2015: Writing 0 to: /sys/fs/cgroup/cpuset/machine.slice/machine-qemu\x2dwin8.1\x2d2.scope/emulator/cpuset.mems
Wed Sep 30 13:16:50 2015: Including PID: 13333 in cpuset: /sys/fs/cgroup/cpuset/machine.slice/machine-qemu\x2dwin8.1\x2d2.scope/emulator
Wed Sep 30 13:16:50 2015: Writing 0-5,12-17 to: /sys/fs/cgroup/cpuset/machine.slice/machine-qemu\x2dwin8.1\x2d2.scope/emulator/cpuset.cpus
Wed Sep 30 13:16:50 2015: Writing 1 to: /sys/fs/cgroup/cpuset/machine.slice/machine-qemu\x2dwin8.1\x2d2.scope/emulator/cpuset.memory_migrate
Wed Sep 30 13:16:50 2015: Writing 0 to: /sys/fs/cgroup/cpuset/machine.slice/machine-qemu\x2dwin8.1\x2d2.scope/emulator/cpuset.mems
Wed Sep 30 13:16:50 2015: Writing 0-5,12-17 to: /sys/fs/cgroup/cpuset/machine.slice/machine-qemu\x2dwin8.1\x2d2.scope/emulator/cpuset.cpus
Wed Sep 30 13:16:50 2015: Writing 1 to: /sys/fs/cgroup/cpuset/machine.slice/machine-qemu\x2dwin8.1\x2d2.scope/vcpu0/cpuset.memory_migrate
Wed Sep 30 13:16:50 2015: Writing 0 to: /sys/fs/cgroup/cpuset/machine.slice/machine-qemu\x2dwin8.1\x2d2.scope/vcpu0/cpuset.mems
Wed Sep 30 13:17:00 2015: Writing 0-5,12-17 to: /sys/fs/cgroup/cpuset/machine.slice/machine-qemu\x2dwin8.1\x2d2.scope/vcpu0/cpuset.cpus
Wed Sep 30 13:17:00 2015: Writing 1 to: /sys/fs/cgroup/cpuset/machine.slice/machine-qemu\x2dwin8.1\x2d2.scope/vcpu1/cpuset.memory_migrate
Wed Sep 30 13:17:00 2015: Writing 0 to: /sys/fs/cgroup/cpuset/machine.slice/machine-qemu\x2dwin8.1\x2d2.scope/vcpu1/cpuset.mems
Wed Sep 30 13:17:09 2015: Writing 0-5,12-17 to: /sys/fs/cgroup/cpuset/machine.slice/machine-qemu\x2dwin8.1\x2d2.scope/vcpu1/cpuset.cpus
Wed Sep 30 13:17:09 2015: Writing 1 to: /sys/fs/cgroup/cpuset/machine.slice/machine-qemu\x2dwin8.1\x2d2.scope/vcpu2/cpuset.memory_migrate
Wed Sep 30 13:17:09 2015: Writing 0 to: /sys/fs/cgroup/cpuset/machine.slice/machine-qemu\x2dwin8.1\x2d2.scope/vcpu2/cpuset.mems
Wed Sep 30 13:17:19 2015: Writing 0-5,12-17 to: /sys/fs/cgroup/cpuset/machine.slice/machine-qemu\x2dwin8.1\x2d2.scope/vcpu2/cpuset.cpus
Wed Sep 30 13:17:19 2015: Writing 1 to: /sys/fs/cgroup/cpuset/machine.slice/machine-qemu\x2dwin8.1\x2d2.scope/vcpu3/cpuset.memory_migrate
Wed Sep 30 13:17:19 2015: Writing 0 to: /sys/fs/cgroup/cpuset/machine.slice/machine-qemu\x2dwin8.1\x2d2.scope/vcpu3/cpuset.mems
Wed Sep 30 13:17:29 2015: Writing 0-5,12-17 to: /sys/fs/cgroup/cpuset/machine.slice/machine-qemu\x2dwin8.1\x2d2.scope/vcpu3/cpuset.cpus
Wed Sep 30 13:17:29 2015: PID 13333 moved to node(s) 0 in 80.65 seconds

Note You need to log in before you can comment on or make changes to this bug.