Description of problem: Numa sampling causes very high load on the hypervisor. The load on the hypervisor grows over the time. Version-Release number of selected component (if applicable): vdsm-4.17.35-1.el7ev.noarch How reproducible: 100% in a specific environment Steps to Reproduce: 1. see supervdsm logs Actual results: The load on the hypervisor is very high: 20:03:09 up 65 days, 23 min, 1 user, load average: 42.69, 41.55, 38.18 systemctl stop vdsmd 20:04:04 up 65 days, 24 min, 1 user, load average: 33.70, 39.56, 37.71 20:04:28 up 65 days, 24 min, 1 user, load average: 24.64, 36.98, 36.91 20:04:57 up 65 days, 25 min, 1 user, load average: 16.49, 33.83, 35.86 20:05:35 up 65 days, 25 min, 1 user, load average: 11.20, 30.59, 34.70 20:05:48 up 65 days, 26 min, 1 user, load average: 9.78, 29.33, 34.22 Additional info: The issue was workarounded by setting vm_sample_numa_interval = 600 numa stats are collected 3171 times in one hour for just 14 VMs
MOM has nothing to do with NUMA. Moving to VDSM. There also were some big changes to monitoring in 4.0 so this might be just a matter of backporting. However, there is also the (fixed for at least 4.0 and up) bug about high load because of disk IO tune queries: https://bugzilla.redhat.com/show_bug.cgi?id=1366556
*** Bug 1398953 has been marked as a duplicate of this bug. ***
msivak can we consider removing *VM* numa stats totally? it is for reporting only. 2nd option is to relax the interval, but I prefer that if we don't needed, just remove it
It seems it is already removed in 4.1 engine. But we need to instruct VDSM to limit the collection frequency (and possibly remove the code) too.
the code was dropped in 4.1 in bug 1148039 and it is unused in 3.6/4.0 as well, to minimize changes we can just increase the poll interval from 15s to 1h
I meant 600s, that was actually tested in real setup already.
Verified on vdsm-4.19.2-2.el7ev.x86_64