+++ This bug is a downstream clone. The original bug is: +++ +++ bug 1396910 +++ ====================================================================== Description of problem: Numa sampling causes very high load on the hypervisor. The load on the hypervisor grows over the time. Version-Release number of selected component (if applicable): vdsm-4.17.35-1.el7ev.noarch How reproducible: 100% in a specific environment Steps to Reproduce: 1. see supervdsm logs Actual results: The load on the hypervisor is very high: 20:03:09 up 65 days, 23 min, 1 user, load average: 42.69, 41.55, 38.18 systemctl stop vdsmd 20:04:04 up 65 days, 24 min, 1 user, load average: 33.70, 39.56, 37.71 20:04:28 up 65 days, 24 min, 1 user, load average: 24.64, 36.98, 36.91 20:04:57 up 65 days, 25 min, 1 user, load average: 16.49, 33.83, 35.86 20:05:35 up 65 days, 25 min, 1 user, load average: 11.20, 30.59, 34.70 20:05:48 up 65 days, 26 min, 1 user, load average: 9.78, 29.33, 34.22 Additional info: The issue was workarounded by setting vm_sample_numa_interval = 600 numa stats are collected 3171 times in one hour for just 14 VMs (Originally by Roman Hodain)
MOM has nothing to do with NUMA. Moving to VDSM. There also were some big changes to monitoring in 4.0 so this might be just a matter of backporting. However, there is also the (fixed for at least 4.0 and up) bug about high load because of disk IO tune queries: https://bugzilla.redhat.com/show_bug.cgi?id=1366556 (Originally by Martin Sivak)
*** Bug 1398953 has been marked as a duplicate of this bug. *** (Originally by Martin Sivak)
msivak can we consider removing *VM* numa stats totally? it is for reporting only. 2nd option is to relax the interval, but I prefer that if we don't needed, just remove it (Originally by Roy Golan)
It seems it is already removed in 4.1 engine. But we need to instruct VDSM to limit the collection frequency (and possibly remove the code) too. (Originally by Martin Sivak)
the code was dropped in 4.1 in bug 1148039 and it is unused in 3.6/4.0 as well, to minimize changes we can just increase the poll interval from 15s to 1h (Originally by michal.skrivanek)
I meant 600s, that was actually tested in real setup already. (Originally by michal.skrivanek)
Package vdsm-4.16.36-1.el6ev.x86_64 does not include the patch.
The right version for 3.6.10 is 4.16.37, please retest
Verified on vdsm-4.17.37-1.el7ev.noarch, vdsm has correct NUMA sampling interval.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2017-0109.html