Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1396910

Summary: Numa sampling causes very high load on the hypervisor.
Product: Red Hat Enterprise Virtualization Manager Reporter: Roman Hodain <rhodain>
Component: vdsmAssignee: Martin Polednik <mpoledni>
Status: CLOSED ERRATA QA Contact: Artyom <alukiano>
Severity: high Docs Contact:
Priority: high    
Version: 3.6.9CC: bazulay, bcholler, dfediuck, gklein, guchen, lsurette, mgoldboi, michal.skrivanek, mkalinin, mpoledni, srevivo, trichard, ycui, ykaul
Target Milestone: ovirt-4.1.0-alphaKeywords: Performance, Triaged, ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Previously, NUMA sampling could cause an unnecessarily high load on a complex host. This update reduces the sample interval to 10 minutes, as that is enough for rarely-changing NUMA topology.
Story Points: ---
Clone Of:
: 1401580 1401583 (view as bug list) Environment:
Last Closed: 2017-04-25 00:41:12 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1401580, 1401583    

Description Roman Hodain 2016-11-21 06:56:16 UTC
Description of problem:
Numa sampling causes very high load on the hypervisor. The load on the hypervisor grows over the time.

Version-Release number of selected component (if applicable):
vdsm-4.17.35-1.el7ev.noarch

How reproducible:
100% in a specific environment

Steps to Reproduce:
1. see supervdsm logs

Actual results:
The load on the hypervisor is very high:

     20:03:09 up 65 days, 23 min,  1 user,  load average: 42.69, 41.55, 38.18

     systemctl stop vdsmd

     20:04:04 up 65 days, 24 min,  1 user,  load average: 33.70, 39.56, 37.71
     20:04:28 up 65 days, 24 min,  1 user,  load average: 24.64, 36.98, 36.91
     20:04:57 up 65 days, 25 min,  1 user,  load average: 16.49, 33.83, 35.86
     20:05:35 up 65 days, 25 min,  1 user,  load average: 11.20, 30.59, 34.70
     20:05:48 up 65 days, 26 min,  1 user,  load average: 9.78, 29.33, 34.22

Additional info:

The issue was workarounded by setting 

     vm_sample_numa_interval = 600

numa stats are collected 3171 times in one hour for just 14 VMs

Comment 1 Martin Sivák 2016-11-21 13:17:23 UTC
MOM has nothing to do with NUMA. Moving to VDSM. There also were some big changes to monitoring in 4.0 so this might be just a matter of backporting.

However, there is also the (fixed for at least 4.0 and up) bug about high load because of disk IO tune queries: https://bugzilla.redhat.com/show_bug.cgi?id=1366556

Comment 2 Martin Sivák 2016-11-28 12:39:01 UTC
*** Bug 1398953 has been marked as a duplicate of this bug. ***

Comment 3 Roy Golan 2016-12-05 13:44:22 UTC
msivak can we consider removing *VM* numa stats totally? it is for reporting only. 2nd option is to relax the interval, but I prefer that if we don't needed, just remove it

Comment 4 Roy Golan 2016-12-05 13:44:51 UTC
msivak can we consider removing *VM* numa stats totally? it is for reporting only. 2nd option is to relax the interval, but I prefer that if we don't needed, just remove it

Comment 5 Martin Sivák 2016-12-05 14:54:54 UTC
It seems it is already removed in 4.1 engine. But we need to instruct VDSM to limit the collection frequency (and possibly remove the code) too.

Comment 6 Michal Skrivanek 2016-12-05 15:03:17 UTC
the code was dropped in 4.1 in bug 1148039 and it is unused in 3.6/4.0 as well, to minimize changes we can just increase the poll interval from 15s to 1h

Comment 7 Michal Skrivanek 2016-12-05 15:37:38 UTC
I meant 600s, that was actually tested in real setup already.

Comment 12 Artyom 2017-01-24 12:55:53 UTC
Verified on vdsm-4.19.2-2.el7ev.x86_64