Bug 1401580 - [z-stream clone - 3.6.10] Numa sampling causes very high load on the hypervisor.
Summary: [z-stream clone - 3.6.10] Numa sampling causes very high load on the hypervisor.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 3.6.9
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ovirt-3.6.10
: ---
Assignee: Martin Polednik
QA Contact: Artyom
URL:
Whiteboard:
Depends On: 1396910
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-12-05 15:39 UTC by rhev-integ
Modified: 2020-02-16 07:11 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Previously, NUMA sampling could cause an unnecessarily high load on complex hosts. Now, the sampling interval has been reduced to 10 minutes to reduce the load on hosts. This is frequent enough as NUMA topology rarely changes.
Clone Of: 1396910
Environment:
Last Closed: 2017-01-17 18:07:25 UTC
oVirt Team: Virt
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:0109 0 normal SHIPPED_LIVE vdsm 3.6.10 bug fix and enhancement update 2017-01-17 22:48:48 UTC
oVirt gerrit 67854 0 None None None 2016-12-06 08:26:43 UTC

Description rhev-integ 2016-12-05 15:39:01 UTC
+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1396910 +++
======================================================================

Description of problem:
Numa sampling causes very high load on the hypervisor. The load on the hypervisor grows over the time.

Version-Release number of selected component (if applicable):
vdsm-4.17.35-1.el7ev.noarch

How reproducible:
100% in a specific environment

Steps to Reproduce:
1. see supervdsm logs

Actual results:
The load on the hypervisor is very high:

     20:03:09 up 65 days, 23 min,  1 user,  load average: 42.69, 41.55, 38.18

     systemctl stop vdsmd

     20:04:04 up 65 days, 24 min,  1 user,  load average: 33.70, 39.56, 37.71
     20:04:28 up 65 days, 24 min,  1 user,  load average: 24.64, 36.98, 36.91
     20:04:57 up 65 days, 25 min,  1 user,  load average: 16.49, 33.83, 35.86
     20:05:35 up 65 days, 25 min,  1 user,  load average: 11.20, 30.59, 34.70
     20:05:48 up 65 days, 26 min,  1 user,  load average: 9.78, 29.33, 34.22

Additional info:

The issue was workarounded by setting 

     vm_sample_numa_interval = 600

numa stats are collected 3171 times in one hour for just 14 VMs

(Originally by Roman Hodain)

Comment 1 rhev-integ 2016-12-05 15:39:10 UTC
MOM has nothing to do with NUMA. Moving to VDSM. There also were some big changes to monitoring in 4.0 so this might be just a matter of backporting.

However, there is also the (fixed for at least 4.0 and up) bug about high load because of disk IO tune queries: https://bugzilla.redhat.com/show_bug.cgi?id=1366556

(Originally by Martin Sivak)

Comment 3 rhev-integ 2016-12-05 15:39:18 UTC
MOM has nothing to do with NUMA. Moving to VDSM. There also were some big changes to monitoring in 4.0 so this might be just a matter of backporting.

However, there is also the (fixed for at least 4.0 and up) bug about high load because of disk IO tune queries: https://bugzilla.redhat.com/show_bug.cgi?id=1366556

(Originally by Martin Sivak)

Comment 4 rhev-integ 2016-12-05 15:39:24 UTC
*** Bug 1398953 has been marked as a duplicate of this bug. ***

(Originally by Martin Sivak)

Comment 5 rhev-integ 2016-12-05 15:39:30 UTC
msivak can we consider removing *VM* numa stats totally? it is for reporting only. 2nd option is to relax the interval, but I prefer that if we don't needed, just remove it

(Originally by Roy Golan)

Comment 6 rhev-integ 2016-12-05 15:39:36 UTC
msivak can we consider removing *VM* numa stats totally? it is for reporting only. 2nd option is to relax the interval, but I prefer that if we don't needed, just remove it

(Originally by Roy Golan)

Comment 7 rhev-integ 2016-12-05 15:39:41 UTC
It seems it is already removed in 4.1 engine. But we need to instruct VDSM to limit the collection frequency (and possibly remove the code) too.

(Originally by Martin Sivak)

Comment 8 rhev-integ 2016-12-05 15:39:47 UTC
the code was dropped in 4.1 in bug 1148039 and it is unused in 3.6/4.0 as well, to minimize changes we can just increase the poll interval from 15s to 1h

(Originally by michal.skrivanek)

Comment 9 rhev-integ 2016-12-05 15:39:54 UTC
I meant 600s, that was actually tested in real setup already.

(Originally by michal.skrivanek)

Comment 11 Artyom 2016-12-18 09:31:02 UTC
Package vdsm-4.16.36-1.el6ev.x86_64 does not include the patch.

Comment 12 Michal Skrivanek 2016-12-19 13:17:56 UTC
The right version for 3.6.10 is 4.16.37, please retest

Comment 13 Artyom 2016-12-19 15:15:38 UTC
Verified on vdsm-4.17.37-1.el7ev.noarch, vdsm has correct NUMA sampling interval.

Comment 15 errata-xmlrpc 2017-01-17 18:07:25 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2017-0109.html


Note You need to log in before you can comment on or make changes to this bug.