Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1401583 - [z-stream clone - 4.0.6] Numa sampling causes very high load on the hypervisor.
[z-stream clone - 4.0.6] Numa sampling causes very high load on the hypervisor.
Status: CLOSED ERRATA
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm (Show other bugs)
3.6.9
Unspecified Unspecified
high Severity high
: ovirt-4.0.6
: ---
Assigned To: Martin Polednik
Artyom
: Performance, Triaged, ZStream
Depends On: 1396910
Blocks:
  Show dependency treegraph
 
Reported: 2016-12-05 10:41 EST by rhev-integ
Modified: 2017-02-01 19:53 EST (History)
14 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Previously, NUMA sampling could cause an unnecessarily high load on a complex host. This release reduces the sample interval to 10 minutes, as that is enough for rarely-changing NUMA topology.
Story Points: ---
Clone Of: 1396910
Environment:
Last Closed: 2017-01-10 12:04:09 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Virt
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
oVirt gerrit 67855 None None None 2016-12-06 03:27 EST
Red Hat Product Errata RHBA-2017:0044 normal SHIPPED_LIVE vdsm 4.0.6 bug fix and enhancement update 2017-01-10 16:52:50 EST

  None (edit)
Description rhev-integ 2016-12-05 10:41:12 EST
+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1396910 +++
======================================================================

Description of problem:
Numa sampling causes very high load on the hypervisor. The load on the hypervisor grows over the time.

Version-Release number of selected component (if applicable):
vdsm-4.17.35-1.el7ev.noarch

How reproducible:
100% in a specific environment

Steps to Reproduce:
1. see supervdsm logs

Actual results:
The load on the hypervisor is very high:

     20:03:09 up 65 days, 23 min,  1 user,  load average: 42.69, 41.55, 38.18

     systemctl stop vdsmd

     20:04:04 up 65 days, 24 min,  1 user,  load average: 33.70, 39.56, 37.71
     20:04:28 up 65 days, 24 min,  1 user,  load average: 24.64, 36.98, 36.91
     20:04:57 up 65 days, 25 min,  1 user,  load average: 16.49, 33.83, 35.86
     20:05:35 up 65 days, 25 min,  1 user,  load average: 11.20, 30.59, 34.70
     20:05:48 up 65 days, 26 min,  1 user,  load average: 9.78, 29.33, 34.22

Additional info:

The issue was workarounded by setting 

     vm_sample_numa_interval = 600

numa stats are collected 3171 times in one hour for just 14 VMs

(Originally by Roman Hodain)
Comment 1 rhev-integ 2016-12-05 10:41:21 EST
MOM has nothing to do with NUMA. Moving to VDSM. There also were some big changes to monitoring in 4.0 so this might be just a matter of backporting.

However, there is also the (fixed for at least 4.0 and up) bug about high load because of disk IO tune queries: https://bugzilla.redhat.com/show_bug.cgi?id=1366556

(Originally by Martin Sivak)
Comment 3 rhev-integ 2016-12-05 10:41:27 EST
*** Bug 1398953 has been marked as a duplicate of this bug. ***

(Originally by Martin Sivak)
Comment 4 rhev-integ 2016-12-05 10:41:33 EST
msivak can we consider removing *VM* numa stats totally? it is for reporting only. 2nd option is to relax the interval, but I prefer that if we don't needed, just remove it

(Originally by Roy Golan)
Comment 5 rhev-integ 2016-12-05 10:41:39 EST
msivak can we consider removing *VM* numa stats totally? it is for reporting only. 2nd option is to relax the interval, but I prefer that if we don't needed, just remove it

(Originally by Roy Golan)
Comment 6 rhev-integ 2016-12-05 10:41:45 EST
It seems it is already removed in 4.1 engine. But we need to instruct VDSM to limit the collection frequency (and possibly remove the code) too.

(Originally by Martin Sivak)
Comment 7 rhev-integ 2016-12-05 10:41:50 EST
the code was dropped in 4.1 in bug 1148039 and it is unused in 3.6/4.0 as well, to minimize changes we can just increase the poll interval from 15s to 1h

(Originally by michal.skrivanek)
Comment 8 rhev-integ 2016-12-05 10:41:56 EST
I meant 600s, that was actually tested in real setup already.

(Originally by michal.skrivanek)
Comment 10 Artyom 2016-12-11 08:01:28 EST
Patch still does not exist under vdsm version:
vdsm-4.18.18-4.git198e48d.el7ev.x86_64
vdsm-yajsonrpc-4.18.18-4.git198e48d.el7ev.noarch
vdsm-api-4.18.18-4.git198e48d.el7ev.noarch
vdsm-jsonrpc-4.18.18-4.git198e48d.el7ev.noarch
vdsm-python-4.18.18-4.git198e48d.el7ev.noarch
vdsm-hook-vmfex-dev-4.18.18-4.git198e48d.el7ev.noarch
vdsm-xmlrpc-4.18.18-4.git198e48d.el7ev.noarch
vdsm-infra-4.18.18-4.git198e48d.el7ev.noarch
vdsm-cli-4.18.18-4.git198e48d.el7ev.noarch
Comment 11 Michal Skrivanek 2016-12-12 08:15:35 EST
wrong tag used for build, will be rebuild today
Comment 12 Artyom 2016-12-13 09:52:05 EST
Verified on vdsm-4.18.20-1.el7ev.x86_64
Comment 14 errata-xmlrpc 2017-01-10 12:04:09 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2017-0044.html
Comment 15 Tahlia Richardson 2017-02-01 19:53:24 EST
Setting a doc type so that the text gets pulled into the Release Notes.

Note You need to log in before you can comment on or make changes to this bug.