Bug 1396910 - Numa sampling causes very high load on the hypervisor.
Summary: Numa sampling causes very high load on the hypervisor.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 3.6.9
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ovirt-4.1.0-alpha
: ---
Assignee: Martin Polednik
QA Contact: Artyom
URL:
Whiteboard:
: 1398953 (view as bug list)
Depends On:
Blocks: 1401580 1401583
TreeView+ depends on / blocked
 
Reported: 2016-11-21 06:56 UTC by Roman Hodain
Modified: 2021-08-30 11:50 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Previously, NUMA sampling could cause an unnecessarily high load on a complex host. This update reduces the sample interval to 10 minutes, as that is enough for rarely-changing NUMA topology.
Clone Of:
: 1401580 1401583 (view as bug list)
Environment:
Last Closed: 2017-04-25 00:41:12 UTC
oVirt Team: Virt
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1398953 0 unspecified CLOSED [scale] getVcpuNumaMemoryMapping get called 4 time per VM per 1 minute - flooding supervdsm.log when many VMs (100) are ... 2021-02-22 00:41:40 UTC
Red Hat Issue Tracker RHV-43183 0 None None None 2021-08-30 11:50:10 UTC
Red Hat Knowledge Base (Solution) 2823931 0 None None None 2016-12-21 18:26:18 UTC
Red Hat Product Errata RHEA-2017:0998 0 normal SHIPPED_LIVE VDSM bug fix and enhancement update 4.1 GA 2017-04-18 20:11:39 UTC
oVirt gerrit 67838 0 None None None 2016-12-05 15:38:24 UTC

Internal Links: 1398953

Description Roman Hodain 2016-11-21 06:56:16 UTC
Description of problem:
Numa sampling causes very high load on the hypervisor. The load on the hypervisor grows over the time.

Version-Release number of selected component (if applicable):
vdsm-4.17.35-1.el7ev.noarch

How reproducible:
100% in a specific environment

Steps to Reproduce:
1. see supervdsm logs

Actual results:
The load on the hypervisor is very high:

     20:03:09 up 65 days, 23 min,  1 user,  load average: 42.69, 41.55, 38.18

     systemctl stop vdsmd

     20:04:04 up 65 days, 24 min,  1 user,  load average: 33.70, 39.56, 37.71
     20:04:28 up 65 days, 24 min,  1 user,  load average: 24.64, 36.98, 36.91
     20:04:57 up 65 days, 25 min,  1 user,  load average: 16.49, 33.83, 35.86
     20:05:35 up 65 days, 25 min,  1 user,  load average: 11.20, 30.59, 34.70
     20:05:48 up 65 days, 26 min,  1 user,  load average: 9.78, 29.33, 34.22

Additional info:

The issue was workarounded by setting 

     vm_sample_numa_interval = 600

numa stats are collected 3171 times in one hour for just 14 VMs

Comment 1 Martin Sivák 2016-11-21 13:17:23 UTC
MOM has nothing to do with NUMA. Moving to VDSM. There also were some big changes to monitoring in 4.0 so this might be just a matter of backporting.

However, there is also the (fixed for at least 4.0 and up) bug about high load because of disk IO tune queries: https://bugzilla.redhat.com/show_bug.cgi?id=1366556

Comment 2 Martin Sivák 2016-11-28 12:39:01 UTC
*** Bug 1398953 has been marked as a duplicate of this bug. ***

Comment 3 Roy Golan 2016-12-05 13:44:22 UTC
msivak can we consider removing *VM* numa stats totally? it is for reporting only. 2nd option is to relax the interval, but I prefer that if we don't needed, just remove it

Comment 4 Roy Golan 2016-12-05 13:44:51 UTC
msivak can we consider removing *VM* numa stats totally? it is for reporting only. 2nd option is to relax the interval, but I prefer that if we don't needed, just remove it

Comment 5 Martin Sivák 2016-12-05 14:54:54 UTC
It seems it is already removed in 4.1 engine. But we need to instruct VDSM to limit the collection frequency (and possibly remove the code) too.

Comment 6 Michal Skrivanek 2016-12-05 15:03:17 UTC
the code was dropped in 4.1 in bug 1148039 and it is unused in 3.6/4.0 as well, to minimize changes we can just increase the poll interval from 15s to 1h

Comment 7 Michal Skrivanek 2016-12-05 15:37:38 UTC
I meant 600s, that was actually tested in real setup already.

Comment 12 Artyom 2017-01-24 12:55:53 UTC
Verified on vdsm-4.19.2-2.el7ev.x86_64


Note You need to log in before you can comment on or make changes to this bug.