1396910 – Numa sampling causes very high load on the hypervisor.

Bug 1396910 - Numa sampling causes very high load on the hypervisor.

Summary: Numa sampling causes very high load on the hypervisor.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	3.6.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	ovirt-4.1.0-alpha
Target Release:	---
Assignee:	Martin Polednik
QA Contact:	Artyom
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1398953 (view as bug list)
Depends On:
Blocks:	1401580 1401583
TreeView+	depends on / blocked

Reported:	2016-11-21 06:56 UTC by Roman Hodain
Modified:	2021-08-30 11:50 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Previously, NUMA sampling could cause an unnecessarily high load on a complex host. This update reduces the sample interval to 10 minutes, as that is enough for rarely-changing NUMA topology.
Clone Of:
Clones:	1401580 1401583 (view as bug list)
Environment:
Last Closed:	2017-04-25 00:41:12 UTC
oVirt Team:	Virt
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1398953	unspecified	CLOSED	[scale] getVcpuNumaMemoryMapping get called 4 time per VM per 1 minute - flooding supervdsm.log when many VMs (100) are ...	2021-02-22 00:41:40 UTC
Red Hat Issue Tracker	RHV-43183	None	None	None	2021-08-30 11:50:10 UTC
Red Hat Knowledge Base (Solution)	2823931	None	None	None	2016-12-21 18:26:18 UTC
Red Hat Product Errata	RHEA-2017:0998	normal	SHIPPED_LIVE	VDSM bug fix and enhancement update 4.1 GA	2017-04-18 20:11:39 UTC
oVirt gerrit	67838	None	None	None	2016-12-05 15:38:24 UTC

Internal Links: 1398953

Description Roman Hodain 2016-11-21 06:56:16 UTC

Description of problem:
Numa sampling causes very high load on the hypervisor. The load on the hypervisor grows over the time.

Version-Release number of selected component (if applicable):
vdsm-4.17.35-1.el7ev.noarch

How reproducible:
100% in a specific environment

Steps to Reproduce:
1. see supervdsm logs

Actual results:
The load on the hypervisor is very high:

     20:03:09 up 65 days, 23 min,  1 user,  load average: 42.69, 41.55, 38.18

     systemctl stop vdsmd

     20:04:04 up 65 days, 24 min,  1 user,  load average: 33.70, 39.56, 37.71
     20:04:28 up 65 days, 24 min,  1 user,  load average: 24.64, 36.98, 36.91
     20:04:57 up 65 days, 25 min,  1 user,  load average: 16.49, 33.83, 35.86
     20:05:35 up 65 days, 25 min,  1 user,  load average: 11.20, 30.59, 34.70
     20:05:48 up 65 days, 26 min,  1 user,  load average: 9.78, 29.33, 34.22

Additional info:

The issue was workarounded by setting 

     vm_sample_numa_interval = 600

numa stats are collected 3171 times in one hour for just 14 VMs

Comment 1 Martin Sivák 2016-11-21 13:17:23 UTC

MOM has nothing to do with NUMA. Moving to VDSM. There also were some big changes to monitoring in 4.0 so this might be just a matter of backporting.

However, there is also the (fixed for at least 4.0 and up) bug about high load because of disk IO tune queries: https://bugzilla.redhat.com/show_bug.cgi?id=1366556

Comment 2 Martin Sivák 2016-11-28 12:39:01 UTC

*** Bug 1398953 has been marked as a duplicate of this bug. ***

Comment 3 Roy Golan 2016-12-05 13:44:22 UTC

msivak can we consider removing *VM* numa stats totally? it is for reporting only. 2nd option is to relax the interval, but I prefer that if we don't needed, just remove it

Comment 4 Roy Golan 2016-12-05 13:44:51 UTC

msivak can we consider removing *VM* numa stats totally? it is for reporting only. 2nd option is to relax the interval, but I prefer that if we don't needed, just remove it

Comment 5 Martin Sivák 2016-12-05 14:54:54 UTC

It seems it is already removed in 4.1 engine. But we need to instruct VDSM to limit the collection frequency (and possibly remove the code) too.

Comment 6 Michal Skrivanek 2016-12-05 15:03:17 UTC

the code was dropped in 4.1 in bug 1148039 and it is unused in 3.6/4.0 as well, to minimize changes we can just increase the poll interval from 15s to 1h

Comment 7 Michal Skrivanek 2016-12-05 15:37:38 UTC

I meant 600s, that was actually tested in real setup already.

Comment 12 Artyom 2017-01-24 12:55:53 UTC

Verified on vdsm-4.19.2-2.el7ev.x86_64

Note You need to log in before you can comment on or make changes to this bug.