Bug 1182094

Summary: vdsm NUMA code not effective, slowing down statistics retrieval
Product: Red Hat Enterprise Virtualization Manager Reporter: Michal Skrivanek <michal.skrivanek>
Component: vdsmAssignee: Martin Sivák <msivak>
Status: CLOSED ERRATA QA Contact: Artyom <alukiano>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 3.5.0CC: bazulay, dfediuck, fromani, gklein, istein, lpeer, lsurette, melewis, mkalinin, rhodain, sherold, yeylon, ykaul
Target Milestone: ovirt-3.6.0-rcKeywords: Triaged, ZStream
Target Release: 3.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ovirt-3.6.0-alpha1.2 Doc Type: Bug Fix
Doc Text:
Previously, NUMA statistics were collected every time VDSM was queried for host statistics. This resulted in a higher load and unnecessary delays as collecting the data was time consuming as an external process was executed. Now, NUMA statistic collection has been moved to the statistics threads and the host statistic query reports the last collected result.
Story Points: ---
Clone Of:
: 1220113 (view as bug list) Environment:
Last Closed: 2016-03-09 19:29:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: SLA RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1177634, 1185279, 1220113    

Description Michal Skrivanek 2015-01-14 12:39:21 UTC
following up to general scaling bug 1177634 opening a specific SLA bug as per https://bugzilla.redhat.com/show_bug.cgi?id=1177634#c46

NUMA code introduced in 3.5 is very ineffective and when enabled will significantly slow down the high-profile getAllVmStats call

The periodic parsing of private libvirt's xml is a very problematic approach and should be handled correctly, missing APIs should be requested to relevant components(libvirt)

In any case it should be moved out of the stats call which is supposed to only collect information which are being gathered in a separate thread asynchronously (this is the "urgent" part of the bug since it affects the overall performance)

Comment 1 Michal Skrivanek 2015-01-23 11:29:35 UTC
in addition see point 3 in https://bugzilla.redhat.com/show_bug.cgi?id=1185279#c1 for NUMA issue in host monitoring

Comment 2 Eyal Edri 2015-02-25 08:40:14 UTC
3.5.1 is already full with bugs (over 80), and since none of these bugs were added as urgent for 3.5.1 release in the tracker bug, moving to 3.5.2

Comment 6 Martin Sivák 2015-03-10 16:48:55 UTC
The patch is posted and improvement was measured to be about 12ms per VM per call.

Two NUMA enabled VMs caused the following difference in time for x in $(seq 100); do vdsClient -s 0 getAllVmStats >/dev/null; done

Old VDSM:

real	0m21.093s
user	0m11.998s
sys	0m1.690s

Updated VDSM:

real	0m18.485s
user	0m12.009s
sys	0m1.846s

And a control timing of two VMs without NUMA:

real	0m18.298s
user	0m11.878s
sys	0m1.699s

As you can see the time difference for 100 calls was 2.5 seconds.

Comment 7 Martin Sivák 2015-03-10 16:50:22 UTC
But just to make everything clear, all NUMA related code was introduced in 3.5. So it should not affect 3.4 and the issue there is something different.

Comment 9 Martin Sivák 2015-05-11 08:36:24 UTC
The main issue is fixed.

Comment 10 Artyom 2015-05-26 14:14:58 UTC
Verified on vdsm-4.17.0-822.git9b11a18.el7.noarch
Run two vms with two cpu's, without NUMA:
[root@alma06 ~]# time for x in $(seq 100); do vdsClient -s 0 getAllVmStats >/dev/null; done

real    0m14.549s
user    0m11.643s
sys     0m2.316s

With NUMA:
[root@alma06 ~]# time for x in $(seq 100); do vdsClient -s 0 getAllVmStats >/dev/null; done

real    0m14.570s
user    0m11.632s
sys     0m2.370s

Comment 12 errata-xmlrpc 2016-03-09 19:29:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0362.html