Bug 1435218

Summary: [scale] - getAllVmIoTunePolicies hit the performance
Product: [oVirt] vdsm Reporter: Eldad Marciano <emarcian>
Component: CoreAssignee: Andrej Krejcir <akrejcir>
Status: CLOSED CURRENTRELEASE QA Contact: Ilanit Stein <istein>
Severity: high Docs Contact:
Priority: high    
Version: ---CC: akrejcir, bugs, lbopf, michal.skrivanek, stirabos
Target Milestone: ovirt-4.1.3Keywords: Performance
Target Release: 4.19.16Flags: rule-engine: ovirt-4.1+
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
undefined
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-07-06 13:31:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: SLA RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Eldad Marciano 2017-03-23 12:12:40 UTC
Description of problem:

on a loaded host with ~500 vms.
getAllVmIoTunePolicies hit the performance, and causing other vdsm tasks to be delayed.

this scenario led to TooManyTasks exception in vdsm.
ERROR (JsonRpcServer) [jsonrpc.JsonRpcServer] could not allocate request thread (__init__:626)

  File "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 361, in put

    raise TooManyTasks()


In order to understand better whats going on we need to profile mom and provide better further information.

most of the workload becomes from mom, and while other vdsm tasks delayed, we see sometimes the host become non-responsive in the engine 

workaround:
turnoff mom momd  

Version-Release number of selected component (if applicable):
4.1.1
vdsm-4.19.9-1.el7ev.x86_64
libvirt-client-2.0.0-10.el7_3.4.x86_64
qemu-kvm-rhev-2.6.0-28.el7_3.6.x86_64

How reproducible:
100%

Steps to Reproduce:
1. running ~500 vms.


Actual results:
most of the workload becomes from mom, and while other vdsm tasks delayed, we see sometimes the host become non-responsive in the engine 


Expected results:
normal response time for getAllVmIoTunePolicies

Additional info:

Comment 2 Martin Sivák 2017-03-23 14:30:42 UTC
Mom executes this method once per 15 seconds. That is hardly unreasonable. MOM can't be optimized further, it needs the data and it needs them reasonably often.

This can be probably improved on the vdsm/libvirt side so the method finishes faster.

Comment 3 Michal Skrivanek 2017-03-24 09:54:45 UTC
The feature design is problematic.
VDSM has all configuration data, it doesn't need to go to libvirt for them (it's the only entity that sets them). Since mom is already using a bulk call for VM stats a trigger can be added to request iotune policies only when they actually change.
I suggest to take into consideration Francesco's collectd monitoring work as well.

In the short term you can increase the 15s interval

Comment 4 Martin Sivák 2017-03-29 10:43:25 UTC
I think a cache layer should indeed be in vdsm.

Comment 7 Eldad Marciano 2017-06-04 22:34:56 UTC
verified on top of:
vdsm-4.19.17-1.el7ev.x86_64
mom-0.5.9-1.el7ev.noarch

with 500 vms.

the overall CPU utilizaion @host is very stable.
the response time for 'vdsm-client Host getAllVmIoTunePolicies' is in milliseconds due to the cache fix! (same response time by vdsm logs)

what makes the CPU workload to be nicer and smoother.

no 'TooManyTasks' were found as described https://bugzilla.redhat.com/show_bug.cgi?id=1435218#c0

moving to verified.