Description of problem: on a loaded host with ~500 vms. getAllVmIoTunePolicies hit the performance, and causing other vdsm tasks to be delayed. this scenario led to TooManyTasks exception in vdsm. ERROR (JsonRpcServer) [jsonrpc.JsonRpcServer] could not allocate request thread (__init__:626) File "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 361, in put raise TooManyTasks() In order to understand better whats going on we need to profile mom and provide better further information. most of the workload becomes from mom, and while other vdsm tasks delayed, we see sometimes the host become non-responsive in the engine workaround: turnoff mom momd Version-Release number of selected component (if applicable): 4.1.1 vdsm-4.19.9-1.el7ev.x86_64 libvirt-client-2.0.0-10.el7_3.4.x86_64 qemu-kvm-rhev-2.6.0-28.el7_3.6.x86_64 How reproducible: 100% Steps to Reproduce: 1. running ~500 vms. Actual results: most of the workload becomes from mom, and while other vdsm tasks delayed, we see sometimes the host become non-responsive in the engine Expected results: normal response time for getAllVmIoTunePolicies Additional info:
Mom executes this method once per 15 seconds. That is hardly unreasonable. MOM can't be optimized further, it needs the data and it needs them reasonably often. This can be probably improved on the vdsm/libvirt side so the method finishes faster.
The feature design is problematic. VDSM has all configuration data, it doesn't need to go to libvirt for them (it's the only entity that sets them). Since mom is already using a bulk call for VM stats a trigger can be added to request iotune policies only when they actually change. I suggest to take into consideration Francesco's collectd monitoring work as well. In the short term you can increase the 15s interval
I think a cache layer should indeed be in vdsm.
verified on top of: vdsm-4.19.17-1.el7ev.x86_64 mom-0.5.9-1.el7ev.noarch with 500 vms. the overall CPU utilizaion @host is very stable. the response time for 'vdsm-client Host getAllVmIoTunePolicies' is in milliseconds due to the cache fix! (same response time by vdsm logs) what makes the CPU workload to be nicer and smoother. no 'TooManyTasks' were found as described https://bugzilla.redhat.com/show_bug.cgi?id=1435218#c0 moving to verified.