Description of problem: Currently getAllVmIoTunePolicies is not executed as a periodic task. But if any of the storage domain accessing these VMs go down , these calls can get blocked for a long time. This can make the tasks not getting served from the engine as the all worker threads are equipped with getAllVmIoTunePolicies. The host will not be able to process the request from the manager making host non responsive. This even happens if the ISO storage domain go away if VMs are having CD's attached from this domain. I was able to replicate this in a 4.1 environment by starting 30 VMs in a host with CD attached and then blocking the connection between the NFS server and host. I edited the code to print the JsonRpcServer executor state just as we have in for periodic threads and I can see that all the 8 workers are blocked in getAllVmIoTunePolicies task. ==== 2017-07-27 21:04:10,721+0530 DEBUG (jsonrpc/3) [jsonrpc.JsonRpcServer] Calling 'Host.getAllVmIoTunePolicies' in bridge with {} (__init__:532) 2017-07-27 21:04:40,739+0530 DEBUG (jsonrpc/7) [jsonrpc.JsonRpcServer] Calling 'Host.getAllVmIoTunePolicies' in bridge with {} (__init__:532) 2017-07-27 21:05:10,746+0530 DEBUG (jsonrpc/5) [jsonrpc.JsonRpcServer] Calling 'Host.getAllVmIoTunePolicies' in bridge with {} (__init__:532) 2017-07-27 21:05:40,764+0530 DEBUG (jsonrpc/1) [jsonrpc.JsonRpcServer] Calling 'Host.getAllVmIoTunePolicies' in bridge with {} (__init__:532) 2017-07-27 21:06:10,771+0530 DEBUG (jsonrpc/6) [jsonrpc.JsonRpcServer] Calling 'Host.getAllVmIoTunePolicies' in bridge with {} (__init__:532) 2017-07-27 21:06:40,795+0530 DEBUG (jsonrpc/0) [jsonrpc.JsonRpcServer] Calling 'Host.getAllVmIoTunePolicies' in bridge with {} (__init__:532) 2017-07-27 21:07:10,820+0530 DEBUG (jsonrpc/4) [jsonrpc.JsonRpcServer] Calling 'Host.getAllVmIoTunePolicies' in bridge with {} (__init__:532) 2017-07-27 21:07:40,832+0530 DEBUG (jsonrpc/2) [jsonrpc.JsonRpcServer] Calling 'Host.getAllVmIoTunePolicies' in bridge with {} (__init__:532) 2017-07-27 21:09:24,879+0530 DEBUG (JsonRpcServer) [Executor] custom:executor state: count=8 workers=set([<Worker name=jsonrpc/5 running Task(callable=<functools.partial object at 0x7fd8bc6b7100>, timeout=None) task#=78 at 0x3ae4450>, <Worker name=jsonrpc/0 running Task(callable=<functools.partial object at 0x41ed7e0>, timeout=None) task#=66 at 0x3a5a290>, <Worker name=jsonrpc/4 running Task(callable=<functools.partial object at 0x41edd08>, timeout=None) task#=68 at 0x3ae40d0>, <Worker name=jsonrpc/6 running Task(callable=<functools.partial object at 0x3ee8c00>, timeout=None) task#=64 at 0x3ace7d0>, <Worker name=jsonrpc/3 running Task(callable=<functools.partial object at 0x41edf18>, timeout=None) task#=60 at 0x3aced10>, <Worker name=jsonrpc/1 running Task(callable=<functools.partial object at 0x3ee8c58>, timeout=None) task#=71 at 0x3a5a550>, <Worker name=jsonrpc/2 running Task(callable=<functools.partial object at 0x7fd8bc6b9418>, timeout=None) task#=77 at 0x3ac17d0>, <Worker name=jsonrpc/7 running Task(callable=<functools.partial object at 0x7fd8bc25b7e0>, timeout=None) task#=50 at 0x3ae4950>]) (executor:150) ==== Even if I use, virsh command , I can see that it's getting hanged for a long time. == time virsh -r blkdeviotune test2e hdc --live ^C real 3m50.048s user 0m0.009s sys 0m0.010s == Version-Release number of selected component (if applicable): vdsm-4.19.10.1-1.el7ev.x86_64 How reproducible: 100% Steps to Reproduce: 1. Start around 30 VMs in a machine and block the NFS connection between the host and storage. 2. Monitor the JsonRpc executor. All the worker thread will be blocked in getAllVmIoTunePolicies . Actual results: The JsonRpc executor is blocked for a long time because of getAllVmIoTunePolicies. May have to call this from periodic executor with discard ability. Expected results: The JsonRpc executor should not be blocked for a long time. Additional info:
Duplicate of Bug 1443654 Keeping it open for verification by the bugzilla owner.
Indeed this is fixed as per Bug 1443654 and I can't reproduce this with vdsm-4.19.24-1.el7ev.x86_64 . Closing this. *** This bug has been marked as a duplicate of bug 1443654 ***