Bug 1475971

Summary: getAllVmIoTunePolicies can get blocked making executor queue full and host non responsive
Product: Red Hat Enterprise Virtualization Manager Reporter: nijin ashok <nashok>
Component: vdsmAssignee: Dan Kenigsberg <danken>
Status: CLOSED DUPLICATE QA Contact: Raz Tamir <ratamir>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.1.3CC: bazulay, lsurette, nashok, rhodain, srevivo, ycui, ykaul
Target Milestone: ---Flags: lsvaty: testing_plan_complete-
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-07-28 08:31:27 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description nijin ashok 2017-07-27 16:24:52 UTC
Description of problem:

Currently getAllVmIoTunePolicies is not executed as a periodic task. But if any of the storage domain accessing these VMs go down , these calls can get blocked  for a long time. This can make the tasks not getting served from the engine as the all worker threads are equipped with getAllVmIoTunePolicies. The host will not be able to process the request from the manager making host non responsive. This even happens if the ISO storage domain go away if VMs are having CD's attached from this domain.

I was able to replicate this in a 4.1 environment by starting 30 VMs in a host with CD attached and then blocking the connection between the NFS server and host. I edited the code to print the JsonRpcServer executor state just as we have in for periodic threads  and I can see that all the 8 workers are blocked in  getAllVmIoTunePolicies task.


====
2017-07-27 21:04:10,721+0530 DEBUG (jsonrpc/3) [jsonrpc.JsonRpcServer] Calling 'Host.getAllVmIoTunePolicies' in bridge with {} (__init__:532)
2017-07-27 21:04:40,739+0530 DEBUG (jsonrpc/7) [jsonrpc.JsonRpcServer] Calling 'Host.getAllVmIoTunePolicies' in bridge with {} (__init__:532)
2017-07-27 21:05:10,746+0530 DEBUG (jsonrpc/5) [jsonrpc.JsonRpcServer] Calling 'Host.getAllVmIoTunePolicies' in bridge with {} (__init__:532)
2017-07-27 21:05:40,764+0530 DEBUG (jsonrpc/1) [jsonrpc.JsonRpcServer] Calling 'Host.getAllVmIoTunePolicies' in bridge with {} (__init__:532)
2017-07-27 21:06:10,771+0530 DEBUG (jsonrpc/6) [jsonrpc.JsonRpcServer] Calling 'Host.getAllVmIoTunePolicies' in bridge with {} (__init__:532)
2017-07-27 21:06:40,795+0530 DEBUG (jsonrpc/0) [jsonrpc.JsonRpcServer] Calling 'Host.getAllVmIoTunePolicies' in bridge with {} (__init__:532)
2017-07-27 21:07:10,820+0530 DEBUG (jsonrpc/4) [jsonrpc.JsonRpcServer] Calling 'Host.getAllVmIoTunePolicies' in bridge with {} (__init__:532)
2017-07-27 21:07:40,832+0530 DEBUG (jsonrpc/2) [jsonrpc.JsonRpcServer] Calling 'Host.getAllVmIoTunePolicies' in bridge with {} (__init__:532)


2017-07-27 21:09:24,879+0530 DEBUG (JsonRpcServer) [Executor] custom:executor state: count=8 workers=set([<Worker name=jsonrpc/5 running Task(callable=<functools.partial object at 0x7fd8bc6b7100>, timeout=None) task#=78 at 0x3ae4450>, <Worker name=jsonrpc/0 running Task(callable=<functools.partial object at 0x41ed7e0>, timeout=None) task#=66 at 0x3a5a290>, <Worker name=jsonrpc/4 running Task(callable=<functools.partial object at 0x41edd08>, timeout=None) task#=68 at 0x3ae40d0>, <Worker name=jsonrpc/6 running Task(callable=<functools.partial object at 0x3ee8c00>, timeout=None) task#=64 at 0x3ace7d0>, <Worker name=jsonrpc/3 running Task(callable=<functools.partial object at 0x41edf18>, timeout=None) task#=60 at 0x3aced10>, <Worker name=jsonrpc/1 running Task(callable=<functools.partial object at 0x3ee8c58>, timeout=None) task#=71 at 0x3a5a550>, <Worker name=jsonrpc/2 running Task(callable=<functools.partial object at 0x7fd8bc6b9418>, timeout=None) task#=77 at 0x3ac17d0>, <Worker name=jsonrpc/7 running Task(callable=<functools.partial object at 0x7fd8bc25b7e0>, timeout=None) task#=50 at 0x3ae4950>]) (executor:150)
====

Even if I use, virsh command , I can see that it's getting hanged for a long time.

==
time virsh -r blkdeviotune test2e hdc --live
^C

real	3m50.048s
user	0m0.009s
sys	0m0.010s
==



Version-Release number of selected component (if applicable):
vdsm-4.19.10.1-1.el7ev.x86_64


How reproducible:
100%

Steps to Reproduce:
1. Start around 30 VMs in a machine and block the NFS connection between the host and storage.

2. Monitor the JsonRpc executor. All the worker thread will be blocked in getAllVmIoTunePolicies .

Actual results:

The JsonRpc executor is blocked for a long time because of getAllVmIoTunePolicies. May have to call this from periodic executor with discard ability. 

Expected results:

The JsonRpc executor should not be blocked for a long time.


Additional info:

Comment 2 Roman Hodain 2017-07-28 06:03:06 UTC
Duplicate of Bug 1443654
Keeping it open for verification by the bugzilla owner.

Comment 3 nijin ashok 2017-07-28 08:31:27 UTC
Indeed this is fixed as per Bug 1443654 and I can't reproduce this with vdsm-4.19.24-1.el7ev.x86_64 .

Closing this.

*** This bug has been marked as a duplicate of bug 1443654 ***