Description of problem:
stalling calls to VDSM from withing a monitoring cycle might delay other important monitoring stuff, such as storgae domain monitoring.
Version-Release number of selected component (if applicable):
always - e.g when a VM was shutdown and VURTI needs to send a destory to VDSM
and the call stalls then the whole VURTI thread is stuck
Steps to Reproduce:
1. create some timout in the destroy call and see domain monitroing isn't being called while at it
other calls to VDSM couldn't be called, while the vds manager lock is held and 1 out of 2 connections to VDSM is not available
VURTI thread shouldn't stall on call to VDSM for VM realted stuff.
VURTI shall contain VDS only related logic and thus won't need to call VDSM for other VM related call
VdsManager lock should be free while VDSM calls are in progree and not complete (i.e throughout the lifetime of the network use)
How to reproduce and verify this bug?
Exactly like said in the description, after this fix, a slow shutdown vm shouldn't stall the domain monitoring thread. Thus the host won't go into non-operational etc...
So, you can either load the system with shutdown vm calls, hack a host to replay with a delay to shutdown and see that the system behaves.
related to this thread.
i added 120 sec delay in shutdown and destroy VM methods in vdsm.
while the shutdown is running refresh vds capabilities were executed from the engine.
no issues were found.
this scenario is fair enough ?
(In reply to Eldad Marciano from comment #3)
> related to this thread.
> i added 120 sec delay in shutdown and destroy VM methods in vdsm.
> while the shutdown is running refresh vds capabilities were executed from
> the engine.
> no issues were found.
> this scenario is fair enough ?
Yes that should do the job. Now you need to verify that the update of the pool domains is performed by the Host Monitoring cycle independent of 120 sec stall.
Since I don't see a specific log to it, @Liron please give a direction on how to verify that VdsManager.ontimer - IrsBrokerCommand.updateVdsDomainsData(cachedVds, storagePoolId, domainsList);
is actually called?
You can create a problematic domain report (by blocking the connection for some domain for example), you'll get an engine report that the domain is in problem, then you can perform the operation you added the delay on and unblock the domain.
when the updateVdsDomainsData() method is called it should log that the domain has recovered from problem.
once storage blocked a vm shutdown (with latency of 120sec).
storage unblocked, and recover messages logged into engine log.
"recovered from problem. vds: 'host20-*"
once the delay for the vm passed, the vm shutdown correctly.
moving to verified on top of 3.6.3