Bug 1099068

Summary:	[scale] monitoring: separate VDS and VM monitoring
Product:	[oVirt] ovirt-engine	Reporter:	Roy Golan <rgolan>
Component:	General	Assignee:	Roy Golan <rgolan>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Eldad Marciano <emarcian>
Severity:	high	Docs Contact:
Priority:	high
Version:	---	CC:	bugs, gklein, istein, laravot, michal.skrivanek, mkalinin, rbalakri, rgolan, yeylon, yobshans
Target Milestone:	ovirt-3.6.0-rc	Flags:	rule-engine: ovirt-3.6.0+ ylavi: planning_ack+ rule-engine: devel_ack+ gklein: testing_ack+
Target Release:	3.6.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	ovirt-engine-3.6.0-0.0.master.20150412172306.git55ba764	Doc Type:	Enhancement
Doc Text:	Separation of VM and Host monitoring increases robustness and performance of large scale deployments. Several issues when hosts became non-responsive were fixed, now such hosts do not affect the rest of the system	Story Points:	---
Clone Of:
Clones:	1099081 (view as bug list)		Environment:
Last Closed:	2016-03-11 07:18:46 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Virt	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1099081

Description Roy Golan 2014-05-19 12:21:17 UTC

Description of problem:

stalling calls to VDSM from withing a monitoring cycle might delay other important monitoring stuff, such as storgae domain monitoring.


Version-Release number of selected component (if applicable):


How reproducible:
always - e.g when a VM was shutdown and VURTI needs to send a destory to VDSM 
and the call stalls then the whole VURTI thread is stuck

Steps to Reproduce:
1. create some timout in the destroy call and see domain monitroing isn't being called while at it
2.
3.

Actual results:
other calls to VDSM couldn't be called, while the vds manager lock is held and 1 out of 2 connections to VDSM is not available

Expected results:
VURTI thread shouldn't stall on call to VDSM for VM realted stuff.
VURTI shall contain VDS only related logic and thus won't need to call VDSM for other VM related call
VdsManager lock should be free while VDSM calls are in progree and not complete (i.e throughout the lifetime of the network use)

Additional info:

Comment 1 Yuri Obshansky 2016-02-15 09:23:35 UTC

How to reproduce and verify this bug?

Comment 2 Roy Golan 2016-02-15 11:43:44 UTC

Exactly like said in the description, after this fix, a slow shutdown vm shouldn't stall the domain monitoring thread. Thus the host won't go into non-operational etc...

So, you can either load the system with shutdown vm calls, hack a host to replay with a delay to shutdown and see that the system behaves.

Comment 3 Eldad Marciano 2016-02-18 12:42:29 UTC

related to this thread.
i added 120 sec delay in shutdown and destroy VM methods in vdsm.
while the shutdown is running refresh vds capabilities were executed from the engine.
no issues were found.
this scenario is fair enough ?

Comment 4 Roy Golan 2016-02-21 10:47:47 UTC

(In reply to Eldad Marciano from comment #3)
> related to this thread.
> i added 120 sec delay in shutdown and destroy VM methods in vdsm.
> while the shutdown is running refresh vds capabilities were executed from
> the engine.
> no issues were found.
> this scenario is fair enough ?

Yes that should do the job. Now you need to verify that the update of the pool domains is performed by the Host Monitoring cycle independent of 120 sec stall.
Since I don't see a specific log to it, @Liron please give a direction on how to verify that VdsManager.ontimer - IrsBrokerCommand.updateVdsDomainsData(cachedVds, storagePoolId, domainsList);
is actually called?

Comment 5 Liron Aravot 2016-02-21 14:19:20 UTC

You can create a problematic domain report (by blocking the connection for some domain for example), you'll get an engine report that the domain is in problem, then you can perform the operation you added the delay on and unblock the domain.
when the updateVdsDomainsData() method is called it should log that the domain has recovered from problem.

Comment 6 Eldad Marciano 2016-02-23 11:40:27 UTC

once storage blocked a vm shutdown (with latency of 120sec).
storage unblocked, and recover messages logged into engine log.
"recovered from problem. vds: 'host20-*"
once the delay for the vm passed, the vm shutdown correctly.

moving to verified on top of 3.6.3