1099068 – [scale] monitoring: separate VDS and VM monitoring

Bug 1099068 - [scale] monitoring: separate VDS and VM monitoring

Summary: [scale] monitoring: separate VDS and VM monitoring

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	General
Sub Component:
Version:	---
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	ovirt-3.6.0-rc
Target Release:	3.6.0
Assignee:	Roy Golan
QA Contact:	Eldad Marciano
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1099081
TreeView+	depends on / blocked

Reported:	2014-05-19 12:21 UTC by Roy Golan
Modified:	2016-03-11 07:18 UTC (History)
CC List:	10 users (show)
Fixed In Version:	ovirt-engine-3.6.0-0.0.master.20150412172306.git55ba764
Clone Of:
Clones:	1099081 (view as bug list)
Environment:
Last Closed:	2016-03-11 07:18:46 UTC
oVirt Team:	Virt
Embargoed:
Dependent Products:
Flags:	rule-engine: ovirt-3.6.0+ ylavi: planning_ack+ rule-engine: devel_ack+ gklein: testing_ack+

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
oVirt gerrit	27917	master	MERGED	core: cleanup - order ctr and important public methods first	Never
oVirt gerrit	27918	master	MERGED	core: monitoring - extract saveToDb and afterRefresh of VMs	Never
oVirt gerrit	27919	master	MERGED	core: monitoring - rename VURTI to HostMonitoring	Never
oVirt gerrit	27920	master	MERGED	core: monitoring - split VDS and VMs monitoring apart	Never
oVirt gerrit	27921	master	MERGED	core: monitoring - make host monitoring fetch mem commited from vms monitoring	Never
oVirt gerrit	27922	master	ABANDONED	core: monitoring - Host Monitoring handles Vms Monitoring exeption	Never
oVirt gerrit	28173	master	MERGED	core: monitoring - separate Host Monitoring and Vm Moniting	Never
oVirt gerrit	28661	master	MERGED	core: VM monitoring - no need to handle exception	Never
oVirt gerrit	28662	master	MERGED	core: VM Monitoring abstract fetching/analyzing/monitoring	Never
oVirt gerrit	32586	master	MERGED	core: monitoring - make Vm a managed resource	Never
oVirt gerrit	35521	master	MERGED	core: Monitoring - refactor Host newtork error handling	Never
oVirt gerrit	35741	None	None	None	Never

Description Roy Golan 2014-05-19 12:21:17 UTC

Description of problem:

stalling calls to VDSM from withing a monitoring cycle might delay other important monitoring stuff, such as storgae domain monitoring.


Version-Release number of selected component (if applicable):


How reproducible:
always - e.g when a VM was shutdown and VURTI needs to send a destory to VDSM 
and the call stalls then the whole VURTI thread is stuck

Steps to Reproduce:
1. create some timout in the destroy call and see domain monitroing isn't being called while at it
2.
3.

Actual results:
other calls to VDSM couldn't be called, while the vds manager lock is held and 1 out of 2 connections to VDSM is not available

Expected results:
VURTI thread shouldn't stall on call to VDSM for VM realted stuff.
VURTI shall contain VDS only related logic and thus won't need to call VDSM for other VM related call
VdsManager lock should be free while VDSM calls are in progree and not complete (i.e throughout the lifetime of the network use)

Additional info:

Comment 1 Yuri Obshansky 2016-02-15 09:23:35 UTC

How to reproduce and verify this bug?

Comment 2 Roy Golan 2016-02-15 11:43:44 UTC

Exactly like said in the description, after this fix, a slow shutdown vm shouldn't stall the domain monitoring thread. Thus the host won't go into non-operational etc...

So, you can either load the system with shutdown vm calls, hack a host to replay with a delay to shutdown and see that the system behaves.

Comment 3 Eldad Marciano 2016-02-18 12:42:29 UTC

related to this thread.
i added 120 sec delay in shutdown and destroy VM methods in vdsm.
while the shutdown is running refresh vds capabilities were executed from the engine.
no issues were found.
this scenario is fair enough ?

Comment 4 Roy Golan 2016-02-21 10:47:47 UTC

(In reply to Eldad Marciano from comment #3)
> related to this thread.
> i added 120 sec delay in shutdown and destroy VM methods in vdsm.
> while the shutdown is running refresh vds capabilities were executed from
> the engine.
> no issues were found.
> this scenario is fair enough ?

Yes that should do the job. Now you need to verify that the update of the pool domains is performed by the Host Monitoring cycle independent of 120 sec stall.
Since I don't see a specific log to it, @Liron please give a direction on how to verify that VdsManager.ontimer - IrsBrokerCommand.updateVdsDomainsData(cachedVds, storagePoolId, domainsList);
is actually called?

Comment 5 Liron Aravot 2016-02-21 14:19:20 UTC

You can create a problematic domain report (by blocking the connection for some domain for example), you'll get an engine report that the domain is in problem, then you can perform the operation you added the delay on and unblock the domain.
when the updateVdsDomainsData() method is called it should log that the domain has recovered from problem.

Comment 6 Eldad Marciano 2016-02-23 11:40:27 UTC

once storage blocked a vm shutdown (with latency of 120sec).
storage unblocked, and recover messages logged into engine log.
"recovered from problem. vds: 'host20-*"
once the delay for the vm passed, the vm shutdown correctly.

moving to verified on top of 3.6.3

Note You need to log in before you can comment on or make changes to this bug.