Bug 1099068 - [scale] monitoring: separate VDS and VM monitoring
Summary: [scale] monitoring: separate VDS and VM monitoring
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: General
Version: ---
Hardware: Unspecified
OS: Unspecified
Target Milestone: ovirt-3.6.0-rc
: 3.6.0
Assignee: Roy Golan
QA Contact: Eldad Marciano
Depends On:
Blocks: 1099081
TreeView+ depends on / blocked
Reported: 2014-05-19 12:21 UTC by Roy Golan
Modified: 2016-03-11 07:18 UTC (History)
10 users (show)

Fixed In Version: ovirt-engine-3.6.0-0.0.master.20150412172306.git55ba764
Doc Type: Enhancement
Doc Text:
Separation of VM and Host monitoring increases robustness and performance of large scale deployments. Several issues when hosts became non-responsive were fixed, now such hosts do not affect the rest of the system
Clone Of:
: 1099081 (view as bug list)
Last Closed: 2016-03-11 07:18:46 UTC
oVirt Team: Virt
rule-engine: ovirt-3.6.0+
ylavi: planning_ack+
rule-engine: devel_ack+
gklein: testing_ack+

Attachments (Terms of Use)

System ID Private Priority Status Summary Last Updated
oVirt gerrit 27917 0 master MERGED core: cleanup - order ctr and important public methods first Never
oVirt gerrit 27918 0 master MERGED core: monitoring - extract saveToDb and afterRefresh of VMs Never
oVirt gerrit 27919 0 master MERGED core: monitoring - rename VURTI to HostMonitoring Never
oVirt gerrit 27920 0 master MERGED core: monitoring - split VDS and VMs monitoring apart Never
oVirt gerrit 27921 0 master MERGED core: monitoring - make host monitoring fetch mem commited from vms monitoring Never
oVirt gerrit 27922 0 master ABANDONED core: monitoring - Host Monitoring handles Vms Monitoring exeption Never
oVirt gerrit 28173 0 master MERGED core: monitoring - separate Host Monitoring and Vm Moniting Never
oVirt gerrit 28661 0 master MERGED core: VM monitoring - no need to handle exception Never
oVirt gerrit 28662 0 master MERGED core: VM Monitoring abstract fetching/analyzing/monitoring Never
oVirt gerrit 32586 0 master MERGED core: monitoring - make Vm a managed resource Never
oVirt gerrit 35521 0 master MERGED core: Monitoring - refactor Host newtork error handling Never
oVirt gerrit 35741 0 None None None Never

Description Roy Golan 2014-05-19 12:21:17 UTC
Description of problem:

stalling calls to VDSM from withing a monitoring cycle might delay other important monitoring stuff, such as storgae domain monitoring.

Version-Release number of selected component (if applicable):

How reproducible:
always - e.g when a VM was shutdown and VURTI needs to send a destory to VDSM 
and the call stalls then the whole VURTI thread is stuck

Steps to Reproduce:
1. create some timout in the destroy call and see domain monitroing isn't being called while at it

Actual results:
other calls to VDSM couldn't be called, while the vds manager lock is held and 1 out of 2 connections to VDSM is not available

Expected results:
VURTI thread shouldn't stall on call to VDSM for VM realted stuff.
VURTI shall contain VDS only related logic and thus won't need to call VDSM for other VM related call
VdsManager lock should be free while VDSM calls are in progree and not complete (i.e throughout the lifetime of the network use)

Additional info:

Comment 1 Yuri Obshansky 2016-02-15 09:23:35 UTC
How to reproduce and verify this bug?

Comment 2 Roy Golan 2016-02-15 11:43:44 UTC
Exactly like said in the description, after this fix, a slow shutdown vm shouldn't stall the domain monitoring thread. Thus the host won't go into non-operational etc...

So, you can either load the system with shutdown vm calls, hack a host to replay with a delay to shutdown and see that the system behaves.

Comment 3 Eldad Marciano 2016-02-18 12:42:29 UTC
related to this thread.
i added 120 sec delay in shutdown and destroy VM methods in vdsm.
while the shutdown is running refresh vds capabilities were executed from the engine.
no issues were found.
this scenario is fair enough ?

Comment 4 Roy Golan 2016-02-21 10:47:47 UTC
(In reply to Eldad Marciano from comment #3)
> related to this thread.
> i added 120 sec delay in shutdown and destroy VM methods in vdsm.
> while the shutdown is running refresh vds capabilities were executed from
> the engine.
> no issues were found.
> this scenario is fair enough ?

Yes that should do the job. Now you need to verify that the update of the pool domains is performed by the Host Monitoring cycle independent of 120 sec stall.
Since I don't see a specific log to it, @Liron please give a direction on how to verify that VdsManager.ontimer - IrsBrokerCommand.updateVdsDomainsData(cachedVds, storagePoolId, domainsList);
is actually called?

Comment 5 Liron Aravot 2016-02-21 14:19:20 UTC
You can create a problematic domain report (by blocking the connection for some domain for example), you'll get an engine report that the domain is in problem, then you can perform the operation you added the delay on and unblock the domain.
when the updateVdsDomainsData() method is called it should log that the domain has recovered from problem.

Comment 6 Eldad Marciano 2016-02-23 11:40:27 UTC
once storage blocked a vm shutdown (with latency of 120sec).
storage unblocked, and recover messages logged into engine log.
"recovered from problem. vds: 'host20-*"
once the delay for the vm passed, the vm shutdown correctly.

moving to verified on top of 3.6.3

Note You need to log in before you can comment on or make changes to this bug.