1196735 – [performance] bad getVMList output creates unnecessary calls from Engine

Bug 1196735 - [performance] bad getVMList output creates unnecessary calls from Engine

Summary: [performance] bad getVMList output creates unnecessary calls from Engine

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	3.5.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	urgent
Target Milestone:	---
Target Release:	3.5.1
Assignee:	Piotr Kliczewski
QA Contact:	Eldad Marciano
Docs Contact:
URL:
Whiteboard:	infra
Duplicates (1):	1199387 (view as bug list)
Depends On:	1196327 1202360
Blocks:	1193058 1198248 1203305
TreeView+	depends on / blocked

Reported:	2015-02-26 15:58 UTC by Omer Frenkel
Modified:	2016-02-10 19:02 UTC (History)
CC List:	26 users (show)
Fixed In Version:
Doc Type:	Known Issue
Doc Text:	An unversioned hotfix of the communication protocol between the Red Hat Enterprise Virtualization Manager and virtualization hosts resulted in exceptions in engine-log, and caused the Manager to be unable to communicate with hosts. To avoid this, when using Manager to host communication via JSON-RPC (which is new in 3.5), you must update the Manager to the new version before updating the virtualization hosts. Any host using JSON-RPC can only be upgraded once the new engine release is running. Alternatively, configure the Manager to use XML-RPC instead.
Clone Of:	1196327
Clones:	1198248 1203305 (view as bug list)
Environment:
Last Closed:	2015-04-28 18:52:13 UTC
oVirt Team:	Infra
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2015:0904	normal	SHIPPED_LIVE	vdsm 3.5.1 - bug fix and enhancement update	2015-04-28 22:50:53 UTC
oVirt gerrit	38172	master	MERGED	json-rpc: fix the Host.getVMList return value	Never
oVirt gerrit	38271	master	MERGED	json-rpc: fix the Host.getVMList return value	Never
oVirt gerrit	38304	None	None	None	Never
oVirt gerrit	38321	None	None	None	Never
oVirt gerrit	38322	None	None	None	Never
oVirt gerrit	38449	master	MERGED	api: introducing onlyUUID parameter for getVMList verb	Never
oVirt gerrit	38461	ovirt-3.5	ABANDONED	api: introducing onlyUUID parameter for getVMList verb	Never
oVirt gerrit	38476	ovirt-3.5	MERGED	api: introducing onlyUUID parameter for getVMList verb	Never
oVirt gerrit	38489	ovirt-3.5	MERGED	schema: fixing rebase issue	Never
oVirt gerrit	38513	ovirt-3.5	MERGED	getVMList: using internal function instead of vm.status	Never
oVirt gerrit	38546	ovirt-3.5	MERGED	getAllVmStats: failing to parse vmids	Never
oVirt gerrit	38573	ovirt-3.5	MERGED	getVMList: let api to manage when send short info	Never
oVirt gerrit	38679	master	MERGED	API: onlyUUID should affect only short status	Never
oVirt gerrit	38709	None	None	None	Never
oVirt gerrit	38711	None	None	None	Never
oVirt gerrit	38867	None	None	None	Never
oVirt gerrit	38868	None	None	None	Never

Description Omer Frenkel 2015-02-26 15:58:49 UTC

+++ This bug was initially created as a clone of Bug #1196327 +++

Description of problem:
Somewhere during the transition to JSON-RPC, the short output of getVMList changed to be a simple list of UUIDs.
Unfortunately, this is not what the monitoring code in Engine expects.

From a VERY high-level, monitoring does the following
- short cycle: monitoring calls getVMList, to fetch VM UUID *and* its status.
  if status chenged since last poll, something interesting happened and *then*
  monitoring calls getVMStats(UUID) to learn what happened

- long cycle: monitoring just calls getAllVmStats to get all the informations 
  about VMs.

Currently, short cycle is 3 seconds, long cycle is 15 seconds.
The whole point of this approach is to minimize the traffic and the VDSM load,
while keeping Engine able to respond quickly.

But, if VDSM retuns just the UUID, then Engine cannot know the status, and enters the recovery mode, so calls getVmStats for each VM.

This is practically equivalent to call getAllVmStats() every 3s (and _also_ every 15s), which is very wasteful.

It is important to point out that this affects *only* performance, the stats
are reported correctly so there is no functional impact.


Version-Release number of selected component (if applicable):
found in VDSM master 380713b80d124d1a19749085f477e7658468bf07, but most likely
introduced earlier

How reproducible:
100% with JSON-RPC protocol

Steps to Reproduce:
1. Configure Engine to use JSON-RPC (default)
2. Run a VM
3. snoop the traffic between VDSM and Engine, see VM.getStats() be called too often

Actual results:
On steady state, VM.getStats() get called after each short cycle, for each running VM

Expected results:
On steady state, VM.getStats() is never called

Additional info:
To be verified: Engine patch may be needd. VDSM-only fix may not be enough.

--- Additional comment from Roy Golan on 2015-02-26 12:05:10 IST ---

related to Bug 1196040 ?

--- Additional comment from Francesco Romani on 2015-02-26 12:07:24 IST ---

(In reply to Roy Golan from comment #1)
> related to Bug 1196040 ?

The only link so far is that I discovered this issue while investigating in that area, but so far 1196040 seems just noise in the logs.

Comment 3 Eyal Edri 2015-03-03 14:31:45 UTC

micahl, please remove the rhevm-3.6.0? flag, otherwise this can't be built in distgit.

Comment 6 Michal Skrivanek 2015-03-03 16:13:34 UTC

vdsm side - all in

Comment 10 Eldad Marciano 2015-03-05 13:48:52 UTC

build vt13.12 successfully installed with no crashes or abnormal behavior, and the following benchmark was taken.

build vt13.7 vs build vt13.12 (both engine & host).

use case:
sampling 600sec 1 host 1 vm.


results:

build vt13.7 reproducing the bug.
19 - getAllVmStats
76 - list
76 - getVmStats 

-----------------------------
build vt13.12 fixing the bug.
20 - getAllVmStats
116 - list
0 - getVmStats

Comment 11 Eldad Marciano 2015-03-05 13:51:05 UTC

(In reply to Eldad Marciano from comment #10)
> build vt13.12 successfully installed with no crashes or abnormal behavior,
> and the following benchmark was taken.
> 
> build vt13.7 vs build vt13.12 (both engine & host).
> 
> use case:
> sampling 600sec 1 host 1 vm.
> 
> 
> results:
> 
> build vt13.7 reproducing the bug.
> 19 - getAllVmStats
> 76 - list
> 76 - getVmStats 
> 
> -----------------------------
> build vt13.12 fixing the bug.
> 20 - getAllVmStats
> 116 - list
> 0 - getVmStats



build vt13.12 successfully installed with no crashes or abnormal behavior, and the following method statistics benchmark was taken.

build vt13.7 vs build vt13.12 (both engine & host).

use case:
sampling 600sec 1 host 1 vm.


method invocations count results:

build vt13.7 reproducing the bug.
19 - getAllVmStats
76 - list
76 - getVmStats 

-----------------------------
build vt13.12 fixing the bug.
20 - getAllVmStats
116 - list
0 - getVmStats

Comment 13 Gil Klein 2015-03-05 18:59:08 UTC

Failed QE as this fix has introduce a regression that is now documented in new BZ #1198680

Comment 14 Francesco Romani 2015-03-06 12:18:11 UTC

the bug is now handled by infra

Comment 15 Fabian Deutsch 2015-03-09 10:11:23 UTC

*** Bug 1199387 has been marked as a duplicate of this bug. ***

Comment 17 Gil Klein 2015-03-11 08:12:03 UTC

I've setup an VT13.11 engine with the latest vdsm build VT13.14, that should fix this issue. Now, I don't see the error that we saw when using VT13.12 build (Host moved to "ERROR" state) but I keep seeing this error in vdsm.log:

Thread-964::ERROR::2015-03-11 09:52:44,842::__init__::493::jsonrpc.JsonRpcServer::(_serveRequest) Internal server error
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/yajsonrpc/__init__.py", line 488, in _serveRequest
    res = method(**params)
  File "/usr/share/vdsm/rpc/Bridge.py", line 278, in _dynamicMethod
    ret = retfield(result)
  File "/usr/share/vdsm/rpc/Bridge.py", line 341, in Host_getVMList_Ret
    return [v['vmId'] for v in ret['vmList']]
TypeError: string indices must be integers, not str
Thread-964::DEBUG::2015-03-11 09:52:44,843::stompReactor::163::yajsonrpc.StompServer::(send) Sending response

Comment 18 Gil Klein 2015-03-11 08:49:06 UTC

Failed QA, Moved back to assigned based on comment #17

Comment 19 Eldad Marciano 2015-03-12 15:37:12 UTC

verified.
build vt13.15 applied on engine & vdsm both.

use case:
sampling 500 sec 1 host 1 vm.


method invocations count results:

16 - getAllVmStats
61 - list
0 - getVmStats

Comment 20 Michal Skrivanek 2015-03-16 11:11:52 UTC

reopening due to a problem in MOM which is being fixed in review 38679

Comment 21 Michal Skrivanek 2015-03-16 13:30:31 UTC

see bug 1202360 for MOM change status, this one is kept open for corresponding vdsm spec file change in 38711 (and 38679)

Comment 23 Michal Skrivanek 2015-03-20 07:54:58 UTC

all merged to ovirt-3.5

Comment 24 Eyal Edri 2015-04-01 15:34:28 UTC

this bug was missing from commit msg, probably due to the fact it was merged a while ago.

please verify this is included in the 3.5.2 branch, and move to on_qa.
(build_id: vt14.2)

Comment 25 Eldad Marciano 2015-04-22 09:47:14 UTC

tested on top of vt14.3 , both engine and vdsm.

1 host, 1 vm
duration 600 sec.

profiler analysis results:

0 - getVmStats

Comment 27 errata-xmlrpc 2015-04-28 18:52:13 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0904.html

Note You need to log in before you can comment on or make changes to this bug.

bazulay
bkorren
bugs
cshao
ecohen
eedri
emarcian
fdeutsch
fromani
gklein
hadong
huiwa
istein
lbopf
lpeer
lsurette
mgoldboi
michal.skrivanek
mkalinin
ofrenkel
pdwyer
rbalakri
rgolan
sherold
ycui
yeylon