Bug 1196735 - [performance] bad getVMList output creates unnecessary calls from Engine
Summary: [performance] bad getVMList output creates unnecessary calls from Engine
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 3.5.0
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: ---
: 3.5.1
Assignee: Piotr Kliczewski
QA Contact: Eldad Marciano
URL:
Whiteboard: infra
: 1199387 (view as bug list)
Depends On: 1196327 1202360
Blocks: 1193058 1198248 1203305
TreeView+ depends on / blocked
 
Reported: 2015-02-26 15:58 UTC by Omer Frenkel
Modified: 2016-02-10 19:02 UTC (History)
26 users (show)

Fixed In Version:
Doc Type: Known Issue
Doc Text:
An unversioned hotfix of the communication protocol between the Red Hat Enterprise Virtualization Manager and virtualization hosts resulted in exceptions in engine-log, and caused the Manager to be unable to communicate with hosts. To avoid this, when using Manager to host communication via JSON-RPC (which is new in 3.5), you must update the Manager to the new version before updating the virtualization hosts. Any host using JSON-RPC can only be upgraded once the new engine release is running. Alternatively, configure the Manager to use XML-RPC instead.
Clone Of: 1196327
: 1198248 1203305 (view as bug list)
Environment:
Last Closed: 2015-04-28 18:52:13 UTC
oVirt Team: Infra
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2015:0904 0 normal SHIPPED_LIVE vdsm 3.5.1 - bug fix and enhancement update 2015-04-28 22:50:53 UTC
oVirt gerrit 38172 0 master MERGED json-rpc: fix the Host.getVMList return value Never
oVirt gerrit 38271 0 master MERGED json-rpc: fix the Host.getVMList return value Never
oVirt gerrit 38304 0 None None None Never
oVirt gerrit 38321 0 None None None Never
oVirt gerrit 38322 0 None None None Never
oVirt gerrit 38449 0 master MERGED api: introducing onlyUUID parameter for getVMList verb Never
oVirt gerrit 38461 0 ovirt-3.5 ABANDONED api: introducing onlyUUID parameter for getVMList verb Never
oVirt gerrit 38476 0 ovirt-3.5 MERGED api: introducing onlyUUID parameter for getVMList verb Never
oVirt gerrit 38489 0 ovirt-3.5 MERGED schema: fixing rebase issue Never
oVirt gerrit 38513 0 ovirt-3.5 MERGED getVMList: using internal function instead of vm.status Never
oVirt gerrit 38546 0 ovirt-3.5 MERGED getAllVmStats: failing to parse vmids Never
oVirt gerrit 38573 0 ovirt-3.5 MERGED getVMList: let api to manage when send short info Never
oVirt gerrit 38679 0 master MERGED API: onlyUUID should affect only short status Never
oVirt gerrit 38709 0 None None None Never
oVirt gerrit 38711 0 None None None Never
oVirt gerrit 38867 0 None None None Never
oVirt gerrit 38868 0 None None None Never

Description Omer Frenkel 2015-02-26 15:58:49 UTC
+++ This bug was initially created as a clone of Bug #1196327 +++

Description of problem:
Somewhere during the transition to JSON-RPC, the short output of getVMList changed to be a simple list of UUIDs.
Unfortunately, this is not what the monitoring code in Engine expects.

From a VERY high-level, monitoring does the following
- short cycle: monitoring calls getVMList, to fetch VM UUID *and* its status.
  if status chenged since last poll, something interesting happened and *then*
  monitoring calls getVMStats(UUID) to learn what happened

- long cycle: monitoring just calls getAllVmStats to get all the informations 
  about VMs.

Currently, short cycle is 3 seconds, long cycle is 15 seconds.
The whole point of this approach is to minimize the traffic and the VDSM load,
while keeping Engine able to respond quickly.

But, if VDSM retuns just the UUID, then Engine cannot know the status, and enters the recovery mode, so calls getVmStats for each VM.

This is practically equivalent to call getAllVmStats() every 3s (and _also_ every 15s), which is very wasteful.

It is important to point out that this affects *only* performance, the stats
are reported correctly so there is no functional impact.


Version-Release number of selected component (if applicable):
found in VDSM master 380713b80d124d1a19749085f477e7658468bf07, but most likely
introduced earlier

How reproducible:
100% with JSON-RPC protocol

Steps to Reproduce:
1. Configure Engine to use JSON-RPC (default)
2. Run a VM
3. snoop the traffic between VDSM and Engine, see VM.getStats() be called too often

Actual results:
On steady state, VM.getStats() get called after each short cycle, for each running VM

Expected results:
On steady state, VM.getStats() is never called

Additional info:
To be verified: Engine patch may be needd. VDSM-only fix may not be enough.

--- Additional comment from Roy Golan on 2015-02-26 12:05:10 IST ---

related to Bug 1196040 ?

--- Additional comment from Francesco Romani on 2015-02-26 12:07:24 IST ---

(In reply to Roy Golan from comment #1)
> related to Bug 1196040 ?

The only link so far is that I discovered this issue while investigating in that area, but so far 1196040 seems just noise in the logs.

Comment 3 Eyal Edri 2015-03-03 14:31:45 UTC
micahl, please remove the rhevm-3.6.0? flag, otherwise this can't be built in distgit.

Comment 6 Michal Skrivanek 2015-03-03 16:13:34 UTC
vdsm side - all in

Comment 10 Eldad Marciano 2015-03-05 13:48:52 UTC
build vt13.12 successfully installed with no crashes or abnormal behavior, and the following benchmark was taken.

build vt13.7 vs build vt13.12 (both engine & host).

use case:
sampling 600sec 1 host 1 vm.


results:

build vt13.7 reproducing the bug.
19 - getAllVmStats
76 - list
76 - getVmStats 

-----------------------------
build vt13.12 fixing the bug.
20 - getAllVmStats
116 - list
0 - getVmStats

Comment 11 Eldad Marciano 2015-03-05 13:51:05 UTC
(In reply to Eldad Marciano from comment #10)
> build vt13.12 successfully installed with no crashes or abnormal behavior,
> and the following benchmark was taken.
> 
> build vt13.7 vs build vt13.12 (both engine & host).
> 
> use case:
> sampling 600sec 1 host 1 vm.
> 
> 
> results:
> 
> build vt13.7 reproducing the bug.
> 19 - getAllVmStats
> 76 - list
> 76 - getVmStats 
> 
> -----------------------------
> build vt13.12 fixing the bug.
> 20 - getAllVmStats
> 116 - list
> 0 - getVmStats



build vt13.12 successfully installed with no crashes or abnormal behavior, and the following method statistics benchmark was taken.

build vt13.7 vs build vt13.12 (both engine & host).

use case:
sampling 600sec 1 host 1 vm.


method invocations count results:

build vt13.7 reproducing the bug.
19 - getAllVmStats
76 - list
76 - getVmStats 

-----------------------------
build vt13.12 fixing the bug.
20 - getAllVmStats
116 - list
0 - getVmStats

Comment 13 Gil Klein 2015-03-05 18:59:08 UTC
Failed QE as this fix has introduce a regression that is now documented in new BZ #1198680

Comment 14 Francesco Romani 2015-03-06 12:18:11 UTC
the bug is now handled by infra

Comment 15 Fabian Deutsch 2015-03-09 10:11:23 UTC
*** Bug 1199387 has been marked as a duplicate of this bug. ***

Comment 17 Gil Klein 2015-03-11 08:12:03 UTC
I've setup an VT13.11 engine with the latest vdsm build VT13.14, that should fix this issue. Now, I don't see the error that we saw when using VT13.12 build (Host moved to "ERROR" state) but I keep seeing this error in vdsm.log:

Thread-964::ERROR::2015-03-11 09:52:44,842::__init__::493::jsonrpc.JsonRpcServer::(_serveRequest) Internal server error
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/yajsonrpc/__init__.py", line 488, in _serveRequest
    res = method(**params)
  File "/usr/share/vdsm/rpc/Bridge.py", line 278, in _dynamicMethod
    ret = retfield(result)
  File "/usr/share/vdsm/rpc/Bridge.py", line 341, in Host_getVMList_Ret
    return [v['vmId'] for v in ret['vmList']]
TypeError: string indices must be integers, not str
Thread-964::DEBUG::2015-03-11 09:52:44,843::stompReactor::163::yajsonrpc.StompServer::(send) Sending response

Comment 18 Gil Klein 2015-03-11 08:49:06 UTC
Failed QA, Moved back to assigned based on comment #17

Comment 19 Eldad Marciano 2015-03-12 15:37:12 UTC
verified.
build vt13.15 applied on engine & vdsm both.

use case:
sampling 500 sec 1 host 1 vm.


method invocations count results:

16 - getAllVmStats
61 - list
0 - getVmStats

Comment 20 Michal Skrivanek 2015-03-16 11:11:52 UTC
reopening due to a problem in MOM which is being fixed in review 38679

Comment 21 Michal Skrivanek 2015-03-16 13:30:31 UTC
see bug 1202360 for MOM change status, this one is kept open for corresponding vdsm spec file change in 38711 (and 38679)

Comment 23 Michal Skrivanek 2015-03-20 07:54:58 UTC
all merged to ovirt-3.5

Comment 24 Eyal Edri 2015-04-01 15:34:28 UTC
this bug was missing from commit msg, probably due to the fact it was merged a while ago.

please verify this is included in the 3.5.2 branch, and move to on_qa.
(build_id: vt14.2)

Comment 25 Eldad Marciano 2015-04-22 09:47:14 UTC
tested on top of vt14.3 , both engine and vdsm.

1 host, 1 vm
duration 600 sec.

profiler analysis results:

0 - getVmStats

Comment 27 errata-xmlrpc 2015-04-28 18:52:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0904.html


Note You need to log in before you can comment on or make changes to this bug.