+++ This bug was initially created as a clone of Bug #1196327 +++ Description of problem: Somewhere during the transition to JSON-RPC, the short output of getVMList changed to be a simple list of UUIDs. Unfortunately, this is not what the monitoring code in Engine expects. From a VERY high-level, monitoring does the following - short cycle: monitoring calls getVMList, to fetch VM UUID *and* its status. if status chenged since last poll, something interesting happened and *then* monitoring calls getVMStats(UUID) to learn what happened - long cycle: monitoring just calls getAllVmStats to get all the informations about VMs. Currently, short cycle is 3 seconds, long cycle is 15 seconds. The whole point of this approach is to minimize the traffic and the VDSM load, while keeping Engine able to respond quickly. But, if VDSM retuns just the UUID, then Engine cannot know the status, and enters the recovery mode, so calls getVmStats for each VM. This is practically equivalent to call getAllVmStats() every 3s (and _also_ every 15s), which is very wasteful. It is important to point out that this affects *only* performance, the stats are reported correctly so there is no functional impact. Version-Release number of selected component (if applicable): found in VDSM master 380713b80d124d1a19749085f477e7658468bf07, but most likely introduced earlier How reproducible: 100% with JSON-RPC protocol Steps to Reproduce: 1. Configure Engine to use JSON-RPC (default) 2. Run a VM 3. snoop the traffic between VDSM and Engine, see VM.getStats() be called too often Actual results: On steady state, VM.getStats() get called after each short cycle, for each running VM Expected results: On steady state, VM.getStats() is never called Additional info: To be verified: Engine patch may be needd. VDSM-only fix may not be enough. --- Additional comment from Roy Golan on 2015-02-26 12:05:10 IST --- related to Bug 1196040 ? --- Additional comment from Francesco Romani on 2015-02-26 12:07:24 IST --- (In reply to Roy Golan from comment #1) > related to Bug 1196040 ? The only link so far is that I discovered this issue while investigating in that area, but so far 1196040 seems just noise in the logs.
micahl, please remove the rhevm-3.6.0? flag, otherwise this can't be built in distgit.
vdsm side - all in
build vt13.12 successfully installed with no crashes or abnormal behavior, and the following benchmark was taken. build vt13.7 vs build vt13.12 (both engine & host). use case: sampling 600sec 1 host 1 vm. results: build vt13.7 reproducing the bug. 19 - getAllVmStats 76 - list 76 - getVmStats ----------------------------- build vt13.12 fixing the bug. 20 - getAllVmStats 116 - list 0 - getVmStats
(In reply to Eldad Marciano from comment #10) > build vt13.12 successfully installed with no crashes or abnormal behavior, > and the following benchmark was taken. > > build vt13.7 vs build vt13.12 (both engine & host). > > use case: > sampling 600sec 1 host 1 vm. > > > results: > > build vt13.7 reproducing the bug. > 19 - getAllVmStats > 76 - list > 76 - getVmStats > > ----------------------------- > build vt13.12 fixing the bug. > 20 - getAllVmStats > 116 - list > 0 - getVmStats build vt13.12 successfully installed with no crashes or abnormal behavior, and the following method statistics benchmark was taken. build vt13.7 vs build vt13.12 (both engine & host). use case: sampling 600sec 1 host 1 vm. method invocations count results: build vt13.7 reproducing the bug. 19 - getAllVmStats 76 - list 76 - getVmStats ----------------------------- build vt13.12 fixing the bug. 20 - getAllVmStats 116 - list 0 - getVmStats
Failed QE as this fix has introduce a regression that is now documented in new BZ #1198680
the bug is now handled by infra
*** Bug 1199387 has been marked as a duplicate of this bug. ***
I've setup an VT13.11 engine with the latest vdsm build VT13.14, that should fix this issue. Now, I don't see the error that we saw when using VT13.12 build (Host moved to "ERROR" state) but I keep seeing this error in vdsm.log: Thread-964::ERROR::2015-03-11 09:52:44,842::__init__::493::jsonrpc.JsonRpcServer::(_serveRequest) Internal server error Traceback (most recent call last): File "/usr/lib/python2.6/site-packages/yajsonrpc/__init__.py", line 488, in _serveRequest res = method(**params) File "/usr/share/vdsm/rpc/Bridge.py", line 278, in _dynamicMethod ret = retfield(result) File "/usr/share/vdsm/rpc/Bridge.py", line 341, in Host_getVMList_Ret return [v['vmId'] for v in ret['vmList']] TypeError: string indices must be integers, not str Thread-964::DEBUG::2015-03-11 09:52:44,843::stompReactor::163::yajsonrpc.StompServer::(send) Sending response
Failed QA, Moved back to assigned based on comment #17
verified. build vt13.15 applied on engine & vdsm both. use case: sampling 500 sec 1 host 1 vm. method invocations count results: 16 - getAllVmStats 61 - list 0 - getVmStats
reopening due to a problem in MOM which is being fixed in review 38679
see bug 1202360 for MOM change status, this one is kept open for corresponding vdsm spec file change in 38711 (and 38679)
all merged to ovirt-3.5
this bug was missing from commit msg, probably due to the fact it was merged a while ago. please verify this is included in the 3.5.2 branch, and move to on_qa. (build_id: vt14.2)
tested on top of vt14.3 , both engine and vdsm. 1 host, 1 vm duration 600 sec. profiler analysis results: 0 - getVmStats
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-0904.html