Bug 1298527

Summary:	KeyError in getVmMemoryStats when VMs are migrating
Product:	[oVirt] mom	Reporter:	Francesco Romani <fromani>
Component:	Core	Assignee:	bugs <bugs>
Status:	CLOSED WONTFIX	QA Contact:	Shira Maximov <mshira>
Severity:	medium	Docs Contact:
Priority:	low
Version:	0.5.1	CC:	bugs, dfediuck, fromani, mgoldboi, michal.skrivanek, mshira, msivak, pkliczew
Target Milestone:	---	Flags:	sbonazzo: ovirt-4.2-
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-01-17 11:52:01 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	SLA	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1363728
Bug Blocks:

Description Francesco Romani 2016-01-14 11:02:11 UTC

Description of problem:
Noticed while verifying https://bugzilla.redhat.com/show_bug.cgi?id=1296936
Noticed under sustained migration load, but can happen in other cases

If MOM queries Vdsm when a VM just migrated, or in general just died but it is not yet collected by Engine, we could find

2016-01-14 09:40:19,043 - mom.Monitor - ERROR - Unexpected collection error
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/mom/Monitor.py", line 95, in collect
    collected = c.collect()
  File "/usr/lib/python2.7/site-packages/mom/Collectors/GuestMemory.py", line 50, in collect
    stat = self.hypervisor_iface.getVmMemoryStats(self.uuid)
  File "/usr/lib/python2.7/site-packages/mom/HypervisorInterfaces/vdsmxmlrpcInterface.py", line 121, in getVmMemoryStats
    usage = int(vm['memUsage'])
KeyError: 'memUsage'

in MOM logs.

This should be little harm because the VM is already down, so MOM has not much to do here.

It is not Vdsm fault in this case because if Vm is down, a getVmStats call
must return ExitedVmStats (not RunningVmStats, see schema), and memUsage is not part of ExitedVmStats.

Version-Release number of selected component (if applicable):
# rpm -qi mom
Name        : mom
Version     : 0.5.1
Release     : 2.1.el7
# rpm -qi vdsm
Name        : vdsm
Version     : 4.17.17
Release     : 3.git4517968.el7.centos
(prerelease)

How reproducible:
Time-based

Steps to Reproduce:
1. start a bunch of VMs
2. migrate or stop/wait/restart VMs, to increse the chance of query while VM is down and not collected (VM.destroy() from Engine)
3. check MOM logs

Actual results:
KeyError in the log, error not gracefully handled

Expected results:
MOM gracefully handles this case

Additional info:
To fix, it should be enough to enhance 
mom/HypervisorInterfaces/vdsmxmlrpcInterface.py:getVmStats like

diff --git a/mom/HypervisorInterfaces/vdsmxmlrpcInterface.py b/mom/HypervisorInterfaces/vdsmxmlrpcInterface.py
index 63b6d0d..950a6b6 100644
--- a/mom/HypervisorInterfaces/vdsmxmlrpcInterface.py
+++ b/mom/HypervisorInterfaces/vdsmxmlrpcInterface.py
@@ -93,15 +93,23 @@ class XmlRpcVdsmInterface(HypervisorInterface):
             vms[vm['vmId']] = vm
         return vms
 
-    def getVmStats(self, vmId):
-        return self.getAllVmStats()[vmId]
-
     def _vmIsRunning(self, vm):
         if vm['status'] == 'Up':
             return True
         else:
             return False
 
+    def getVmStats(self, vmId):
+        try:
+            vm = self.getAllVmStats()[vmId]
+        except KeyError as e:
+            raise HypervisorInterfaceError("VM %s does not exist" % uuid)
+        else:
+            if self._vmIsRunning(vm):
+                return vm
+            raise HypervisorInterfaceError(
+                "VM %s not in Up state" % uuid)

Comment 1 Sandro Bonazzola 2016-05-02 10:02:53 UTC

Moving from 4.0 alpha to 4.0 beta since 4.0 alpha has been already released and bug is not ON_QA.

Comment 2 Yaniv Lavi 2016-05-23 13:18:07 UTC

oVirt 4.0 beta has been released, moving to RC milestone.

Comment 3 Yaniv Lavi 2016-05-23 13:22:08 UTC

oVirt 4.0 beta has been released, moving to RC milestone.

Comment 4 Roy Golan 2016-06-01 10:44:02 UTC

Martin, Francesco, if the the VM is already DOWN at vdsm then we just need to ignore it in the collector if we are not doing so today.

Comment 5 Martin Sivák 2016-06-01 12:41:04 UTC

Except we won't know that, because the engine will collect the result first sometimes. We need to be able to start listening to the events between vdsm and engine to handle this properly.

Comment 6 Roy Golan 2016-06-29 09:07:15 UTC

(In reply to Martin Sivák from comment #5)
> Except we won't know that, because the engine will collect the result first
> sometimes. We need to be able to start listening to the events between vdsm
> and engine to handle this properly.

+1

Comment 7 Martin Sivák 2016-09-27 15:56:41 UTC

We can only solve this by monitoring the event stream between vdsm - engine, because we have no way of detecting a finished migration or stopped VM soon enough (the engine calls destroy almost immediately and the status in vdsm is then lost).

Comment 8 Piotr Kliczewski 2016-09-28 06:43:54 UTC

What about querying vdsm source or destination for the vm?

Comment 9 Martin Sivák 2016-09-29 07:59:28 UTC

We are querying the source (localhost). But the VM just suddenly disappears when the migration is finished - migration finished, event to engine, engine calls destroy, vdsm removes all traces about the vm.

MOM will never talk to anybody else than the local VDSM or libvirt.

Comment 10 Doron Fediuck 2017-07-26 08:59:54 UTC

Is this still relevant when we move to event monitoring?
If so, is it relevant for 4.2?

Comment 11 Martin Sivák 2017-07-26 09:13:48 UTC

The event monitoring would fix it. This bug actually depends on one of the event monitoring bugs.

Comment 12 Michal Skrivanek 2017-07-26 09:17:08 UTC

how exactly would it fix it? can you enumerate the exact flow of actions?

Comment 13 Martin Sivák 2017-07-26 10:05:46 UTC

By actually telling MOM that the VM is not available anymore? We can then use that information to not request data from VDSM..

Comment 14 Michal Skrivanek 2017-07-26 10:13:44 UTC

how does that solve a situation of MOM receiving this event while there is an inflight getVmMemoryStats request in vdsm already?

Comment 15 Martin Sivák 2017-07-26 14:36:35 UTC

This is about MOM traceback not vdsm. And the event means the VM finished and there is a result to be collected. It takes time for the VM structure to be destroyed and that is another request that needs to be done from the engine first. So it will not break any in-flight requests. And not other requests will be done since MOM will know that someone is about to call VM.destroy.

Comment 19 Martin Sivák 2018-01-17 11:52:01 UTC

This is just annoying and we have too many other issues for now. Please reopen if it appears too often.