Red Hat Bugzilla – Bug 1298527
KeyError in getVmMemoryStats when VMs are migrating
Last modified: 2017-11-22 03:46:52 EST
Description of problem:
Noticed while verifying https://bugzilla.redhat.com/show_bug.cgi?id=1296936
Noticed under sustained migration load, but can happen in other cases
If MOM queries Vdsm when a VM just migrated, or in general just died but it is not yet collected by Engine, we could find
2016-01-14 09:40:19,043 - mom.Monitor - ERROR - Unexpected collection error
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/mom/Monitor.py", line 95, in collect
collected = c.collect()
File "/usr/lib/python2.7/site-packages/mom/Collectors/GuestMemory.py", line 50, in collect
stat = self.hypervisor_iface.getVmMemoryStats(self.uuid)
File "/usr/lib/python2.7/site-packages/mom/HypervisorInterfaces/vdsmxmlrpcInterface.py", line 121, in getVmMemoryStats
usage = int(vm['memUsage'])
in MOM logs.
This should be little harm because the VM is already down, so MOM has not much to do here.
It is not Vdsm fault in this case because if Vm is down, a getVmStats call
must return ExitedVmStats (not RunningVmStats, see schema), and memUsage is not part of ExitedVmStats.
Version-Release number of selected component (if applicable):
# rpm -qi mom
Name : mom
Version : 0.5.1
Release : 2.1.el7
# rpm -qi vdsm
Name : vdsm
Version : 4.17.17
Release : 3.git4517968.el7.centos
Steps to Reproduce:
1. start a bunch of VMs
2. migrate or stop/wait/restart VMs, to increse the chance of query while VM is down and not collected (VM.destroy() from Engine)
3. check MOM logs
KeyError in the log, error not gracefully handled
MOM gracefully handles this case
To fix, it should be enough to enhance
diff --git a/mom/HypervisorInterfaces/vdsmxmlrpcInterface.py b/mom/HypervisorInterfaces/vdsmxmlrpcInterface.py
index 63b6d0d..950a6b6 100644
@@ -93,15 +93,23 @@ class XmlRpcVdsmInterface(HypervisorInterface):
vms[vm['vmId']] = vm
- def getVmStats(self, vmId):
- return self.getAllVmStats()[vmId]
def _vmIsRunning(self, vm):
if vm['status'] == 'Up':
+ def getVmStats(self, vmId):
+ vm = self.getAllVmStats()[vmId]
+ except KeyError as e:
+ raise HypervisorInterfaceError("VM %s does not exist" % uuid)
+ if self._vmIsRunning(vm):
+ return vm
+ raise HypervisorInterfaceError(
+ "VM %s not in Up state" % uuid)
Moving from 4.0 alpha to 4.0 beta since 4.0 alpha has been already released and bug is not ON_QA.
oVirt 4.0 beta has been released, moving to RC milestone.
Martin, Francesco, if the the VM is already DOWN at vdsm then we just need to ignore it in the collector if we are not doing so today.
Except we won't know that, because the engine will collect the result first sometimes. We need to be able to start listening to the events between vdsm and engine to handle this properly.
(In reply to Martin Sivák from comment #5)
> Except we won't know that, because the engine will collect the result first
> sometimes. We need to be able to start listening to the events between vdsm
> and engine to handle this properly.
We can only solve this by monitoring the event stream between vdsm - engine, because we have no way of detecting a finished migration or stopped VM soon enough (the engine calls destroy almost immediately and the status in vdsm is then lost).
What about querying vdsm source or destination for the vm?
We are querying the source (localhost). But the VM just suddenly disappears when the migration is finished - migration finished, event to engine, engine calls destroy, vdsm removes all traces about the vm.
MOM will never talk to anybody else than the local VDSM or libvirt.
Is this still relevant when we move to event monitoring?
If so, is it relevant for 4.2?
The event monitoring would fix it. This bug actually depends on one of the event monitoring bugs.
how exactly would it fix it? can you enumerate the exact flow of actions?
By actually telling MOM that the VM is not available anymore? We can then use that information to not request data from VDSM..
how does that solve a situation of MOM receiving this event while there is an inflight getVmMemoryStats request in vdsm already?
This is about MOM traceback not vdsm. And the event means the VM finished and there is a result to be collected. It takes time for the VM structure to be destroyed and that is another request that needs to be done from the engine first. So it will not break any in-flight requests. And not other requests will be done since MOM will know that someone is about to call VM.destroy.