Description of problem: Noticed while verifying https://bugzilla.redhat.com/show_bug.cgi?id=1296936 Noticed under sustained migration load, but can happen in other cases If MOM queries Vdsm when a VM just migrated, or in general just died but it is not yet collected by Engine, we could find 2016-01-14 09:40:19,043 - mom.Monitor - ERROR - Unexpected collection error Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/mom/Monitor.py", line 95, in collect collected = c.collect() File "/usr/lib/python2.7/site-packages/mom/Collectors/GuestMemory.py", line 50, in collect stat = self.hypervisor_iface.getVmMemoryStats(self.uuid) File "/usr/lib/python2.7/site-packages/mom/HypervisorInterfaces/vdsmxmlrpcInterface.py", line 121, in getVmMemoryStats usage = int(vm['memUsage']) KeyError: 'memUsage' in MOM logs. This should be little harm because the VM is already down, so MOM has not much to do here. It is not Vdsm fault in this case because if Vm is down, a getVmStats call must return ExitedVmStats (not RunningVmStats, see schema), and memUsage is not part of ExitedVmStats. Version-Release number of selected component (if applicable): # rpm -qi mom Name : mom Version : 0.5.1 Release : 2.1.el7 # rpm -qi vdsm Name : vdsm Version : 4.17.17 Release : 3.git4517968.el7.centos (prerelease) How reproducible: Time-based Steps to Reproduce: 1. start a bunch of VMs 2. migrate or stop/wait/restart VMs, to increse the chance of query while VM is down and not collected (VM.destroy() from Engine) 3. check MOM logs Actual results: KeyError in the log, error not gracefully handled Expected results: MOM gracefully handles this case Additional info: To fix, it should be enough to enhance mom/HypervisorInterfaces/vdsmxmlrpcInterface.py:getVmStats like diff --git a/mom/HypervisorInterfaces/vdsmxmlrpcInterface.py b/mom/HypervisorInterfaces/vdsmxmlrpcInterface.py index 63b6d0d..950a6b6 100644 --- a/mom/HypervisorInterfaces/vdsmxmlrpcInterface.py +++ b/mom/HypervisorInterfaces/vdsmxmlrpcInterface.py @@ -93,15 +93,23 @@ class XmlRpcVdsmInterface(HypervisorInterface): vms[vm['vmId']] = vm return vms - def getVmStats(self, vmId): - return self.getAllVmStats()[vmId] - def _vmIsRunning(self, vm): if vm['status'] == 'Up': return True else: return False + def getVmStats(self, vmId): + try: + vm = self.getAllVmStats()[vmId] + except KeyError as e: + raise HypervisorInterfaceError("VM %s does not exist" % uuid) + else: + if self._vmIsRunning(vm): + return vm + raise HypervisorInterfaceError( + "VM %s not in Up state" % uuid)
Moving from 4.0 alpha to 4.0 beta since 4.0 alpha has been already released and bug is not ON_QA.
oVirt 4.0 beta has been released, moving to RC milestone.
Martin, Francesco, if the the VM is already DOWN at vdsm then we just need to ignore it in the collector if we are not doing so today.
Except we won't know that, because the engine will collect the result first sometimes. We need to be able to start listening to the events between vdsm and engine to handle this properly.
(In reply to Martin Sivák from comment #5) > Except we won't know that, because the engine will collect the result first > sometimes. We need to be able to start listening to the events between vdsm > and engine to handle this properly. +1
We can only solve this by monitoring the event stream between vdsm - engine, because we have no way of detecting a finished migration or stopped VM soon enough (the engine calls destroy almost immediately and the status in vdsm is then lost).
What about querying vdsm source or destination for the vm?
We are querying the source (localhost). But the VM just suddenly disappears when the migration is finished - migration finished, event to engine, engine calls destroy, vdsm removes all traces about the vm. MOM will never talk to anybody else than the local VDSM or libvirt.
Is this still relevant when we move to event monitoring? If so, is it relevant for 4.2?
The event monitoring would fix it. This bug actually depends on one of the event monitoring bugs.
how exactly would it fix it? can you enumerate the exact flow of actions?
By actually telling MOM that the VM is not available anymore? We can then use that information to not request data from VDSM..
how does that solve a situation of MOM receiving this event while there is an inflight getVmMemoryStats request in vdsm already?
This is about MOM traceback not vdsm. And the event means the VM finished and there is a result to be collected. It takes time for the VM structure to be destroyed and that is another request that needs to be done from the engine first. So it will not break any in-flight requests. And not other requests will be done since MOM will know that someone is about to call VM.destroy.
This is just annoying and we have too many other issues for now. Please reopen if it appears too often.