Description of problem: Not sure what the root cause or effect is, just saw it when analyzing unrelated ovirt-system-tests (http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch-el7-x86_64/901/artifact/exported-artifacts/basic-suite-master__logs/test_logs/basic-suite-master/post-004_basic_sanity.py/lago-basic-suite-master-host0/_var_log/vdsm/vdsm.log ): 2017-06-05 13:43:07,746-0400 WARN (periodic/3) [virt.periodic.VmDispatcher] could not run <class 'vdsm.virt.periodic.UpdateVolumes'> on [u'b18cf579-5d89-4bbe-96e0-396f0155408c'] (periodic:295) 2017-06-05 13:43:07,761-0400 ERROR (periodic/0) [virt.periodic.Operation] <vdsm.virt.sampling.VMBulkstatsMonitor object at 0x3155490> operation failed (periodic:200) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 198, in __call__ self._func() File "/usr/lib/python2.7/site-packages/vdsm/virt/sampling.py", line 513, in __call__ self._send_metrics() File "/usr/lib/python2.7/site-packages/vdsm/virt/sampling.py", line 529, in _send_metrics vm_sample.interval) File "/usr/lib/python2.7/site-packages/vdsm/virt/vmstats.py", line 44, in produce balloon(vm, stats, last_sample) File "/usr/lib/python2.7/site-packages/vdsm/virt/vmstats.py", line 167, in balloon balloon_cur = sample['balloon.current'] KeyError: 'balloon.current'
the VM was paused just before the err due to outgoing migration. I think this case is not covered, many of those stats are not present when the VM is paused. Perhaps just extend the _skip_if_missing_stats check?
Created attachment 1285707 [details] Another balloon.current KeyError Please also look at the attached log snippet, when the error occurs on VM creation. It's from my development Vdsm version, so I can't 100% guarantee it's not induced by something else, but the VM has been happily running from its start to successful migration. So I think it's worth checking whether a contingent fix covers this case as well.
From the provided snippets, it seems to me another benign instance of stats not present (aka, we do much noise in the logs for something not important). Now, the problem with a generic solution is that we cannot distinguish from the outside if the event of missing information is benign or not. It is benign in well known circumstances, like VM starting up, shutting down or migrating away; at steady state, however, we expect this data to be present, so it is strange and suspicious if the data isn't there. This is the reason why our handling of missing data is inconsistent and cumbersome. However, I believe so far we framed the issue in a wrong way. Lack of data reported from libvirt is never actionable. We lack data, we cannot do anything about that, so there no point in bubbling up the error. The best thing we could do is concisely complain in debug mode, and silently pass through in each and other case.
Setting it for 4.2, we can backport if needed, I guess (unsure we need to)
patches merged on master, and we didn't branch out 4.2 branch, so MODIFIED.
Hey, I tried to reproduce the bug on : Software version:4.2.0-0.0.master.20170901193740.git7900511.el7.centos I couldn't reproduce this bug, since the error was caused by race condition. Steps: 1. Configured cluster and VM with memory balloon 2. Migrate the VM back and forth several times 3. Changed the log mode of VDSM to DEBUG to catch the error. Yaniv, I couldn't reproduce the bug, can you reproduce the error on your environment, in DEBUG mode?
(In reply to Shira Maximov from comment #6) > Hey, > > I tried to reproduce the bug on : > Software version:4.2.0-0.0.master.20170901193740.git7900511.el7.centos > > I couldn't reproduce this bug, since the error was caused by race condition. > > Steps: > 1. Configured cluster and VM with memory balloon > 2. Migrate the VM back and forth several times > 3. Changed the log mode of VDSM to DEBUG to catch the error. > > > Yaniv, I couldn't reproduce the bug, can you reproduce the error on your > environment, in DEBUG mode? It's oVirt system tests, not my environment. I can check if I see it.
Doesn't seem to happen anymore - closing as CURRENTRELEASE.
*** Bug 1467696 has been marked as a duplicate of this bug. ***