Bug 1458901

Summary: Error: vdsm.virt.sampling.VMBulkstatsMonitor object at 0x3155490> operation failed (KeyError: 'balloon.current')
Product: [oVirt] vdsm Reporter: Yaniv Kaul <ykaul>
Component: CoreAssignee: Francesco Romani <fromani>
Status: CLOSED CURRENTRELEASE QA Contact: Shira Maximov <mshira>
Severity: low Docs Contact:
Priority: medium    
Version: 4.19.20CC: bugs, fromani, michal.skrivanek, mzamazal, vyerys, ykaul
Target Milestone: ovirt-4.2.0Flags: rule-engine: ovirt-4.2+
Target Release: 4.20.9.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-09-12 09:52:05 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Another balloon.current KeyError none

Description Yaniv Kaul 2017-06-05 19:03:25 UTC
Description of problem:
Not sure what the root cause or effect is, just saw it when analyzing unrelated ovirt-system-tests (http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch-el7-x86_64/901/artifact/exported-artifacts/basic-suite-master__logs/test_logs/basic-suite-master/post-004_basic_sanity.py/lago-basic-suite-master-host0/_var_log/vdsm/vdsm.log ):

2017-06-05 13:43:07,746-0400 WARN  (periodic/3) [virt.periodic.VmDispatcher] could not run <class 'vdsm.virt.periodic.UpdateVolumes'> on [u'b18cf579-5d89-4bbe-96e0-396f0155408c'] (periodic:295)
2017-06-05 13:43:07,761-0400 ERROR (periodic/0) [virt.periodic.Operation] <vdsm.virt.sampling.VMBulkstatsMonitor object at 0x3155490> operation failed (periodic:200)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 198, in __call__
    self._func()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/sampling.py", line 513, in __call__
    self._send_metrics()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/sampling.py", line 529, in _send_metrics
    vm_sample.interval)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vmstats.py", line 44, in produce
    balloon(vm, stats, last_sample)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vmstats.py", line 167, in balloon
    balloon_cur = sample['balloon.current']
KeyError: 'balloon.current'

Comment 1 Michal Skrivanek 2017-06-07 08:28:41 UTC
the VM was paused just before the err due to outgoing migration. I think this case is not covered, many of those stats are not present when the VM is paused. Perhaps just extend the _skip_if_missing_stats check?

Comment 2 Milan Zamazal 2017-06-07 08:57:43 UTC
Created attachment 1285707 [details]
Another balloon.current KeyError

Please also look at the attached log snippet, when the error occurs on VM creation. It's from my development Vdsm version, so I can't 100% guarantee it's not induced by something else, but the VM has been happily running from its start to successful migration. So I think it's worth checking whether a contingent fix covers this case as well.

Comment 3 Francesco Romani 2017-06-08 07:15:22 UTC
From the provided snippets, it seems to me another benign instance of stats not present (aka, we do much noise in the logs for something not important).

Now, the problem with a generic solution is that we cannot distinguish from the outside if the event of missing information is benign or not.
It is benign in well known circumstances, like VM starting up, shutting down or migrating away; at steady state, however, we expect this data to be present, so it is strange and suspicious if the data isn't there.

This is the reason why our handling of missing data is inconsistent and cumbersome.
However, I believe so far we framed the issue in a wrong way.

Lack of data reported from libvirt is never actionable. We lack data, we cannot do anything about that, so there no point in bubbling up the error. The best thing we could do is concisely complain in debug mode, and silently pass through in each and other case.

Comment 4 Yaniv Kaul 2017-06-08 13:05:59 UTC
Setting it for 4.2, we can backport if needed, I guess (unsure we need to)

Comment 5 Francesco Romani 2017-06-27 13:43:50 UTC
patches merged on master, and we didn't branch out 4.2 branch, so MODIFIED.

Comment 6 Shira Maximov 2017-09-05 11:10:22 UTC
Hey, 

I tried to reproduce the bug on : 
Software version:4.2.0-0.0.master.20170901193740.git7900511.el7.centos

I couldn't reproduce this bug,  since the error was caused by race condition.

Steps:
1. Configured cluster and VM with memory balloon
2. Migrate the VM back and forth several times
3. Changed the log mode of VDSM to DEBUG to catch the error.


Yaniv, I couldn't reproduce the bug, can you reproduce the error on your environment, in DEBUG mode?

Comment 7 Yaniv Kaul 2017-09-05 11:47:57 UTC
(In reply to Shira Maximov from comment #6)
> Hey, 
> 
> I tried to reproduce the bug on : 
> Software version:4.2.0-0.0.master.20170901193740.git7900511.el7.centos
> 
> I couldn't reproduce this bug,  since the error was caused by race condition.
> 
> Steps:
> 1. Configured cluster and VM with memory balloon
> 2. Migrate the VM back and forth several times
> 3. Changed the log mode of VDSM to DEBUG to catch the error.
> 
> 
> Yaniv, I couldn't reproduce the bug, can you reproduce the error on your
> environment, in DEBUG mode?

It's oVirt system tests, not my environment. I can check if I see it.

Comment 8 Yaniv Kaul 2017-09-12 09:52:05 UTC
Doesn't seem to happen anymore - closing as CURRENTRELEASE.

Comment 9 Tomas Jelinek 2017-11-24 14:53:09 UTC
*** Bug 1467696 has been marked as a duplicate of this bug. ***