Bug 1458901 - Error: vdsm.virt.sampling.VMBulkstatsMonitor object at 0x3155490> operation failed (KeyError: 'balloon.current')
Summary: Error: vdsm.virt.sampling.VMBulkstatsMonitor object at 0x3155490> operation f...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: vdsm
Classification: oVirt
Component: Core
Version: 4.19.20
Hardware: Unspecified
OS: Unspecified
medium
low
Target Milestone: ovirt-4.2.0
: 4.20.9.1
Assignee: Francesco Romani
QA Contact: Shira Maximov
URL:
Whiteboard:
: 1467696 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-06-05 19:03 UTC by Yaniv Kaul
Modified: 2017-12-13 07:04 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-09-12 09:52:05 UTC
oVirt Team: Virt
Embargoed:
rule-engine: ovirt-4.2+


Attachments (Terms of Use)
Another balloon.current KeyError (26.77 KB, application/x-xz)
2017-06-07 08:57 UTC, Milan Zamazal
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 77995 0 master MERGED virt: vmstats: never raise if stat is missing. 2020-12-17 17:00:31 UTC
oVirt gerrit 77996 0 master MERGED virt: vmstats: use dedicated logger 2020-12-17 17:00:01 UTC

Description Yaniv Kaul 2017-06-05 19:03:25 UTC
Description of problem:
Not sure what the root cause or effect is, just saw it when analyzing unrelated ovirt-system-tests (http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch-el7-x86_64/901/artifact/exported-artifacts/basic-suite-master__logs/test_logs/basic-suite-master/post-004_basic_sanity.py/lago-basic-suite-master-host0/_var_log/vdsm/vdsm.log ):

2017-06-05 13:43:07,746-0400 WARN  (periodic/3) [virt.periodic.VmDispatcher] could not run <class 'vdsm.virt.periodic.UpdateVolumes'> on [u'b18cf579-5d89-4bbe-96e0-396f0155408c'] (periodic:295)
2017-06-05 13:43:07,761-0400 ERROR (periodic/0) [virt.periodic.Operation] <vdsm.virt.sampling.VMBulkstatsMonitor object at 0x3155490> operation failed (periodic:200)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 198, in __call__
    self._func()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/sampling.py", line 513, in __call__
    self._send_metrics()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/sampling.py", line 529, in _send_metrics
    vm_sample.interval)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vmstats.py", line 44, in produce
    balloon(vm, stats, last_sample)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vmstats.py", line 167, in balloon
    balloon_cur = sample['balloon.current']
KeyError: 'balloon.current'

Comment 1 Michal Skrivanek 2017-06-07 08:28:41 UTC
the VM was paused just before the err due to outgoing migration. I think this case is not covered, many of those stats are not present when the VM is paused. Perhaps just extend the _skip_if_missing_stats check?

Comment 2 Milan Zamazal 2017-06-07 08:57:43 UTC
Created attachment 1285707 [details]
Another balloon.current KeyError

Please also look at the attached log snippet, when the error occurs on VM creation. It's from my development Vdsm version, so I can't 100% guarantee it's not induced by something else, but the VM has been happily running from its start to successful migration. So I think it's worth checking whether a contingent fix covers this case as well.

Comment 3 Francesco Romani 2017-06-08 07:15:22 UTC
From the provided snippets, it seems to me another benign instance of stats not present (aka, we do much noise in the logs for something not important).

Now, the problem with a generic solution is that we cannot distinguish from the outside if the event of missing information is benign or not.
It is benign in well known circumstances, like VM starting up, shutting down or migrating away; at steady state, however, we expect this data to be present, so it is strange and suspicious if the data isn't there.

This is the reason why our handling of missing data is inconsistent and cumbersome.
However, I believe so far we framed the issue in a wrong way.

Lack of data reported from libvirt is never actionable. We lack data, we cannot do anything about that, so there no point in bubbling up the error. The best thing we could do is concisely complain in debug mode, and silently pass through in each and other case.

Comment 4 Yaniv Kaul 2017-06-08 13:05:59 UTC
Setting it for 4.2, we can backport if needed, I guess (unsure we need to)

Comment 5 Francesco Romani 2017-06-27 13:43:50 UTC
patches merged on master, and we didn't branch out 4.2 branch, so MODIFIED.

Comment 6 Shira Maximov 2017-09-05 11:10:22 UTC
Hey, 

I tried to reproduce the bug on : 
Software version:4.2.0-0.0.master.20170901193740.git7900511.el7.centos

I couldn't reproduce this bug,  since the error was caused by race condition.

Steps:
1. Configured cluster and VM with memory balloon
2. Migrate the VM back and forth several times
3. Changed the log mode of VDSM to DEBUG to catch the error.


Yaniv, I couldn't reproduce the bug, can you reproduce the error on your environment, in DEBUG mode?

Comment 7 Yaniv Kaul 2017-09-05 11:47:57 UTC
(In reply to Shira Maximov from comment #6)
> Hey, 
> 
> I tried to reproduce the bug on : 
> Software version:4.2.0-0.0.master.20170901193740.git7900511.el7.centos
> 
> I couldn't reproduce this bug,  since the error was caused by race condition.
> 
> Steps:
> 1. Configured cluster and VM with memory balloon
> 2. Migrate the VM back and forth several times
> 3. Changed the log mode of VDSM to DEBUG to catch the error.
> 
> 
> Yaniv, I couldn't reproduce the bug, can you reproduce the error on your
> environment, in DEBUG mode?

It's oVirt system tests, not my environment. I can check if I see it.

Comment 8 Yaniv Kaul 2017-09-12 09:52:05 UTC
Doesn't seem to happen anymore - closing as CURRENTRELEASE.

Comment 9 Tomas Jelinek 2017-11-24 14:53:09 UTC
*** Bug 1467696 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.