Bug 1073478

Summary: VDSM internal exception gathering balloon info while shutting down
Product: Red Hat Enterprise Virtualization Manager Reporter: Ilanit Stein <istein>
Component: vdsmAssignee: Francesco Romani <fromani>
Status: CLOSED ERRATA QA Contact: Ilanit Stein <istein>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.4.0CC: acathrow, adahms, bazulay, fromani, iheim, istein, jentrena, lpeer, mavital, michal.skrivanek, ofrenkel, Rhev-m-bugs, sherold, yeylon
Target Milestone: ---Keywords: Reopened, Triaged, ZStream
Target Release: 3.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: virt
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Previously, the state of suspended virtual machines was incorrectly displayed as 'Down' under certain conditions. This was caused by the logic used to poll the status of virtual machines, whereby an uncaught exception would cause the status of virtual machines to be reported incorrectly. Now, this logic has been revised so that the status of virtual machines is reported correctly.
Story Points: ---
Clone Of:
: 1098148 (view as bug list) Environment:
Last Closed: 2014-06-09 13:29:27 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1098148, 1142923, 1156165    
Attachments:
Description Flags
engine.log
none
vdsm.log
none
libvirt log
none
VM qemu log none

Description Ilanit Stein 2014-03-06 13:52:45 UTC
Created attachment 871465 [details]
engine.log

Description of problem:
Suspend VM (rhel6.5), shutdown VM, Run VM, suspend VM - The VM had turned into down state.

Version-Release number of selected component (if applicable):
av2

Expected results:
VM should be suspended

Additional info:
I didn't reproduce it.

Events (1st suspend 13:24, 2nd suspend 13:26):
2014-Mar-06, 13:26
VM downtime is down. Exit message: SaveState succeeded

2014-Mar-06, 13:26
User admin@internal got disconnected from VM downtime.
	
2014-Mar-06, 13:26
Suspending VM downtime was initiated by User admin (Host: lilach-vdsb.tlv.redhat.com).
	
2014-Mar-06, 13:25
User admin@internal is connected to VM downtime.
	
2014-Mar-06, 13:25
user admin initiated console session for VM downtime
	
2014-Mar-06, 13:25
VM downtime started on Host lilach-vdsb.tlv.redhat.com
	
2014-Mar-06, 13:24
user admin initiated console session for VM downtime
	
2014-Mar-06, 13:24
VM downtime was started by admin (Host: lilach-vdsb.tlv.redhat.com).
	
2014-Mar-06, 13:24
Suspended VM downtime powered off by admin.
	
2014-Mar-06, 13:24
VM downtime on Host lilach-vdsb.tlv.redhat.com is suspended.
	
2014-Mar-06, 13:23
Suspending VM downtime was initiated by User admin (Host: lilach-vdsb.tlv.redhat.com).

Comment 1 Ilanit Stein 2014-03-06 13:53:28 UTC
Created attachment 871466 [details]
vdsm.log

Comment 2 Michal Skrivanek 2014-03-09 12:12:14 UTC
Do you still have vdsm+qemu logs? would be helpful

Comment 3 Michal Skrivanek 2014-03-09 12:12:29 UTC
I mean libvirt logs

Comment 4 Ilanit Stein 2014-03-09 13:50:11 UTC
Created attachment 872409 [details]
libvirt log

Comment 5 Ilanit Stein 2014-03-09 13:50:39 UTC
Created attachment 872410 [details]
VM qemu log

Comment 6 Omer Frenkel 2014-03-10 12:03:18 UTC
vm moves to down because exception in getVmStats is not treated well in the engine, and the vm is mistaken to be down. this is documented and fixed in bug 1072282
we need to understand why vdsm fails in getVmStats, though, so this is not a duplicate (in bug 1072282 exception is due to network issues)

Thread-105985::ERROR::2014-03-06 13:26:40,972::BindingXMLRPC::1072::vds::(wrapper) libvirt error
Traceback (most recent call last):
  File "/usr/share/vdsm/BindingXMLRPC.py", line 1065, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/BindingXMLRPC.py", line 414, in getAllVmStats
    response = self.vmGetStats(s['vmId'])
  File "/usr/share/vdsm/BindingXMLRPC.py", line 405, in vmGetStats
    return vm.getStats()
  File "/usr/share/vdsm/API.py", line 357, in getStats
    stats = v.getStats().copy()
  File "/usr/share/vdsm/vm.py", line 2750, in getStats
    stats = self._getStatsInternal()
  File "/usr/share/vdsm/vm.py", line 2850, in _getStatsInternal
    stats['balloonInfo'] = self._getBalloonInfo()
  File "/usr/share/vdsm/vm.py", line 4644, in _getBalloonInfo
    cur_mem = self._dom.info()[2]
  File "/usr/share/vdsm/vm.py", line 859, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line 92, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib64/python2.6/site-packages/libvirt.py", line 1874, in info
    if ret is None: raise libvirtError ('virDomainGetInfo() failed', dom=self)
libvirtError: Domain not found: no domain with matching uuid '534197a4-a88d-4378-8081-7b789b2edc65'

Comment 7 Francesco Romani 2014-03-17 15:48:35 UTC
may be a consequence of commit 8fedf8bde3c28edb07add23c3e9b72681cb48e49, and a race between libvirt notification and sampling threads. The fact this is hard to reproduce seems to confirm that. Adding some extra checks in the sampling threads won't hurt.

Comment 8 Francesco Romani 2014-03-17 16:03:02 UTC
(In reply to Francesco Romani from comment #7)
> may be a consequence of commit 8fedf8bde3c28edb07add23c3e9b72681cb48e49, and
> a race between libvirt notification and sampling threads. The fact this is
> hard to reproduce seems to confirm that. Adding some extra checks in the
> sampling threads won't hurt.


Sorry, not sampling threads but engine polling. Still a race, but a different one :)

Comment 9 Michal Skrivanek 2014-03-31 09:24:46 UTC
*** Bug 1070649 has been marked as a duplicate of this bug. ***

Comment 10 Francesco Romani 2014-04-14 08:59:12 UTC
was: Suspend VM ends up with VM in down status

if a getVmStats call fails inside VDSM for whatever reason, older engine are confused (fixed in https://bugzilla.redhat.com/show_bug.cgi?id=1072282) and mistanely think the VM is down.

This bug is about how to properly handle the failure inside VDSM, as explained in
https://bugzilla.redhat.com/show_bug.cgi?id=1073478#c6
https://bugzilla.redhat.com/show_bug.cgi?id=1073478#c7
https://bugzilla.redhat.com/show_bug.cgi?id=1073478#c8

Comment 12 Ilanit Stein 2014-05-01 07:07:49 UTC
As the original flow reported in this bug is not reproducible, how should this bug be verified?

Comment 13 Francesco Romani 2014-05-06 07:43:24 UTC
Being a timing issue, is indeed very hard to reproduce.

The only answer I can offer is to try to recreate the race issuing the getVmStats command VERY frequently (something like 10 times per second), for example with a shell script, and to shutdown a VM in the meantime.

Hopefully in a few try you can get an error like the one being fixed.

Can't think yet of a better or simpler way to reproduce this issue.

Comment 14 Ilanit Stein 2014-05-13 13:02:35 UTC
Closing the bug since it can't be reproduced.

Comment 15 Michal Skrivanek 2014-05-15 10:24:14 UTC
*** Bug 1070649 has been marked as a duplicate of this bug. ***

Comment 16 Michal Skrivanek 2014-05-15 10:26:14 UTC
(In reply to Ilanit Stein from comment #14)
Please do not close the bug if you can't reproduce it after the fix. That means it should be VERIFIED or you should say you can't verify it…but that doesn't mean the fix is invalid - on the contrary, it indicates it seems to work.

Comment 17 Michal Skrivanek 2014-05-15 11:48:26 UTC
unfortunately it's too late for fixing this in 3.4 GA so retargeting to 3.4.1. It's important to get this fixed

Comment 19 errata-xmlrpc 2014-06-09 13:29:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-0504.html