Bug 1318550 - Vm.status() causes crash of MoM GuestManager
Summary: Vm.status() causes crash of MoM GuestManager
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 3.5.7
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ovirt-4.0.0-alpha
: 4.0.0
Assignee: Francesco Romani
QA Contact: Shira Maximov
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-03-17 08:36 UTC by Roman Hodain
Modified: 2019-10-10 11:38 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Previously, incorrect handling of internal status updates in VDSM caused API calls to fail with unexpected exceptions. This triggered insufficient error handling in MOM, causing the MOM process to crash. Now, the internal status update handling of VDSM has been improved, reducing the impact of MOM crashes. The complete fix requires improvements in MOM, tracked in a different bug.
Clone Of:
Environment:
Last Closed: 2016-08-23 20:19:01 UTC
oVirt Team: Virt
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2016:1671 normal SHIPPED_LIVE VDSM 4.0 GA bug fix and enhancement update 2016-09-02 21:32:03 UTC
oVirt gerrit 54905 master MERGED vm: conf: proper locking in migration path 2016-04-06 12:01:15 UTC
oVirt gerrit 54906 master MERGED vm: conf: proper locking in status() 2016-04-20 09:23:31 UTC
oVirt gerrit 54907 master MERGED vm: conf: proper locking in onConnect() 2016-04-20 09:23:54 UTC
oVirt gerrit 54908 master ABANDONED vm: conf: proper locking in onDisconnect() 2016-04-19 13:37:34 UTC
oVirt gerrit 54909 master MERGED Vm: conf: proper locking in the creation path 2016-04-07 11:00:23 UTC
oVirt gerrit 54910 master ABANDONED vm: conf: proper locking in setNumberOfCpus() 2016-07-27 09:09:41 UTC

Description Roman Hodain 2016-03-17 08:36:30 UTC
Description of problem:
calling status on an instance of class Vm(object) throws the following exception:

2016-01-05 23:41:17,725 - mom.GuestManager - ERROR - Guest Manager crashed
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/mom/GuestManager.py", line 114, in run
  File "/usr/lib/python2.6/site-packages/mom/HypervisorInterfaces/vdsmInterface.py", line 75, in getVmList
  File "/usr/share/vdsm/API.py", line 1380, in getVMList
  File "/usr/share/vdsm/API.py", line 1370, in reportedStatus
  File "/usr/share/vdsm/virt/vm.py", line 2817, in status
  File "/usr/share/vdsm/virt/vm.py", line 2817, in <genexpr>
RuntimeError: dictionary changed size during iteration
2016-01-05 23:41:19,843 - mom - ERROR - Thread 'GuestManager' has exited

Version-Release number of selected component (if applicable):
    vdsm-4.16.35-2.el6ev

How reproducible:
    Randomly (race condition)

Steps to Reproduce:
    Most probably by forcing massive live migration

Actual results:
    the mentioned exception

Expected results:
    Exception cannot raise

Additional info:

In this case the exception causes malfunctioning MoM as the GuestManager crashes due to this exception.
The issue has been partially fixed in https://bugzilla.redhat.com/show_bug.cgi?id=1298190
but the exception still occurs even if the fix is applied

Comment 2 Michal Skrivanek 2016-03-18 08:08:19 UTC
Of course crashes are not nice, but I suppose impact is less significant in 3.6 as the mom functionality was separate out of vdsm and is now an independent process. There should be little to no impact on the actual stuff mom does. 

Francesco, other thoughts?

Comment 3 Francesco Romani 2016-03-18 08:47:57 UTC
(In reply to Michal Skrivanek from comment #2)
> Of course crashes are not nice, but I suppose impact is less significant in
> 3.6 as the mom functionality was separate out of vdsm and is now an
> independent process. There should be little to no impact on the actual stuff
> mom does. 
> 
> Francesco, other thoughts?

Yes, this is how it should work - MOM side should be able to recover from those API failures, and go ahead without crashing like it did in 3.5. However, to be sure I'll need to review the MOM code, specifically the xmlrpc interface.

Speaking of Vdsm, we can (and will) improve further the handling of Vm.conf.

The problem is that Vm.conf is indeed abused and misused. Too much data is stored here freely by many complex flows. I acknowledge the fix was partial; the problem is that a clean complete fix would require a large rewrite.

Comment 5 Mike McCune 2016-03-28 22:37:22 UTC
This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune@redhat.com with any questions

Comment 6 Francesco Romani 2016-04-20 10:03:12 UTC
the remaining patches are either unhelpful for this issue or not directly related, so moving now to MODIFIED.

Comment 7 Francesco Romani 2016-05-09 06:47:38 UTC
not sure this needs doc_text. Added just in case.

Comment 9 Shira Maximov 2016-05-30 08:09:50 UTC
verified on : 
oVirt Engine Version: 4.0.0-0.7.master.el7ev


verification steps: 

create a massive migration on hosts 
1. create a pool of 100 VMs and put all 100 VMs on the same host 
(each VM has 8GB memory and was loaded 90% memory)  
2. migrate all the 100 VMs 
3. check in mom logs

Comment 11 errata-xmlrpc 2016-08-23 20:19:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-1671.html


Note You need to log in before you can comment on or make changes to this bug.