Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1318550

Summary: Vm.status() causes crash of MoM GuestManager
Product: Red Hat Enterprise Virtualization Manager Reporter: Roman Hodain <rhodain>
Component: vdsmAssignee: Francesco Romani <fromani>
Status: CLOSED ERRATA QA Contact: Shira Maximov <mshira>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.5.7CC: bazulay, fromani, gklein, lsurette, mavital, michal.skrivanek, srevivo, trichard, ycui, ykaul
Target Milestone: ovirt-4.0.0-alpha   
Target Release: 4.0.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Previously, incorrect handling of internal status updates in VDSM caused API calls to fail with unexpected exceptions. This triggered insufficient error handling in MOM, causing the MOM process to crash. Now, the internal status update handling of VDSM has been improved, reducing the impact of MOM crashes. The complete fix requires improvements in MOM, tracked in a different bug.
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-08-23 20:19:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Roman Hodain 2016-03-17 08:36:30 UTC
Description of problem:
calling status on an instance of class Vm(object) throws the following exception:

2016-01-05 23:41:17,725 - mom.GuestManager - ERROR - Guest Manager crashed
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/mom/GuestManager.py", line 114, in run
  File "/usr/lib/python2.6/site-packages/mom/HypervisorInterfaces/vdsmInterface.py", line 75, in getVmList
  File "/usr/share/vdsm/API.py", line 1380, in getVMList
  File "/usr/share/vdsm/API.py", line 1370, in reportedStatus
  File "/usr/share/vdsm/virt/vm.py", line 2817, in status
  File "/usr/share/vdsm/virt/vm.py", line 2817, in <genexpr>
RuntimeError: dictionary changed size during iteration
2016-01-05 23:41:19,843 - mom - ERROR - Thread 'GuestManager' has exited

Version-Release number of selected component (if applicable):
    vdsm-4.16.35-2.el6ev

How reproducible:
    Randomly (race condition)

Steps to Reproduce:
    Most probably by forcing massive live migration

Actual results:
    the mentioned exception

Expected results:
    Exception cannot raise

Additional info:

In this case the exception causes malfunctioning MoM as the GuestManager crashes due to this exception.
The issue has been partially fixed in https://bugzilla.redhat.com/show_bug.cgi?id=1298190
but the exception still occurs even if the fix is applied

Comment 2 Michal Skrivanek 2016-03-18 08:08:19 UTC
Of course crashes are not nice, but I suppose impact is less significant in 3.6 as the mom functionality was separate out of vdsm and is now an independent process. There should be little to no impact on the actual stuff mom does. 

Francesco, other thoughts?

Comment 3 Francesco Romani 2016-03-18 08:47:57 UTC
(In reply to Michal Skrivanek from comment #2)
> Of course crashes are not nice, but I suppose impact is less significant in
> 3.6 as the mom functionality was separate out of vdsm and is now an
> independent process. There should be little to no impact on the actual stuff
> mom does. 
> 
> Francesco, other thoughts?

Yes, this is how it should work - MOM side should be able to recover from those API failures, and go ahead without crashing like it did in 3.5. However, to be sure I'll need to review the MOM code, specifically the xmlrpc interface.

Speaking of Vdsm, we can (and will) improve further the handling of Vm.conf.

The problem is that Vm.conf is indeed abused and misused. Too much data is stored here freely by many complex flows. I acknowledge the fix was partial; the problem is that a clean complete fix would require a large rewrite.

Comment 5 Mike McCune 2016-03-28 22:37:22 UTC
This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions

Comment 6 Francesco Romani 2016-04-20 10:03:12 UTC
the remaining patches are either unhelpful for this issue or not directly related, so moving now to MODIFIED.

Comment 7 Francesco Romani 2016-05-09 06:47:38 UTC
not sure this needs doc_text. Added just in case.

Comment 9 Shira Maximov 2016-05-30 08:09:50 UTC
verified on : 
oVirt Engine Version: 4.0.0-0.7.master.el7ev


verification steps: 

create a massive migration on hosts 
1. create a pool of 100 VMs and put all 100 VMs on the same host 
(each VM has 8GB memory and was loaded 90% memory)  
2. migrate all the 100 VMs 
3. check in mom logs

Comment 11 errata-xmlrpc 2016-08-23 20:19:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-1671.html