Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1318550 - Vm.status() causes crash of MoM GuestManager
Vm.status() causes crash of MoM GuestManager
Status: CLOSED ERRATA
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm (Show other bugs)
3.5.7
Unspecified Unspecified
medium Severity medium
: ovirt-4.0.0-alpha
: 4.0.0
Assigned To: Francesco Romani
Shira Maximov
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2016-03-17 04:36 EDT by Roman Hodain
Modified: 2016-08-23 16:19 EDT (History)
10 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Previously, incorrect handling of internal status updates in VDSM caused API calls to fail with unexpected exceptions. This triggered insufficient error handling in MOM, causing the MOM process to crash. Now, the internal status update handling of VDSM has been improved, reducing the impact of MOM crashes. The complete fix requires improvements in MOM, tracked in a different bug.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-08-23 16:19:01 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Virt
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
oVirt gerrit 54905 master MERGED vm: conf: proper locking in migration path 2016-04-06 08:01 EDT
oVirt gerrit 54906 master MERGED vm: conf: proper locking in status() 2016-04-20 05:23 EDT
oVirt gerrit 54907 master MERGED vm: conf: proper locking in onConnect() 2016-04-20 05:23 EDT
oVirt gerrit 54908 master ABANDONED vm: conf: proper locking in onDisconnect() 2016-04-19 09:37 EDT
oVirt gerrit 54909 master MERGED Vm: conf: proper locking in the creation path 2016-04-07 07:00 EDT
oVirt gerrit 54910 master ABANDONED vm: conf: proper locking in setNumberOfCpus() 2016-07-27 05:09 EDT
Red Hat Product Errata RHEA-2016:1671 normal SHIPPED_LIVE VDSM 4.0 GA bug fix and enhancement update 2016-09-02 17:32:03 EDT

  None (edit)
Description Roman Hodain 2016-03-17 04:36:30 EDT
Description of problem:
calling status on an instance of class Vm(object) throws the following exception:

2016-01-05 23:41:17,725 - mom.GuestManager - ERROR - Guest Manager crashed
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/mom/GuestManager.py", line 114, in run
  File "/usr/lib/python2.6/site-packages/mom/HypervisorInterfaces/vdsmInterface.py", line 75, in getVmList
  File "/usr/share/vdsm/API.py", line 1380, in getVMList
  File "/usr/share/vdsm/API.py", line 1370, in reportedStatus
  File "/usr/share/vdsm/virt/vm.py", line 2817, in status
  File "/usr/share/vdsm/virt/vm.py", line 2817, in <genexpr>
RuntimeError: dictionary changed size during iteration
2016-01-05 23:41:19,843 - mom - ERROR - Thread 'GuestManager' has exited

Version-Release number of selected component (if applicable):
    vdsm-4.16.35-2.el6ev

How reproducible:
    Randomly (race condition)

Steps to Reproduce:
    Most probably by forcing massive live migration

Actual results:
    the mentioned exception

Expected results:
    Exception cannot raise

Additional info:

In this case the exception causes malfunctioning MoM as the GuestManager crashes due to this exception.
The issue has been partially fixed in https://bugzilla.redhat.com/show_bug.cgi?id=1298190
but the exception still occurs even if the fix is applied
Comment 2 Michal Skrivanek 2016-03-18 04:08:19 EDT
Of course crashes are not nice, but I suppose impact is less significant in 3.6 as the mom functionality was separate out of vdsm and is now an independent process. There should be little to no impact on the actual stuff mom does. 

Francesco, other thoughts?
Comment 3 Francesco Romani 2016-03-18 04:47:57 EDT
(In reply to Michal Skrivanek from comment #2)
> Of course crashes are not nice, but I suppose impact is less significant in
> 3.6 as the mom functionality was separate out of vdsm and is now an
> independent process. There should be little to no impact on the actual stuff
> mom does. 
> 
> Francesco, other thoughts?

Yes, this is how it should work - MOM side should be able to recover from those API failures, and go ahead without crashing like it did in 3.5. However, to be sure I'll need to review the MOM code, specifically the xmlrpc interface.

Speaking of Vdsm, we can (and will) improve further the handling of Vm.conf.

The problem is that Vm.conf is indeed abused and misused. Too much data is stored here freely by many complex flows. I acknowledge the fix was partial; the problem is that a clean complete fix would require a large rewrite.
Comment 5 Mike McCune 2016-03-28 18:37:22 EDT
This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune@redhat.com with any questions
Comment 6 Francesco Romani 2016-04-20 06:03:12 EDT
the remaining patches are either unhelpful for this issue or not directly related, so moving now to MODIFIED.
Comment 7 Francesco Romani 2016-05-09 02:47:38 EDT
not sure this needs doc_text. Added just in case.
Comment 9 Shira Maximov 2016-05-30 04:09:50 EDT
verified on : 
oVirt Engine Version: 4.0.0-0.7.master.el7ev


verification steps: 

create a massive migration on hosts 
1. create a pool of 100 VMs and put all 100 VMs on the same host 
(each VM has 8GB memory and was loaded 90% memory)  
2. migrate all the 100 VMs 
3. check in mom logs
Comment 11 errata-xmlrpc 2016-08-23 16:19:01 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-1671.html

Note You need to log in before you can comment on or make changes to this bug.