Bug 1104030

Summary: Failed VM migrations do not release VM resource lock properly leading to failures in subsequent migration attempts
Product: Red Hat Enterprise Virtualization Manager Reporter: Aval <avyadav>
Component: ovirt-engineAssignee: Arik <ahadas>
Status: CLOSED ERRATA QA Contact: Nisim Simsolo <nsimsolo>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.3.0CC: iheim, lpeer, mavital, michal.skrivanek, rbalakri, Rhev-m-bugs, sherold, vgaikwad, yeylon
Target Milestone: ---Keywords: ZStream
Target Release: 3.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: virt
Fixed In Version: ovirt-engine-3.5.0_beta Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1114587 (view as bug list) Environment:
Last Closed: 2015-02-11 18:03:16 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1114587, 1142923, 1156165    
Attachments:
Description Flags
engine.log none

Description Aval 2014-06-03 06:40:27 UTC
Description of problem: After VM migration thread acquires resource lock and fails due to some reason, it does not release the lock properly and later when other threads try to perform operations like migrate/Run VM they fail while acquiring lock with below error

~~~
2014-06-01 18:15:53,807 INFO  [org.ovirt.engine.core.bll.InternalMigrateVmCommand] (DefaultQuartzScheduler_Worker-3) [36055ac9] Failed to Acquire Lock to object EngineLock [exclusiveLocks= key: 40abfe32-96de-44a7-abd9-e77ecd2bec7b value: VM
, sharedLocks= ]
2014-06-01 18:15:53,808 WARN  [org.ovirt.engine.core.bll.InternalMigrateVmCommand] (DefaultQuartzScheduler_Worker-3) [36055ac9] CanDoAction of action InternalMigrateVm failed. Reasons:VAR__ACTION__MIGRATE,VAR__TYPE__VM,ACTION_TYPE_FAILED_VM_IS_BEING_MIGRATED,$VmName zabbix-190
~~~


Version-Release number of selected component (if applicable):

rhevm-3.3.3-0.52.el6ev.noarch  

How reproducible:

Couple of customers reported the problem that VM migrations keep on failing while putting host in maintenance and later they were not able to start/stop VMs as well.

Steps to Reproduce:
1.
2.
3.

Actual results:
After first failure , subsequent attempts to run/migrate fail while acquiring locks

Expected results:
After first failure , subsequent attempts to run/migrate should get resource locks.

Additional info:

After restarting ovirt-engine service , they one of the customer was able to start/migrate VMs

Comment 2 Aval 2014-06-03 07:18:38 UTC
Created attachment 901619 [details]
engine.log

Adding engine.log from one of the customer facing this problem.

Comment 3 Arik 2014-06-08 12:38:53 UTC
This bug exposed several issues:

1. The first migrations failed not because the VMs remained locked, but because of a bug which caused 'switch host to maintenance' to be re-triggered too soon, so in the retry attempt the migrations fail since the VMs are already locked. This one is solved by http://gerrit.ovirt.org/#/c/28403

2. NPEs in the migrate operations. These exceptions in migrations which were triggered by 'switch to maintenance' operation were already solved by: http://gerrit.ovirt.org/#/c/24639

3. Migrate transaction was aborted, thus the migrate operation failed (on 2014-06-01 18:05:47,045). It won't happen anymore as the migrate operation is no longer transactive.

4. Eventually the VMs remain locked. It happens after the maximum number of retries to migrate the VM is reached (and the migration fails). It was fixed in 3.4. Patch for 3.3:  http://gerrit.ovirt.org/#/c/28460

Comment 7 Nisim Simsolo 2014-10-26 12:27:26 UTC
Fixed. Verified using the next builds:
rhevm-3.5.0-0.17.beta.el6ev.noarch
libvirt-0.10.2-46.el6.x86_64
vdsm-4.16.7.1-1.el6ev.x86_64
sanlock-2.8-1.el6.x86_64
qemu-kvm-rhev-0.12.1.2-2.448.el6.x86_64

Comment 9 errata-xmlrpc 2015-02-11 18:03:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-0158.html