1104030 – Failed VM migrations do not release VM resource lock properly leading to failures in subsequent migration attempts

Bug 1104030 - Failed VM migrations do not release VM resource lock properly leading to failures in subsequent migration attempts

Summary: Failed VM migrations do not release VM resource lock properly leading to fail...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	3.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	3.5.0
Assignee:	Arik
QA Contact:	Nisim Simsolo
Docs Contact:
URL:
Whiteboard:	virt
Depends On:
Blocks:	1114587 rhev3.5beta 1156165
TreeView+	depends on / blocked

Reported:	2014-06-03 06:40 UTC by Aval
Modified:	2019-04-28 09:37 UTC (History)
CC List:	9 users (show)
Fixed In Version:	ovirt-engine-3.5.0_beta
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1114587 (view as bug list)
Environment:
Last Closed:	2015-02-11 18:03:16 UTC
oVirt Team:	---
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
engine.log (870.63 KB, text/x-log) 2014-06-03 07:18 UTC, Aval	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2015:0158	0	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Virtualization Manager 3.5.0	2015-02-11 22:38:50 UTC
oVirt gerrit	28403	0	master	MERGED	core: fix reattempt to go to maintenance mechanism	2020-03-02 23:14:00 UTC

Description Aval 2014-06-03 06:40:27 UTC

Description of problem: After VM migration thread acquires resource lock and fails due to some reason, it does not release the lock properly and later when other threads try to perform operations like migrate/Run VM they fail while acquiring lock with below error

~~~
2014-06-01 18:15:53,807 INFO  [org.ovirt.engine.core.bll.InternalMigrateVmCommand] (DefaultQuartzScheduler_Worker-3) [36055ac9] Failed to Acquire Lock to object EngineLock [exclusiveLocks= key: 40abfe32-96de-44a7-abd9-e77ecd2bec7b value: VM
, sharedLocks= ]
2014-06-01 18:15:53,808 WARN  [org.ovirt.engine.core.bll.InternalMigrateVmCommand] (DefaultQuartzScheduler_Worker-3) [36055ac9] CanDoAction of action InternalMigrateVm failed. Reasons:VAR__ACTION__MIGRATE,VAR__TYPE__VM,ACTION_TYPE_FAILED_VM_IS_BEING_MIGRATED,$VmName zabbix-190
~~~


Version-Release number of selected component (if applicable):

rhevm-3.3.3-0.52.el6ev.noarch  

How reproducible:

Couple of customers reported the problem that VM migrations keep on failing while putting host in maintenance and later they were not able to start/stop VMs as well.

Steps to Reproduce:
1.
2.
3.

Actual results:
After first failure , subsequent attempts to run/migrate fail while acquiring locks

Expected results:
After first failure , subsequent attempts to run/migrate should get resource locks.

Additional info:

After restarting ovirt-engine service , they one of the customer was able to start/migrate VMs

Comment 2 Aval 2014-06-03 07:18:38 UTC

Created attachment 901619 [details]
engine.log

Adding engine.log from one of the customer facing this problem.

Comment 3 Arik 2014-06-08 12:38:53 UTC

This bug exposed several issues:

1. The first migrations failed not because the VMs remained locked, but because of a bug which caused 'switch host to maintenance' to be re-triggered too soon, so in the retry attempt the migrations fail since the VMs are already locked. This one is solved by http://gerrit.ovirt.org/#/c/28403

2. NPEs in the migrate operations. These exceptions in migrations which were triggered by 'switch to maintenance' operation were already solved by: http://gerrit.ovirt.org/#/c/24639

3. Migrate transaction was aborted, thus the migrate operation failed (on 2014-06-01 18:05:47,045). It won't happen anymore as the migrate operation is no longer transactive.

4. Eventually the VMs remain locked. It happens after the maximum number of retries to migrate the VM is reached (and the migration fails). It was fixed in 3.4. Patch for 3.3:  http://gerrit.ovirt.org/#/c/28460

Comment 5 Arik 2014-06-08 12:45:52 UTC

http://gerrit.ovirt.org/gitweb?p=ovirt-engine.git;a=commit;h=8f6d40c424a9775ba1cbfbbec89a2f433f71d28a

Comment 7 Nisim Simsolo 2014-10-26 12:27:26 UTC

Fixed. Verified using the next builds:
rhevm-3.5.0-0.17.beta.el6ev.noarch
libvirt-0.10.2-46.el6.x86_64
vdsm-4.16.7.1-1.el6ev.x86_64
sanlock-2.8-1.el6.x86_64
qemu-kvm-rhev-0.12.1.2-2.448.el6.x86_64

Comment 9 errata-xmlrpc 2015-02-11 18:03:16 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-0158.html

Note You need to log in before you can comment on or make changes to this bug.