Bug 1153544 - Failed VM migrations do not release VM resource lock properly
Summary: Failed VM migrations do not release VM resource lock properly
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 3.4.1-1
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: 3.5.0
Assignee: Arik
QA Contact: Nisim Simsolo
URL:
Whiteboard: virt
Depends On:
Blocks: 1155571
TreeView+ depends on / blocked
 
Reported: 2014-10-16 07:55 UTC by Roman Hodain
Modified: 2019-08-15 04:04 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Previously, after a failed migration, the target virtual machine remained in a locked state. Further operations on the virtual machine failed with an error: 'Cannot run VM. VM <VM NAME> is being migrated.' With this update, locked virtual machines are released upon migration failure and further operations on the virtual machine are allowed.
Clone Of:
: 1155571 (view as bug list)
Environment:
Last Closed: 2015-02-11 18:09:41 UTC
oVirt Team: ---
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2015:0158 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Virtualization Manager 3.5.0 2015-02-11 22:38:50 UTC

Description Roman Hodain 2014-10-16 07:55:04 UTC
Description of problem:
When a VM live migration fails, the resource lock is not released and further attempts for migration fails. It is also not possible to start the VM if it is powered down. The following message is displayed:

    Cannot run VM. VM <VM NAME> is being migrated.


Version-Release number of selected component (if applicable):
    rhevm-3.4.2-0.2.el6ev.noarch

How reproducible:
    occasionally when some of the migrations fails. 

Steps to Reproduce:
    Not completely clear

Actual results:
    The resource lock is not released and the VM cannot be migrated or started. The only workaround is to restart the engine.

Expected results:
    The resource lock is always released

Additional info:
    There is already a couple of issues fixed in bug https://bugzilla.redhat.com/show_bug.cgi?id=1114587 , but it does not seem to be fully fixed.

Logs:

Here is the first occurrence of the consequence:
var/log/ovirt-engine/engine.log-20141009.gz:
	2014-10-08 16:51:48,711 INFO  [org.ovirt.engine.core.bll.RunVmCommand] (org.ovirt.thread.pool-4-thread-34) [42080fcb] Failed to Acquire Lock to object EngineLock [exclusiveLocks= key: 1e4689ca-9b5e-450f-a1d6-eea83b67175a value: VM
	2014-10-08 16:51:48,711 WARN  [org.ovirt.engine.core.bll.RunVmCommand] (org.ovirt.thread.pool-4-thread-34) [42080fcb] CanDoAction of action RunVm failed. Reasons:VAR__ACTION__RUN,VAR__TYPE__VM,ACTION_TYPE_FAILED_VM_IS_BEING_MIGRA  63237 TED,$VmName VM01

The first successful equired lock before this operation happened at 2014-10-06 13:20:06 when we started the VM live migration. The source host (rhevh01.example.com) had some issues and was reinitialized during the live migration.

$ xzgrep -E '/VM01|32962ab0|6aa8269|1e4689ca-9b5e-450f-a1d6-eea83b67175a|310499b1' var/log/ovirt-engine/engine.log-20141007.gz
	2014-10-06 13:20:06,725 INFO  [org.ovirt.engine.core.bll.InternalMigrateVmCommand] (org.ovirt.thread.pool-4-thread-42) [310499b1] Lock Acquired to object EngineLock [exclusiveLocks= key: 1e4689ca-9b5e-450f-a1d6-eea83b67175a value: VM
	2014-10-06 13:20:08,441 INFO  [org.ovirt.engine.core.bll.scheduling.SchedulingManager] (org.ovirt.thread.pool-4-thread-42) [310499b1] Candidate host rhevh02.example.com (bf8fc059-18bb-4333-8851-1e7ea7f2f0c4) was filtered out by VAR__FILTERTYPE__INTERNAL filter Memory
	2014-10-06 13:20:11,125 INFO  [org.ovirt.engine.core.bll.InternalMigrateVmCommand] (org.ovirt.thread.pool-4-thread-42) [310499b1] Running command: InternalMigrateVmCommand internal: true. Entities affected :  ID: 1e4689ca-9b5e-450f-a1d6-eea83b67175a Type: VM
	2014-10-06 13:20:11,287 INFO  [org.ovirt.engine.core.bll.scheduling.SchedulingManager] (org.ovirt.thread.pool-4-thread-42) [310499b1] Candidate host rhevh02.example.com (bf8fc059-18bb-4333-8851-1e7ea7f2f0c4) was filtered out by VAR__FILTERTYPE__INTERNAL filter Memory (correlation id: 310499b1)
	2014-10-06 13:20:13,623 INFO  [org.ovirt.engine.core.bll.scheduling.policyunits.HaReservationWeightPolicyUnit] (org.ovirt.thread.pool-4-thread-42) [310499b1] Started HA reservation scoring method
	2014-10-06 13:20:15,220 INFO  [org.ovirt.engine.core.vdsbroker.MigrateVDSCommand] (org.ovirt.thread.pool-4-thread-42) [310499b1] START, MigrateVDSCommand(HostName = rhevh01.example.com, HostId = 7df6fab3-1b6e-46e5-9c43-60e46e1ff355, vmId=1e4689ca-9b5e-450f-a1d6-eea83b67175a, srcHost=192.168.1.2, dstVdsId=3f3e360a-9598-4a20-aff6-23727d13b670, dstHost=192.168.1.3:54321, migrationMethod=ONLINE, tunnelMigration=false, migrationDowntime=0), log id: 7c320061
	2014-10-06 13:20:16,688 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.MigrateBrokerVDSCommand] (org.ovirt.thread.pool-4-thread-42) [310499b1] VdsBroker::migrate::Entered (vm_guid=1e4689ca-9b5e-450f-a1d6-eea83b67175a, srcHost=192.168.1.2, dstHost=192.168.1.3:54321,  method=online
	2014-10-06 13:20:16,920 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.MigrateBrokerVDSCommand] (org.ovirt.thread.pool-4-thread-42) [310499b1] START, MigrateBrokerVDSCommand(HostName = rhevh01.example.com, HostId = 7df6fab3-1b6e-46e5-9c43-60e46e1ff355, vmId=1e4689ca-9b5e-450f-a1d6-eea83b67175a, srcHost=192.168.1.2, dstVdsId=3f3e360a-9598-4a20-aff6-23727d13b670, dstHost=192.168.1.3:54321, migrationMethod=ONLINE, tunnelMigration=false, migrationDowntime=0), log id: 4822c1ad
	2014-10-06 13:20:19,860 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.MigrateBrokerVDSCommand] (org.ovirt.thread.pool-4-thread-42) [310499b1] Command org.ovirt.engine.core.vdsbroker.vdsbroker.MigrateBrokerVDSCommand return value 
	2014-10-06 13:20:19,860 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.MigrateBrokerVDSCommand] (org.ovirt.thread.pool-4-thread-42) [310499b1] HostName = rhevh01.example.com
	2014-10-06 13:20:19,860 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.MigrateBrokerVDSCommand] (org.ovirt.thread.pool-4-thread-42) [310499b1] Command MigrateBrokerVDSCommand(HostName = rhevh01.example.com, HostId = 7df6fab3-1b6e-46e5-9c43-60e46e1ff355, vmId=1e4689ca-9b5e-450f-a1d6-eea83b67175a, srcHost=192.168.1.2, dstVdsId=3f3e360a-9598-4a20-aff6-23727d13b670, dstHost=192.168.1.3:54321, migrationMethod=ONLINE, tunnelMigration=false, migrationDowntime=0) execution failed. Exception: VDSRecoveringException: Recovering from crash or Initializing
	2014-10-06 13:20:19,860 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.MigrateBrokerVDSCommand] (org.ovirt.thread.pool-4-thread-42) [310499b1] FINISH, MigrateBrokerVDSCommand, log id: 4822c1ad
	2014-10-06 13:20:19,942 ERROR [org.ovirt.engine.core.vdsbroker.MigrateVDSCommand] (org.ovirt.thread.pool-4-thread-42) [310499b1] Failed Vm migration
	2014-10-06 13:20:19,942 INFO  [org.ovirt.engine.core.vdsbroker.MigrateVDSCommand] (org.ovirt.thread.pool-4-thread-42) [310499b1] FINISH, MigrateVDSCommand, return: Up, log id: 7c320061
	2014-10-06 13:21:46,195 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-31) [26da4afe] VM VM01 1e4689ca-9b5e-450f-a1d6-eea83b67175a moved from Up --> Paused

There is no lock release for 1e4689ca-9b5e-450f-a1d6-eea83b67175a in the rest of the logs.

Comment 3 Arik 2014-10-19 11:24:43 UTC
fixed by http://gerrit.ovirt.org/#/c/24799/

Comment 4 Roman Hodain 2014-10-20 08:25:33 UTC
(In reply to Arik from comment #3)
> fixed by http://gerrit.ovirt.org/#/c/24799/

Are you sure that this exact issue was fixed by the mentioned fix? As the problem seems to occur on rhevm-3.4.2-0.2.el6ev.noarch. This patch is already included there.

Comment 5 Arik 2014-10-20 15:05:22 UTC
(In reply to Roman Hodain from comment #4)
Yes, I'm sure.
The backport of this patch to 3.4 was a bit complicated so it is not exactly the same patch in both places. look at the 'perform' method in both places:

In http://gerrit.ovirt.org/#/c/24799/19/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/MigrateVmCommand.java,cm (master), in case of exception we're calling runningFailed which release the lock

While in http://gerrit.ovirt.org/#/c/29872/3/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/MigrateVmCommand.java,cm (3.4), in case of exception we're not calling runningFailed so the lock remains

Switching it to ON_QA so they'll verify it as well on 3.5.
If we need it to be fixed in 3.4, we need to make a dedicated fix for 3.4.

Comment 6 Omer Frenkel 2014-10-20 15:08:03 UTC
suggesting 3.4.z flag, if needed/acked we will create a fix for it

Comment 8 Nisim Simsolo 2014-10-26 12:23:43 UTC
Fixed. Verified using the next builds:
rhevm-3.5.0-0.17.beta.el6ev.noarch
libvirt-0.10.2-46.el6.x86_64
vdsm-4.16.7.1-1.el6ev.x86_64
sanlock-2.8-1.el6.x86_64
qemu-kvm-rhev-0.12.1.2-2.448.el6.x86_64

Comment 10 errata-xmlrpc 2015-02-11 18:09:41 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-0158.html


Note You need to log in before you can comment on or make changes to this bug.