Description of problem: When a VM live migration fails, the resource lock is not released and further attempts for migration fails. It is also not possible to start the VM if it is powered down. The following message is displayed: Cannot run VM. VM <VM NAME> is being migrated. Version-Release number of selected component (if applicable): rhevm-3.4.2-0.2.el6ev.noarch How reproducible: occasionally when some of the migrations fails. Steps to Reproduce: Not completely clear Actual results: The resource lock is not released and the VM cannot be migrated or started. The only workaround is to restart the engine. Expected results: The resource lock is always released Additional info: There is already a couple of issues fixed in bug https://bugzilla.redhat.com/show_bug.cgi?id=1114587 , but it does not seem to be fully fixed. Logs: Here is the first occurrence of the consequence: var/log/ovirt-engine/engine.log-20141009.gz: 2014-10-08 16:51:48,711 INFO [org.ovirt.engine.core.bll.RunVmCommand] (org.ovirt.thread.pool-4-thread-34) [42080fcb] Failed to Acquire Lock to object EngineLock [exclusiveLocks= key: 1e4689ca-9b5e-450f-a1d6-eea83b67175a value: VM 2014-10-08 16:51:48,711 WARN [org.ovirt.engine.core.bll.RunVmCommand] (org.ovirt.thread.pool-4-thread-34) [42080fcb] CanDoAction of action RunVm failed. Reasons:VAR__ACTION__RUN,VAR__TYPE__VM,ACTION_TYPE_FAILED_VM_IS_BEING_MIGRA 63237 TED,$VmName VM01 The first successful equired lock before this operation happened at 2014-10-06 13:20:06 when we started the VM live migration. The source host (rhevh01.example.com) had some issues and was reinitialized during the live migration. $ xzgrep -E '/VM01|32962ab0|6aa8269|1e4689ca-9b5e-450f-a1d6-eea83b67175a|310499b1' var/log/ovirt-engine/engine.log-20141007.gz 2014-10-06 13:20:06,725 INFO [org.ovirt.engine.core.bll.InternalMigrateVmCommand] (org.ovirt.thread.pool-4-thread-42) [310499b1] Lock Acquired to object EngineLock [exclusiveLocks= key: 1e4689ca-9b5e-450f-a1d6-eea83b67175a value: VM 2014-10-06 13:20:08,441 INFO [org.ovirt.engine.core.bll.scheduling.SchedulingManager] (org.ovirt.thread.pool-4-thread-42) [310499b1] Candidate host rhevh02.example.com (bf8fc059-18bb-4333-8851-1e7ea7f2f0c4) was filtered out by VAR__FILTERTYPE__INTERNAL filter Memory 2014-10-06 13:20:11,125 INFO [org.ovirt.engine.core.bll.InternalMigrateVmCommand] (org.ovirt.thread.pool-4-thread-42) [310499b1] Running command: InternalMigrateVmCommand internal: true. Entities affected : ID: 1e4689ca-9b5e-450f-a1d6-eea83b67175a Type: VM 2014-10-06 13:20:11,287 INFO [org.ovirt.engine.core.bll.scheduling.SchedulingManager] (org.ovirt.thread.pool-4-thread-42) [310499b1] Candidate host rhevh02.example.com (bf8fc059-18bb-4333-8851-1e7ea7f2f0c4) was filtered out by VAR__FILTERTYPE__INTERNAL filter Memory (correlation id: 310499b1) 2014-10-06 13:20:13,623 INFO [org.ovirt.engine.core.bll.scheduling.policyunits.HaReservationWeightPolicyUnit] (org.ovirt.thread.pool-4-thread-42) [310499b1] Started HA reservation scoring method 2014-10-06 13:20:15,220 INFO [org.ovirt.engine.core.vdsbroker.MigrateVDSCommand] (org.ovirt.thread.pool-4-thread-42) [310499b1] START, MigrateVDSCommand(HostName = rhevh01.example.com, HostId = 7df6fab3-1b6e-46e5-9c43-60e46e1ff355, vmId=1e4689ca-9b5e-450f-a1d6-eea83b67175a, srcHost=192.168.1.2, dstVdsId=3f3e360a-9598-4a20-aff6-23727d13b670, dstHost=192.168.1.3:54321, migrationMethod=ONLINE, tunnelMigration=false, migrationDowntime=0), log id: 7c320061 2014-10-06 13:20:16,688 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.MigrateBrokerVDSCommand] (org.ovirt.thread.pool-4-thread-42) [310499b1] VdsBroker::migrate::Entered (vm_guid=1e4689ca-9b5e-450f-a1d6-eea83b67175a, srcHost=192.168.1.2, dstHost=192.168.1.3:54321, method=online 2014-10-06 13:20:16,920 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.MigrateBrokerVDSCommand] (org.ovirt.thread.pool-4-thread-42) [310499b1] START, MigrateBrokerVDSCommand(HostName = rhevh01.example.com, HostId = 7df6fab3-1b6e-46e5-9c43-60e46e1ff355, vmId=1e4689ca-9b5e-450f-a1d6-eea83b67175a, srcHost=192.168.1.2, dstVdsId=3f3e360a-9598-4a20-aff6-23727d13b670, dstHost=192.168.1.3:54321, migrationMethod=ONLINE, tunnelMigration=false, migrationDowntime=0), log id: 4822c1ad 2014-10-06 13:20:19,860 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.MigrateBrokerVDSCommand] (org.ovirt.thread.pool-4-thread-42) [310499b1] Command org.ovirt.engine.core.vdsbroker.vdsbroker.MigrateBrokerVDSCommand return value 2014-10-06 13:20:19,860 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.MigrateBrokerVDSCommand] (org.ovirt.thread.pool-4-thread-42) [310499b1] HostName = rhevh01.example.com 2014-10-06 13:20:19,860 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.MigrateBrokerVDSCommand] (org.ovirt.thread.pool-4-thread-42) [310499b1] Command MigrateBrokerVDSCommand(HostName = rhevh01.example.com, HostId = 7df6fab3-1b6e-46e5-9c43-60e46e1ff355, vmId=1e4689ca-9b5e-450f-a1d6-eea83b67175a, srcHost=192.168.1.2, dstVdsId=3f3e360a-9598-4a20-aff6-23727d13b670, dstHost=192.168.1.3:54321, migrationMethod=ONLINE, tunnelMigration=false, migrationDowntime=0) execution failed. Exception: VDSRecoveringException: Recovering from crash or Initializing 2014-10-06 13:20:19,860 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.MigrateBrokerVDSCommand] (org.ovirt.thread.pool-4-thread-42) [310499b1] FINISH, MigrateBrokerVDSCommand, log id: 4822c1ad 2014-10-06 13:20:19,942 ERROR [org.ovirt.engine.core.vdsbroker.MigrateVDSCommand] (org.ovirt.thread.pool-4-thread-42) [310499b1] Failed Vm migration 2014-10-06 13:20:19,942 INFO [org.ovirt.engine.core.vdsbroker.MigrateVDSCommand] (org.ovirt.thread.pool-4-thread-42) [310499b1] FINISH, MigrateVDSCommand, return: Up, log id: 7c320061 2014-10-06 13:21:46,195 INFO [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-31) [26da4afe] VM VM01 1e4689ca-9b5e-450f-a1d6-eea83b67175a moved from Up --> Paused There is no lock release for 1e4689ca-9b5e-450f-a1d6-eea83b67175a in the rest of the logs.
fixed by http://gerrit.ovirt.org/#/c/24799/
(In reply to Arik from comment #3) > fixed by http://gerrit.ovirt.org/#/c/24799/ Are you sure that this exact issue was fixed by the mentioned fix? As the problem seems to occur on rhevm-3.4.2-0.2.el6ev.noarch. This patch is already included there.
(In reply to Roman Hodain from comment #4) Yes, I'm sure. The backport of this patch to 3.4 was a bit complicated so it is not exactly the same patch in both places. look at the 'perform' method in both places: In http://gerrit.ovirt.org/#/c/24799/19/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/MigrateVmCommand.java,cm (master), in case of exception we're calling runningFailed which release the lock While in http://gerrit.ovirt.org/#/c/29872/3/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/MigrateVmCommand.java,cm (3.4), in case of exception we're not calling runningFailed so the lock remains Switching it to ON_QA so they'll verify it as well on 3.5. If we need it to be fixed in 3.4, we need to make a dedicated fix for 3.4.
suggesting 3.4.z flag, if needed/acked we will create a fix for it
Fixed. Verified using the next builds: rhevm-3.5.0-0.17.beta.el6ev.noarch libvirt-0.10.2-46.el6.x86_64 vdsm-4.16.7.1-1.el6ev.x86_64 sanlock-2.8-1.el6.x86_64 qemu-kvm-rhev-0.12.1.2-2.448.el6.x86_64
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-0158.html