Description of problem: After VM migration thread acquires resource lock and fails due to some reason, it does not release the lock properly and later when other threads try to perform operations like migrate/Run VM they fail while acquiring lock with below error ~~~ 2014-06-01 18:15:53,807 INFO [org.ovirt.engine.core.bll.InternalMigrateVmCommand] (DefaultQuartzScheduler_Worker-3) [36055ac9] Failed to Acquire Lock to object EngineLock [exclusiveLocks= key: 40abfe32-96de-44a7-abd9-e77ecd2bec7b value: VM , sharedLocks= ] 2014-06-01 18:15:53,808 WARN [org.ovirt.engine.core.bll.InternalMigrateVmCommand] (DefaultQuartzScheduler_Worker-3) [36055ac9] CanDoAction of action InternalMigrateVm failed. Reasons:VAR__ACTION__MIGRATE,VAR__TYPE__VM,ACTION_TYPE_FAILED_VM_IS_BEING_MIGRATED,$VmName zabbix-190 ~~~ Version-Release number of selected component (if applicable): rhevm-3.3.3-0.52.el6ev.noarch How reproducible: Couple of customers reported the problem that VM migrations keep on failing while putting host in maintenance and later they were not able to start/stop VMs as well. Steps to Reproduce: 1. 2. 3. Actual results: After first failure , subsequent attempts to run/migrate fail while acquiring locks Expected results: After first failure , subsequent attempts to run/migrate should get resource locks. Additional info: After restarting ovirt-engine service , they one of the customer was able to start/migrate VMs
Created attachment 901619 [details] engine.log Adding engine.log from one of the customer facing this problem.
This bug exposed several issues: 1. The first migrations failed not because the VMs remained locked, but because of a bug which caused 'switch host to maintenance' to be re-triggered too soon, so in the retry attempt the migrations fail since the VMs are already locked. This one is solved by http://gerrit.ovirt.org/#/c/28403 2. NPEs in the migrate operations. These exceptions in migrations which were triggered by 'switch to maintenance' operation were already solved by: http://gerrit.ovirt.org/#/c/24639 3. Migrate transaction was aborted, thus the migrate operation failed (on 2014-06-01 18:05:47,045). It won't happen anymore as the migrate operation is no longer transactive. 4. Eventually the VMs remain locked. It happens after the maximum number of retries to migrate the VM is reached (and the migration fails). It was fixed in 3.4. Patch for 3.3: http://gerrit.ovirt.org/#/c/28460
http://gerrit.ovirt.org/gitweb?p=ovirt-engine.git;a=commit;h=8f6d40c424a9775ba1cbfbbec89a2f433f71d28a
Fixed. Verified using the next builds: rhevm-3.5.0-0.17.beta.el6ev.noarch libvirt-0.10.2-46.el6.x86_64 vdsm-4.16.7.1-1.el6ev.x86_64 sanlock-2.8-1.el6.x86_64 qemu-kvm-rhev-0.12.1.2-2.448.el6.x86_64
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-0158.html