Description of problem: When starting concurrent live storage migration, one of the auto generated snapshots remains LOCKED. Version-Release number of selected component (if applicable): vdsm-4.20.2-90.git6511af5.el7.centos.x86_64 ovirt-engine-4.2.0-0.0.master.20170821071755.git5677f03.el7.centos.noarch How reproducible: 100% so far Steps to Reproduce: 1. create vm with 4 disks 2. start the vm 3. start migrating the disk, wait for the auto-generated snapshot to be in status OK and then start migrating the next disk Actual results: 2 first disks and snapshots are migrated and deleted, the 3rd snapshot is stuck in status LOCKED Expected results: all disks should migrate successfully and all auto generated snapshots should be removed Additional info: correlationID of migrating the disk with the problematic snapshot is - disks_syncAction_f6311339-4d5e-4c75 engine.log 2017-08-24 14:09:48,561+03 ERROR [org.ovirt.engine.core.bll.storage.lsm.LiveMigrateVmDisksCommand] (DefaultQuartzScheduler1) [disks_syncAction_f6311339-4d5e-4c75] Ending command 'org.ovirt.engine.core.bll.storage.lsm.LiveMigrateVmDisksCommand' with failure.
Created attachment 1317642 [details] logs
Benny, can you take a look please?
This bug report has Keywords: Regression or TestBlocker. Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.
Yes, from the logs it seems the failure to delete is because the validation failed 2017-08-24 14:09:45,294+03 WARN [org.ovirt.engine.core.bll.snapshots.RemoveSnapshotCommand] (DefaultQuartzScheduler10) [disks_syncAction_f6311339-4d5e-4c75] Validation of action 'RemoveSnapshot' failed for user admin@internal-authz. Reasons: VAR__TYPE__SNAPSHOT,VAR__ACTION__REMOVE,ACTION_TYPE_FAILED_VM_IS_DURING_SNAPSHOT Which is similar to this bug: https://bugzilla.redhat.com/show_bug.cgi?id=1465539 I'll check why is this happening again
Benny, the attached patch is now merged. Should the BZ be moved to MODIFIED, or are we waiting for other patches?
No other patches, moving to modified
(In reply to Benny Zlotnik from comment #6) > No other patches, moving to modified Do we need to backport this to 4.1.z?
(In reply to Yaniv Kaul from comment #7) > (In reply to Benny Zlotnik from comment #6) > > No other patches, moving to modified > > Do we need to backport this to 4.1.z? Looking at the code, I don't think this is indeed a regression. The problem in the code seemed to have been there for quite some time. Having said that, it's a nasty issue, and the fix seems straight forwards. Benny - let's get this in 4.1.7?
Sent a patch to 4.1.7
-------------------------------------- Tested with the following code: ---------------------------------------- rhevm-4.1.7.1-0.1.el7.noarch vdsm-4.19.32-1.el7ev.x86_64 Tested with the following scenario: Steps to Reproduce: 1. create vm with 4 disks 2. start the vm 3. start migrating the disk, wait for the auto-generated snapshot to be in status OK and then start migrating the next disk Actual results: all disks migrated successfully and all auto generated snapshots were removed Expected results: Moving to VERIFIED!