Bug 1484825 - Auto generated snapshot remains LOCKED after concurrent LSM
Summary: Auto generated snapshot remains LOCKED after concurrent LSM
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Storage
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ovirt-4.1.7
: 4.1.7.1
Assignee: Benny Zlotnik
QA Contact: Lilach Zitnitski
URL:
Whiteboard:
Depends On:
Blocks: 1494711
TreeView+ depends on / blocked
 
Reported: 2017-08-24 11:29 UTC by Lilach Zitnitski
Modified: 2017-11-13 12:24 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-11-13 12:24:52 UTC
oVirt Team: Storage
Embargoed:
rule-engine: ovirt-4.1+
rule-engine: ovirt-4.2+
rule-engine: blocker+


Attachments (Terms of Use)
logs (345.28 KB, application/zip)
2017-08-24 11:29 UTC, Lilach Zitnitski
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 81913 0 master MERGED core: fix race in LSM 2020-02-04 12:41:43 UTC
oVirt gerrit 81996 0 ovirt-engine-4.1 MERGED core: fix race in LSM 2020-02-04 12:41:43 UTC

Description Lilach Zitnitski 2017-08-24 11:29:05 UTC
Description of problem:
When starting concurrent live storage migration, one of the auto generated snapshots remains LOCKED.

Version-Release number of selected component (if applicable):
vdsm-4.20.2-90.git6511af5.el7.centos.x86_64
ovirt-engine-4.2.0-0.0.master.20170821071755.git5677f03.el7.centos.noarch

How reproducible:
100% so far 

Steps to Reproduce:
1. create vm with 4 disks
2. start the vm
3. start migrating the disk, wait for the auto-generated snapshot to be in status OK and then start migrating the next disk

Actual results:
2 first disks and snapshots are migrated and deleted, the 3rd snapshot is stuck in status LOCKED

Expected results:
all disks should migrate successfully and all auto generated snapshots should be removed

Additional info:
correlationID of migrating the disk with the problematic snapshot is - disks_syncAction_f6311339-4d5e-4c75

engine.log

2017-08-24 14:09:48,561+03 ERROR [org.ovirt.engine.core.bll.storage.lsm.LiveMigrateVmDisksCommand] (DefaultQuartzScheduler1) [disks_syncAction_f6311339-4d5e-4c75] Ending command 'org.ovirt.engine.core.bll.storage.lsm.LiveMigrateVmDisksCommand' with failure.

Comment 1 Lilach Zitnitski 2017-08-24 11:29:31 UTC
Created attachment 1317642 [details]
logs

Comment 2 Allon Mureinik 2017-08-24 11:56:22 UTC
Benny, can you take a look please?

Comment 3 Red Hat Bugzilla Rules Engine 2017-08-24 11:56:27 UTC
This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 4 Benny Zlotnik 2017-08-25 11:46:24 UTC
Yes, from the logs it seems the failure to delete is because the validation failed
2017-08-24 14:09:45,294+03 WARN  [org.ovirt.engine.core.bll.snapshots.RemoveSnapshotCommand] (DefaultQuartzScheduler10) [disks_syncAction_f6311339-4d5e-4c75] Validation of action 'RemoveSnapshot' failed for user admin@internal-authz. Reasons: VAR__TYPE__SNAPSHOT,VAR__ACTION__REMOVE,ACTION_TYPE_FAILED_VM_IS_DURING_SNAPSHOT

Which is similar to this bug: https://bugzilla.redhat.com/show_bug.cgi?id=1465539
I'll check why is this happening again

Comment 5 Allon Mureinik 2017-09-19 05:55:53 UTC
Benny, the attached patch is now merged.
Should the BZ be moved to MODIFIED, or are we waiting for other patches?

Comment 6 Benny Zlotnik 2017-09-19 07:11:28 UTC
No other patches, moving to modified

Comment 7 Yaniv Kaul 2017-09-19 11:43:40 UTC
(In reply to Benny Zlotnik from comment #6)
> No other patches, moving to modified

Do we need to backport this to 4.1.z?

Comment 8 Allon Mureinik 2017-09-19 12:21:57 UTC
(In reply to Yaniv Kaul from comment #7)
> (In reply to Benny Zlotnik from comment #6)
> > No other patches, moving to modified
> 
> Do we need to backport this to 4.1.z?

Looking at the code, I don't think this is indeed a regression. The problem in the code seemed to have been there for quite some time.
Having said that, it's a nasty issue, and the fix seems straight forwards. Benny - let's get this in 4.1.7?

Comment 9 Benny Zlotnik 2017-09-19 13:00:38 UTC
Sent a patch to 4.1.7

Comment 10 Lilach Zitnitski 2017-09-24 12:58:20 UTC
--------------------------------------
Tested with the following code:
----------------------------------------

rhevm-4.1.7.1-0.1.el7.noarch
vdsm-4.19.32-1.el7ev.x86_64

Tested with the following scenario:

Steps to Reproduce:
1. create vm with 4 disks
2. start the vm
3. start migrating the disk, wait for the auto-generated snapshot to be in status OK and then start migrating the next disk

Actual results:
all disks migrated successfully and all auto generated snapshots were removed

Expected results:

Moving to VERIFIED!


Note You need to log in before you can comment on or make changes to this bug.