1292130 – [RHEV clone] Live Merge - request to remove images deleted db records without performing merge

Bug 1292130 - [RHEV clone] Live Merge - request to remove images deleted db records without performing merge

Summary: [RHEV clone] Live Merge - request to remove images deleted db records without...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	3.5.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	ovirt-3.5.7
Target Release:	3.5.7
Assignee:	Greg Padgett
QA Contact:	Aharon Canan
Docs Contact:
URL:
Whiteboard:	storage
Depends On:	1280160
Blocks:
TreeView+	depends on / blocked

Reported:	2015-12-16 14:43 UTC by Allon Mureinik
Modified:	2016-02-10 17:31 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:	1280160
Environment:
Last Closed:	2016-01-12 20:41:40 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2016:0029	normal	SHIPPED_LIVE	Red Hat Enterprise Virtualization Manager 3.5.7 update	2016-01-13 01:39:40 UTC
oVirt gerrit	48400	None	None	None	2015-12-16 14:43:16 UTC
oVirt gerrit	48427	None	None	None	2015-12-16 14:43:16 UTC
oVirt gerrit	48441	None	None	None	2015-12-16 14:43:16 UTC

Description Allon Mureinik 2015-12-16 14:43:16 UTC

+++ This bug was initially created as a clone of Bug #1280160 +++

Description of problem:
Due to a race condition, Live Merge execution can claim successful completion after only removing the snapshot record from the database, skipping the merge operation altogether.

Version-Release number of selected component (if applicable):
Witnessed on ovirt master commit 01b8febd9b; however looking through the logs it appears this is a possibility in all versions with this feature.

How reproducible:
Based on some testing on my work laptop:
 - before 3.6.0, approximately 1-in-5000 executions
 - 3.6.0 or later, approximately 1-in-500 executions
(See below for more detail.)

Steps to Reproduce:
1. Execute live merge
2. Hope for the best
3. 
The best way to reproduce the bug is to make a small code change that makes reproduction much more likely.  Changing the loop initialization In CommandExecutor to use TimeUnit.MILLISECONDS instead of TimeUnit.SECONDS allows reproduction after just a handful of attempts.

Actual results:
RemoveSnapshotCommand executes without RemoveSnapshotSingleDiskLiveCommand.  This means:
- the snapshot record is removed from the db with no relevant updates
- the image record remains in the db with no relevant updates
- no vdsm verbs are called, thus the merge is not complete
- data loss after subsequent merges??

Expected results:
- db fields should be updated and removed as appropriate
- the merge should occur

Additional info:

The root cause of this is a race caused by the RemoveSnapshotCommandCallback.doPolling() method not accounting for the NOT_STARTED task status of its RemoveSnapshotSingleDiskLiveCommand children.  When RemoveSnapshotCommand starts its children, the children are executed asynchronously.  Usually (in my tests) the children start before the parent is complete; however, sometimes the order is reversed and this opens up a small window of time during which the callback might execute and find the child commands in the NOT_STARTED state.

When this occurs, the callback errantly considers the children to have completed successfully, causing the end-action callbacks on the parent and child commands to be triggered.  The RemoveSnapshotSingleDiskCommand is thus bypassed, causing the merge to not complete.

Testing on my laptop gives evidence that the behavior is usually what is desired.  8 of 10 tests shows a favorable command sequence, whereas 2 tests showed an small (8ms) window of time where a well-timed interruption would trigger this bug.  The probabilities listed above are based on this small sample.

--- Additional comment from Greg Padgett on 2015-11-11 06:27:40 IST ---

Setting target to 4.0, but this has a very small, low-risk fix that I believe should be backported.

--- Additional comment from Greg Padgett on 2015-11-11 07:19:49 IST ---

FYI this was the commit that made the problem more likely to happen in more recent releases: https://gerrit.ovirt.org/#/c/43428/ (core: capped exponential backoff for CoCo CommandExecutor, sha 3175d5a141)

--- Additional comment from Allon Mureinik on 2015-11-11 10:38:08 IST ---

Definitely should be backported to oVirt 3.6.1.

Frankly, given the triviality of the patch, I'd even backport it to the 3.5.z branch. Even though it's [much] less likely to occur there, if it does it results in a pretty nasty outcome.

Comment 1 Eyal Edri 2015-12-17 13:34:24 UTC

included in the build.

Comment 2 Aharon Canan 2015-12-23 15:59:20 UTC

From description it sounds like in order to reproduce/verify I need to change the code which is can't be part of verification process.

Do we have a way to verify without code change ?

Comment 3 Tal Nisan 2015-12-27 09:08:14 UTC

The issue described here is a race condition, it is close to impossible to reproduce it without modifying the code or using a debugger.
Best course of action will be to run the live merge sanity just to make sure nothing got broken in the flow

Comment 4 Aharon Canan 2015-12-27 14:28:34 UTC

(In reply to Tal Nisan from comment #3)
> The issue described here is a race condition, it is close to impossible to
> reproduce it without modifying the code or using a debugger.
> Best course of action will be to run the live merge sanity just to make sure
> nothing got broken in the flow

Verified

Comment 7 errata-xmlrpc 2016-01-12 20:41:40 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0029.html

Note You need to log in before you can comment on or make changes to this bug.