Bug 1306367

Summary: Live Merge - recovery flow not working properly when DESTROY_IMAGE_CHECK command fails
Product: [oVirt] ovirt-engine Reporter: Greg Padgett <gpadgett>
Component: BLL.StorageAssignee: Greg Padgett <gpadgett>
Status: CLOSED CURRENTRELEASE QA Contact: Elad <ebenahar>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 3.6.2CC: acanan, ahino, amureini, bugs, gklein, gpadgett, sbonazzo, tnisan
Target Milestone: ovirt-3.6.3Flags: rule-engine: ovirt-3.6.z+
rule-engine: planning_ack+
tnisan: devel_ack+
rule-engine: testing_ack+
Target Release: 3.6.3.2   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-03-10 12:49:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Greg Padgett 2016-02-10 16:55:14 UTC
Description of problem:
After a Live Merge failure, it is supposed to update records in the database such that a retry of the command will identify the progress of the merge and recover from the failure.

It's not always possible to do this in code, but there are at least 2 problems that have cropped up lately which weren't present in the original design, inhibiting automatic recovery that should be possible to accomplish in the code.

1. The addition of DestroyImageCheckCommand (a fix for bug 1227497) was not accounted for in checking if the merge was completed successfully, so retrying a failed live merge wherein the merge had succeeded but image removal failed may opt to retry the merge.  This causes a failure that requires manual intervention to fix.

2. It looks like something changed the way the database records are marked in the failure case in general, and the images are no longer being marked illegal.  Consequently, the Live Merge code won't think it has ever attempted a merge on the image.  Even worse, other operations such as snapshot preview may attempt to use a partially-merged snapshot.  If it appears to work and the user commits the preview, the vm may end up with corrupt data.

Version-Release number of selected component (if applicable):
3.6.2 (for problem #1 above; still investigating problem #2)

How reproducible:
Always

Steps to Reproduce:
This may happen on any environment and is trivial to reproduce with a code modification.  It's not too easy without (since we designed Live Merge to try and not fail...)

To verify, make sure Live Merge is functional.  I'll try to get some steps for move involved verification of the recovery flows and post them in a comment to this bz.

Actual results:
The database is not marked properly for some failure cases, causing recovery to require manual intervention.

Expected results:
Live merge should work and recover successfully.

Additional info:

Comment 1 Greg Padgett 2016-02-11 01:08:43 UTC
(In reply to Greg Padgett from comment #0)
> 2. It looks like something changed the way the database records are marked
> in the failure case in general, and the images are no longer being marked
> illegal.  Consequently, the Live Merge code won't think it has ever
> attempted a merge on the image.  Even worse, other operations such as
> snapshot preview may attempt to use a partially-merged snapshot.  If it
> appears to work and the user commits the preview, the vm may end up with
> corrupt data.
> 
> Version-Release number of selected component (if applicable):
> 3.6.2 (for problem #1 above; still investigating problem #2)

The second problem does not apply to the 3.6 branch.

Comment 2 Red Hat Bugzilla Rules Engine 2016-02-11 01:33:07 UTC
Bug tickets must have version flags set prior to targeting them to a release. Please ask maintainer to set the correct version flags and only then set the target milestone.

Comment 3 Elad 2016-02-23 10:20:32 UTC
Hi Greg, can you please provide us a way to reproduce the bug?

Comment 4 Ala Hino 2016-03-01 09:26:23 UTC
Hi Elad,

Basically, to reproduce this issue we have to fake the result of getVolumeInfo after invoking DESTROY_IMAGE.
This can done by stopping vdsm/shutdown the host after you see in the log that DESTROY_IMAGE completed.

Comment 5 Elad 2016-03-02 07:59:45 UTC
Thanks, 

Tested accordingly: 
Rebooted the hypervisor right after I saw the following in engine.log:

2016-03-02 07:36:53,715 INFO  [org.ovirt.engine.core.bll.DestroyImageCheckCommand] (pool-5-thread-5) [2e486e8c] Running command: DestroyImageCheckCommand internal: true.
2016-03-02 07:36:55,906 INFO  [org.ovirt.engine.core.bll.DestroyImageCheckCommand] (pool-5-thread-5) [2e486e8c] Requested images were successfully removed

Live merge operation was marked as failed after few minutes:

2016-03-02 07:41:47,747 INFO  [org.ovirt.engine.core.bll.RemoveSnapshotCommandCallback] (DefaultQuartzScheduler_Worker-68) [7162a4dd] All Live Merge child commands have complete
d, status 'FAILED'
2016-03-02 07:41:48,792 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotCommand] (DefaultQuartzScheduler_Worker-9) [7162a4dd] Ending command 'org.ovirt.engine.core.bll.RemoveSnap
shotCommand' with failure.
2016-03-02 07:41:48,869 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-9) [7162a4dd] Correlation ID: 7162a4dd, Job I
D: ba5075ea-36b0-4e79-8218-549e02412b95, Call Stack: null, Custom Event ID: -1, Message: Failed to delete snapshot '1' for VM 'test1'.


Using:
rhevm-3.6.3.3-0.1.el6.noarch
vdsm-4.17.23-0.el7ev.noarch


Greg, please confirm that the test was done properly and the results are the expected so I'll verify, thanks.

Comment 6 Ala Hino 2016-03-02 12:51:25 UTC
Elad,

The results seem to be as expected.
Can you run the manual recovery script (found in bug 1308501) and see whether it fixes the database?

Comment 7 Elad 2016-03-02 13:32:24 UTC
Thanks Ala, 

Executed the script and the snapshot got removed successfully.
Based on this and on comment #5, moving to VERIFIED.