Bug 1306367 - Live Merge - recovery flow not working properly when DESTROY_IMAGE_CHECK command fails
Live Merge - recovery flow not working properly when DESTROY_IMAGE_CHECK comm...
Status: CLOSED CURRENTRELEASE
Product: ovirt-engine
Classification: oVirt
Component: BLL.Storage (Show other bugs)
3.6.2
Unspecified Unspecified
unspecified Severity urgent (vote)
: ovirt-3.6.3
: 3.6.3.2
Assigned To: Greg Padgett
Elad
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2016-02-10 11:55 EST by Greg Padgett
Modified: 2016-05-09 19:12 EDT (History)
8 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-03-10 07:49:30 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Storage
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
rule-engine: ovirt‑3.6.z+
rule-engine: planning_ack+
tnisan: devel_ack+
rule-engine: testing_ack+


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
oVirt gerrit 53370 master MERGED core: Live Merge's check for merge completion is wrong 2016-02-14 09:34 EST
oVirt gerrit 53383 ovirt-engine-3.6 MERGED core: Live Merge's check for merge completion is wrong 2016-02-14 09:35 EST
oVirt gerrit 53439 ovirt-engine-3.6.3 MERGED core: Live Merge's check for merge completion is wrong 2016-02-14 09:37 EST

  None (edit)
Description Greg Padgett 2016-02-10 11:55:14 EST
Description of problem:
After a Live Merge failure, it is supposed to update records in the database such that a retry of the command will identify the progress of the merge and recover from the failure.

It's not always possible to do this in code, but there are at least 2 problems that have cropped up lately which weren't present in the original design, inhibiting automatic recovery that should be possible to accomplish in the code.

1. The addition of DestroyImageCheckCommand (a fix for bug 1227497) was not accounted for in checking if the merge was completed successfully, so retrying a failed live merge wherein the merge had succeeded but image removal failed may opt to retry the merge.  This causes a failure that requires manual intervention to fix.

2. It looks like something changed the way the database records are marked in the failure case in general, and the images are no longer being marked illegal.  Consequently, the Live Merge code won't think it has ever attempted a merge on the image.  Even worse, other operations such as snapshot preview may attempt to use a partially-merged snapshot.  If it appears to work and the user commits the preview, the vm may end up with corrupt data.

Version-Release number of selected component (if applicable):
3.6.2 (for problem #1 above; still investigating problem #2)

How reproducible:
Always

Steps to Reproduce:
This may happen on any environment and is trivial to reproduce with a code modification.  It's not too easy without (since we designed Live Merge to try and not fail...)

To verify, make sure Live Merge is functional.  I'll try to get some steps for move involved verification of the recovery flows and post them in a comment to this bz.

Actual results:
The database is not marked properly for some failure cases, causing recovery to require manual intervention.

Expected results:
Live merge should work and recover successfully.

Additional info:
Comment 1 Greg Padgett 2016-02-10 20:08:43 EST
(In reply to Greg Padgett from comment #0)
> 2. It looks like something changed the way the database records are marked
> in the failure case in general, and the images are no longer being marked
> illegal.  Consequently, the Live Merge code won't think it has ever
> attempted a merge on the image.  Even worse, other operations such as
> snapshot preview may attempt to use a partially-merged snapshot.  If it
> appears to work and the user commits the preview, the vm may end up with
> corrupt data.
> 
> Version-Release number of selected component (if applicable):
> 3.6.2 (for problem #1 above; still investigating problem #2)

The second problem does not apply to the 3.6 branch.
Comment 2 Red Hat Bugzilla Rules Engine 2016-02-10 20:33:07 EST
Bug tickets must have version flags set prior to targeting them to a release. Please ask maintainer to set the correct version flags and only then set the target milestone.
Comment 3 Elad 2016-02-23 05:20:32 EST
Hi Greg, can you please provide us a way to reproduce the bug?
Comment 4 Ala Hino 2016-03-01 04:26:23 EST
Hi Elad,

Basically, to reproduce this issue we have to fake the result of getVolumeInfo after invoking DESTROY_IMAGE.
This can done by stopping vdsm/shutdown the host after you see in the log that DESTROY_IMAGE completed.
Comment 5 Elad 2016-03-02 02:59:45 EST
Thanks, 

Tested accordingly: 
Rebooted the hypervisor right after I saw the following in engine.log:

2016-03-02 07:36:53,715 INFO  [org.ovirt.engine.core.bll.DestroyImageCheckCommand] (pool-5-thread-5) [2e486e8c] Running command: DestroyImageCheckCommand internal: true.
2016-03-02 07:36:55,906 INFO  [org.ovirt.engine.core.bll.DestroyImageCheckCommand] (pool-5-thread-5) [2e486e8c] Requested images were successfully removed

Live merge operation was marked as failed after few minutes:

2016-03-02 07:41:47,747 INFO  [org.ovirt.engine.core.bll.RemoveSnapshotCommandCallback] (DefaultQuartzScheduler_Worker-68) [7162a4dd] All Live Merge child commands have complete
d, status 'FAILED'
2016-03-02 07:41:48,792 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotCommand] (DefaultQuartzScheduler_Worker-9) [7162a4dd] Ending command 'org.ovirt.engine.core.bll.RemoveSnap
shotCommand' with failure.
2016-03-02 07:41:48,869 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-9) [7162a4dd] Correlation ID: 7162a4dd, Job I
D: ba5075ea-36b0-4e79-8218-549e02412b95, Call Stack: null, Custom Event ID: -1, Message: Failed to delete snapshot '1' for VM 'test1'.


Using:
rhevm-3.6.3.3-0.1.el6.noarch
vdsm-4.17.23-0.el7ev.noarch


Greg, please confirm that the test was done properly and the results are the expected so I'll verify, thanks.
Comment 6 Ala Hino 2016-03-02 07:51:25 EST
Elad,

The results seem to be as expected.
Can you run the manual recovery script (found in bug 1308501) and see whether it fixes the database?
Comment 7 Elad 2016-03-02 08:32:24 EST
Thanks Ala, 

Executed the script and the snapshot got removed successfully.
Based on this and on comment #5, moving to VERIFIED.

Note You need to log in before you can comment on or make changes to this bug.