Red Hat Bugzilla – Bug 1306367
Live Merge - recovery flow not working properly when DESTROY_IMAGE_CHECK command fails
Last modified: 2016-05-09 19:12:02 EDT
Description of problem:
After a Live Merge failure, it is supposed to update records in the database such that a retry of the command will identify the progress of the merge and recover from the failure.
It's not always possible to do this in code, but there are at least 2 problems that have cropped up lately which weren't present in the original design, inhibiting automatic recovery that should be possible to accomplish in the code.
1. The addition of DestroyImageCheckCommand (a fix for bug 1227497) was not accounted for in checking if the merge was completed successfully, so retrying a failed live merge wherein the merge had succeeded but image removal failed may opt to retry the merge. This causes a failure that requires manual intervention to fix.
2. It looks like something changed the way the database records are marked in the failure case in general, and the images are no longer being marked illegal. Consequently, the Live Merge code won't think it has ever attempted a merge on the image. Even worse, other operations such as snapshot preview may attempt to use a partially-merged snapshot. If it appears to work and the user commits the preview, the vm may end up with corrupt data.
Version-Release number of selected component (if applicable):
3.6.2 (for problem #1 above; still investigating problem #2)
Steps to Reproduce:
This may happen on any environment and is trivial to reproduce with a code modification. It's not too easy without (since we designed Live Merge to try and not fail...)
To verify, make sure Live Merge is functional. I'll try to get some steps for move involved verification of the recovery flows and post them in a comment to this bz.
The database is not marked properly for some failure cases, causing recovery to require manual intervention.
Live merge should work and recover successfully.
(In reply to Greg Padgett from comment #0)
> 2. It looks like something changed the way the database records are marked
> in the failure case in general, and the images are no longer being marked
> illegal. Consequently, the Live Merge code won't think it has ever
> attempted a merge on the image. Even worse, other operations such as
> snapshot preview may attempt to use a partially-merged snapshot. If it
> appears to work and the user commits the preview, the vm may end up with
> corrupt data.
> Version-Release number of selected component (if applicable):
> 3.6.2 (for problem #1 above; still investigating problem #2)
The second problem does not apply to the 3.6 branch.
Bug tickets must have version flags set prior to targeting them to a release. Please ask maintainer to set the correct version flags and only then set the target milestone.
Hi Greg, can you please provide us a way to reproduce the bug?
Basically, to reproduce this issue we have to fake the result of getVolumeInfo after invoking DESTROY_IMAGE.
This can done by stopping vdsm/shutdown the host after you see in the log that DESTROY_IMAGE completed.
Rebooted the hypervisor right after I saw the following in engine.log:
2016-03-02 07:36:53,715 INFO [org.ovirt.engine.core.bll.DestroyImageCheckCommand] (pool-5-thread-5) [2e486e8c] Running command: DestroyImageCheckCommand internal: true.
2016-03-02 07:36:55,906 INFO [org.ovirt.engine.core.bll.DestroyImageCheckCommand] (pool-5-thread-5) [2e486e8c] Requested images were successfully removed
Live merge operation was marked as failed after few minutes:
2016-03-02 07:41:47,747 INFO [org.ovirt.engine.core.bll.RemoveSnapshotCommandCallback] (DefaultQuartzScheduler_Worker-68) [7162a4dd] All Live Merge child commands have complete
d, status 'FAILED'
2016-03-02 07:41:48,792 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotCommand] (DefaultQuartzScheduler_Worker-9) [7162a4dd] Ending command 'org.ovirt.engine.core.bll.RemoveSnap
shotCommand' with failure.
2016-03-02 07:41:48,869 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-9) [7162a4dd] Correlation ID: 7162a4dd, Job I
D: ba5075ea-36b0-4e79-8218-549e02412b95, Call Stack: null, Custom Event ID: -1, Message: Failed to delete snapshot '1' for VM 'test1'.
Greg, please confirm that the test was done properly and the results are the expected so I'll verify, thanks.
The results seem to be as expected.
Can you run the manual recovery script (found in bug 1308501) and see whether it fixes the database?
Executed the script and the snapshot got removed successfully.
Based on this and on comment #5, moving to VERIFIED.