Bug 1306741

Summary: Live Merge - failure in step 'DESTROY_IMAGE_CHECK' needs manual database fixup
Product: [oVirt] ovirt-engine Reporter: Greg Padgett <gpadgett>
Component: BLL.StorageAssignee: Greg Padgett <gpadgett>
Status: CLOSED WONTFIX QA Contact: Aharon Canan <acanan>
Severity: high Docs Contact:
Priority: medium    
Version: 3.6.0CC: acanan, ahino, alitke, amureini, bugs, glazarov, gpadgett, nsoffer, ratamir, rhodain, tdosek, tnisan, v.astafiev, ylavi
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 1302215
: 1308501 (view as bug list) Environment:
Last Closed: 2016-02-17 07:41:38 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1302215    
Bug Blocks: 1306732, 1308501    
Attachments:
Description Flags
Live Merge post-merge failure repair script.
none
VM disk info gathering tool
none
Live Merge post-merge failure repair script (updated) none

Description Greg Padgett 2016-02-11 17:08:14 UTC
Created attachment 1123222 [details]
Live Merge post-merge failure repair script.

In 3.6 we introduced a new step in Live Merge to verify image removal.  There was a bug in the flow (bug 1302215), and if Live Merge fails at this point then manual fixup of the database is required to complete what the Live Merge code would have otherwise done by itself.

Because the series of database queries is somewhat complex, we've provided a .sql script that can be run to perform these changes which you can find attached to this bug.

First, please verify that the image was removed on the vdsm side by investigating the logs or otherwise verifying that the image in question has been removed from the storage.  The image id to check can be retrieved from the engine logs, described below.

Here is how to use the script, copied from a comment on the above bz with minor modifications:

=== Begin instructions ===

Because the cleanup for this is difficult, I wrote a sql script that can be used which performs the database updates done at the end of a successful Live Merge.

It's appropriate to use if Live merge fails with a message like the following in your engine log:

2016-02-09 15:09:43,220 ERROR [org.ovirt.engine.core.bll.DestroyImageCheckCommand] (pool-7-thread-4) [6ab159d3] The following images were not removed: [11aedd61-b3a7-46ae-b7d4-982e9e58b863]
[...]
2016-02-09 15:10:43,214 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (DefaultQuartzScheduler_Worker-62) [91baf933] Failed child command status for step 'DESTROY_IMAGE_CHECK' 

In this instance, you would run the below commands with the following:
  VM_NAME = the name of the VM with the snapshot you were removing
  SNAPSHOT_DESC = the description of the snapshot you were removing
  IMAGE_ID = the image from the above log message, e.g. 11aedd61-b3a7-46ae-b7d4-982e9e58b863
  (Note: be careful to choose the right image id!  If it's wrong, this
   script will remove references to other, possibly valid images.  It
   performs some verification, but can't guard against all possibilities.)

To use it:

  1. Stop engine
  2. Back up your database
  3. psql <dbname> -U <dbuser> < /path/to/repair_failed_merge.sql
  4. psql <dbname> -U <dbuser> -c "SELECT ovirt_repair_failed_merge('VM_NAME', 'SNAPSHOT_DESC', 'IMAGE_ID')"
  5. psql <dbname> -U <dbuser> -c "DROP FUNCTION ovirt_repair_failed_merge(varchar, varchar, varchar)"

If all went well, step 4 will print "Success".  If it fails or one if its verifications finds an error, it will report it to the console.

=== End instructions ===


See below for more detail on the original bug.

+++ This bug was initially created as a clone of Bug #1302215 +++

Description of problem:
Live merge operation fails noting Failed child command status for step 'DESTROY_IMAGE_CHECK'

Version-Release number of selected component (if applicable):
3.6.2.6-0.1

How reproducible:
100%

Steps to Reproduce:
1. Create a VM with 3 snapshots (cold snapshot), each with unique changes within files on OS
2. Power on the VM
3. Attempt to remove the middle snapshot (2/3)

Actual results:
The live merge operation fails

Expected results:


Additional info:
Pertinent failure from the engine log:
[trimmed]

2016-01-27 09:13:20,193 INFO  [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (DefaultQuartzScheduler_Worker-62) [67edccaf] Executing Live Merge command step 'DESTROY_IMAGE_CHECK'
2016-01-27 09:13:20,231 ERROR [org.ovirt.engine.core.bll.DestroyImageCheckCommand] (pool-6-thread-6) [4bd5b450] The following images were not removed: [894cf020-474d-44aa-94f1-e34c809bfe4e]
2016-01-27 09:13:30,282 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (DefaultQuartzScheduler_Worker-46) [4119c6bc] Failed child command status for step 'DESTROY_IMAGE_CHECK'
2016-01-27 09:13:31,316 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (DefaultQuartzScheduler_Worker-48) [4119c6bc] Merging of snapshot 'ce039203-6544-41d1-9449-eac1a009ccab' images '256e51cd-87e6-40fe-8346-1035625605fb'..'894cf020-474d-44aa-94f1-e34c809bfe4e' failed. Images have been marked illegal and can no longer be previewed or reverted to. Please retry Live Merge on the snapshot to complete the operation.
2016-01-27 09:13:31,383 INFO  [org.ovirt.engine.core.bll.RemoveSnapshotCommandCallback] (DefaultQuartzScheduler_Worker-48) [snapshots_delete_2a3b8ef0-0690-4b69] All Live Merge child commands have completed, status 'FAILED'
2016-01-27 09:13:32,393 ERROR (DefaultQuartzScheduler_Worker-25) [snapshots_delete_2a3b8ef0-0690-4b69] Ending command 'org.ovirt.engine.core.bll.RemoveSnapshotCommand' with failure.
2016-01-27 09:13:32,438 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-25) [snapshots_delete_2a3b8ef0-0690-4b69] Correlation ID: snapshots_delete_2a3b8ef0-0690-4b69, Job ID: 21b4b54c-50ac-4909-8ade-d9fcd98b1aea, Call Stack: null, Custom Event ID: -1, Message: Failed to delete snapshot 'snapshot_12215_iscsi_1' for VM 'live_merge_vm_iscsi'.

[trimmed comments]

Comment 2 Adam Litke 2016-02-11 21:40:28 UTC
Created attachment 1123299 [details]
VM disk info gathering tool

When using the Live Merge post-merge failure repair script (also attached to this bug) it's important to make sure that the live merge succeeded and the volume is no longer part of the VM disk's volume chain.  This utility will help you to check this.  The tool should be run on the host where the VM is running.  It will query vdsm for information about all running VMs and print the volume chains of all disks found.

The program takes no arguments but must be run as root (in order to access SSL keys needed to talk to vdsm).

To verify the merge, ensure that the volume ID referenced in the engine logs does not appear in the VMs volume chain.

Comment 3 Red Hat Bugzilla Rules Engine 2016-02-11 22:55:14 UTC
Bug tickets must have version flags set prior to targeting them to a release. Please ask maintainer to set the correct version flags and only then set the target milestone.

Comment 8 Allon Mureinik 2016-02-17 07:41:38 UTC
The script is attached to this bug, for oVirt there's no additional action item.

Comment 9 Greg Padgett 2016-05-03 13:47:21 UTC
Created attachment 1153440 [details]
Live Merge post-merge failure repair script (updated)

An updated, safer script is attached.  It requires a 4th parameter; for use, please see instructions at https://bugzilla.redhat.com/show_bug.cgi?id=1302215#c15

Also note that it should be removed after use; leaving it may cause trouble later on (especially if installed via a different dbuser than the rest of the engine database):

psql <dbname> -U <dbuser> -c 'DROP FUNCTION ovirt_repair_failed_merge(varchar, varchar, varchar, varchar)'