Created attachment 1123222 [details]
Live Merge post-merge failure repair script.
In 3.6 we introduced a new step in Live Merge to verify image removal. There was a bug in the flow (bug 1302215), and if Live Merge fails at this point then manual fixup of the database is required to complete what the Live Merge code would have otherwise done by itself.
Because the series of database queries is somewhat complex, we've provided a .sql script that can be run to perform these changes which you can find attached to this bug.
First, please verify that the image was removed on the vdsm side by investigating the logs or otherwise verifying that the image in question has been removed from the storage. The image id to check can be retrieved from the engine logs, described below.
Here is how to use the script, copied from a comment on the above bz with minor modifications:
=== Begin instructions ===
Because the cleanup for this is difficult, I wrote a sql script that can be used which performs the database updates done at the end of a successful Live Merge.
It's appropriate to use if Live merge fails with a message like the following in your engine log:
2016-02-09 15:09:43,220 ERROR [org.ovirt.engine.core.bll.DestroyImageCheckCommand] (pool-7-thread-4) [6ab159d3] The following images were not removed: [11aedd61-b3a7-46ae-b7d4-982e9e58b863]
2016-02-09 15:10:43,214 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (DefaultQuartzScheduler_Worker-62) [91baf933] Failed child command status for step 'DESTROY_IMAGE_CHECK'
In this instance, you would run the below commands with the following:
VM_NAME = the name of the VM with the snapshot you were removing
SNAPSHOT_DESC = the description of the snapshot you were removing
IMAGE_ID = the image from the above log message, e.g. 11aedd61-b3a7-46ae-b7d4-982e9e58b863
(Note: be careful to choose the right image id! If it's wrong, this
script will remove references to other, possibly valid images. It
performs some verification, but can't guard against all possibilities.)
To use it:
1. Stop engine
2. Back up your database
3. psql <dbname> -U <dbuser> < /path/to/repair_failed_merge.sql
4. psql <dbname> -U <dbuser> -c "SELECT ovirt_repair_failed_merge('VM_NAME', 'SNAPSHOT_DESC', 'IMAGE_ID')"
5. psql <dbname> -U <dbuser> -c "DROP FUNCTION ovirt_repair_failed_merge(varchar, varchar, varchar)"
If all went well, step 4 will print "Success". If it fails or one if its verifications finds an error, it will report it to the console.
=== End instructions ===
See below for more detail on the original bug.
+++ This bug was initially created as a clone of Bug #1302215 +++
Description of problem:
Live merge operation fails noting Failed child command status for step 'DESTROY_IMAGE_CHECK'
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Create a VM with 3 snapshots (cold snapshot), each with unique changes within files on OS
2. Power on the VM
3. Attempt to remove the middle snapshot (2/3)
The live merge operation fails
Pertinent failure from the engine log:
2016-01-27 09:13:20,193 INFO [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (DefaultQuartzScheduler_Worker-62) [67edccaf] Executing Live Merge command step 'DESTROY_IMAGE_CHECK'
2016-01-27 09:13:20,231 ERROR [org.ovirt.engine.core.bll.DestroyImageCheckCommand] (pool-6-thread-6) [4bd5b450] The following images were not removed: [894cf020-474d-44aa-94f1-e34c809bfe4e]
2016-01-27 09:13:30,282 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (DefaultQuartzScheduler_Worker-46) [4119c6bc] Failed child command status for step 'DESTROY_IMAGE_CHECK'
2016-01-27 09:13:31,316 ERROR [org.ovirt.engine.core.bll.RemoveSnapshotSingleDiskLiveCommand] (DefaultQuartzScheduler_Worker-48) [4119c6bc] Merging of snapshot 'ce039203-6544-41d1-9449-eac1a009ccab' images '256e51cd-87e6-40fe-8346-1035625605fb'..'894cf020-474d-44aa-94f1-e34c809bfe4e' failed. Images have been marked illegal and can no longer be previewed or reverted to. Please retry Live Merge on the snapshot to complete the operation.
2016-01-27 09:13:31,383 INFO [org.ovirt.engine.core.bll.RemoveSnapshotCommandCallback] (DefaultQuartzScheduler_Worker-48) [snapshots_delete_2a3b8ef0-0690-4b69] All Live Merge child commands have completed, status 'FAILED'
2016-01-27 09:13:32,393 ERROR (DefaultQuartzScheduler_Worker-25) [snapshots_delete_2a3b8ef0-0690-4b69] Ending command 'org.ovirt.engine.core.bll.RemoveSnapshotCommand' with failure.
2016-01-27 09:13:32,438 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-25) [snapshots_delete_2a3b8ef0-0690-4b69] Correlation ID: snapshots_delete_2a3b8ef0-0690-4b69, Job ID: 21b4b54c-50ac-4909-8ade-d9fcd98b1aea, Call Stack: null, Custom Event ID: -1, Message: Failed to delete snapshot 'snapshot_12215_iscsi_1' for VM 'live_merge_vm_iscsi'.
Created attachment 1123299 [details]
VM disk info gathering tool
When using the Live Merge post-merge failure repair script (also attached to this bug) it's important to make sure that the live merge succeeded and the volume is no longer part of the VM disk's volume chain. This utility will help you to check this. The tool should be run on the host where the VM is running. It will query vdsm for information about all running VMs and print the volume chains of all disks found.
The program takes no arguments but must be run as root (in order to access SSL keys needed to talk to vdsm).
To verify the merge, ensure that the volume ID referenced in the engine logs does not appear in the VMs volume chain.
Bug tickets must have version flags set prior to targeting them to a release. Please ask maintainer to set the correct version flags and only then set the target milestone.
The script is attached to this bug, for oVirt there's no additional action item.
Created attachment 1153440 [details]
Live Merge post-merge failure repair script (updated)
An updated, safer script is attached. It requires a 4th parameter; for use, please see instructions at https://bugzilla.redhat.com/show_bug.cgi?id=1302215#c15
Also note that it should be removed after use; leaving it may cause trouble later on (especially if installed via a different dbuser than the rest of the engine database):
psql <dbname> -U <dbuser> -c 'DROP FUNCTION ovirt_repair_failed_merge(varchar, varchar, varchar, varchar)'