Bug 1227497 - Live Merge - in certain failure cases engine may not correctly mark merged snapshots in the database
Summary: Live Merge - in certain failure cases engine may not correctly mark merged sn...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: General
Version: 3.5.1
Hardware: x86_64
OS: Unspecified
high
medium
Target Milestone: ovirt-3.6.2
: 3.6.2.5
Assignee: Greg Padgett
QA Contact: Elad
Kevin Alon Goldblatt
URL:
Whiteboard:
Depends On:
Blocks: 1306732
TreeView+ depends on / blocked
 
Reported: 2015-06-02 19:59 UTC by Greg Padgett
Modified: 2016-02-28 14:40 UTC (History)
16 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of: 1213157
Environment:
Last Closed: 2016-02-18 11:03:46 UTC
oVirt Team: Storage
Embargoed:
rule-engine: ovirt-3.6.z+
ylavi: planning_ack+
amureini: devel_ack+
rule-engine: testing_ack+


Attachments (Terms of Use)
VDSM patch for reproduction (620 bytes, patch)
2015-09-17 03:21 UTC, Greg Padgett
no flags Details | Diff
Success/failure engine log excerpts (9.10 KB, application/octet-stream)
2015-09-17 03:24 UTC, Greg Padgett
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 46263 0 master ABANDONED core: verify image removal during Live Merge 2016-01-12 15:35:33 UTC
oVirt gerrit 51680 0 master MERGED core: verify image removal during Live Merge 2016-01-12 08:26:32 UTC
oVirt gerrit 51681 0 ovirt-engine-3.6 ABANDONED core: verify image removal during Live Merge 2016-01-12 10:37:45 UTC
oVirt gerrit 51682 0 ovirt-engine-3.6.2 ABANDONED core: verify image removal during Live Merge 2016-01-12 10:37:36 UTC
oVirt gerrit 51683 0 ovirt-engine-3.6 MERGED core: verify image removal during Live Merge 2016-01-12 08:30:22 UTC
oVirt gerrit 51684 0 ovirt-engine-3.6.2 MERGED core: verify image removal during Live Merge 2016-01-12 08:32:41 UTC

Description Greg Padgett 2015-06-02 19:59:00 UTC
Cloned to track issue of engine forgetting it had merged snapshots in certain failure cases (see last 2 comments below).

+++ This bug was initially created as a clone of Bug #1213157 +++

Description of problem:
VM disks become illegal after failed Live Merge

Version-Release number of selected component (if applicable):
ver 3.5.1 vt14.3
rhevm-3.5.1-0.4.el6ev.noarch
vdsm-4.16.13.1-1.el7ev.x86_64

How reproducible:


Steps to Reproduce:
1. Create a VM with several disks including block preallocated and thin and nfs preallocated and thin
2. Started the VM 
3. Created 3 snapshots snsa1, snsa2, snsa3
4. Deleted snapshot snsa2 - while the snapshot is locked restarted the vdsm >>>>> The snapshot delete operation failed BUT 2 of the block disks vm11_Disk3 - preallocated and vm11_Disk3 - Thin are now illegal. All further attempts to deleted the snapshot will fail

Actual results:
2 of the block disks vm11_Disk3 - preallocated and vm11_Disk3 - Thin are now illegal after a failed Live Snapshot Delete merge operation

Expected results:
All disks should be in legal state after a failed Live Snapshot Delete Merge operation

Additional info:

THERE IS A 3 HOUR TIME DIFFERENCE BETWEEN THE ENGINE AND THE HOST

ENGINE log
-----------------------
<snipped>

VDSM log
-----------------------------
<snipped>

--- Additional comment from Kevin Alon Goldblatt on 2015-04-19 11:12:02 EDT ---

Libvirt rpms on host
--------------------------------
libvirt-client-1.2.8-16.el7_1.2.x86_64
libvirt-daemon-driver-qemu-1.2.8-16.el7_1.2.x86_64
libvirt-daemon-driver-interface-1.2.8-16.el7_1.2.x86_64
libvirt-daemon-1.2.8-16.el7_1.2.x86_64
libvirt-daemon-driver-nwfilter-1.2.8-16.el7_1.2.x86_64
libvirt-daemon-config-nwfilter-1.2.8-16.el7_1.2.x86_64
libvirt-daemon-driver-secret-1.2.8-16.el7_1.2.x86_64
libvirt-daemon-driver-nodedev-1.2.8-16.el7_1.2.x86_64
libvirt-python-1.2.8-7.el7_1.1.x86_64
libvirt-daemon-kvm-1.2.8-16.el7_1.2.x86_64
libvirt-daemon-driver-network-1.2.8-16.el7_1.2.x86_64
libvirt-lock-sanlock-1.2.8-16.el7_1.2.x86_64
libvirt-daemon-driver-storage-1.2.8-16.el7_1.2.x86_64

--- Additional comment from Kevin Alon Goldblatt on 2015-04-19 11:19:10 EDT ---

Added logs

NOTE THERE IS A 3 HOUR DIFFERENCE BETWEEN THE ENGINE AND HOST e.g.

Current Engine time: Sun Apr 19 18:18:05 IDT 2015
Current Host time:   Sun Apr 19 21:18:35 IDT 2015

--- Additional comment from Allon Mureinik on 2015-04-19 12:40:32 EDT ---

The error seems unrelated to the restart (missing PVs?), so I'm not quite sure what I'd want the disk status to be in...

Adam/Greg, please take a look.

Kevin - assuming the domain is still functioning (which doesn't seem necessary from the error, if you manually update the DB to mark those disks as OK [1], can you run the VM?


[1] in psql, use:

UPDATE images
SET    imagestatus = 0
WHERE  image_group_id = <the disk's ID>;

COMMIT;

--- Additional comment from Greg Padgett on 2015-04-19 13:45:13 EDT ---

Just to verify, the illegal state referred to here is illegal on the host?  (I don't see any references to illegal disks in the supplied engine log.)

--- Additional comment from Allon Mureinik on 2015-04-20 07:01:00 EDT ---

(In reply to Greg Padgett from comment #4)
> Just to verify, the illegal state referred to here is illegal on the host? 
> (I don't see any references to illegal disks in the supplied engine log.)
Then again, the vdsm log has no errors at all.
Kevin, are you sure these are the right logs?

--- Additional comment from Kevin Alon Goldblatt on 2015-04-20 10:08:48 EDT ---

(In reply to Greg Padgett from comment #4)
> Just to verify, the illegal state referred to here is illegal on the host? 
> (I don't see any references to illegal disks in the supplied engine log.)

(In reply to Greg Padgett from comment #4)
> Just to verify, the illegal state referred to here is illegal on the host? 
> (I don't see any references to illegal disks in the supplied engine log.)

1. NOTE THERE IS A 3 HOUR DIFFERENCE BETWEEN THE ENGINE AND HOST e.g.

Current Engine time: Sun Apr 19 18:18:05 IDT 2015
Current Host time:   Sun Apr 19 21:18:35 IDT 2015


2. ILLEGAL status appears in the host vdsm log e.g.

cbfe5fbe-9f6a-441f-bb5b-4bac3de41fdc::INFO::2015-04-19 19:01:20,332::volume::617::Storage.Volume::(setLegality) sdUUID=17eac76f-4aaa-4d99-8f1a-b7f6e68e4c70 imgUUID=4a31b260-8136-411d-9743-b3e937859097 volUUID = 54c94300-650a-4da9-9656-87969a6c60a4 legality = ILLEGAL 
0bb73a86-25be-42d5-9695-3b6bf1411d5b::INFO::2015-04-19 19:01:24,958::volume::617::Storage.Volume::(setLegality) sdUUID=3eae5544-0c43-483b-bf06-118660fcbfb1 imgUUID=32d03b22-d9a8-40e8-a7f6-fae12cdbb840 volUUID = 2655ffd3-5c64-4cd4-8aac-16599acae719 legality = ILLEGAL 
e4ece9d5-2b01-4fda-b1af-2a6e0a766bed::INFO::2015-04-19 


3. In the GUI we see the ILLEGAL status displayed in the Storage > <Storage Domain> > Disk Snapshots > Status > illegal
Added screen shot

4. 2 of the disk snapshots are displayed as Illegal since the failed Live Delete Snapshot operation. Subsequently I can no longer delete this snapshot

--- Additional comment from Kevin Alon Goldblatt on 2015-04-20 10:10:54 EDT ---

(In reply to Allon Mureinik from comment #5)
> (In reply to Greg Padgett from comment #4)
> > Just to verify, the illegal state referred to here is illegal on the host? 
> > (I don't see any references to illegal disks in the supplied engine log.)
> Then again, the vdsm log has no errors at all.
> Kevin, are you sure these are the right logs?

Appologies, the vdsm log rotated before I loaded them. Adding logs again

--- Additional comment from Kevin Alon Goldblatt on 2015-04-20 10:12:59 EDT ---

Added logs again as the vdsm log was old due to log rotation

--- Additional comment from Kevin Alon Goldblatt on 2015-04-20 10:48:25 EDT ---

Edited the summary of the bz to reflect the problem

--- Additional comment from Adam Litke on 2015-04-20 12:06:00 EDT ---

The included vdsm log does not show the actual merge operation so it's hard to tell if there were any specific errors with that flow.  What I see here are some errors from LVM indicating missing PVs.  It seems like there could have been some external corruption of the storage domain (perhaps someone removed some LUNs on the server side?

--- Additional comment from Allon Mureinik on 2015-04-20 13:52:40 EDT ---

(In reply to Kevin Alon Goldblatt from comment #6)
> 4. 2 of the disk snapshots are displayed as Illegal since the failed Live
> Delete Snapshot operation. Subsequently I can no longer delete this snapshot

Just to clarify - it's just the snapshot that's marked illegal?
The Disk object is OK?
Can you stop the VM and re-run it?

--- Additional comment from Kevin Alon Goldblatt on 2015-04-21 03:32:33 EDT ---

(In reply to Allon Mureinik from comment #11)
> (In reply to Kevin Alon Goldblatt from comment #6)
> > 4. 2 of the disk snapshots are displayed as Illegal since the failed Live
> > Delete Snapshot operation. Subsequently I can no longer delete this snapshot
> 
> Just to clarify - it's just the snapshot that's marked illegal?
> The Disk object is OK?
> Can you stop the VM and re-run it?

Yes I can stop and re-run the VM, But any attempt to delete the same snapshot again fails.

--- Additional comment from Kevin Alon Goldblatt on 2015-04-21 03:35:37 EDT ---

(In reply to Adam Litke from comment #10)
> The included vdsm log does not show the actual merge operation so it's hard
> to tell if there were any specific errors with that flow.  What I see here
> are some errors from LVM indicating missing PVs.  It seems like there could
> have been some external corruption of the storage domain (perhaps someone
> removed some LUNs on the server side?

No LUNS are missing on the storage.

--- Additional comment from Kevin Alon Goldblatt on 2015-04-21 03:46:22 EDT ---

Added an additiona vdsm log that includes the earlier time from before the merge
vdsm.log.22.xz

--- Additional comment from Kevin Alon Goldblatt on 2015-04-21 04:57:57 EDT ---

Upon further investigation I found that the illegal disk is displayed in the Storage > Disks > Disk snapshots but not in the VM > Snapshots > Disks. This means that the database has 2 tables that are not synchronized

--- Additional comment from Allon Mureinik on 2015-04-21 05:00:24 EDT ---

(In reply to Kevin Alon Goldblatt from comment #15)
> Upon further investigation I found that the illegal disk is displayed in the
> Storage > Disks > Disk snapshots but not in the VM > Snapshots > Disks. This
> means that the database has 2 tables that are not synchronized
No, it doesn't.
It means the disk as a whole is OK (i.e., can be used for VM operations like running it), but there's a problem somewhere down the chain which prevents things like merging and requires some intervention.

--- Additional comment from Greg Padgett on 2015-04-21 12:55:42 EDT ---

(In reply to Kevin Alon Goldblatt from comment #15)
> Upon further investigation I found that the illegal disk is displayed in the
> Storage > Disks > Disk snapshots but not in the VM > Snapshots > Disks.
<snip>

Okay, I think I see what's going on.  The reboot happened during the destroy_image phase of the merge attempts and thus the merges were successful but the snapshot volumes are still on the storage.  We see they're illegal, as they should be; however, it's designed so that you can retry the merge on the illegal volumes in order to clean them up.  This last bit isn't working for some reason or another.  I'm looking into it.

--- Additional comment from Greg Padgett on 2015-04-24 17:05:34 EDT ---

After further testing, it appears the live merge could not be retried because the command would not converge after failure to delete the image. [1]  With the linked patch, the failure causes overall command failure, leaving the snapshot and image unlocked and in the proper state to allow a retry.

There may still be a failure due to the timing of the restart, but note that live merge isn't designed to necessarily succeed if restarted at any point--but rather to recover gracefully if re-executed after a failure such as this.

Please test with this patch once it's in the build and let me know if you still see any problems.

[1] In this case, DestroyImageCommand failed, which is the third part (of 3) of the live merge flow on the engine.  This child command is an old-style command (runs an SPM task) whereas its parent which controls the flow uses newer infrastructure (CommandCoordinator).  The two don't always play well together, and we had already added a special case to get the endAction flow working between the two.  This was extended further to avoid an endless loop of status-checking if the child command fails.

--- Additional comment from Greg Padgett on 2015-04-24 23:56:14 EDT ---

Found some issues later in the retry flow, moving back to assigned.

--- Additional comment from Greg Padgett on 2015-04-30 16:10:06 EDT ---

A partial fix is posted to gerrit.  Looks like there's still a case where the engine can "forget" that it perfomed a merge, which will cause an error when it retries during command recovery.

--- Additional comment from Allon Mureinik on 2015-05-03 05:37:58 EDT ---

(In reply to Greg Padgett from comment #20)
> A partial fix is posted to gerrit.  Looks like there's still a case where
> the engine can "forget" that it perfomed a merge, which will cause an error
> when it retries during command recovery.
So let's clone the bug for better tracking.
We'll have this bug for 3.5.4 to track the fix we DO have, and use the other one to track the remaining problem.

Comment 1 Greg Padgett 2015-09-15 03:50:44 UTC
The cloned comments in the description don't provide a lot of detail on this particular defect.

To elaborate, occasionally Live Merge executions would result in a failure that could not be recovered by retries (as is the convention for Live Merge).  The problem turns out to be improper tracking of the image deletion after the merge is complete.  If the vdsm call to deleteVolume is made and completes but the engine can't immediately retrieve the status, engine will never discover that the actual state of the image.  Instead, upon retry engine will again try deleteVolume, which subsequently fails and causes Live Merge to fail.

This will be fixed by SPM removal is complete, but in the meantime we'll add another command to the RemoveSnapshotSingleDiskLiveCommand chain to check if the image has been deleted so that the above scenario will let the command complete successfully if deletion succeeded without engine's knowledge.

Comment 2 Greg Padgett 2015-09-17 03:20:37 UTC
Verification:

This one is a bit tricky because the problem only occurs if vdsm dies at a very specific point in time.  Vdsm must be killed right after the image deletion occurs but before the task status is saved.  The best way that I've found to accomplish this is the following:

1. Stop vdsm on the host
2. Edit /usr/share/vdsm/storage/task.py and apply the code in the attached patch file (listed below [1] for convenience--applying the patch is preferred due to python's whitespace treatment)
3. Edit /usr/lib/systemd/system/vdsmd.service, change "Restart=always" to "Restart=no" to prevent vdsm from restarting with the patched code during the test

There may very well be some other way to kill vdsm at just the right moment, but this is the most reliable way I could think of.

Now to reproduce:
1. Ensure the host to be used for snapshot removal is running a vdsm with the patched code from above.
2. Create a snapshot for a vm
3. Ensure the vm is running
4. Remove the snapshot
5. Wait for vdsm to exit; the end of the log should show "HACK: Found deleteVolume in finished state, exiting".  The engine will start raising exceptions due to connection errors; this is normal.
6. Remove the vdsm code patch (best to remove it with `patch -r` to avoid possible syntax errors due to whitespace issues)
7. Start vdsm
8. Remove the same snapshot again

If the engine fix works, this second attempt in step 8 will result in successful snapshot removal.  Prior to the fix, the second attempt would fail leaving the snapshot behind.

I've attached log samples showing excerpts of the engine log for a failure and successful removal.  The errors and start/end of the logs are the most obvious signs; if needed, these can be used as a guide to help correlate with the behavior you're seeing during testing.


[1] patch contents:
--- vdsm/storage/task.py	2015-09-16 23:00:26.936111622 -0400
+++ vdsm/storage/task.py.new	2015-09-16 23:00:16.492108251 -0400
@@ -751,4 +751,9 @@
         taskDir = os.path.join(storPath, self.id + TEMP_EXT)
         self.log.debug("_save: orig %s temp %s", origTaskDir, taskDir)
+        # TODO hack
+        if self.state == State.finished and self.name == 'deleteVolume':
+            self.log.error("HACK: Found deleteVolume in finished state, exiting")
+            os._exit(1)
+            # TODO end hack
         if getProcPool().os.path.exists(taskDir):
             getProcPool().fileUtils.cleanupdir(taskDir)

Comment 3 Greg Padgett 2015-09-17 03:21:39 UTC
Created attachment 1074236 [details]
VDSM patch for reproduction

Comment 4 Greg Padgett 2015-09-17 03:24:06 UTC
Created attachment 1074237 [details]
Success/failure engine log excerpts

Comment 5 Greg Padgett 2015-09-17 03:28:25 UTC
Yaniv, this bug should be a rare occurrence; however, when it happens the snapshot cannot be removed without some manual intervention (ie editing the db).  There's more detail about the bug in general in my preceding comments.  Given all that, should this be backported to z-stream?

Comment 6 Yaniv Lavi 2015-09-21 11:32:23 UTC
(In reply to Greg Padgett from comment #5)
> Yaniv, this bug should be a rare occurrence; however, when it happens the
> snapshot cannot be removed without some manual intervention (ie editing the
> db).  There's more detail about the bug in general in my preceding comments.
> Given all that, should this be backported to z-stream?

Can you please add step to correct the db in case a user hits this?

Comment 7 Sandro Bonazzola 2015-10-26 12:31:41 UTC
this is an automated message. oVirt 3.6.0 RC3 has been released and GA is targeted to next week, Nov 4th 2015.
Please review this bug and if not a blocker, please postpone to a later release.
All bugs not postponed on GA release will be automatically re-targeted to

- 3.6.1 if severity >= high
- 4.0 if severity < high

Comment 8 Tal Nisan 2015-10-29 10:41:32 UTC
Greg, we currently targeted it to 3.6.1 but it depends on the risk, how risky is this change?

Comment 9 Greg Padgett 2015-11-10 03:53:00 UTC
(In reply to Tal Nisan from comment #8)
> Greg, we currently targeted it to 3.6.1 but it depends on the risk, how
> risky is this change?

I'd consider this relatively low risk.  The main live merge flow has been updated, but the change is to call a well-baked vds command to verify the deletion.  Testing for regressions should be straightforward.


(In reply to Yaniv Dary from comment #6)
> (In reply to Greg Padgett from comment #5)
> > Yaniv, this bug should be a rare occurrence; however, when it happens the
> > snapshot cannot be removed without some manual intervention (ie editing the
> > db).  There's more detail about the bug in general in my preceding comments.
> > Given all that, should this be backported to z-stream?
> 
> Can you please add step to correct the db in case a user hits this?

The steps here are rather intricate and involve updating rows in multiple tables as well altering the vm configuration text stored in the snapshot record.  I'm not sure yet if there is a way to trigger an update to the vm configuration or if it must be done by hand.  Regardless, it may be safer to override the vdsm deleteVolume verb return value, re-run Live merge, and let the Live Merge flow update the records itself.  Thoughts on this alternative?

Comment 10 Yaniv Lavi 2015-11-22 12:39:12 UTC
(In reply to Greg Padgett from comment #9)
> (In reply to Tal Nisan from comment #8)
> > Greg, we currently targeted it to 3.6.1 but it depends on the risk, how
> > risky is this change?
> 
> I'd consider this relatively low risk.  The main live merge flow has been
> updated, but the change is to call a well-baked vds command to verify the
> deletion.  Testing for regressions should be straightforward.
> 
> 
> (In reply to Yaniv Dary from comment #6)
> > (In reply to Greg Padgett from comment #5)
> > > Yaniv, this bug should be a rare occurrence; however, when it happens the
> > > snapshot cannot be removed without some manual intervention (ie editing the
> > > db).  There's more detail about the bug in general in my preceding comments.
> > > Given all that, should this be backported to z-stream?
> > 
> > Can you please add step to correct the db in case a user hits this?
> 
> The steps here are rather intricate and involve updating rows in multiple
> tables as well altering the vm configuration text stored in the snapshot
> record.  I'm not sure yet if there is a way to trigger an update to the vm
> configuration or if it must be done by hand.  Regardless, it may be safer to
> override the vdsm deleteVolume verb return value, re-run Live merge, and let
> the Live Merge flow update the records itself.  Thoughts on this alternative?

I don't have knowledge of other steps. Tal?

Comment 11 Tal Nisan 2015-11-22 13:56:05 UTC
Seems better and less intrusive than manual update, I think it can be backported

Comment 12 Red Hat Bugzilla Rules Engine 2015-12-02 00:04:55 UTC
Bug tickets must have version flags set prior to targeting them to a release. Please ask maintainer to set the correct version flags and only then set the target milestone.

Comment 13 Red Hat Bugzilla Rules Engine 2015-12-02 00:04:55 UTC
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 14 boruvka.michal 2015-12-07 08:10:16 UTC
(In reply to Yaniv Dary from comment #10)
> (In reply to Greg Padgett from comment #9)
> > (In reply to Tal Nisan from comment #8)
> > > Greg, we currently targeted it to 3.6.1 but it depends on the risk, how
> > > risky is this change?
> > 
> > I'd consider this relatively low risk.  The main live merge flow has been
> > updated, but the change is to call a well-baked vds command to verify the
> > deletion.  Testing for regressions should be straightforward.
> > 
> > 
> > (In reply to Yaniv Dary from comment #6)
> > > (In reply to Greg Padgett from comment #5)
> > > > Yaniv, this bug should be a rare occurrence; however, when it happens the
> > > > snapshot cannot be removed without some manual intervention (ie editing the
> > > > db).  There's more detail about the bug in general in my preceding comments.
> > > > Given all that, should this be backported to z-stream?
> > > 
> > > Can you please add step to correct the db in case a user hits this?
> > 
> > The steps here are rather intricate and involve updating rows in multiple
> > tables as well altering the vm configuration text stored in the snapshot
> > record.  I'm not sure yet if there is a way to trigger an update to the vm
> > configuration or if it must be done by hand.  Regardless, it may be safer to
> > override the vdsm deleteVolume verb return value, re-run Live merge, and let
> > the Live Merge flow update the records itself.  Thoughts on this alternative?
> 
> I don't have knowledge of other steps. Tal?

Hello, could you please tell me how override the vdsm deleteVolume verb return value?

Comment 15 Allon Mureinik 2015-12-07 08:14:50 UTC
(In reply to boruvka.michal from comment #14)
> (In reply to Yaniv Dary from comment #10)
> > (In reply to Greg Padgett from comment #9)
> > > (In reply to Tal Nisan from comment #8)
> > > > Greg, we currently targeted it to 3.6.1 but it depends on the risk, how
> > > > risky is this change?
> > > 
> > > I'd consider this relatively low risk.  The main live merge flow has been
> > > updated, but the change is to call a well-baked vds command to verify the
> > > deletion.  Testing for regressions should be straightforward.
> > > 
> > > 
> > > (In reply to Yaniv Dary from comment #6)
> > > > (In reply to Greg Padgett from comment #5)
> > > > > Yaniv, this bug should be a rare occurrence; however, when it happens the
> > > > > snapshot cannot be removed without some manual intervention (ie editing the
> > > > > db).  There's more detail about the bug in general in my preceding comments.
> > > > > Given all that, should this be backported to z-stream?
> > > > 
> > > > Can you please add step to correct the db in case a user hits this?
> > > 
> > > The steps here are rather intricate and involve updating rows in multiple
> > > tables as well altering the vm configuration text stored in the snapshot
> > > record.  I'm not sure yet if there is a way to trigger an update to the vm
> > > configuration or if it must be done by hand.  Regardless, it may be safer to
> > > override the vdsm deleteVolume verb return value, re-run Live merge, and let
> > > the Live Merge flow update the records itself.  Thoughts on this alternative?
> > 
> > I don't have knowledge of other steps. Tal?
> 
> Hello, could you please tell me how override the vdsm deleteVolume verb
> return value?
Hi Michal,

Those were instructions to our QA team to attempt to reproduce the bug in order to see if this fix works.
What exactly are you trying to achieve?

Comment 16 boruvka.michal 2015-12-07 09:14:39 UTC
(In reply to Allon Mureinik from comment #15)
> (In reply to boruvka.michal from comment #14)
> > (In reply to Yaniv Dary from comment #10)
> > > (In reply to Greg Padgett from comment #9)
> > > > (In reply to Tal Nisan from comment #8)
> > > > > Greg, we currently targeted it to 3.6.1 but it depends on the risk, how
> > > > > risky is this change?
> > > > 
> > > > I'd consider this relatively low risk.  The main live merge flow has been
> > > > updated, but the change is to call a well-baked vds command to verify the
> > > > deletion.  Testing for regressions should be straightforward.
> > > > 
> > > > 
> > > > (In reply to Yaniv Dary from comment #6)
> > > > > (In reply to Greg Padgett from comment #5)
> > > > > > Yaniv, this bug should be a rare occurrence; however, when it happens the
> > > > > > snapshot cannot be removed without some manual intervention (ie editing the
> > > > > > db).  There's more detail about the bug in general in my preceding comments.
> > > > > > Given all that, should this be backported to z-stream?
> > > > > 
> > > > > Can you please add step to correct the db in case a user hits this?
> > > > 
> > > > The steps here are rather intricate and involve updating rows in multiple
> > > > tables as well altering the vm configuration text stored in the snapshot
> > > > record.  I'm not sure yet if there is a way to trigger an update to the vm
> > > > configuration or if it must be done by hand.  Regardless, it may be safer to
> > > > override the vdsm deleteVolume verb return value, re-run Live merge, and let
> > > > the Live Merge flow update the records itself.  Thoughts on this alternative?
> > > 
> > > I don't have knowledge of other steps. Tal?
> > 
> > Hello, could you please tell me how override the vdsm deleteVolume verb
> > return value?
> Hi Michal,
> 
> Those were instructions to our QA team to attempt to reproduce the bug in
> order to see if this fix works.
> What exactly are you trying to achieve?

I have VM snapshot with disk in illegal state and I cant delete it. This disk is in illegal state after Live Merge operation. Could you give me some advice how to delete this snapshot?

Comment 17 Greg Padgett 2015-12-08 03:51:15 UTC
(In reply to boruvka.michal from comment #16)
> (In reply to Allon Mureinik from comment #15)
> > (In reply to boruvka.michal from comment #14)
[trimmed]
> > Hi Michal,
> > 
> > Those were instructions to our QA team to attempt to reproduce the bug in
> > order to see if this fix works.
> > What exactly are you trying to achieve?
> 
> I have VM snapshot with disk in illegal state and I cant delete it. This
> disk is in illegal state after Live Merge operation. Could you give me some
> advice how to delete this snapshot?

Hi Michal,

Are the symptoms the same as this bug: even after stopping/re-starting the vm, live merge fails during the removal phase (engine log indicates "DESTROY_IMAGE") and the error on the vdsm side is something like "image file not found"?

If not, please check the engine/vdsm/libvirt logs for the root cause of the failure.  However, if this is the case and you don't want to wait on the fix, then overriding the vdsm return value may be the easiest way to resolve the situation.

To do this:
1. Note that you should not perform any volume/storage operations other than the required live merge while this change is in effect, as it may cause inconsistencies within the engine database and/or possible corruption down the road.
2. On the host running the VM, edit the /usr/share/vdsm/storage/sp.py and search for the line starting with "    def deleteVolume" (preceded with 8 spaces).  On the very next line, add the following line:
        return
Note that the word "return" is preceded with 8 spaces (no tabs).
3. Restart vdsm on the host: service vdsmd restart
4. Wait for the engine to discover the host is back up (possibly re-electing the SPM if one is needed)
5. Perform the merge
6. Undo the change in step 2
7. Restart vdsm as in step 3

Comment 18 Greg Padgett 2015-12-09 00:16:05 UTC
(In reply to Greg Padgett from comment #17)
> To do this:
> 1. Note that you should not perform any volume/storage operations other than
> the required live merge while this change is in effect, as it may cause
> inconsistencies within the engine database and/or possible corruption down
> the road.
> 2. On the host running the VM, edit the /usr/share/vdsm/storage/sp.py and
> search for the line starting with "    def deleteVolume" (preceded with 8
> spaces).  On the very next line, add the following line:
>         return
> Note that the word "return" is preceded with 8 spaces (no tabs).

Correction: the deletion is done on the SPM host, not the host running the VM.  Please perform the steps on your SPM host, making sure this host remains SPM throughout the operation (re-elect it if necessary).

> 3. Restart vdsm on the host: service vdsmd restart
> 4. Wait for the engine to discover the host is back up (possibly re-electing
> the SPM if one is needed)
> 5. Perform the merge
> 6. Undo the change in step 2
> 7. Restart vdsm as in step 3

Comment 19 Kevin Alon Goldblatt 2016-02-08 14:54:20 UTC
(In reply to Greg Padgett from comment #2)
> Verification:
> 
> This one is a bit tricky because the problem only occurs if vdsm dies at a
> very specific point in time.  Vdsm must be killed right after the image
> deletion occurs but before the task status is saved.  The best way that I've
> found to accomplish this is the following:
> 
> 1. Stop vdsm on the host
> 2. Edit /usr/share/vdsm/storage/task.py and apply the code in the attached
> patch file (listed below [1] for convenience--applying the patch is
> preferred due to python's whitespace treatment)
> 3. Edit /usr/lib/systemd/system/vdsmd.service, change "Restart=always" to
> "Restart=no" to prevent vdsm from restarting with the patched code during
> the test
> 
> There may very well be some other way to kill vdsm at just the right moment,
> but this is the most reliable way I could think of.
> 
> Now to reproduce:
> 1. Ensure the host to be used for snapshot removal is running a vdsm with
> the patched code from above.
> 2. Create a snapshot for a vm
> 3. Ensure the vm is running
> 4. Remove the snapshot
> 5. Wait for vdsm to exit; the end of the log should show "HACK: Found
> deleteVolume in finished state, exiting".  The engine will start raising
> exceptions due to connection errors; this is normal.
> 6. Remove the vdsm code patch (best to remove it with `patch -r` to avoid
> possible syntax errors due to whitespace issues)
> 7. Start vdsm
> 8. Remove the same snapshot again
> 
> If the engine fix works, this second attempt in step 8 will result in
> successful snapshot removal.  Prior to the fix, the second attempt would
> fail leaving the snapshot behind.
> 
> I've attached log samples showing excerpts of the engine log for a failure
> and successful removal.  The errors and start/end of the logs are the most
> obvious signs; if needed, these can be used as a guide to help correlate
> with the behavior you're seeing during testing.
> 
> 
> [1] patch contents:
> --- vdsm/storage/task.py	2015-09-16 23:00:26.936111622 -0400
> +++ vdsm/storage/task.py.new	2015-09-16 23:00:16.492108251 -0400
> @@ -751,4 +751,9 @@
>          taskDir = os.path.join(storPath, self.id + TEMP_EXT)
>          self.log.debug("_save: orig %s temp %s", origTaskDir, taskDir)
> +        # TODO hack
> +        if self.state == State.finished and self.name == 'deleteVolume':
> +            self.log.error("HACK: Found deleteVolume in finished state,
> exiting")
> +            os._exit(1)
> +            # TODO end hack
>          if getProcPool().os.path.exists(taskDir):
>              getProcPool().fileUtils.cleanupdir(taskDir)

Ran the Verification with the following code:
-------------------------------------------------
vdsm-4.17.19-0.el7ev.noarch
rhevm-3.6.3-0.1.el6.noarch

Ran the scenario using the following scenario:
------------------------------------------------
Use the scenario as described above.

After starting the vdsm in step 7, the snapshot is automatically deleted in the cleanup.
I did not have to delete it again.

Moving to VERIFIED!


Note You need to log in before you can comment on or make changes to this bug.