1034856 – When Live Storage Migration fails it doesn't remove newly-created volumes from the destination domain.

Bug 1034856 - When Live Storage Migration fails it doesn't remove newly-created volumes from the destination domain.

Summary: When Live Storage Migration fails it doesn't remove newly-created volumes fro...

Keywords:
Status:	CLOSED DUPLICATE of bug 959705
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	3.2.0
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	medium
Target Milestone:	ovirt-3.6.3
Target Release:	3.6.0
Assignee:	Daniel Erez
QA Contact:	Aharon Canan
Docs Contact:
URL:
Whiteboard:	storage
Duplicates (7):	1070863 1097722 1103468 1108577 1195773 1248670 1304810 (view as bug list)
Depends On:	647386 1083310 1196199
Blocks:	959739 1036111
TreeView+	depends on / blocked

Reported:	2013-11-26 15:47 UTC by Gordon Watson
Modified:	2018-12-05 16:39 UTC (History)
CC List:	15 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2014-06-26 08:05:50 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	542243	0	None	None	None	Never

Description Gordon Watson 2013-11-26 15:47:16 UTC

Description of problem:

There are two parts to the following problem scenario with suggestions for two possible logic chnages.

1) In a specific incident, Live Storage Migration failed due to pathnames not existing in the '/rhev/data-center' tree. This condition had occurred due to vdsm having been upgraded incorrectly. The result of the failed LSM attempt was that the following were created;

- new "snapshot image" in the source domain.
- base image in the destination domain.
- "snapshot image" in the destination domain.

By "snapshot image" I mean the image created by the snapshot creation, I realise the snapshot is actually the base image.

In this case they have block-based storage and so two new logical volumes got created in the destination domain. However, nothing got copied to these images, so in essence they could have just been removed when the failure occurred. So, the first suggestion is to have LSM cleanup after itself after a failure.


2) To compound this, a subsequent LSM was performed. This also failed, but this time because the base image already existed in the destination domain. As a result another "snapshot image" got created in the source domain. Unfortunately the customer repeated this a few more times and the end result was that several VMs with multiple disks now had several snapshots (images only in the source domain). Each LV is most likely only 1gb, but multiplying this by the number of disks and the number of failed LSM attempts, there were now about 50 meaningless LVs, and therefore about 50gb of wasted space in the source domain. Not only that, but if the customer wants to remove these snapshots at a later time, they now have to bring the VM down (this is RHEV 3.2) and then remove multiple snapshots, which may not take too long for each one for the auto-generated ones (unless they end up merging woth the base image), but it still requires time and sufficient knowledge of exactly what you're doing and what's happeneing when you're removing snapshots. My concern is that this scenario may not be well-understood by customers and they might end up merging into the base image, which could lead to significant downtime.

I haven't looked at the LSM logic to see if it's feasible to not generate the snapshot before checking for the existence of images in the destination domain, but if that could be done then would be beneficial.



Version-Release number of selected component (if applicable):

- RHEV 3.2
- RHEL6.4 host with vdsm-4.10.2-27.0


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:

1) Logical volumes left behind in the destination domain after a specific LSM failure.
2) Snapshots created during failed LSM attempts.


Expected results:

1) Logical volumes removed from the destination domain after this specific LSM failure.
2) Snapshots not created during failed LSM attempts.


Additional info:

I will add supporting data and logs a little later.

Comment 3 Allon Mureinik 2013-11-27 10:33:31 UTC

(In reply to Gordon Watson from comment #0)
> Description of problem:
> 
> There are two parts to the following problem scenario with suggestions for
> two possible logic chnages.
> 
> 1) In a specific incident, Live Storage Migration failed due to pathnames
> not existing in the '/rhev/data-center' tree. This condition had occurred
> due to vdsm having been upgraded incorrectly. The result of the failed LSM
> attempt was that the following were created;
I don't understand this statement. Do you mean that VDSM's upgrade procedure is faulty, or did the customer make a mistake?

Comment 6 Daniel Erez 2014-02-19 08:57:43 UTC

According to logs, the snapshot creation succeeded but the live snapshot operation failed. A roll-back is indeed necessary as we can't go on with the live disk migration. However, in order to perform snapshot deletion, live merge support is needed.

Comment 9 Daniel Erez 2014-06-01 14:27:57 UTC

*** Bug 1103468 has been marked as a duplicate of this bug. ***

Comment 10 Daniel Erez 2014-06-17 13:13:17 UTC

*** Bug 1070863 has been marked as a duplicate of this bug. ***

Comment 11 Daniel Erez 2014-06-17 14:01:25 UTC

*** Bug 1108577 has been marked as a duplicate of this bug. ***

Comment 12 Daniel Erez 2014-06-18 13:47:08 UTC

*** Bug 1097722 has been marked as a duplicate of this bug. ***

Comment 13 Allon Mureinik 2014-06-26 08:05:50 UTC


*** This bug has been marked as a duplicate of bug 959705 ***

Comment 14 Daniel Erez 2015-02-25 08:09:54 UTC

*** Bug 1195773 has been marked as a duplicate of this bug. ***

Comment 15 Daniel Erez 2015-11-10 13:28:33 UTC

*** Bug 1248670 has been marked as a duplicate of this bug. ***

Comment 16 Daniel Erez 2016-02-07 15:50:07 UTC

*** Bug 1304810 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.