Bug 1034856

Summary: When Live Storage Migration fails it doesn't remove newly-created volumes from the destination domain.
Product: Red Hat Enterprise Virtualization Manager Reporter: Gordon Watson <gwatson>
Component: ovirt-engineAssignee: Daniel Erez <derez>
Status: CLOSED DUPLICATE QA Contact: Aharon Canan <acanan>
Severity: medium Docs Contact:
Priority: high    
Version: 3.2.0CC: acathrow, amureini, asegundo, gwatson, iheim, jcoscia, kgoldbla, lpeer, mkalinin, mlehrer, nsednev, ogofen, Rhev-m-bugs, scohen, yeylon
Target Milestone: ovirt-3.6.3   
Target Release: 3.6.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: storage
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-06-26 08:05:50 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On: 647386, 1083310, 1196199    
Bug Blocks: 959739, 1036111    

Description Gordon Watson 2013-11-26 15:47:16 UTC
Description of problem:

There are two parts to the following problem scenario with suggestions for two possible logic chnages.

1) In a specific incident, Live Storage Migration failed due to pathnames not existing in the '/rhev/data-center' tree. This condition had occurred due to vdsm having been upgraded incorrectly. The result of the failed LSM attempt was that the following were created;

- new "snapshot image" in the source domain.
- base image in the destination domain.
- "snapshot image" in the destination domain.

By "snapshot image" I mean the image created by the snapshot creation, I realise the snapshot is actually the base image.

In this case they have block-based storage and so two new logical volumes got created in the destination domain. However, nothing got copied to these images, so in essence they could have just been removed when the failure occurred. So, the first suggestion is to have LSM cleanup after itself after a failure.


2) To compound this, a subsequent LSM was performed. This also failed, but this time because the base image already existed in the destination domain. As a result another "snapshot image" got created in the source domain. Unfortunately the customer repeated this a few more times and the end result was that several VMs with multiple disks now had several snapshots (images only in the source domain). Each LV is most likely only 1gb, but multiplying this by the number of disks and the number of failed LSM attempts, there were now about 50 meaningless LVs, and therefore about 50gb of wasted space in the source domain. Not only that, but if the customer wants to remove these snapshots at a later time, they now have to bring the VM down (this is RHEV 3.2) and then remove multiple snapshots, which may not take too long for each one for the auto-generated ones (unless they end up merging woth the base image), but it still requires time and sufficient knowledge of exactly what you're doing and what's happeneing when you're removing snapshots. My concern is that this scenario may not be well-understood by customers and they might end up merging into the base image, which could lead to significant downtime.

I haven't looked at the LSM logic to see if it's feasible to not generate the snapshot before checking for the existence of images in the destination domain, but if that could be done then would be beneficial.



Version-Release number of selected component (if applicable):

- RHEV 3.2
- RHEL6.4 host with vdsm-4.10.2-27.0


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:

1) Logical volumes left behind in the destination domain after a specific LSM failure.
2) Snapshots created during failed LSM attempts.


Expected results:

1) Logical volumes removed from the destination domain after this specific LSM failure.
2) Snapshots not created during failed LSM attempts.


Additional info:

I will add supporting data and logs a little later.

Comment 3 Allon Mureinik 2013-11-27 10:33:31 UTC
(In reply to Gordon Watson from comment #0)
> Description of problem:
> 
> There are two parts to the following problem scenario with suggestions for
> two possible logic chnages.
> 
> 1) In a specific incident, Live Storage Migration failed due to pathnames
> not existing in the '/rhev/data-center' tree. This condition had occurred
> due to vdsm having been upgraded incorrectly. The result of the failed LSM
> attempt was that the following were created;
I don't understand this statement. Do you mean that VDSM's upgrade procedure is faulty, or did the customer make a mistake?

Comment 6 Daniel Erez 2014-02-19 08:57:43 UTC
According to logs, the snapshot creation succeeded but the live snapshot operation failed. A roll-back is indeed necessary as we can't go on with the live disk migration. However, in order to perform snapshot deletion, live merge support is needed.

Comment 9 Daniel Erez 2014-06-01 14:27:57 UTC
*** Bug 1103468 has been marked as a duplicate of this bug. ***

Comment 10 Daniel Erez 2014-06-17 13:13:17 UTC
*** Bug 1070863 has been marked as a duplicate of this bug. ***

Comment 11 Daniel Erez 2014-06-17 14:01:25 UTC
*** Bug 1108577 has been marked as a duplicate of this bug. ***

Comment 12 Daniel Erez 2014-06-18 13:47:08 UTC
*** Bug 1097722 has been marked as a duplicate of this bug. ***

Comment 13 Allon Mureinik 2014-06-26 08:05:50 UTC

*** This bug has been marked as a duplicate of bug 959705 ***

Comment 14 Daniel Erez 2015-02-25 08:09:54 UTC
*** Bug 1195773 has been marked as a duplicate of this bug. ***

Comment 15 Daniel Erez 2015-11-10 13:28:33 UTC
*** Bug 1248670 has been marked as a duplicate of this bug. ***

Comment 16 Daniel Erez 2016-02-07 15:50:07 UTC
*** Bug 1304810 has been marked as a duplicate of this bug. ***