Bug 1304810 - [Scale] Live Disk Migration fails when VM forced to pause during Live Migration of large disk
Summary: [Scale] Live Disk Migration fails when VM forced to pause during Live Migrati...
Keywords:
Status: CLOSED DUPLICATE of bug 1034856
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Storage
Version: 3.6.2.5
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ovirt-3.6.5
: ---
Assignee: Daniel Erez
QA Contact: Aharon Canan
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-02-04 16:52 UTC by mlehrer
Modified: 2016-02-22 12:19 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-02-07 15:50:07 UTC
oVirt Team: Storage
Embargoed:
tnisan: ovirt-3.6.z?
rule-engine: planning_ack?
rule-engine: devel_ack?
rule-engine: testing_ack?


Attachments (Terms of Use)
ovirt_vdsm_logs (3.16 MB, application/zip)
2016-02-04 16:52 UTC, mlehrer
no flags Details

Description mlehrer 2016-02-04 16:52:15 UTC
Created attachment 1121144 [details]
ovirt_vdsm_logs

Description of problem:

Attempted to move disk (1.3 TB) from one ISCSI storage domain to another ISCSI storage domain.  During the disk migration process the VM user logs in and deletes and moves files around which temporarily results in the temporary filling of the VM disk capacity to 100% during the Live Disk Migration process.

Ovirt-Engine reports that the VM had been temporarily paused, and is now un-paused.  This occurs during the Live Disk Migration process.

The result is that the Live Migration continues to copy over all data to the target SD and then fails at the end.

Results:
Failed Disk Migration after several hours.
Target SD is now full of data from from failed Live Disk Migration.
Target SD shows capacity increased as if it completed, but source SD still mapped to disk.





Version-Release number of selected component (if applicable):

vdsm-hook-vmfex-dev-4.17.17-0.el7ev.noarch
vdsm-python-4.17.17-0.el7ev.noarch
vdsm-yajsonrpc-4.17.17-0.el7ev.noarch
vdsm-4.17.17-0.el7ev.noarch
vdsm-xmlrpc-4.17.17-0.el7ev.noarch
vdsm-jsonrpc-4.17.17-0.el7ev.noarch
vdsm-cli-4.17.17-0.el7ev.noarch
vdsm-infra-4.17.17-0.el7ev.noarch

Env contains

50 total SDs of which 
   21 are ISCSI
   30 are NFS

SD Source Domain contains 40 luns
SD Target Domain contains 60 luns
VM used in this test has
   disk 1: 150 GB of which 95% full (OS)
   disk 2: 1.3 TB of which 100% full [ disk was used for attempted migraqtion]

Disk info:

NAME            MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sr0              11:0    1 1024M  0 rom  
vda             252:0    0  150G  0 disk 
├─vda1          252:1    0  500M  0 part /boot
└─vda2          252:2    0 99.5G  0 part 
  ├─vg0-lv_root 253:0    0 19.5G  0 lvm  /
  ├─vg0-lv_swap 253:1    0 15.6G  0 lvm  [SWAP]
  └─vg0-lv_home 253:2    0 64.4G  0 lvm  /home
vdb             252:16   0  1.3T  0 disk /bigdisk


Filesystem               Size  Used Avail Use% Mounted on
/dev/mapper/vg0-lv_root   20G   18G 1019M  95% /
devtmpfs                  16G     0   16G   0% /dev
tmpfs                     16G     0   16G   0% /dev/shm
tmpfs                     16G  8.4M   16G   1% /run
tmpfs                     16G     0   16G   0% /sys/fs/cgroup
/dev/vdb                 1.3T  1.3T     0 100% /bigdisk
/dev/vda1                477M   94M  354M  21% /boot
/dev/mapper/vg0-lv_home   64G   17G   44G  29% /home



How reproducible:


Steps to Reproduce:
1. Set up above env of 2 large SD's and VM with populated disk of 1.3TB
2. During Live Migration (move) of disk, fill disk capacity, then immediately delete some files to release some disk space.
3. VM will be temporarily paused.
4. Disk Migration completes with error.


Actual results:

"Error: VDSM <hostname> command failed: Drive replication error"

SD Target domain showing capacity taken for attempted disk migration.

Expected results:
If Disk Migration fails, SD Target space to be freed.
OR Disk Migration able to handle paused VM.
OR Disk Migration able to fail gracefully UPON error not after whole migration process completes.



Additional info:

Comment 1 Daniel Erez 2016-02-07 15:50:07 UTC
We have two types of rollbacks which we still don't handle appropriately:
1. Deleting the redundant snapshots volumes on live migrate failure; which requires live merge.
2. Cleaning the destination volumes on live migrate failure after the VmReplicateDiskFinish step of LSM. Failure of this step means that we can't determine whether the VM still uses the source images or already migrated to the destination images. Hence, we can't remove the destination nor source volumes.
On failure prior VmReplicateDiskFinish, we do the a proper rollback; i.e. the newly created volumes should be removed.

Closing as a duplicate of bug 1034856 (which should be covered by rfe 959705).

*** This bug has been marked as a duplicate of bug 1034856 ***


Note You need to log in before you can comment on or make changes to this bug.