Red Hat Bugzilla – Bug 1304810
[Scale] Live Disk Migration fails when VM forced to pause during Live Migration of large disk
Last modified: 2016-02-22 07:19:37 EST
Created attachment 1121144 [details]
Description of problem:
Attempted to move disk (1.3 TB) from one ISCSI storage domain to another ISCSI storage domain. During the disk migration process the VM user logs in and deletes and moves files around which temporarily results in the temporary filling of the VM disk capacity to 100% during the Live Disk Migration process.
Ovirt-Engine reports that the VM had been temporarily paused, and is now un-paused. This occurs during the Live Disk Migration process.
The result is that the Live Migration continues to copy over all data to the target SD and then fails at the end.
Failed Disk Migration after several hours.
Target SD is now full of data from from failed Live Disk Migration.
Target SD shows capacity increased as if it completed, but source SD still mapped to disk.
Version-Release number of selected component (if applicable):
50 total SDs of which
21 are ISCSI
30 are NFS
SD Source Domain contains 40 luns
SD Target Domain contains 60 luns
VM used in this test has
disk 1: 150 GB of which 95% full (OS)
disk 2: 1.3 TB of which 100% full [ disk was used for attempted migraqtion]
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sr0 11:0 1 1024M 0 rom
vda 252:0 0 150G 0 disk
├─vda1 252:1 0 500M 0 part /boot
└─vda2 252:2 0 99.5G 0 part
├─vg0-lv_root 253:0 0 19.5G 0 lvm /
├─vg0-lv_swap 253:1 0 15.6G 0 lvm [SWAP]
└─vg0-lv_home 253:2 0 64.4G 0 lvm /home
vdb 252:16 0 1.3T 0 disk /bigdisk
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg0-lv_root 20G 18G 1019M 95% /
devtmpfs 16G 0 16G 0% /dev
tmpfs 16G 0 16G 0% /dev/shm
tmpfs 16G 8.4M 16G 1% /run
tmpfs 16G 0 16G 0% /sys/fs/cgroup
/dev/vdb 1.3T 1.3T 0 100% /bigdisk
/dev/vda1 477M 94M 354M 21% /boot
/dev/mapper/vg0-lv_home 64G 17G 44G 29% /home
Steps to Reproduce:
1. Set up above env of 2 large SD's and VM with populated disk of 1.3TB
2. During Live Migration (move) of disk, fill disk capacity, then immediately delete some files to release some disk space.
3. VM will be temporarily paused.
4. Disk Migration completes with error.
"Error: VDSM <hostname> command failed: Drive replication error"
SD Target domain showing capacity taken for attempted disk migration.
If Disk Migration fails, SD Target space to be freed.
OR Disk Migration able to handle paused VM.
OR Disk Migration able to fail gracefully UPON error not after whole migration process completes.
We have two types of rollbacks which we still don't handle appropriately:
1. Deleting the redundant snapshots volumes on live migrate failure; which requires live merge.
2. Cleaning the destination volumes on live migrate failure after the VmReplicateDiskFinish step of LSM. Failure of this step means that we can't determine whether the VM still uses the source images or already migrated to the destination images. Hence, we can't remove the destination nor source volumes.
On failure prior VmReplicateDiskFinish, we do the a proper rollback; i.e. the newly created volumes should be removed.
Closing as a duplicate of bug 1034856 (which should be covered by rfe 959705).
*** This bug has been marked as a duplicate of bug 1034856 ***