Created attachment 1121144 [details] ovirt_vdsm_logs Description of problem: Attempted to move disk (1.3 TB) from one ISCSI storage domain to another ISCSI storage domain. During the disk migration process the VM user logs in and deletes and moves files around which temporarily results in the temporary filling of the VM disk capacity to 100% during the Live Disk Migration process. Ovirt-Engine reports that the VM had been temporarily paused, and is now un-paused. This occurs during the Live Disk Migration process. The result is that the Live Migration continues to copy over all data to the target SD and then fails at the end. Results: Failed Disk Migration after several hours. Target SD is now full of data from from failed Live Disk Migration. Target SD shows capacity increased as if it completed, but source SD still mapped to disk. Version-Release number of selected component (if applicable): vdsm-hook-vmfex-dev-4.17.17-0.el7ev.noarch vdsm-python-4.17.17-0.el7ev.noarch vdsm-yajsonrpc-4.17.17-0.el7ev.noarch vdsm-4.17.17-0.el7ev.noarch vdsm-xmlrpc-4.17.17-0.el7ev.noarch vdsm-jsonrpc-4.17.17-0.el7ev.noarch vdsm-cli-4.17.17-0.el7ev.noarch vdsm-infra-4.17.17-0.el7ev.noarch Env contains 50 total SDs of which 21 are ISCSI 30 are NFS SD Source Domain contains 40 luns SD Target Domain contains 60 luns VM used in this test has disk 1: 150 GB of which 95% full (OS) disk 2: 1.3 TB of which 100% full [ disk was used for attempted migraqtion] Disk info: NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sr0 11:0 1 1024M 0 rom vda 252:0 0 150G 0 disk ├─vda1 252:1 0 500M 0 part /boot └─vda2 252:2 0 99.5G 0 part ├─vg0-lv_root 253:0 0 19.5G 0 lvm / ├─vg0-lv_swap 253:1 0 15.6G 0 lvm [SWAP] └─vg0-lv_home 253:2 0 64.4G 0 lvm /home vdb 252:16 0 1.3T 0 disk /bigdisk Filesystem Size Used Avail Use% Mounted on /dev/mapper/vg0-lv_root 20G 18G 1019M 95% / devtmpfs 16G 0 16G 0% /dev tmpfs 16G 0 16G 0% /dev/shm tmpfs 16G 8.4M 16G 1% /run tmpfs 16G 0 16G 0% /sys/fs/cgroup /dev/vdb 1.3T 1.3T 0 100% /bigdisk /dev/vda1 477M 94M 354M 21% /boot /dev/mapper/vg0-lv_home 64G 17G 44G 29% /home How reproducible: Steps to Reproduce: 1. Set up above env of 2 large SD's and VM with populated disk of 1.3TB 2. During Live Migration (move) of disk, fill disk capacity, then immediately delete some files to release some disk space. 3. VM will be temporarily paused. 4. Disk Migration completes with error. Actual results: "Error: VDSM <hostname> command failed: Drive replication error" SD Target domain showing capacity taken for attempted disk migration. Expected results: If Disk Migration fails, SD Target space to be freed. OR Disk Migration able to handle paused VM. OR Disk Migration able to fail gracefully UPON error not after whole migration process completes. Additional info:
We have two types of rollbacks which we still don't handle appropriately: 1. Deleting the redundant snapshots volumes on live migrate failure; which requires live merge. 2. Cleaning the destination volumes on live migrate failure after the VmReplicateDiskFinish step of LSM. Failure of this step means that we can't determine whether the VM still uses the source images or already migrated to the destination images. Hence, we can't remove the destination nor source volumes. On failure prior VmReplicateDiskFinish, we do the a proper rollback; i.e. the newly created volumes should be removed. Closing as a duplicate of bug 1034856 (which should be covered by rfe 959705). *** This bug has been marked as a duplicate of bug 1034856 ***