Previously, the LV on the target storage domain was not deactivated when live storage migration failed.
In this release, when live storage migration fails, VDSM deactivates the LV that was created on the target storage domain.
Description of problem:
The LSM fails at the final stage:
~~~
2023-05-05 18:06:48,618+05 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-61) [58c7e9a0-5da3-4a29-836f-2f4c0ab075b5] Command 'VmReplicateDiskFinishVDSCommand(HostName = dell-r530-3.gsslab.pnq.redhat.com, VmReplicateDiskParameters:{hostId='786d6c9f-afba-4c0a-beb9-5c88b831c029', vmId='13456b7f-5d75-44f0-a15b-8b273c272840', storagePoolId='71cf243c-edec-11eb-aa8c-002b6a01557c', srcStorageDomainId='c044e714-676f-47b3-93ed-0144fff2863c', targetStorageDomainId='c044e714-676f-47b3-93ed-0144fff2863c', imageGroupId='8476b433-2cf5-4eab-8b50-951d9698c249', imageId='09f3de16-2014-4f56-b3d2-7a066d7bfcc4', diskType='null', needExtend='true'})' execution failed: VDSGenericException: VDSErrorException: Failed in vdscommand to VmReplicateDiskFinishVDS, error = Replication not in progress.: {'vmId': '13456b7f-5d75-44f0-a15b-8b273c272840', 'driveName': 'sdb', 'srcDisk': {'imageID': '8476b433-2cf5-4eab-8b50-951d9698c249', 'poolID': '71cf243c-edec-11eb-aa8c-002b6a01557c', 'volumeID': '09f3de16-2014-4f56-b3d2-7a066d7bfcc4', 'device': 'disk', 'domainID': 'c044e714-676f-47b3-93ed-0144fff2863c'}}
~~~
So it deleted the destination volume through SPM:
~~~
2023-05-05 18:06:48,618+05 ERROR [org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-61) [58c7e9a0-5da3-4a29-836f-2f4c0ab075b5] Attempting to delete the destination of disk '8476b433-2cf5-4eab-8b50-951d9698c249' of vm '13456b7f-5d75-44f0-a15b-8b273c272840'
2023-05-05 18:06:48,675+05 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-61) [717dc32e] START, DeleteImageGroupVDSCommand( DeleteImageGroupVDSCommandParameters:{storagePoolId='71cf243c-edec-11eb-aa8c-002b6a01557c', ignoreFailoverLimit='false', storageDomainId='1bfb1005-b98d-4592-9c1d-5c04292584ed', imageGroupId='8476b433-2cf5-4eab-8b50-951d9698c249', postZeros='false', discard='false', forceDelete='false'}), log id: 7da8ea9
~~~
However, it is not deactivating the LV in the host where the VM was running before deleting the LV. So it left a dm device in this host mapping to the old blocks:
~~~
dmsetup table|grep 2c2aa5b9
1bfb1005--b98d--4592--9c1d--5c04292584ed-2c2aa5b9--1805--4336--9978--52a17aecfcd9: 0 5242880 linear 253:13 331089920
~~~
Now, if we add a new disk, the LVM will allocate the same segments since LVM point of view, these segments are free:
~~~
offset is the same 331089920 for the new lv 44e5bbc2:
dmsetup table|grep 44e5bbc2
1bfb1005--b98d--4592--9c1d--5c04292584ed-44e5bbc2--b1fd--4198--907f--53cde3c5cff8: 0 6291456 linear 253:5 331089920
~~~
So this ends up in two LVs mapping to the same blocks of the underlying LUN.
Migrate the first VM disk again and if it's successful, VM will end up using the old mapped sectors. Although the VM will be using 2c2aa5b9, it will see the contents of the LV 44e5bbc2.
The data of the other disk can easily get corrupted if this VMs writes any data in it.
Version-Release number of selected component (if applicable):
rhvm-4.5.3.7-1.el8ev.noarch
How reproducible:
100%
Steps to Reproduce:
1. Migrate a disk to a block storage domain. Make sure that the VM is _not_ running in the SPM host.
2. To fail the diskreplicatefinish, abort the block job manually during the migration.
~~~
# virsh blockjob <vm-name> <disk-name> --abort
~~~
3. The engine will remove the destination LV. Confirm that the dm device of destination LV is not removed in the host where the VM is running:
~~~
# dmsetup table |grep lv
~~~
4. Create a new disk. Depending on the available segments, LVM may allocate the same segment blocks for the new LV.
5. Migrate the disk in [1] again. After the migration, the VM will end up seeing the contents from the new disk.
Actual results:
LV is not deactivated in the VM's host after a failed live storage migration may cause data corruption
Expected results:
Before deleting the LV through SPM host, it should deactivate the LV on the host where the VM was running.
Additional info:
I looked at this bug together with Benny today
It's the same scenario that was verified in https://bugzilla.redhat.com/show_bug.cgi?id=1542423#c4 but back then we concentrated on removing the disk from the destination storage domain in order to allow initiating another live storage migration to the same destination storage domain, and didn't pay attention to the fact that the destination volume on the host is not teared down.
We see how it may cause data corruption if the blocks we read on the destination storage domain contain data from the previous live storage migration and the fix seems trivial - to call tear-down before returning the ReplicationNotInProgress exception on the host
Description of problem: The LSM fails at the final stage: ~~~ 2023-05-05 18:06:48,618+05 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-61) [58c7e9a0-5da3-4a29-836f-2f4c0ab075b5] Command 'VmReplicateDiskFinishVDSCommand(HostName = dell-r530-3.gsslab.pnq.redhat.com, VmReplicateDiskParameters:{hostId='786d6c9f-afba-4c0a-beb9-5c88b831c029', vmId='13456b7f-5d75-44f0-a15b-8b273c272840', storagePoolId='71cf243c-edec-11eb-aa8c-002b6a01557c', srcStorageDomainId='c044e714-676f-47b3-93ed-0144fff2863c', targetStorageDomainId='c044e714-676f-47b3-93ed-0144fff2863c', imageGroupId='8476b433-2cf5-4eab-8b50-951d9698c249', imageId='09f3de16-2014-4f56-b3d2-7a066d7bfcc4', diskType='null', needExtend='true'})' execution failed: VDSGenericException: VDSErrorException: Failed in vdscommand to VmReplicateDiskFinishVDS, error = Replication not in progress.: {'vmId': '13456b7f-5d75-44f0-a15b-8b273c272840', 'driveName': 'sdb', 'srcDisk': {'imageID': '8476b433-2cf5-4eab-8b50-951d9698c249', 'poolID': '71cf243c-edec-11eb-aa8c-002b6a01557c', 'volumeID': '09f3de16-2014-4f56-b3d2-7a066d7bfcc4', 'device': 'disk', 'domainID': 'c044e714-676f-47b3-93ed-0144fff2863c'}} ~~~ So it deleted the destination volume through SPM: ~~~ 2023-05-05 18:06:48,618+05 ERROR [org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-61) [58c7e9a0-5da3-4a29-836f-2f4c0ab075b5] Attempting to delete the destination of disk '8476b433-2cf5-4eab-8b50-951d9698c249' of vm '13456b7f-5d75-44f0-a15b-8b273c272840' 2023-05-05 18:06:48,675+05 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-61) [717dc32e] START, DeleteImageGroupVDSCommand( DeleteImageGroupVDSCommandParameters:{storagePoolId='71cf243c-edec-11eb-aa8c-002b6a01557c', ignoreFailoverLimit='false', storageDomainId='1bfb1005-b98d-4592-9c1d-5c04292584ed', imageGroupId='8476b433-2cf5-4eab-8b50-951d9698c249', postZeros='false', discard='false', forceDelete='false'}), log id: 7da8ea9 ~~~ However, it is not deactivating the LV in the host where the VM was running before deleting the LV. So it left a dm device in this host mapping to the old blocks: ~~~ dmsetup table|grep 2c2aa5b9 1bfb1005--b98d--4592--9c1d--5c04292584ed-2c2aa5b9--1805--4336--9978--52a17aecfcd9: 0 5242880 linear 253:13 331089920 ~~~ Now, if we add a new disk, the LVM will allocate the same segments since LVM point of view, these segments are free: ~~~ offset is the same 331089920 for the new lv 44e5bbc2: dmsetup table|grep 44e5bbc2 1bfb1005--b98d--4592--9c1d--5c04292584ed-44e5bbc2--b1fd--4198--907f--53cde3c5cff8: 0 6291456 linear 253:5 331089920 ~~~ So this ends up in two LVs mapping to the same blocks of the underlying LUN. Migrate the first VM disk again and if it's successful, VM will end up using the old mapped sectors. Although the VM will be using 2c2aa5b9, it will see the contents of the LV 44e5bbc2. The data of the other disk can easily get corrupted if this VMs writes any data in it. Version-Release number of selected component (if applicable): rhvm-4.5.3.7-1.el8ev.noarch How reproducible: 100% Steps to Reproduce: 1. Migrate a disk to a block storage domain. Make sure that the VM is _not_ running in the SPM host. 2. To fail the diskreplicatefinish, abort the block job manually during the migration. ~~~ # virsh blockjob <vm-name> <disk-name> --abort ~~~ 3. The engine will remove the destination LV. Confirm that the dm device of destination LV is not removed in the host where the VM is running: ~~~ # dmsetup table |grep lv ~~~ 4. Create a new disk. Depending on the available segments, LVM may allocate the same segment blocks for the new LV. 5. Migrate the disk in [1] again. After the migration, the VM will end up seeing the contents from the new disk. Actual results: LV is not deactivated in the VM's host after a failed live storage migration may cause data corruption Expected results: Before deleting the LV through SPM host, it should deactivate the LV on the host where the VM was running. Additional info: