Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2193392

Summary:	LV is not deactivated in the VM's host after a failed live storage migration may cause data corruption
Product:	Red Hat Enterprise Virtualization Manager	Reporter:	nijin ashok <nashok>
Component:	vdsm	Assignee:	Arik <ahadas>
Status:	CLOSED ERRATA	QA Contact:	Shir Fishbain <sfishbai>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.5.3	CC:	ahadas, bzlotnik, emarcus, gveitmic, lsurette, michal.skrivanek, srevivo, ycui
Target Milestone:	ovirt-4.5.3-async
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Previously, the LV on the target storage domain was not deactivated when live storage migration failed. In this release, when live storage migration fails, VDSM deactivates the LV that was created on the target storage domain.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-09-11 10:15:10 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Storage	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description nijin ashok 2023-05-05 13:48:00 UTC

Description of problem:

The LSM fails at the final stage:

~~~
2023-05-05 18:06:48,618+05 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-61) [58c7e9a0-5da3-4a29-836f-2f4c0ab075b5] Command 'VmReplicateDiskFinishVDSCommand(HostName = dell-r530-3.gsslab.pnq.redhat.com, VmReplicateDiskParameters:{hostId='786d6c9f-afba-4c0a-beb9-5c88b831c029', vmId='13456b7f-5d75-44f0-a15b-8b273c272840', storagePoolId='71cf243c-edec-11eb-aa8c-002b6a01557c', srcStorageDomainId='c044e714-676f-47b3-93ed-0144fff2863c', targetStorageDomainId='c044e714-676f-47b3-93ed-0144fff2863c', imageGroupId='8476b433-2cf5-4eab-8b50-951d9698c249', imageId='09f3de16-2014-4f56-b3d2-7a066d7bfcc4', diskType='null', needExtend='true'})' execution failed: VDSGenericException: VDSErrorException: Failed in vdscommand to VmReplicateDiskFinishVDS, error = Replication not in progress.: {'vmId': '13456b7f-5d75-44f0-a15b-8b273c272840', 'driveName': 'sdb', 'srcDisk': {'imageID': '8476b433-2cf5-4eab-8b50-951d9698c249', 'poolID': '71cf243c-edec-11eb-aa8c-002b6a01557c', 'volumeID': '09f3de16-2014-4f56-b3d2-7a066d7bfcc4', 'device': 'disk', 'domainID': 'c044e714-676f-47b3-93ed-0144fff2863c'}}
~~~

So it deleted the destination volume through SPM:

~~~
2023-05-05 18:06:48,618+05 ERROR [org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-61) [58c7e9a0-5da3-4a29-836f-2f4c0ab075b5] Attempting to delete the destination of disk '8476b433-2cf5-4eab-8b50-951d9698c249' of vm '13456b7f-5d75-44f0-a15b-8b273c272840'

2023-05-05 18:06:48,675+05 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-61) [717dc32e] START, DeleteImageGroupVDSCommand( DeleteImageGroupVDSCommandParameters:{storagePoolId='71cf243c-edec-11eb-aa8c-002b6a01557c', ignoreFailoverLimit='false', storageDomainId='1bfb1005-b98d-4592-9c1d-5c04292584ed', imageGroupId='8476b433-2cf5-4eab-8b50-951d9698c249', postZeros='false', discard='false', forceDelete='false'}), log id: 7da8ea9
~~~

However, it is not deactivating the LV in the host where the VM was running before deleting the LV. So it left a dm device in this host mapping to the old blocks:

~~~
dmsetup table|grep 2c2aa5b9
1bfb1005--b98d--4592--9c1d--5c04292584ed-2c2aa5b9--1805--4336--9978--52a17aecfcd9: 0 5242880 linear 253:13 331089920
~~~

Now, if we add a new disk, the LVM will allocate the same segments since LVM point of view, these segments are free:

~~~
offset is the same 331089920 for the new lv 44e5bbc2:
 
dmsetup table|grep 44e5bbc2
1bfb1005--b98d--4592--9c1d--5c04292584ed-44e5bbc2--b1fd--4198--907f--53cde3c5cff8: 0 6291456 linear 253:5 331089920
~~~

So this ends up in two LVs mapping to the same blocks of the underlying LUN.
 
Migrate the first VM disk again and if it's successful, VM will end up using the old mapped sectors. Although the VM will be using 2c2aa5b9, it will see the contents of the LV 44e5bbc2. 

The data of the other disk can easily get corrupted if this VMs writes any data in it.

Version-Release number of selected component (if applicable):

rhvm-4.5.3.7-1.el8ev.noarch

How reproducible:

100%

Steps to Reproduce:

1. Migrate a disk to a block storage domain. Make sure that the VM is _not_ running in the SPM host.

2. To fail the diskreplicatefinish, abort the block job manually during the migration.

~~~
# virsh  blockjob <vm-name> <disk-name> --abort
~~~

3. The engine will remove the destination LV. Confirm that the dm device of destination LV is not removed in the host where the VM is running:

~~~
# dmsetup table |grep lv
~~~

4. Create a new disk. Depending on the available segments, LVM may allocate the same segment blocks for the new LV.

5. Migrate the disk in [1] again. After the migration, the VM will end up seeing the contents from the new disk.


Actual results:

LV is not deactivated in the VM's host after a failed live storage migration may cause data corruption

Expected results:

Before deleting the LV through SPM host, it should deactivate the LV on the host where the VM was running.

Additional info:

Comment 10 Arik 2023-06-26 19:44:11 UTC

I looked at this bug together with Benny today
It's the same scenario that was verified in https://bugzilla.redhat.com/show_bug.cgi?id=1542423#c4 but back then we concentrated on removing the disk from the destination storage domain in order to allow initiating another live storage migration to the same destination storage domain, and didn't pay attention to the fact that the destination volume on the host is not teared down.
We see how it may cause data corruption if the blocks we read on the destination storage domain contain data from the previous live storage migration and the fix seems trivial - to call tear-down before returning the ReplicationNotInProgress exception on the host

Comment 15 Shir Fishbain 2023-09-07 17:25:43 UTC

Verified by automation regression runnings (Tier1, Tier2, and Tier3)

versions:
ovirt-engine-4.5.3.9-1.el8ev.noarch
vdsm-4.50.3.9-1.el8ev.x86_64

Comment 17 errata-xmlrpc 2023-09-11 10:15:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (vdsm bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:5038