Description of problem: During a live storage migration from storage domain to storage domain (both backed by FC) several steps take place. One of the first steps is to create a snapshot of the source vdisk, named "auto generated snapshot for migration" If the migration fails (in our case due to broken paths of the destination SD) the "auto generated snapshot" does not get removed. Version-Release number of selected component (if applicable): How reproducible: always Steps to Reproduce: (1) Trigger live migration of storage (2) Wait until snapshot is created and you can see disk activity on the destination SD (3) Cut access to the destination SD (e.g. pull the cable) (4) The task in RHV fails (5) The snapshot from step (2) is still there and does not get removed. Actual results: The snapshot is left untouched and blocking storage Expected results: The RHV correctly detects, that the migration did not took place and should clean up automatically. Additional info:
Benny, I recall we have an RFE for this issue
Not sure, I think we have an RFE for removing if the VM was shutdown I am not entirely sure at which stage the cable is pulled? Live storage migration consists of: 1. Create a snapshot 2. Create image placeholder 3. Start replication 4. sync 5. finish replication 6. live merge After stage 2 and until the end of 5 the "snapshot" is present on both source and destination, and if the destination is blocked, we can't really clean it up and it will require manual intervention Though we can add a best-effort attempt to remove the auto-generated snapshot after failures
(In reply to Benny Zlotnik from comment #2) > Not sure, I think we have an RFE for removing if the VM was shutdown > > I am not entirely sure at which stage the cable is pulled? In my case, the "pull cable" was caused by issue described in [1]. The scenario was as follow: The vDisk from VM should be moved from SD1 -> SD2. The snapshot was created on SD1 and migration started. The migration failed due to [1] and the Snapshot was not deleted. The expectation is, that in this case, the automatic created snapshot is removed automatically. I don't know, in which state the migration failed, but let me know, how I can help with additional logs, to get clarification to this. [1]: https://access.redhat.com/solutions/3086271
I see, I will a best-effort attempt to remove the auto-generated snapshot
sync2jira
The documentation text flag should only be set after 'doc text' field is provided. Please provide the documentation text and set the flag to '?' again.
WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed: [Found non-acked flags: '{}', ] For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed: [Found non-acked flags: '{}', ] For more info please contact: rhv-devops
Due to 1785939 I'm blocked and can't verify the bug yet. working version: 4.4.0-0.24.master.el8ev | Red Hat Enterprise Linux release 8.2 Beta (Ootpa) bug reproduced: 100% steps: 1. create nfs \ gluster disk and attach it to VM 2. make LSM to different nfs \ gluster Expected: LSM should work properly and the disk should migrated. Actual: An error raised when trying to LSM in one of the next cases: nfs -> nfs nfs -> gluster gluster -> gluster gluster -> nfs Engine Logs: 2020-03-17 12:38:59,268+02 INFO [org.ovirt.engine.core.bll.StorageJobCallback] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-6) [d3e546d9-6349-4b02-accd-a1d1d1e765b4] Command CopyData id: '19240136-d097-4416-b981-6deaee59c48a': execution was completed, the command status is 'FAILED' 2020-03-17 12:38:59,544+02 INFO [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-6) [d3e546d9-6349-4b02-accd-a1d1d1e765b4] Co mmand 'LiveMigrateDisk' (id: '4f37c28a-597f-4749-bc0a-a890358b9cf5') waiting on child command id: '0cedd7c9-06cd-417b-8b45-11146bb16134' type:'CopyImageGroupVolumesData' to complete 2020-03-17 12:39:00,557+02 ERROR [org.ovirt.engine.core.bll.storage.disk.image.CopyDataCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-45) [d3e546d9-6349-4b02-accd-a1d1d1e765b4] End ing command 'org.ovirt.engine.core.bll.storage.disk.image.CopyDataCommand' with failure. 2020-03-17 12:39:00,573+02 INFO [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-45) [d3e546d9-6349-4b02-accd-a1d1d1e765b4] C ommand 'CopyImageGroupVolumesData' id: '0cedd7c9-06cd-417b-8b45-11146bb16134' child commands '[19240136-d097-4416-b981-6deaee59c48a]' executions were completed, status 'FAILED' 2020-03-17 12:39:01,587+02 ERROR [org.ovirt.engine.core.bll.storage.disk.image.CopyImageGroupVolumesDataCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-94) [d3e546d9-6349-4b02-accd- a1d1d1e765b4] Ending command 'org.ovirt.engine.core.bll.storage.disk.image.CopyImageGroupVolumesDataCommand' with failure. 2020-03-17 12:39:01,662+02 INFO [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-94) [d3e546d9-6349-4b02-accd-a1d1d1e765b4] C ommand 'LiveMigrateDisk' id: '4f37c28a-597f-4749-bc0a-a890358b9cf5' child commands '[a469ef90-ee0b-46f4-b263-cc6c6ddced92, 2239dbf0-5dff-4384-ae2d-f531b2c9ad72, 0cedd7c9-06cd-417b-8b45-11146bb16134]' executions were completed, status 'FAILED' 2020-03-17 12:39:02,751+02 ERROR [org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-90) [d3e546d9-6349-4b02-accd-a1d1d1e765b4] End ing command 'org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand' with failure. 2020-03-17 12:39:02,752+02 ERROR [org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-90) [d3e546d9-6349-4b02-accd-a1d1d1e765b4] Fai led during live storage migration of disk '1bf2203b-1bb6-4d0f-84ac-5cc29fcb0d4c' of vm '0deb41e6-deb6-41c5-9f42-bc3dda1a1238', attempting to end replication before deleting the target disk 2020-03-17 12:39:02,753+02 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-90) [d3e546d9-6349-4b02-accd-a1d1 d1e765b4] START, VmReplicateDiskFinishVDSCommand(HostName = host_mixed_1, VmReplicateDiskParameters:{hostId='9804e9ec-b212-40ce-9c8b-934675f1e896', vmId='0deb41e6-deb6-41c5-9f42-bc3dda1a1238', storagePoolId='677 451ec-4159-4fdc-8136-dec3fb9d861b', srcStorageDomainId='6543bae2-f959-48a7-8b82-84e2a0a5a1a7', targetStorageDomainId='6543bae2-f959-48a7-8b82-84e2a0a5a1a7', imageGroupId='1bf2203b-1bb6-4d0f-84ac-5cc29fcb0d4c', i mageId='af3437c4-9fd3-4ebd-89c3-e3a9014eb206'}), log id: 387ea7c5 2020-03-17 12:39:02,882+02 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-90) [d3e546d9-6349-4b02-accd-a1d1 d1e765b4] FINISH, VmReplicateDiskFinishVDSCommand, return: , log id: 387ea7c5 2020-03-17 12:39:02,883+02 ERROR [org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-90) [d3e546d9-6349-4b02-accd-a1d1d1e765b4] Attempting to delete the target of disk '1bf2203b-1bb6-4d0f-84ac-5cc29fcb0d4c' of vm '0deb41e6-deb6-41c5-9f42-bc3dda1a1238'
Verified according steps that were agreed with DEV: ovirt-engine-4.4.0-0.33.master.el8ev.noarch vdsm-4.40.13-1.el8ev.x86_64 1. Trigger LSM for a blank vm iscsi -> iscsi 2. Grep for 'USER_CREATE_SNAPSHOT_FINISHED_SUCCESS' in engine log 3 then wait for "Running command: CloneImageGroupVolumesStructureCommand" (it will come very fast) 4. as soon as it starts, block connection to Storage on appropriate vdsm [root@storage-ge5-vdsm1 ~]# iptables -A OUTPUT -d XXXXX -j DROP 5. after short time, as soon as 'failed' / 'ERROR' appears, release the connection [root@storage-ge5-vdsm1 ~]# iptables -D OUTPUT -d XXXXX -j DROP 5. See that the temporary snapshot was removed from snapshots for that vm.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: RHV Manager (ovirt-engine) 4.4 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:3247