Bug 1690475

Summary: When a live storage migration fails, the auto generated snapshot does not get removed
Product: Red Hat Enterprise Virtualization Manager Reporter: Steffen Froemer <sfroemer>
Component: ovirt-engineAssignee: Benny Zlotnik <bzlotnik>
Status: CLOSED ERRATA QA Contact: Ilan Zuckerman <izuckerm>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.2.7CC: aefrat, bzlotnik, eshames, lsurette, mtessun, srevivo, tnisan, ycui
Target Milestone: ovirt-4.4.0Keywords: ZStream
Target Release: ---Flags: lsvaty: testing_plan_complete-
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: rhv-4.4.0-28 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 1702597 (view as bug list) Environment:
Last Closed: 2020-08-04 13:16:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1785939, 1787378    
Bug Blocks: 1702597    

Description Steffen Froemer 2019-03-19 14:32:07 UTC
Description of problem:
During a live storage migration from storage domain to storage domain (both backed by FC) several steps take place.
One of the first steps is to create a snapshot of the source vdisk, named "auto generated snapshot for migration"
If the migration fails (in our case due to broken paths of the destination SD) the "auto generated snapshot" does not get removed.

Version-Release number of selected component (if applicable):


How reproducible:
always

Steps to Reproduce:
(1) Trigger live migration of storage
(2) Wait until snapshot is created and you can see disk activity on the destination SD
(3) Cut access to the destination SD (e.g. pull the cable)
(4) The task in RHV fails
(5) The snapshot from step (2) is still there and does not get removed.

Actual results:
The snapshot is left untouched and blocking storage

Expected results:
The RHV correctly detects, that the migration did not took place and should clean up automatically.

Additional info:

Comment 1 Tal Nisan 2019-03-21 13:22:17 UTC
Benny, I recall we have an RFE for this issue

Comment 2 Benny Zlotnik 2019-03-21 13:51:29 UTC
Not sure, I think we have an RFE for removing if the VM was shutdown

I am not entirely sure at which stage the cable is pulled?
Live storage migration consists of:
1. Create a snapshot 
2. Create image placeholder
3. Start replication
4. sync
5. finish replication
6. live merge

After stage 2 and until the end of 5 the "snapshot" is present on both source and destination, and if the destination is blocked, we can't really clean it up and it will require manual intervention
Though we can add a best-effort attempt to remove the auto-generated snapshot after failures

Comment 3 Steffen Froemer 2019-03-22 08:57:30 UTC
(In reply to Benny Zlotnik from comment #2)
> Not sure, I think we have an RFE for removing if the VM was shutdown
> 
> I am not entirely sure at which stage the cable is pulled?

In my case, the "pull cable" was caused by issue described in [1]. 
The scenario was as follow:

The vDisk from VM should be moved from SD1 -> SD2. The snapshot was created on SD1 and migration started. The migration failed due to [1] and the Snapshot was not deleted.

The expectation is, that in this case, the automatic created snapshot is removed automatically. 
I don't know, in which state the migration failed, but let me know, how I can help with additional logs, to get clarification to this.


[1]: https://access.redhat.com/solutions/3086271

Comment 4 Benny Zlotnik 2019-03-26 10:26:08 UTC
I see, I will a best-effort attempt to remove the auto-generated snapshot

Comment 6 Daniel Gur 2019-08-28 13:15:18 UTC
sync2jira

Comment 7 Daniel Gur 2019-08-28 13:20:21 UTC
sync2jira

Comment 8 RHEL Program Management 2019-12-13 11:36:47 UTC
The documentation text flag should only be set after 'doc text' field is provided. Please provide the documentation text and set the flag to '?' again.

Comment 9 RHV bug bot 2019-12-13 13:14:24 UTC
WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 10 RHV bug bot 2019-12-20 17:44:16 UTC
WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 11 RHV bug bot 2020-01-08 14:48:38 UTC
WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 12 RHV bug bot 2020-01-08 15:15:16 UTC
WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 13 RHV bug bot 2020-01-24 19:50:32 UTC
WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 15 Daniel 2020-03-18 09:41:16 UTC
Due to 1785939 I'm blocked and can't verify the bug yet.

working version:
4.4.0-0.24.master.el8ev | Red Hat Enterprise Linux release 8.2 Beta (Ootpa)

bug reproduced:
100%

steps:
1. create nfs \ gluster disk and attach it to VM
2. make LSM to different nfs \ gluster

Expected:
LSM should work properly and the disk should migrated.

Actual:
An error raised when trying to LSM in one of the next cases:
nfs -> nfs
nfs -> gluster
gluster -> gluster
gluster -> nfs

Engine Logs:
2020-03-17 12:38:59,268+02 INFO  [org.ovirt.engine.core.bll.StorageJobCallback] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-6) [d3e546d9-6349-4b02-accd-a1d1d1e765b4] Command CopyData id:
 '19240136-d097-4416-b981-6deaee59c48a': execution was completed, the command status is 'FAILED'
2020-03-17 12:38:59,544+02 INFO  [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-6) [d3e546d9-6349-4b02-accd-a1d1d1e765b4] Co
mmand 'LiveMigrateDisk' (id: '4f37c28a-597f-4749-bc0a-a890358b9cf5') waiting on child command id: '0cedd7c9-06cd-417b-8b45-11146bb16134' type:'CopyImageGroupVolumesData' to complete
2020-03-17 12:39:00,557+02 ERROR [org.ovirt.engine.core.bll.storage.disk.image.CopyDataCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-45) [d3e546d9-6349-4b02-accd-a1d1d1e765b4] End
ing command 'org.ovirt.engine.core.bll.storage.disk.image.CopyDataCommand' with failure.
2020-03-17 12:39:00,573+02 INFO  [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-45) [d3e546d9-6349-4b02-accd-a1d1d1e765b4] C
ommand 'CopyImageGroupVolumesData' id: '0cedd7c9-06cd-417b-8b45-11146bb16134' child commands '[19240136-d097-4416-b981-6deaee59c48a]' executions were completed, status 'FAILED'
2020-03-17 12:39:01,587+02 ERROR [org.ovirt.engine.core.bll.storage.disk.image.CopyImageGroupVolumesDataCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-94) [d3e546d9-6349-4b02-accd-
a1d1d1e765b4] Ending command 'org.ovirt.engine.core.bll.storage.disk.image.CopyImageGroupVolumesDataCommand' with failure.
2020-03-17 12:39:01,662+02 INFO  [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-94) [d3e546d9-6349-4b02-accd-a1d1d1e765b4] C
ommand 'LiveMigrateDisk' id: '4f37c28a-597f-4749-bc0a-a890358b9cf5' child commands '[a469ef90-ee0b-46f4-b263-cc6c6ddced92, 2239dbf0-5dff-4384-ae2d-f531b2c9ad72, 0cedd7c9-06cd-417b-8b45-11146bb16134]' executions 
were completed, status 'FAILED'
2020-03-17 12:39:02,751+02 ERROR [org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-90) [d3e546d9-6349-4b02-accd-a1d1d1e765b4] End
ing command 'org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand' with failure.
2020-03-17 12:39:02,752+02 ERROR [org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-90) [d3e546d9-6349-4b02-accd-a1d1d1e765b4] Fai
led during live storage migration of disk '1bf2203b-1bb6-4d0f-84ac-5cc29fcb0d4c' of vm '0deb41e6-deb6-41c5-9f42-bc3dda1a1238', attempting to end replication before deleting the target disk
2020-03-17 12:39:02,753+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-90) [d3e546d9-6349-4b02-accd-a1d1
d1e765b4] START, VmReplicateDiskFinishVDSCommand(HostName = host_mixed_1, VmReplicateDiskParameters:{hostId='9804e9ec-b212-40ce-9c8b-934675f1e896', vmId='0deb41e6-deb6-41c5-9f42-bc3dda1a1238', storagePoolId='677
451ec-4159-4fdc-8136-dec3fb9d861b', srcStorageDomainId='6543bae2-f959-48a7-8b82-84e2a0a5a1a7', targetStorageDomainId='6543bae2-f959-48a7-8b82-84e2a0a5a1a7', imageGroupId='1bf2203b-1bb6-4d0f-84ac-5cc29fcb0d4c', i
mageId='af3437c4-9fd3-4ebd-89c3-e3a9014eb206'}), log id: 387ea7c5
2020-03-17 12:39:02,882+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-90) [d3e546d9-6349-4b02-accd-a1d1
d1e765b4] FINISH, VmReplicateDiskFinishVDSCommand, return: , log id: 387ea7c5
2020-03-17 12:39:02,883+02 ERROR [org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-90) [d3e546d9-6349-4b02-accd-a1d1d1e765b4] Attempting to delete the target of disk '1bf2203b-1bb6-4d0f-84ac-5cc29fcb0d4c' of vm '0deb41e6-deb6-41c5-9f42-bc3dda1a1238'

Comment 20 Ilan Zuckerman 2020-04-19 14:10:11 UTC
Verified according steps that were agreed with DEV:

ovirt-engine-4.4.0-0.33.master.el8ev.noarch
vdsm-4.40.13-1.el8ev.x86_64

1. Trigger LSM for a blank vm iscsi -> iscsi
2. Grep for 'USER_CREATE_SNAPSHOT_FINISHED_SUCCESS' in engine log
3 then wait for "Running command: CloneImageGroupVolumesStructureCommand" (it will come very fast)
4. as soon as it starts, block connection to Storage on appropriate vdsm
[root@storage-ge5-vdsm1 ~]# iptables -A OUTPUT -d XXXXX -j DROP

5. after short time, as soon as 'failed' / 'ERROR' appears, release the connection
[root@storage-ge5-vdsm1 ~]# iptables -D OUTPUT -d XXXXX -j DROP

5. See that the temporary snapshot was removed from snapshots for that vm.

Comment 24 errata-xmlrpc 2020-08-04 13:16:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: RHV Manager (ovirt-engine) 4.4 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:3247