1690475 – When a live storage migration fails, the auto generated snapshot does not get removed

Bug 1690475 - When a live storage migration fails, the auto generated snapshot does not get removed

Summary: When a live storage migration fails, the auto generated snapshot does not get...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	4.2.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	ovirt-4.4.0
Target Release:	---
Assignee:	Benny Zlotnik
QA Contact:	Ilan Zuckerman
Docs Contact:
URL:
Whiteboard:
Depends On:	1785939 1787378
Blocks:	1702597
TreeView+	depends on / blocked

Reported:	2019-03-19 14:32 UTC by Steffen Froemer
Modified:	2020-08-04 13:19 UTC (History)
CC List:	8 users (show)
Fixed In Version:	rhv-4.4.0-28
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	1702597 (view as bug list)
Environment:
Last Closed:	2020-08-04 13:16:55 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:
Flags:	lsvaty: testing_plan_complete-

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2020:3247	None	None	None	2020-08-04 13:19:06 UTC
oVirt gerrit	98919	'None'	MERGED	core: attempt to remove the auto-generated snapshot	2020-08-03 12:39:08 UTC
oVirt gerrit	99481	'None'	MERGED	core: attempt to remove the auto-generated snapshot	2020-08-03 12:39:08 UTC

Description Steffen Froemer 2019-03-19 14:32:07 UTC

Description of problem:
During a live storage migration from storage domain to storage domain (both backed by FC) several steps take place.
One of the first steps is to create a snapshot of the source vdisk, named "auto generated snapshot for migration"
If the migration fails (in our case due to broken paths of the destination SD) the "auto generated snapshot" does not get removed.

Version-Release number of selected component (if applicable):


How reproducible:
always

Steps to Reproduce:
(1) Trigger live migration of storage
(2) Wait until snapshot is created and you can see disk activity on the destination SD
(3) Cut access to the destination SD (e.g. pull the cable)
(4) The task in RHV fails
(5) The snapshot from step (2) is still there and does not get removed.

Actual results:
The snapshot is left untouched and blocking storage

Expected results:
The RHV correctly detects, that the migration did not took place and should clean up automatically.

Additional info:

Comment 1 Tal Nisan 2019-03-21 13:22:17 UTC

Benny, I recall we have an RFE for this issue

Comment 2 Benny Zlotnik 2019-03-21 13:51:29 UTC

Not sure, I think we have an RFE for removing if the VM was shutdown

I am not entirely sure at which stage the cable is pulled?
Live storage migration consists of:
1. Create a snapshot 
2. Create image placeholder
3. Start replication
4. sync
5. finish replication
6. live merge

After stage 2 and until the end of 5 the "snapshot" is present on both source and destination, and if the destination is blocked, we can't really clean it up and it will require manual intervention
Though we can add a best-effort attempt to remove the auto-generated snapshot after failures

Comment 3 Steffen Froemer 2019-03-22 08:57:30 UTC

(In reply to Benny Zlotnik from comment #2)
> Not sure, I think we have an RFE for removing if the VM was shutdown
> 
> I am not entirely sure at which stage the cable is pulled?

In my case, the "pull cable" was caused by issue described in [1]. 
The scenario was as follow:

The vDisk from VM should be moved from SD1 -> SD2. The snapshot was created on SD1 and migration started. The migration failed due to [1] and the Snapshot was not deleted.

The expectation is, that in this case, the automatic created snapshot is removed automatically. 
I don't know, in which state the migration failed, but let me know, how I can help with additional logs, to get clarification to this.


[1]: https://access.redhat.com/solutions/3086271

Comment 4 Benny Zlotnik 2019-03-26 10:26:08 UTC

I see, I will a best-effort attempt to remove the auto-generated snapshot

Comment 6 Daniel Gur 2019-08-28 13:15:18 UTC

sync2jira

Comment 7 Daniel Gur 2019-08-28 13:20:21 UTC

sync2jira

Comment 8 RHEL Program Management 2019-12-13 11:36:47 UTC

The documentation text flag should only be set after 'doc text' field is provided. Please provide the documentation text and set the flag to '?' again.

Comment 9 RHV bug bot 2019-12-13 13:14:24 UTC

WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 10 RHV bug bot 2019-12-20 17:44:16 UTC

WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 11 RHV bug bot 2020-01-08 14:48:38 UTC

WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 12 RHV bug bot 2020-01-08 15:15:16 UTC

WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 13 RHV bug bot 2020-01-24 19:50:32 UTC

WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 15 Daniel 2020-03-18 09:41:16 UTC

Due to 1785939 I'm blocked and can't verify the bug yet.

working version:
4.4.0-0.24.master.el8ev | Red Hat Enterprise Linux release 8.2 Beta (Ootpa)

bug reproduced:
100%

steps:
1. create nfs \ gluster disk and attach it to VM
2. make LSM to different nfs \ gluster

Expected:
LSM should work properly and the disk should migrated.

Actual:
An error raised when trying to LSM in one of the next cases:
nfs -> nfs
nfs -> gluster
gluster -> gluster
gluster -> nfs

Engine Logs:
2020-03-17 12:38:59,268+02 INFO  [org.ovirt.engine.core.bll.StorageJobCallback] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-6) [d3e546d9-6349-4b02-accd-a1d1d1e765b4] Command CopyData id:
 '19240136-d097-4416-b981-6deaee59c48a': execution was completed, the command status is 'FAILED'
2020-03-17 12:38:59,544+02 INFO  [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-6) [d3e546d9-6349-4b02-accd-a1d1d1e765b4] Co
mmand 'LiveMigrateDisk' (id: '4f37c28a-597f-4749-bc0a-a890358b9cf5') waiting on child command id: '0cedd7c9-06cd-417b-8b45-11146bb16134' type:'CopyImageGroupVolumesData' to complete
2020-03-17 12:39:00,557+02 ERROR [org.ovirt.engine.core.bll.storage.disk.image.CopyDataCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-45) [d3e546d9-6349-4b02-accd-a1d1d1e765b4] End
ing command 'org.ovirt.engine.core.bll.storage.disk.image.CopyDataCommand' with failure.
2020-03-17 12:39:00,573+02 INFO  [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-45) [d3e546d9-6349-4b02-accd-a1d1d1e765b4] C
ommand 'CopyImageGroupVolumesData' id: '0cedd7c9-06cd-417b-8b45-11146bb16134' child commands '[19240136-d097-4416-b981-6deaee59c48a]' executions were completed, status 'FAILED'
2020-03-17 12:39:01,587+02 ERROR [org.ovirt.engine.core.bll.storage.disk.image.CopyImageGroupVolumesDataCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-94) [d3e546d9-6349-4b02-accd-
a1d1d1e765b4] Ending command 'org.ovirt.engine.core.bll.storage.disk.image.CopyImageGroupVolumesDataCommand' with failure.
2020-03-17 12:39:01,662+02 INFO  [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-94) [d3e546d9-6349-4b02-accd-a1d1d1e765b4] C
ommand 'LiveMigrateDisk' id: '4f37c28a-597f-4749-bc0a-a890358b9cf5' child commands '[a469ef90-ee0b-46f4-b263-cc6c6ddced92, 2239dbf0-5dff-4384-ae2d-f531b2c9ad72, 0cedd7c9-06cd-417b-8b45-11146bb16134]' executions 
were completed, status 'FAILED'
2020-03-17 12:39:02,751+02 ERROR [org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-90) [d3e546d9-6349-4b02-accd-a1d1d1e765b4] End
ing command 'org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand' with failure.
2020-03-17 12:39:02,752+02 ERROR [org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-90) [d3e546d9-6349-4b02-accd-a1d1d1e765b4] Fai
led during live storage migration of disk '1bf2203b-1bb6-4d0f-84ac-5cc29fcb0d4c' of vm '0deb41e6-deb6-41c5-9f42-bc3dda1a1238', attempting to end replication before deleting the target disk
2020-03-17 12:39:02,753+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-90) [d3e546d9-6349-4b02-accd-a1d1
d1e765b4] START, VmReplicateDiskFinishVDSCommand(HostName = host_mixed_1, VmReplicateDiskParameters:{hostId='9804e9ec-b212-40ce-9c8b-934675f1e896', vmId='0deb41e6-deb6-41c5-9f42-bc3dda1a1238', storagePoolId='677
451ec-4159-4fdc-8136-dec3fb9d861b', srcStorageDomainId='6543bae2-f959-48a7-8b82-84e2a0a5a1a7', targetStorageDomainId='6543bae2-f959-48a7-8b82-84e2a0a5a1a7', imageGroupId='1bf2203b-1bb6-4d0f-84ac-5cc29fcb0d4c', i
mageId='af3437c4-9fd3-4ebd-89c3-e3a9014eb206'}), log id: 387ea7c5
2020-03-17 12:39:02,882+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-90) [d3e546d9-6349-4b02-accd-a1d1
d1e765b4] FINISH, VmReplicateDiskFinishVDSCommand, return: , log id: 387ea7c5
2020-03-17 12:39:02,883+02 ERROR [org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-90) [d3e546d9-6349-4b02-accd-a1d1d1e765b4] Attempting to delete the target of disk '1bf2203b-1bb6-4d0f-84ac-5cc29fcb0d4c' of vm '0deb41e6-deb6-41c5-9f42-bc3dda1a1238'

Comment 20 Ilan Zuckerman 2020-04-19 14:10:11 UTC

Verified according steps that were agreed with DEV:

ovirt-engine-4.4.0-0.33.master.el8ev.noarch
vdsm-4.40.13-1.el8ev.x86_64

1. Trigger LSM for a blank vm iscsi -> iscsi
2. Grep for 'USER_CREATE_SNAPSHOT_FINISHED_SUCCESS' in engine log
3 then wait for "Running command: CloneImageGroupVolumesStructureCommand" (it will come very fast)
4. as soon as it starts, block connection to Storage on appropriate vdsm
[root@storage-ge5-vdsm1 ~]# iptables -A OUTPUT -d XXXXX -j DROP

5. after short time, as soon as 'failed' / 'ERROR' appears, release the connection
[root@storage-ge5-vdsm1 ~]# iptables -D OUTPUT -d XXXXX -j DROP

5. See that the temporary snapshot was removed from snapshots for that vm.

Comment 24 errata-xmlrpc 2020-08-04 13:16:55 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: RHV Manager (ovirt-engine) 4.4 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:3247

Note You need to log in before you can comment on or make changes to this bug.