2053183 – [MTV] [RHV] Snapshots are not deleted after a successful migration

This bug has been migrated to another issue tracking site. It has been closed here and may no longer be being monitored.

If you would like to get updates for this issue, or to participate in it, you may do so at Red Hat Issue Tracker .

Bug 2053183 - [MTV] [RHV] Snapshots are not deleted after a successful migration

Summary: [MTV] [RHV] Snapshots are not deleted after a successful migration

Keywords:
Status:	CLOSED MIGRATED
Alias:	None
Product:	Migration Toolkit for Virtualization
Classification:	Red Hat
Component:	Controller
Sub Component:
Version:	2.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Benny Zlotnik
QA Contact:	Amos Mastbaum
Docs Contact:	Richard Hoch
URL:
Whiteboard:
Depends On:
Blocks:	2062570
TreeView+	depends on / blocked

Reported:	2022-02-10 16:23 UTC by Amos Mastbaum
Modified:	2023-09-18 04:31 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	2062570 (view as bug list)
Environment:
Last Closed:	2023-01-23 11:55:37 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	konveyor forklift-controller pull 458	0	None	Draft	WIP ovirt: peform snapshot removal sequentially	2022-08-17 13:06:17 UTC
Red Hat Issue Tracker	MTV-349	0	None	None	None	2023-01-23 11:55:36 UTC

Description Amos Mastbaum 2022-02-10 16:23:57 UTC

Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:
100%


Steps to Reproduce:
1. Run a warm migration from rhv, wait for 2 - 3 pre-copies before cut-over and observe The Snapshots in the RHV VM
2. Run Cut-Over
3.

Actual results:
Snapshots Are not deleted

Expected results:
1. Snapshots Created by MTV flow should be Deleted On the source Provider After a Successful/Failed/Canceled Migration.
2. The failure of deleting the snapshot appear in RHV VM events, but it should also be propagated to MTV API/UI.

Comment 1 Jeff Ortel 2022-02-10 20:22:48 UTC

Eng has seen issues with snapshot delete failing or hanging (forever).
Errors noted in the RHV events: "Validation of action 'RemoveSnapshot' failed for user admin@internal-authz. Reasons: VAR__TYPE__SNAPSHOT,VAR__ACTION__REMOVE,ACTION_TYPE_FAILED_VM_IS_LOCKED"

Attaching the forklift controller logs and the RHV event log would be helpful.

Comment 2 Amos Mastbaum 2022-02-10 21:09:39 UTC

We do see the fail attempts in RHV event logs.
I have managed to delete a snapshot with Ovirt UI
forklift logs and ovirt events will be attached as soon as possible.

Comment 5 kpunwatk 2022-02-14 13:02:05 UTC

Encountered the same behavior for another VM: v2v-karishma-rhel8-2disks2nics-vm.
Addition, as it has bunch of snapshots, we tried to delete one but the task doesn't end Looks like every task of snapshots can lock the VM.
This VM is also on RHV-Blue

Comment 6 Jeff Ortel 2022-02-15 14:59:48 UTC

I wonder if this is related to the known issue on RHV 4.4.3 regarding locked snapshots.
Recommend re-testing with RHV 4.4.9.

Comment 7 mlehrer 2022-03-14 12:23:55 UTC

The issue is seen also in 4.4.10 in events "Failed to delete snapshot 'Forklift Operator warm migration precopy' for VM NAME" viewable searching events ->  severity = error 

Adding additional note that in previous versions of rhv we saw in more extreme/edge cases that vm disks which had multiple snapshots could be problematic in regards to rhv snapshot performance and postgres performance over time for the engine when many vms have many snapshots, again more of an edge case for MTV.

Might be helpful to have KCS article for Customers / GSS with a workaround of adding a post hook to perform the orphan'ed snapshot removal or provide utility for removing snapshot (eg https://docs.ansible.com/ansible/latest/collections/ovirt/ovirt/ovirt_snapshot_module.html)

Comment 9 Arik 2022-06-28 18:36:54 UTC

That's interesting, ACTION_TYPE_FAILED_VM_IS_LOCKED can be returned when:
1. Detecting that the VM is in ImageLocked state but as far as I can tell RemoveVm doesn't check it; or
2. Trying to lock the VM (and RemoveVm attempt to lock the VM exclusively) while the VM is already locked by few other commands that use this message (ACTION_TYPE_FAILED_VM_IS_LOCKED)

From a brief look it seems like ReduceImage is the primary (or only) suspect here since it is called by a command that is triggered by RemoveSnapshot
Amos, can you tell if it happened on block storage? does it reproduce on file storage? can we get engine.log from RHV when it reproduces?

Comment 11 Amos Mastbaum 2022-07-07 09:51:07 UTC

Arik, we are testing only with block storage rhv source
to check filesystem storage, we will have to make some adjustments.
attached the engine log from the time of the migration.

Comment 12 Arik 2022-07-10 13:52:03 UTC

(In reply to Amos Mastbaum from comment #11)
> Arik, we are testing only with block storage rhv source
> to check filesystem storage, we will have to make some adjustments.

Ack, it would be better to have coverage for both types but block storage is typically more tricky to handle so it's good that you test with it

> attached the engine log from the time of the migration.

OK, so Benny looked at it and discovered that my observation in comment 9 was incorrect since I looked at the 'master' branch of oVirt where the message associated with transfer-image has changed
The log shows that the lock that prevented the execution of remove-snapshot was acquired by transfer image
We suspect that there was an attempt to remove the snapshot while another image transfer operation that operated on a different layer of the disk didn't complete

Comment 13 Arik 2022-08-09 19:26:10 UTC

Benny, so what did we say about this one - that we need to ensure that remove-snapshot is triggered only after all transfer-image commands are completed?
I can see how it can happen - CDI is responsible for finalizing the image transfer and that's an asynchronous operation so if MTV triggers remove-snapshots before the callback of TransferImage is invoke, catches the update and releases the lock - remove-snapshot would fail to acquire the lock..
But I think we're not really prepared to invocations of remove-snapshot operations in parallel for the same vm - the fact ReduceImage might be invoked in the process and lock the vm is not that great.. maybe we should wait at the beginning for the VM to be available (by invoking remove-snapshot until it starts) and then execute remove-snapshot operations sequentially?

Comment 14 Red Hat Bugzilla 2023-09-18 04:31:51 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.