Bug 2053183

Summary:	[MTV] [RHV] Snapshots are not deleted after a successful migration
Product:	Migration Toolkit for Virtualization	Reporter:	Amos Mastbaum <amastbau>
Component:	Controller	Assignee:	Benny Zlotnik <bzlotnik>
Status:	CLOSED MIGRATED	QA Contact:	Amos Mastbaum <amastbau>
Severity:	high	Docs Contact:	Richard Hoch <rhoch>
Priority:	high
Version:	2.3.0	CC:	ahadas, bzlotnik, jortel, kpunwatk, mguetta, mlehrer
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	2062570 (view as bug list)		Environment:
Last Closed:	2023-01-23 11:55:37 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2062570

Description Amos Mastbaum 2022-02-10 16:23:57 UTC

Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:
100%


Steps to Reproduce:
1. Run a warm migration from rhv, wait for 2 - 3 pre-copies before cut-over and observe The Snapshots in the RHV VM
2. Run Cut-Over
3.

Actual results:
Snapshots Are not deleted

Expected results:
1. Snapshots Created by MTV flow should be Deleted On the source Provider After a Successful/Failed/Canceled Migration.
2. The failure of deleting the snapshot appear in RHV VM events, but it should also be propagated to MTV API/UI.

Comment 1 Jeff Ortel 2022-02-10 20:22:48 UTC

Eng has seen issues with snapshot delete failing or hanging (forever).
Errors noted in the RHV events: "Validation of action 'RemoveSnapshot' failed for user admin@internal-authz. Reasons: VAR__TYPE__SNAPSHOT,VAR__ACTION__REMOVE,ACTION_TYPE_FAILED_VM_IS_LOCKED"

Attaching the forklift controller logs and the RHV event log would be helpful.

Comment 2 Amos Mastbaum 2022-02-10 21:09:39 UTC

We do see the fail attempts in RHV event logs.
I have managed to delete a snapshot with Ovirt UI
forklift logs and ovirt events will be attached as soon as possible.

Comment 5 kpunwatk 2022-02-14 13:02:05 UTC

Encountered the same behavior for another VM: v2v-karishma-rhel8-2disks2nics-vm.
Addition, as it has bunch of snapshots, we tried to delete one but the task doesn't end Looks like every task of snapshots can lock the VM.
This VM is also on RHV-Blue

Comment 6 Jeff Ortel 2022-02-15 14:59:48 UTC

I wonder if this is related to the known issue on RHV 4.4.3 regarding locked snapshots.
Recommend re-testing with RHV 4.4.9.

Comment 7 mlehrer 2022-03-14 12:23:55 UTC

The issue is seen also in 4.4.10 in events "Failed to delete snapshot 'Forklift Operator warm migration precopy' for VM NAME" viewable searching events ->  severity = error 

Adding additional note that in previous versions of rhv we saw in more extreme/edge cases that vm disks which had multiple snapshots could be problematic in regards to rhv snapshot performance and postgres performance over time for the engine when many vms have many snapshots, again more of an edge case for MTV.

Might be helpful to have KCS article for Customers / GSS with a workaround of adding a post hook to perform the orphan'ed snapshot removal or provide utility for removing snapshot (eg https://docs.ansible.com/ansible/latest/collections/ovirt/ovirt/ovirt_snapshot_module.html)

Comment 9 Arik 2022-06-28 18:36:54 UTC

That's interesting, ACTION_TYPE_FAILED_VM_IS_LOCKED can be returned when:
1. Detecting that the VM is in ImageLocked state but as far as I can tell RemoveVm doesn't check it; or
2. Trying to lock the VM (and RemoveVm attempt to lock the VM exclusively) while the VM is already locked by few other commands that use this message (ACTION_TYPE_FAILED_VM_IS_LOCKED)

From a brief look it seems like ReduceImage is the primary (or only) suspect here since it is called by a command that is triggered by RemoveSnapshot
Amos, can you tell if it happened on block storage? does it reproduce on file storage? can we get engine.log from RHV when it reproduces?

Comment 11 Amos Mastbaum 2022-07-07 09:51:07 UTC

Arik, we are testing only with block storage rhv source
to check filesystem storage, we will have to make some adjustments.
attached the engine log from the time of the migration.

Comment 12 Arik 2022-07-10 13:52:03 UTC

(In reply to Amos Mastbaum from comment #11)
> Arik, we are testing only with block storage rhv source
> to check filesystem storage, we will have to make some adjustments.

Ack, it would be better to have coverage for both types but block storage is typically more tricky to handle so it's good that you test with it

> attached the engine log from the time of the migration.

OK, so Benny looked at it and discovered that my observation in comment 9 was incorrect since I looked at the 'master' branch of oVirt where the message associated with transfer-image has changed
The log shows that the lock that prevented the execution of remove-snapshot was acquired by transfer image
We suspect that there was an attempt to remove the snapshot while another image transfer operation that operated on a different layer of the disk didn't complete

Comment 13 Arik 2022-08-09 19:26:10 UTC

Benny, so what did we say about this one - that we need to ensure that remove-snapshot is triggered only after all transfer-image commands are completed?
I can see how it can happen - CDI is responsible for finalizing the image transfer and that's an asynchronous operation so if MTV triggers remove-snapshots before the callback of TransferImage is invoke, catches the update and releases the lock - remove-snapshot would fail to acquire the lock..
But I think we're not really prepared to invocations of remove-snapshot operations in parallel for the same vm - the fact ReduceImage might be invoked in the process and lock the vm is not that great.. maybe we should wait at the beginning for the VM to be available (by invoking remove-snapshot until it starts) and then execute remove-snapshot operations sequentially?

Comment 14 Red Hat Bugzilla 2023-09-18 04:31:51 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days