Bug 2053183
Summary: | [MTV] [RHV] Snapshots are not deleted after a successful migration | |||
---|---|---|---|---|
Product: | Migration Toolkit for Virtualization | Reporter: | Amos Mastbaum <amastbau> | |
Component: | Controller | Assignee: | Benny Zlotnik <bzlotnik> | |
Status: | CLOSED MIGRATED | QA Contact: | Amos Mastbaum <amastbau> | |
Severity: | high | Docs Contact: | Richard Hoch <rhoch> | |
Priority: | high | |||
Version: | 2.3.0 | CC: | ahadas, bzlotnik, jortel, kpunwatk, mguetta, mlehrer | |
Target Milestone: | --- | |||
Target Release: | --- | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 2062570 (view as bug list) | Environment: | ||
Last Closed: | 2023-01-23 11:55:37 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 2062570 |
Description
Amos Mastbaum
2022-02-10 16:23:57 UTC
Eng has seen issues with snapshot delete failing or hanging (forever). Errors noted in the RHV events: "Validation of action 'RemoveSnapshot' failed for user admin@internal-authz. Reasons: VAR__TYPE__SNAPSHOT,VAR__ACTION__REMOVE,ACTION_TYPE_FAILED_VM_IS_LOCKED" Attaching the forklift controller logs and the RHV event log would be helpful. We do see the fail attempts in RHV event logs. I have managed to delete a snapshot with Ovirt UI forklift logs and ovirt events will be attached as soon as possible. Encountered the same behavior for another VM: v2v-karishma-rhel8-2disks2nics-vm. Addition, as it has bunch of snapshots, we tried to delete one but the task doesn't end Looks like every task of snapshots can lock the VM. This VM is also on RHV-Blue I wonder if this is related to the known issue on RHV 4.4.3 regarding locked snapshots. Recommend re-testing with RHV 4.4.9. The issue is seen also in 4.4.10 in events "Failed to delete snapshot 'Forklift Operator warm migration precopy' for VM NAME" viewable searching events -> severity = error Adding additional note that in previous versions of rhv we saw in more extreme/edge cases that vm disks which had multiple snapshots could be problematic in regards to rhv snapshot performance and postgres performance over time for the engine when many vms have many snapshots, again more of an edge case for MTV. Might be helpful to have KCS article for Customers / GSS with a workaround of adding a post hook to perform the orphan'ed snapshot removal or provide utility for removing snapshot (eg https://docs.ansible.com/ansible/latest/collections/ovirt/ovirt/ovirt_snapshot_module.html) That's interesting, ACTION_TYPE_FAILED_VM_IS_LOCKED can be returned when: 1. Detecting that the VM is in ImageLocked state but as far as I can tell RemoveVm doesn't check it; or 2. Trying to lock the VM (and RemoveVm attempt to lock the VM exclusively) while the VM is already locked by few other commands that use this message (ACTION_TYPE_FAILED_VM_IS_LOCKED) From a brief look it seems like ReduceImage is the primary (or only) suspect here since it is called by a command that is triggered by RemoveSnapshot Amos, can you tell if it happened on block storage? does it reproduce on file storage? can we get engine.log from RHV when it reproduces? Arik, we are testing only with block storage rhv source to check filesystem storage, we will have to make some adjustments. attached the engine log from the time of the migration. (In reply to Amos Mastbaum from comment #11) > Arik, we are testing only with block storage rhv source > to check filesystem storage, we will have to make some adjustments. Ack, it would be better to have coverage for both types but block storage is typically more tricky to handle so it's good that you test with it > attached the engine log from the time of the migration. OK, so Benny looked at it and discovered that my observation in comment 9 was incorrect since I looked at the 'master' branch of oVirt where the message associated with transfer-image has changed The log shows that the lock that prevented the execution of remove-snapshot was acquired by transfer image We suspect that there was an attempt to remove the snapshot while another image transfer operation that operated on a different layer of the disk didn't complete Benny, so what did we say about this one - that we need to ensure that remove-snapshot is triggered only after all transfer-image commands are completed? I can see how it can happen - CDI is responsible for finalizing the image transfer and that's an asynchronous operation so if MTV triggers remove-snapshots before the callback of TransferImage is invoke, catches the update and releases the lock - remove-snapshot would fail to acquire the lock.. But I think we're not really prepared to invocations of remove-snapshot operations in parallel for the same vm - the fact ReduceImage might be invoked in the process and lock the vm is not that great.. maybe we should wait at the beginning for the VM to be available (by invoking remove-snapshot until it starts) and then execute remove-snapshot operations sequentially? The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days |