Bug 1767515 - Failed migrations can leave stale PVCs in target namespace
Summary: Failed migrations can leave stale PVCs in target namespace
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Migration Toolkit for Containers
Classification: Red Hat
Component: General
Version: 1.3.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 1.6.0
Assignee: Derek Whatley
QA Contact: Xin jiang
Avital Pinnick
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-10-31 15:56 UTC by Dylan Murray
Modified: 2021-08-24 18:52 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-08-24 18:52:57 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Dylan Murray 2019-10-31 15:56:53 UTC
Description of problem:
When a migration fails and a user attempts to remigrate to the same namespace, there is potential for stale PVCs with 'Lost' persistent volumes to exist in the target namespace. This causes Velero to not restore the application properly on a subsequent migration since Velero will not overwrite resources that already exist in the cluster.

Version-Release number of selected component (if applicable):
4.2.0

How reproducible:
It depends, if one wants to replicate it by forcing specific failures, 100%.

In terms of how reproducible an actual customer will hit this, it really depends on the steps they have taken with previous migrations and how those migrations failed.

Steps to Reproduce:
1. Run a migration with `copy` selected for a PV
2. Have migration fail on stage pod restore (to force this we could reference in invalid image to force a failure)
3. Delete migplan and migmigration CRs
4. Do not clean up target namespace (aside from controller automatically deleting existing stage pods)
5. Run a new migration with same config

Actual results:
Migration will stall at `CreateStageRestore` phase. Inspecting the PVCs in this namespace you will see 'Lost' state for the PVC as the bound volume was deleted (reclaim policy had to not be 'Retain'). This is caused by the deletion of the pod that was consuming the PVC.

Expected results:
Target namespace is properly cleaned up on failures to so that Velero does not skip restore of needed resources on subsequent migrations.

Additional info:

Comment 1 Scott Seago 2019-11-07 15:15:23 UTC
More generally, failed restores can leave other resources around too, any of which could cause subsequent failures. Should we be adding a label to *everything* we restore in the general plugin and then, either automatically when there's an explicit failure, or manually if the migration stalls, delete all resources in the target namespaces (and all cluster resources) with this label?

Comment 3 Erik Nelson 2021-06-29 17:59:09 UTC
Let's verify this as of 1.6.0, we have cleanup steps that are run that should ensure these resources are not leaked.

Comment 4 Derek Whatley 2021-08-24 18:23:33 UTC
I am not able to reproduce this failure with the given reproducer steps.

I tried:

---

1. Adjusting source cluster stage pod image. This caused migration to fail before anything was created on the target cluster, so doesn't count as reproduced. 

migrationcontroller.spec.migration_stage_image_fqin: migration_stage_image_fqin: quay.io/djwhatle/fake:latest

---

2. Adjusting target cluster stage pod image. This caused had no effect on migration success in my test. Presumably because the working stage pod image from the source cluster is used. Is there a way to provide a bad stage pod image to the target cluster?

migrationcontroller.spec.migration_stage_image_fqin: migration_stage_image_fqin: quay.io/djwhatle/fake:latest

Comment 5 Derek Whatley 2021-08-24 18:52:57 UTC
This appears to have been fixed by the stage pod cleanup code. Was not able to reproduce.

https://github.com/konveyor/mig-controller/pull/738


Note You need to log in before you can comment on or make changes to this bug.