Bug 1767515

Summary: Failed migrations can leave stale PVCs in target namespace
Product: Migration Toolkit for Containers Reporter: Dylan Murray <dymurray>
Component: GeneralAssignee: Derek Whatley <dwhatley>
Status: CLOSED CURRENTRELEASE QA Contact: Xin jiang <xjiang>
Severity: medium Docs Contact: Avital Pinnick <apinnick>
Priority: medium    
Version: 1.3.0CC: ernelson, jmatthew, sseago
Target Milestone: ---   
Target Release: 1.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-08-24 18:52:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Dylan Murray 2019-10-31 15:56:53 UTC
Description of problem:
When a migration fails and a user attempts to remigrate to the same namespace, there is potential for stale PVCs with 'Lost' persistent volumes to exist in the target namespace. This causes Velero to not restore the application properly on a subsequent migration since Velero will not overwrite resources that already exist in the cluster.

Version-Release number of selected component (if applicable):
4.2.0

How reproducible:
It depends, if one wants to replicate it by forcing specific failures, 100%.

In terms of how reproducible an actual customer will hit this, it really depends on the steps they have taken with previous migrations and how those migrations failed.

Steps to Reproduce:
1. Run a migration with `copy` selected for a PV
2. Have migration fail on stage pod restore (to force this we could reference in invalid image to force a failure)
3. Delete migplan and migmigration CRs
4. Do not clean up target namespace (aside from controller automatically deleting existing stage pods)
5. Run a new migration with same config

Actual results:
Migration will stall at `CreateStageRestore` phase. Inspecting the PVCs in this namespace you will see 'Lost' state for the PVC as the bound volume was deleted (reclaim policy had to not be 'Retain'). This is caused by the deletion of the pod that was consuming the PVC.

Expected results:
Target namespace is properly cleaned up on failures to so that Velero does not skip restore of needed resources on subsequent migrations.

Additional info:

Comment 1 Scott Seago 2019-11-07 15:15:23 UTC
More generally, failed restores can leave other resources around too, any of which could cause subsequent failures. Should we be adding a label to *everything* we restore in the general plugin and then, either automatically when there's an explicit failure, or manually if the migration stalls, delete all resources in the target namespaces (and all cluster resources) with this label?

Comment 3 Erik Nelson 2021-06-29 17:59:09 UTC
Let's verify this as of 1.6.0, we have cleanup steps that are run that should ensure these resources are not leaked.

Comment 4 Derek Whatley 2021-08-24 18:23:33 UTC
I am not able to reproduce this failure with the given reproducer steps.

I tried:

---

1. Adjusting source cluster stage pod image. This caused migration to fail before anything was created on the target cluster, so doesn't count as reproduced. 

migrationcontroller.spec.migration_stage_image_fqin: migration_stage_image_fqin: quay.io/djwhatle/fake:latest

---

2. Adjusting target cluster stage pod image. This caused had no effect on migration success in my test. Presumably because the working stage pod image from the source cluster is used. Is there a way to provide a bad stage pod image to the target cluster?

migrationcontroller.spec.migration_stage_image_fqin: migration_stage_image_fqin: quay.io/djwhatle/fake:latest

Comment 5 Derek Whatley 2021-08-24 18:52:57 UTC
This appears to have been fixed by the stage pod cleanup code. Was not able to reproduce.

https://github.com/konveyor/mig-controller/pull/738