Bug 1920911

Summary: Migration fails when Indirect Image Migration and Direct Volume Migration are configured at the same time
Product: Migration Toolkit for Containers Reporter: Sergio <sregidor>
Component: GeneralAssignee: Dylan Murray <dymurray>
Status: CLOSED ERRATA QA Contact: Xin jiang <xjiang>
Severity: medium Docs Contact: Avital Pinnick <apinnick>
Priority: urgent    
Version: 1.4.0CC: chezhang, ernelson, rjohnson, rpattath, whu, xjiang
Target Milestone: ---   
Target Release: 1.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-02-11 12:55:27 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Sergio 2021-01-27 09:16:00 UTC
Description of problem:
When a migration involving internal images and PVCs is executed using Indirect Image Migration and Direct Volume Migration, the migration fails in StageBackup phase.

Version-Release number of selected component (if applicable):
MTC 1.4.0
SOURCE CLUSTER: AWS OCP3.11 (storage class gp2)
TARGET CLUSTER: AWS OCP4.5 (storage class gp2)
REPLICATION REPOSITORY: AWS S3

How reproducible:
Always

Steps to Reproduce:
1. In source cluster. Deploy an application using internal images and PVCs

oc new-project bztest
oc -n bztest new-app --template django-psql-persistent

2. Create a migration plan for this namespace.

Select:
- Indirect Image Migration
- Direct Volume Migration

3. Migrate the migration plan. Do not quiesce the pods.

Actual results:
In StageBackup a failure happens, and an error in the backup is reported.

In the backup logs we can see this error:

time="2021-01-26T14:07:18Z" level=error msg="Error backing up item" backup=openshift-migration/8bbf2e00-5fdf-11eb-938b-f5eff88f2b85-5qp9d error="error getting volume info: rpc error: code = Unknown desc = InvalidVolume.NotFound: The volume 'vol-0d3f70c4c7ee100c9' does not exist.\n\tstatus code: 400, request id: 6a072926-0c82-4718-ad4a-dfcfd4bc56d2" logSource="pkg/backup/backup.go:455" name=postgresql


Expected results:
There should be no errors.

Additional info:
If we use MCG (noobaa) instead of AWS S3 as replication repository, the error that we get is this other one:

time="2021-01-26T14:00:45Z" level=error msg="Error backing up item" backup=openshift-migration/a39dfa70-5fde-11eb-938b-f5eff88f2b85-q7gz5 error="error getting volume info: rpc error: code = Unknown desc = AuthFailure: AWS was not able to validate the provided access credentials\n\tstatus code: 401, request id: 1b434c64-0846-4003-956b-3a8c832b0cff" logSource="pkg/backup/backup.go:455" name=postgresql

Comment 1 Erik Nelson 2021-01-27 14:27:13 UTC
in 4.x -> 4.x it does not seem to happen though. @xjiang cannot reproduce it at least in 4.4 -> 4.7

Comment 2 Dylan Murray 2021-01-27 18:29:53 UTC
This issue occurs when a user has any volumes which are backed by an associated cloud provider that has a registered velero snapshot plugin regardless if `snapshot` was actually selected.

I am updating the code to only include PVCs in the stage backup that actually requested a snapshot.

Comment 6 Sergio 2021-02-02 13:02:47 UTC
Verified using MTC 1.4.0, 3.11 -> 4.3 AWS, AWS S3

openshift-migration-rhel7-operator@sha256:622e42cef37e3e445d04c0c7f28455b322ed5ddb11b0063c2af9950de09121ab
    - name: MIG_CONTROLLER_REPO
      value: openshift-migration-controller-rhel8@sha256
    - name: MIG_CONTROLLER_TAG
      value: 5590dc251338f1d909bb6c76722d251c5de114c272d6425455549623a5472c4d
    - name: VELERO_TAG
      value: 8f2737eb2a9245b945f08007459c3fb7cd304901cadaaff3a673d88e5980c6b5
    - name: VELERO_PLUGIN_REPO
      value: openshift-velero-plugin-rhel8@sha256
    - name: VELERO_PLUGIN_TAG
      value: 2398f40ec877039f3216702c31ea2881f5618f0580df0adcdee2b79e0d99ee57
    - name: VELERO_AWS_PLUGIN_REPO
      value: openshift-migration-velero-plugin-for-aws-rhel8@sha256
    - name: VELERO_AWS_PLUGIN_TAG
      value: df442c91afdda47807f61a5749c4e83b7bdafba107b831c86f28c21ae74f281f
    - name: VELERO_GCP_PLUGIN_REPO
      value: openshift-migration-velero-plugin-for-gcp-rhel8@sha256
    - name: VELERO_GCP_PLUGIN_TAG
      value: 2ec9701726854f62c7acea2059492f1343ee8579aa5721e751593ea953b91dc5
    - name: VELERO_AZURE_PLUGIN_REPO
      value: openshift-migration-velero-plugin-for-microsoft-azure-rhel8@sha256
    - name: VELERO_AZURE_PLUGIN_TAG
      value: b8db59eb4b9a2d4748142e6e435dcfbf3187032b64302b88affbff98cb728e3c


Moved to VERIFIED.

Comment 8 errata-xmlrpc 2021-02-11 12:55:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Migration Toolkit for Containers (MTC) tool image release advisory 1.4.0), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:5329