Bug 2090595 - [MTC] Migrations fails because of backup not found
Summary: [MTC] Migrations fails because of backup not found
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Migration Toolkit for Containers
Classification: Red Hat
Component: Velero
Version: 1.6.5
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: ---
Assignee: Jason Montleon
QA Contact: ssingla
Richard Hoch
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-05-26 05:44 UTC by Prasad Joshi
Modified: 2022-07-01 15:42 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-07-01 15:42:21 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Prasad Joshi 2022-05-26 05:44:10 UTC
Description of problem: Migration fails at restore step because of initial backup not found. We have rarely seen this issue happening in QE env.


Version-Release number of selected component (if applicable):
Source GCP 3.11 MTC 1.5.5
Target GCP 4.10 MTC 1.6.5


How reproducible:
Rarely


Steps to Reproduce:
1. Create a new project in source cluster 
$ oc new-project test
2. Execute migration


Actual results: Migration fails with backup not found error. 


$ oc get migmigration -n openshift-migration
spec:
  migPlanRef:
    name: ocp-24686-project-migplan-1653497851
    namespace: openshift-migration
  stage: false
status:
  conditions:
  - category: Advisory
    durable: true
    lastTransitionTime: "2022-05-25T16:59:50Z"
    message: 'The migration has failed.  See: Errors.'
    reason: EnsureInitialBackupReplicated
    status: "True"
    type: Failed
  errors:
  - Backup not found
  itinerary: Failed
  observedDigest: 45590e63300f11509a03d0cc341ede281c293b6522cb3c5c031315303610d70a
  phase: Completed
  pipeline:
  - completed: "2022-05-25T16:59:12Z"
    message: Completed
    name: Prepare
    started: "2022-05-25T16:58:05Z"
  - completed: "2022-05-25T16:59:16Z"
    message: Completed
    name: Backup
    progress:
    - 'Backup openshift-migration/ocp-24686-project-mig-1653497851-initial-s5cxn: 16 out of estimated total of 16 objects backed up (2s)'
    started: "2022-05-25T16:59:12Z"
  - completed: "2022-05-25T16:59:16Z"
    message: Completed
    name: DirectImage
    progress:
    - 0 total ImageStreams; 0 running; 0 successful; 0 failed
    started: "2022-05-25T16:59:16Z"
  - completed: "2022-05-25T16:59:51Z"
    failed: true
    message: Failed
    name: Restore
    started: "2022-05-25T16:59:16Z"
  - completed: "2022-05-25T16:59:51Z"
    message: Completed
    name: Cleanup
    started: "2022-05-25T16:59:51Z"
  - completed: "2022-05-25T16:59:51Z"
    message: Completed
    name: CleanupHelpers
    started: "2022-05-25T16:59:51Z"
  startTimestamp: "2022-05-25T16:58:04Z"

$ oc -n openshift-migration logs migration-log-reader-6bbc78f6fd-g5j5h -c plain | grep '"migMigration":"ocp-24686-project-mig-1653497851"'

openshift-migration migration-controller-6dc67fbbd5-z68zb mtc {"level":"info","ts":1653497987.857594,"logger":"migration","msg":"Checking if Velero Backup has been replicated to destination cluster","migMigration":"ocp-24686-project-mig-1653497851","phase":"EnsureInitialBackupReplicated","backup":"openshift-migration/ocp-24686-project-mig-1653497851-initial-s5cxn"}
openshift-migration migration-controller-6dc67fbbd5-z68zb mtc {"level":"info","ts":1653497987.8629704,"logger":"migration","msg":"Initial Velero Backup has not yet been replicated to target cluster by Velero. Waiting","migMigration":"ocp-24686-project-mig-1653497851","phase":"EnsureInitialBackupReplicated","backup":"openshift-migration/ocp-24686-project-mig-1653497851-initial-s5cxn"}
openshift-migration migration-controller-6dc67fbbd5-z68zb mtc {"level":"info","ts":1653497990.875281,"logger":"migration","msg":"[RUN] (Step 41/48) Waiting for initial Velero backup replication to target cluster.","migMigration":"ocp-24686-project-mig-1653497851","phase":"EnsureInitialBackupReplicated"}
openshift-migration migration-controller-6dc67fbbd5-z68zb mtc {"level":"info","ts":1653497990.9641583,"logger":"migration","msg":"Phase execution failed.","migMigration":"ocp-24686-project-mig-1653497851","phase":"EnsureInitialBackupReplicated","phaseDescription":"Waiting for initial Velero backup replication to target cluster.","error":"Backup not found"}
openshift-migration migration-controller-6dc67fbbd5-z68zb mtc {"level":"error","ts":1653497990.9641836,"logger":"migration","msg":"","migMigration":"ocp-24686-project-mig-1653497851","error":"Backup not found","errorVerbose":"Backup not found\ngithub.com/konveyor/mig-controller/pkg/controller/migmigration.(*Task).Run\n\t/remote-source/app/pkg/controller/migmigration/task.go:917\ngithub.com/konveyor/mig-controller/pkg/controller/migmigration.(*ReconcileMigMigration).migrate\n\t/remote-source/app/pkg/controller/migmigration/migrate.go:70\ngithub.com/konveyor/mig-controller/pkg/controller/migmigration.(*ReconcileMigMigration).Reconcile\n\t/remote-source/app/pkg/controller/migmigration/migmigration_controller.go:264\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:235\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.1\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:198\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\t/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\t/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\t/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:99\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1371","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/remote-source/app/vendor/github.com/go-logr/zapr/zapr.go:132\ngithub.com/konveyor/controller/pkg/logging.(*Logger).Error\n\t/remote-source/app/vendor/github.com/konveyor/controller/pkg/logging/logger.go:109\ngithub.com/konveyor/controller/pkg/logging.(*Logger).Trace\n\t/remote-source/app/vendor/github.com/konveyor/controller/pkg/logging/logger.go:115\ngithub.com/konveyor/mig-controller/pkg/controller/migmigration.(*ReconcileMigMigration).migrate\n\t/remote-source/app/pkg/controller/migmigration/migrate.go:80\ngithub.com/konveyor/mig-controller/pkg/controller/migmigration.(*ReconcileMigMigration).Reconcile\n\t/remote-source/app/pkg/controller/migmigration/migmigration_controller.go:264\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:235\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.1\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:198\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\t/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\t/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\t/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:99"}
openshift-migration migration-controller-6dc67fbbd5-z68zb mtc {"level":"info","ts":1653497990.9643703,"logger":"migration","msg":"Marking migration as FAILED. See Status.Errors","migMigration":"ocp-24686-project-mig-1653497851","phase":"EnsureInitialBackupReplicated","migrationErrors":["Backup not found"]}
openshift-migration migration-controller-6dc67fbbd5-z68zb mtc {"level":"info","ts":1653497991.0802553,"logger":"migration","msg":"[RUN] (Step 1/4) Migration failed.","migMigration":"ocp-24686-project-mig-1653497851","phase":"EnsureInitialBackupReplicated"}
openshift-migration migration-controller-6dc67fbbd5-z68zb mtc {"level":"info","ts":1653497991.0802739,"logger":"migration","msg":"[COMPLETED]","migMigration":"ocp-24686-project-mig-1653497851","phase":"EnsureInitialBackupReplicated"}

Expected results: Migration should not fail.


Additional info:

Comment 1 Prasad Joshi 2022-05-31 11:41:50 UTC
We are able to reproduce this bug reliably, when backup location has too many backup resources in that case it's happening. In our case s3 bucket had a 500+ backups.

Comment 2 Jason Montleon 2022-05-31 14:20:14 UTC
QE's reproducer works something like this:
- set up two new clusters
- use an existing (GCP in this case) bucket with 700+ backups on it
- run a small/quick migration, which causes the BSL to be set up for the first time
- the backup finishes before all the backups are sync'd from the bucket, and migration-controller cannot find the new one.
- if you rerun after everything is sync'd it seems to work

Currently it's not clear to us why the backup that is created locally apparently cannot be found until the backup sync is complete.

A customer reproducer could happen in a scenario where:
- they perform many migrations from source cluster 1 to destination cluster 1. 
- Moving on they start migrations from source cluster 2 to the same or another destination cluster reusing the same bucket.
- If the sync of backups takes longer than the first migration it's likely they'll encounter this.

One possibility is that we ignore the not found error and continue polling. Eventually the backup should be returned. It does open us up to the possibility to hanging if the backup is somehow really missing, but we can't think of a situation that would occur at present.

Comment 3 Erik Nelson 2022-07-01 15:42:21 UTC
We'll be deprecating 1.6/1.5 shortly, please open distinct bug against 1.7 if this continues to be seen.


Note You need to log in before you can comment on or make changes to this bug.