Description of problem: Migration fails at restore step because of initial backup not found. We have rarely seen this issue happening in QE env. Version-Release number of selected component (if applicable): Source GCP 3.11 MTC 1.5.5 Target GCP 4.10 MTC 1.6.5 How reproducible: Rarely Steps to Reproduce: 1. Create a new project in source cluster $ oc new-project test 2. Execute migration Actual results: Migration fails with backup not found error. $ oc get migmigration -n openshift-migration spec: migPlanRef: name: ocp-24686-project-migplan-1653497851 namespace: openshift-migration stage: false status: conditions: - category: Advisory durable: true lastTransitionTime: "2022-05-25T16:59:50Z" message: 'The migration has failed. See: Errors.' reason: EnsureInitialBackupReplicated status: "True" type: Failed errors: - Backup not found itinerary: Failed observedDigest: 45590e63300f11509a03d0cc341ede281c293b6522cb3c5c031315303610d70a phase: Completed pipeline: - completed: "2022-05-25T16:59:12Z" message: Completed name: Prepare started: "2022-05-25T16:58:05Z" - completed: "2022-05-25T16:59:16Z" message: Completed name: Backup progress: - 'Backup openshift-migration/ocp-24686-project-mig-1653497851-initial-s5cxn: 16 out of estimated total of 16 objects backed up (2s)' started: "2022-05-25T16:59:12Z" - completed: "2022-05-25T16:59:16Z" message: Completed name: DirectImage progress: - 0 total ImageStreams; 0 running; 0 successful; 0 failed started: "2022-05-25T16:59:16Z" - completed: "2022-05-25T16:59:51Z" failed: true message: Failed name: Restore started: "2022-05-25T16:59:16Z" - completed: "2022-05-25T16:59:51Z" message: Completed name: Cleanup started: "2022-05-25T16:59:51Z" - completed: "2022-05-25T16:59:51Z" message: Completed name: CleanupHelpers started: "2022-05-25T16:59:51Z" startTimestamp: "2022-05-25T16:58:04Z" $ oc -n openshift-migration logs migration-log-reader-6bbc78f6fd-g5j5h -c plain | grep '"migMigration":"ocp-24686-project-mig-1653497851"' openshift-migration migration-controller-6dc67fbbd5-z68zb mtc {"level":"info","ts":1653497987.857594,"logger":"migration","msg":"Checking if Velero Backup has been replicated to destination cluster","migMigration":"ocp-24686-project-mig-1653497851","phase":"EnsureInitialBackupReplicated","backup":"openshift-migration/ocp-24686-project-mig-1653497851-initial-s5cxn"} openshift-migration migration-controller-6dc67fbbd5-z68zb mtc {"level":"info","ts":1653497987.8629704,"logger":"migration","msg":"Initial Velero Backup has not yet been replicated to target cluster by Velero. Waiting","migMigration":"ocp-24686-project-mig-1653497851","phase":"EnsureInitialBackupReplicated","backup":"openshift-migration/ocp-24686-project-mig-1653497851-initial-s5cxn"} openshift-migration migration-controller-6dc67fbbd5-z68zb mtc {"level":"info","ts":1653497990.875281,"logger":"migration","msg":"[RUN] (Step 41/48) Waiting for initial Velero backup replication to target cluster.","migMigration":"ocp-24686-project-mig-1653497851","phase":"EnsureInitialBackupReplicated"} openshift-migration migration-controller-6dc67fbbd5-z68zb mtc {"level":"info","ts":1653497990.9641583,"logger":"migration","msg":"Phase execution failed.","migMigration":"ocp-24686-project-mig-1653497851","phase":"EnsureInitialBackupReplicated","phaseDescription":"Waiting for initial Velero backup replication to target cluster.","error":"Backup not found"} openshift-migration migration-controller-6dc67fbbd5-z68zb mtc {"level":"error","ts":1653497990.9641836,"logger":"migration","msg":"","migMigration":"ocp-24686-project-mig-1653497851","error":"Backup not found","errorVerbose":"Backup not found\ngithub.com/konveyor/mig-controller/pkg/controller/migmigration.(*Task).Run\n\t/remote-source/app/pkg/controller/migmigration/task.go:917\ngithub.com/konveyor/mig-controller/pkg/controller/migmigration.(*ReconcileMigMigration).migrate\n\t/remote-source/app/pkg/controller/migmigration/migrate.go:70\ngithub.com/konveyor/mig-controller/pkg/controller/migmigration.(*ReconcileMigMigration).Reconcile\n\t/remote-source/app/pkg/controller/migmigration/migmigration_controller.go:264\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:235\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.1\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:198\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\t/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\t/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\t/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:99\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1371","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/remote-source/app/vendor/github.com/go-logr/zapr/zapr.go:132\ngithub.com/konveyor/controller/pkg/logging.(*Logger).Error\n\t/remote-source/app/vendor/github.com/konveyor/controller/pkg/logging/logger.go:109\ngithub.com/konveyor/controller/pkg/logging.(*Logger).Trace\n\t/remote-source/app/vendor/github.com/konveyor/controller/pkg/logging/logger.go:115\ngithub.com/konveyor/mig-controller/pkg/controller/migmigration.(*ReconcileMigMigration).migrate\n\t/remote-source/app/pkg/controller/migmigration/migrate.go:80\ngithub.com/konveyor/mig-controller/pkg/controller/migmigration.(*ReconcileMigMigration).Reconcile\n\t/remote-source/app/pkg/controller/migmigration/migmigration_controller.go:264\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:235\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.1\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:198\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\t/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\t/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\t/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:99"} openshift-migration migration-controller-6dc67fbbd5-z68zb mtc {"level":"info","ts":1653497990.9643703,"logger":"migration","msg":"Marking migration as FAILED. See Status.Errors","migMigration":"ocp-24686-project-mig-1653497851","phase":"EnsureInitialBackupReplicated","migrationErrors":["Backup not found"]} openshift-migration migration-controller-6dc67fbbd5-z68zb mtc {"level":"info","ts":1653497991.0802553,"logger":"migration","msg":"[RUN] (Step 1/4) Migration failed.","migMigration":"ocp-24686-project-mig-1653497851","phase":"EnsureInitialBackupReplicated"} openshift-migration migration-controller-6dc67fbbd5-z68zb mtc {"level":"info","ts":1653497991.0802739,"logger":"migration","msg":"[COMPLETED]","migMigration":"ocp-24686-project-mig-1653497851","phase":"EnsureInitialBackupReplicated"} Expected results: Migration should not fail. Additional info:
We are able to reproduce this bug reliably, when backup location has too many backup resources in that case it's happening. In our case s3 bucket had a 500+ backups.
QE's reproducer works something like this: - set up two new clusters - use an existing (GCP in this case) bucket with 700+ backups on it - run a small/quick migration, which causes the BSL to be set up for the first time - the backup finishes before all the backups are sync'd from the bucket, and migration-controller cannot find the new one. - if you rerun after everything is sync'd it seems to work Currently it's not clear to us why the backup that is created locally apparently cannot be found until the backup sync is complete. A customer reproducer could happen in a scenario where: - they perform many migrations from source cluster 1 to destination cluster 1. - Moving on they start migrations from source cluster 2 to the same or another destination cluster reusing the same bucket. - If the sync of backups takes longer than the first migration it's likely they'll encounter this. One possibility is that we ignore the not found error and continue polling. Eventually the backup should be returned. It does open us up to the possibility to hanging if the backup is somehow really missing, but we can't think of a situation that would occur at present.
We'll be deprecating 1.6/1.5 shortly, please open distinct bug against 1.7 if this continues to be seen.