1970338 – Parallel migrations fail because the initial backup is missing

Bug 1970338 - Parallel migrations fail because the initial backup is missing

Summary: Parallel migrations fail because the initial backup is missing

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Migration Toolkit for Containers
Classification:	Red Hat
Component:	General
Sub Component:
Version:	1.4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	1.6.0
Assignee:	Dylan Murray
QA Contact:	Xin jiang
Docs Contact:	Avital Pinnick
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-06-10 09:48 UTC by Sergio
Modified:	2021-09-29 14:34 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-09-29 14:34:47 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	konveyor mig-controller pull 1190	None	None	None	2021-09-02 02:17:09 UTC
Github	konveyor mig-controller pull 1191	None	None	None	2021-09-02 18:30:02 UTC
Github	konveyor mig-controller pull 1192	None	None	None	2021-09-02 20:01:21 UTC
Red Hat Product Errata	RHSA-2021:3694	None	None	None	2021-09-29 14:34:55 UTC

Description Sergio 2021-06-10 09:48:20 UTC

Description of problem:
When several migrations are executed in parallel some migrations can report a failure because they cannot create the initial backup.


Version-Release number of selected component (if applicable):
MTC 1.4.5
TARGET CLUSTER: 4.8 AWS
SOURCE CLUSTER: 3.11 AWS
REPLICATION REPOSITORY: AWS S3


We have seen this error in 1.4.5, but probably it is present in previous versions too.


How reproducible:
Intermittent.

Steps to Reproduce:
1. Execute 3 or more migrations in parallel

Actual results:
Eventually, one of them will report an error creating the InitialBackup, like this:

$ oc get migmigration -o yaml 

  status:
    conditions:
    - category: Advisory
      durable: true
      lastTransitionTime: "2021-06-09T13:31:46Z"
      message: 'The migration has failed.  See: Errors.'
      reason: InitialBackupCreated
      status: "True"
      type: Failed
    errors:
    - Backup not found
    itinerary: Failed
    observedDigest: b922c7ac32f32e776c564ebcfdbee4ebf2a7319cab64d6583c5b6744f9a5e9a9
    phase: Completed
    pipeline:
    - completed: "2021-06-09T13:31:24Z"
      message: Completed
      name: Prepare
      started: "2021-06-09T13:30:33Z"
    - completed: "2021-06-09T13:31:48Z"
      failed: true
      message: Failed
      name: Backup
      progress:
      - 'Backup openshift-migration/ocp-25000-sets-mig-1623245405-zqcsr: 64 out of estimated total of 90 objects backed up (14s)'
      started: "2021-06-09T13:31:24Z"
    - message: Skipped
      name: DirectImage
      skipped: true
    - message: Skipped
      name: Restore
      skipped: true
    - completed: "2021-06-09T13:31:48Z"
      message: Completed
      name: Cleanup
      started: "2021-06-09T13:31:48Z"
    - completed: "2021-06-09T13:31:48Z"
      message: Completed
      name: CleanupHelpers
      started: "2021-06-09T13:31:48Z"
    startTimestamp: "2021-06-09T13:30:28Z"



In the source cluster we can find this error message too:

$ oc logs $(oc get pods -l logreader -o name) -c plain

openshift-migration velero-7979c7d6b4-jk6t6 velero time="2021-06-09T11:54:48Z" level=warning msg="Got error trying to update backup's status.progress" backup=openshift-migration/ocp-25000-sets-mig-1623239611-znzld error="backups.velero.io \"ocp-25000-sets-mig-1623239611-znzld\" not found" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/backup/backup.go:361" error.function="github.com/vmware-tanzu/velero/pkg/backup.(*kubernetesBackupper).Backup.func1" logSource="pkg/backup/backup.go:361"


And we can see this error in the migration controller pod

{"level":"error","ts":1623239690076,"logger":"migration|8ff9p","msg":"","migMigration":"ocp-25000-sets-mig-1623239611","error":"Backup not found","errorVerbose":"Backup not found\ngithub.com/konveyor/mig-controller/pkg/controller/migmigration.(*Task).Run\n\t/remote-source/app/pkg/controller/migmigration/task.go:537\ngithub.com/konveyor/mig-controller/pkg/controller/migmigration.(*ReconcileMigMigration).migrate\n\t/remote-source/app/pkg/controller/migmigration/migrate.go:71\ngithub.com/konveyor/mig-controller/pkg/controller/migmigration.(*ReconcileMigMigration).Reconcile\n\t/remote-source/app/pkg/controller/migmigration/migmigration_controller.go:241\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:215\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88\nruntime.goexit\n\t/opt/rh/go-toolset-1.16/root/usr/lib/go-toolset-1.16-golang/src/runtime/asm_amd64.s:1371","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/remote-source/app/vendor/github.com/go-logr/zapr/zapr.go:132\ngithub.com/konveyor/controller/pkg/logging.Logger.Error\n\t/remote-source/app/vendor/github.com/konveyor/controller/pkg/logging/logger.go:92\ngithub.com/konveyor/controller/pkg/logging.Logger.Trace\n\t/remote-source/app/vendor/github.com/konveyor/controller/pkg/logging/logger.go:98\ngithub.com/konveyor/mig-controller/pkg/controller/migmigration.(*ReconcileMigMigration).migrate\n\t/remote-source/app/pkg/controller/migmigration/migrate.go:81\ngithub.com/konveyor/mig-controller/pkg/controller/migmigration.(*ReconcileMigMigration).Reconcile\n\t/remote-source/app/pkg/controller/migmigration/migmigration_controller.go:241\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:215\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}


Expected results:

The migration should not fail.


Additional info:

Comment 1 Dylan Murray 2021-09-01 14:51:26 UTC

I can confirm I am able to reproduce this. I have not determined a root cause but it's fairly easily reproducible.

Comment 2 Derek Whatley 2021-09-02 20:02:35 UTC

Fixed in https://github.com/konveyor/mig-controller/pull/1191
Cherry-picked in https://github.com/konveyor/mig-controller/pull/1192

Comment 8 errata-xmlrpc 2021-09-29 14:34:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Migration Toolkit for Containers (MTC) 1.6.0 security & bugfix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3694

Note You need to log in before you can comment on or make changes to this bug.