Bug 2000644

Summary: Invalid migration plan causes "controller" pod to crash
Product: Migration Toolkit for Containers Reporter: Sergio <sregidor>
Component: ControllerAssignee: Jason Montleon <jmontleo>
Status: CLOSED ERRATA QA Contact: Xin jiang <xjiang>
Severity: urgent Docs Contact: Avital Pinnick <apinnick>
Priority: urgent    
Version: 1.6.0CC: ernelson, jmontleo, prajoshi, rjohnson, ssingla, whu, xjiang
Target Milestone: ---   
Target Release: 1.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-09-29 14:35:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Sergio 2021-09-02 15:07:39 UTC
Description of problem:
When a migration plan is malformed and the 'destMigClusterRef' is pointing to a non existing migcluster resource the migration controller starts crashing.


Version-Release number of selected component (if applicable):
SOURCE CLUSTER: AWS 3.11 (MTC 1.5.1)
TARGET CLUSTER: AWS 4.9 (MTC 1.6.0)

How reproducible:
Always

Steps to Reproduce:
1. Create an empty namespace in the source cluster

$ oc new-project empty-project

2. Create a migration plan in order to migrate this namespace. Use this name for the migplan: "migplan-ocp-40171-malformed-crds"

3. Patch the migplan in order to create a malformed migplan

$ oc -n openshift-migration  patch migplans migplan-ocp-40171-malformed-crds -p '{"spec":{"destMigClusterRef": {"name": "foo"}}}' --type='merge'


Actual results:
The migration controller pod starts crashing

$ oc get pods
NAME                                    READY   STATUS             RESTARTS       AGE
migration-controller-6d854fd675-ntczb   1/2     CrashLoopBackOff   5 (103s ago)   6m32s


Expected results:

The migration plan should become "Not ready" and should have a Critical condition informing about the problem.


Additional info:

This is the crash in the migration controller pod:

{"level":"info","ts":1630591811310,"logger":"controller-runtime.manager.controller.migcluster-controller","msg":"Starting workers","worker count":1}
{"level":"info","ts":1630591811310,"logger":"controller-runtime.manager.controller.migplan-controller","msg":"Starting workers","worker count":1}
{"level":"info","ts":1630591811311,"logger":"controller-runtime.manager.controller.directimagemigration-controller","msg":"Starting workers","worker count":1}
{"level":"info","ts":1630591811312,"logger":"controller-runtime.manager.controller.migstorage-controller","msg":"Starting Controller"}
{"level":"info","ts":1630591811312,"logger":"controller-runtime.manager.controller.migstorage-controller","msg":"Starting workers","worker count":1}
{"level":"info","ts":1630591811313,"logger":"controller-runtime.manager.controller.miganalytic-controller","msg":"Starting workers","worker count":2}
{"level":"info","ts":1630591811313,"logger":"controller-runtime.manager.controller.directvolumemigration-controller","msg":"Starting workers","worker count":1}
{"level":"info","ts":1630591811313,"logger":"controller-runtime.manager.controller.mighook-controller","msg":"Starting workers","worker count":1}
{"level":"info","ts":1630591811313,"logger":"controller-runtime.manager.controller.directvolumemigrationprogress-controller","msg":"Starting workers","worker count":5}
{"level":"info","ts":1630591811314,"logger":"controller-runtime.manager.controller.migmigration-controller","msg":"Starting workers","worker count":1}
E0902 14:10:11.378266       1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 556 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic(0x230f860, 0x3b865b0)
	/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0x95
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x86
panic(0x230f860, 0x3b865b0)
	/opt/rh/go-toolset-1.16/root/usr/lib/go-toolset-1.16-golang/src/runtime/panic.go:965 +0x1b9
github.com/konveyor/mig-controller/pkg/apis/migration/v1alpha1.(*MigCluster).BuildRestConfig(0x0, 0x2a4ab40, 0xc00139bdc0, 0x25e3ce0, 0x8, 0xc00487e720)
	/remote-source/app/pkg/apis/migration/v1alpha1/migcluster_types.go:423 +0x3a
github.com/konveyor/mig-controller/pkg/apis/migration/v1alpha1.(*MigCluster).GetClient(0x0, 0x2a4ab40, 0xc00139bdc0, 0xc00139bdc0, 0x0, 0x0, 0x0)
	/remote-source/app/pkg/apis/migration/v1alpha1/migcluster_types.go:209 +0x5a
github.com/konveyor/mig-controller/pkg/controller/migplan.ReconcileMigPlan.getPotentialFilePermissionConflictNamespaces(0x2a4a868, 0xc000bb4aa0, 0x2a1fb70, 0xc00054df40, 0xc0002a8230, 0x0, 0x0, 0xc001d10500, 0x3c7c9c0, 0x0, ...)
	/remote-source/app/pkg/controller/migplan/validation.go:307 +0x2c5
github.com/konveyor/mig-controller/pkg/controller/migplan.ReconcileMigPlan.validateNamespaces(0x2a4a868, 0xc000bb4aa0, 0x2a1fb70, 0xc00054df40, 0xc0002a8230, 0x0, 0x0, 0x2a2a020, 0xc002026e70, 0xc001d10500, ...)
	/remote-source/app/pkg/controller/migplan/validation.go:451 +0x389
github.com/konveyor/mig-controller/pkg/controller/migplan.ReconcileMigPlan.validate(0x2a4a868, 0xc000bb4aa0, 0x2a1fb70, 0xc00054df40, 0xc0002a8230, 0x0, 0x0, 0x2a2a020, 0xc002026e70, 0xc001d10500, ...)
	/remote-source/app/pkg/controller/migplan/validation.go:158 +0x39c
github.com/konveyor/mig-controller/pkg/controller/migplan.(*ReconcileMigPlan).Reconcile(0xc00054df80, 0x2a2a020, 0xc002026e70, 0xc000557590, 0x13, 0xc000c09cc0, 0x20, 0xc002026e00, 0x0, 0x0, ...)
	/remote-source/app/pkg/controller/migplan/migplan_controller.go:261 +0x553
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0008c52c0, 0x2a29f78, 0xc000702000, 0x23a4e20, 0xc001e9e240)
	/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:263 +0x30d
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0008c52c0, 0x2a29f78, 0xc000702000, 0xc00011de00)
	/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:235 +0x205
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.1(0x2a29f78, 0xc000702000)
	/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:198 +0x4a
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1()
	/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185 +0x37
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc00011df50)
	/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x5f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc003e15f50, 0x29dbba0, 0xc002026db0, 0xc000702001, 0xc000bca120)
	/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0x9b
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc00011df50, 0x3b9aca00, 0x0, 0xc80a01, 0xc000bca120)
	/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x98
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext(0x2a29f78, 0xc000702000, 0xc001f75de0, 0x3b9aca00, 0x0, 0x1)
	/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185 +0xa6
k8s.io/apimachinery/pkg/util/wait.UntilWithContext(0x2a29f78, 0xc000702000, 0xc001f75de0, 0x3b9aca00)
	/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:99 +0x57
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1
	/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:195 +0x497
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x118 pc=0x196c9da]

goroutine 556 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:55 +0x109
panic(0x230f860, 0x3b865b0)
	/opt/rh/go-toolset-1.16/root/usr/lib/go-toolset-1.16-golang/src/runtime/panic.go:965 +0x1b9
github.com/konveyor/mig-controller/pkg/apis/migration/v1alpha1.(*MigCluster).BuildRestConfig(0x0, 0x2a4ab40, 0xc00139bdc0, 0x25e3ce0, 0x8, 0xc00487e720)
	/remote-source/app/pkg/apis/migration/v1alpha1/migcluster_types.go:423 +0x3a
github.com/konveyor/mig-controller/pkg/apis/migration/v1alpha1.(*MigCluster).GetClient(0x0, 0x2a4ab40, 0xc00139bdc0, 0xc00139bdc0, 0x0, 0x0, 0x0)
	/remote-source/app/pkg/apis/migration/v1alpha1/migcluster_types.go:209 +0x5a
github.com/konveyor/mig-controller/pkg/controller/migplan.ReconcileMigPlan.getPotentialFilePermissionConflictNamespaces(0x2a4a868, 0xc000bb4aa0, 0x2a1fb70, 0xc00054df40, 0xc0002a8230, 0x0, 0x0, 0xc001d10500, 0x3c7c9c0, 0x0, ...)
	/remote-source/app/pkg/controller/migplan/validation.go:307 +0x2c5
github.com/konveyor/mig-controller/pkg/controller/migplan.ReconcileMigPlan.validateNamespaces(0x2a4a868, 0xc000bb4aa0, 0x2a1fb70, 0xc00054df40, 0xc0002a8230, 0x0, 0x0, 0x2a2a020, 0xc002026e70, 0xc001d10500, ...)
	/remote-source/app/pkg/controller/migplan/validation.go:451 +0x389
github.com/konveyor/mig-controller/pkg/controller/migplan.ReconcileMigPlan.validate(0x2a4a868, 0xc000bb4aa0, 0x2a1fb70, 0xc00054df40, 0xc0002a8230, 0x0, 0x0, 0x2a2a020, 0xc002026e70, 0xc001d10500, ...)
	/remote-source/app/pkg/controller/migplan/validation.go:158 +0x39c
github.com/konveyor/mig-controller/pkg/controller/migplan.(*ReconcileMigPlan).Reconcile(0xc00054df80, 0x2a2a020, 0xc002026e70, 0xc000557590, 0x13, 0xc000c09cc0, 0x20, 0xc002026e00, 0x0, 0x0, ...)
	/remote-source/app/pkg/controller/migplan/migplan_controller.go:261 +0x553
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0008c52c0, 0x2a29f78, 0xc000702000, 0x23a4e20, 0xc001e9e240)
	/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:263 +0x30d
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0008c52c0, 0x2a29f78, 0xc000702000, 0xc00011de00)
	/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:235 +0x205
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.1(0x2a29f78, 0xc000702000)
	/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:198 +0x4a
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1()
	/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185 +0x37
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc00011df50)
	/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x5f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc003e15f50, 0x29dbba0, 0xc002026db0, 0xc000702001, 0xc000bca120)
	/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0x9b
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc00011df50, 0x3b9aca00, 0x0, 0xc80a01, 0xc000bca120)
	/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x98
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext(0x2a29f78, 0xc000702000, 0xc001f75de0, 0x3b9aca00, 0x0, 0x1)
	/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185 +0xa6
k8s.io/apimachinery/pkg/util/wait.UntilWithContext(0x2a29f78, 0xc000702000, 0xc001f75de0, 0x3b9aca00)
	/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:99 +0x57
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1
	/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:195 +0x497

Comment 1 Sergio 2021-09-02 15:10:12 UTC
It happens too if the malformed field is ‘srcMigClusterRef’

Comment 2 Sergio 2021-09-02 15:26:09 UTC
Probably a duplicated of this other BZ https://bugzilla.redhat.com/show_bug.cgi?id=1951869

Comment 3 Erik Nelson 2021-09-03 13:17:17 UTC
We believe this to be resolved as of: https://github.com/konveyor/mig-controller/pull/1186

Comment 7 Sergio 2021-09-08 11:07:57 UTC
Verfied using
SOURCE CLUSTER: AWS OCP 3.11 (MTC 1.5.1) NFS
TARGET CLUSTER: AWS OCP 4.9 (MTC 1.6.0) OCS4

openshift-migration-rhel8-operator@sha256:ef00e934ed578a4acb429f8710284d10acf2cf98f38a2b2268bbea8b5fd7139c
    - name: MIG_CONTROLLER_REPO
      value: openshift-migration-controller-rhel8@sha256
    - name: MIG_CONTROLLER_TAG
      value: 27f465b2cd38cee37af5c3d0fd745676086fe0391e3c459d4df18dd3a12e7051
    - name: MIG_UI_REPO
      value: openshift-migration-ui-rhel8@sha256
    - name: MIG_UI_TAG


Now we get 2 critical conditions and the migration controller is not crashing.

For destination cluster:

status:
  conditions:
  - category: Critical
    lastTransitionTime: "2021-09-08T11:04:13Z"
    message: 'The `dstMigClusterRef` must reference a valid `migcluster`, subject: openshift-migration/foo.'
    reason: NotFound
    status: "True"
    type: InvalidDestinationClusterRef
  - category: Critical
    lastTransitionTime: "2021-09-08T11:04:13Z"
    message: 'Reconcile failed: [destination cluster not found]. See controller logs for details.'
    status: "True"
    type: ReconcileFailed


For source cluster:

status:
  conditions:
  - category: Critical
    lastTransitionTime: "2021-09-08T11:07:07Z"
    message: 'The `srcMigClusterRef` must reference a valid `migcluster`, subject: openshift-migration/foo.'
    reason: NotFound
    status: "True"
    type: InvalidSourceClusterRef
  - category: Critical
    lastTransitionTime: "2021-09-08T11:07:07Z"
    message: 'Reconcile failed: [source cluster not found]. See controller logs for details.'
    status: "True"
    type: ReconcileFailed


Moved to VERIFIED status

Comment 9 errata-xmlrpc 2021-09-29 14:35:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Migration Toolkit for Containers (MTC) 1.6.0 security & bugfix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3694