Description of problem: Version-Release number of selected component (if applicable): OCP4 $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.1.0 True False 5h1m Cluster version is 4.1.0 OCP3 $ oc version oc v3.11.126 kubernetes v1.11.0+d4cacc0 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https:// openshift v3.11.104 kubernetes v1.11.0+d4cacc0 velero image: quay.io/ocpmigrate/velero:fusor-dev imageID: quay.io/ocpmigrate/velero@sha256:e4e19be179221bf8a298cb7282f5890099633194dbc0c698c813e07b40b29302 image: quay.io/ocpmigrate/migration-plugin:latest imageID: quay.io/ocpmigrate/migration-plugin@sha256:d34af290b3c6d808ad360a1f2d41d91e06bff5aa912f9a5a78fed3ea2f0f8f71 controller image: quay.io/ocpmigrate/mig-controller:latest imageID: quay.io/ocpmigrate/mig-controller@sha256:24e1dad428ca878d4b19f73148f485785c96a91d9aa9f738e7ee1b4b40726682 How reproducible: Steps to Reproduce: 1. Set a normal environment with 1 ocp4 target cluster and 1 ocp3 source cluster. 2. Verify that you can see in the UI both clusters and both are online 3. Uninstall mig operator and controller from ocp3 source cluster. 4. Install mig operator and controller in ocp3 source cluster. This will change the SA token of this cluster 5. In the UI update the token, so that the cluster is online again Actual results: 1. The source cluster is online again in the UI 2. There is an error in the controller's logs: $ oc logs $(oc get pod -l control-plane=controller-manager -o NAME) E0814 16:06:10.902643 1 reflector.go:134] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:126: Failed to list *v1.Secret: Unauthorized E0814 16:06:10.964137 1 reflector.go:134] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:126: Failed to list *v1.PersistentVolumeClaim: Unauthorized E0814 16:06:10.989906 1 reflector.go:134] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:126: Failed to list *v1.StorageClass: Unauthorized E0814 16:06:10.993861 1 reflector.go:134] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:126: Failed to list *v1.PersistentVolume: Unauthorized E0814 16:06:10.995861 1 reflector.go:134] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:126: Failed to list *v1.BackupStorageLocation: Unauthorized Expected results: The controller should be aware of the change in the token. Additional info: Once we have the failure in the logs, after deleting the controller pod, the controller realizes of the new token and works fine again: $ oc delete pod $(oc get pod -l control-plane=controller-manager -o jsonpath='{.items[].metadata.name}')
This is happening because the remote watch system will only start a new remote watch if one isn't running for a MigCluster. The check simply looks at whether a remotewatch was started for a particular MigCluster ns/name. Changing the SA token doesn't change this ns/name so the old remote watch remains with a stale SA token. https://github.com/fusor/mig-controller/blob/master/pkg/remote/watch.go#L69 https://github.com/fusor/mig-controller/blob/master/pkg/controller/migcluster/migcluster_controller.go#L204-L207 We are missing support for stopping remote watches currently, which is required to handling changes to SA tokens. https://github.com/fusor/mig-controller/blob/master/pkg/remote/watch.go#L51 @Jeff, we should do some thinking on what kinds of situations should lead to shutdown / restart of a remote watch.
Possibly related to https://bugzilla.redhat.com/show_bug.cgi?id=1945251, although in 1945251 I didn't see recovery happen at all.
Closing as stale, please re-open if the issue persists with the current release.