Description of problem: When migrating a namespace from on-prem OpenShift 3.11 to 4.2, staging hangs on "EnsureCloudSecretPropagated". Version-Release number of selected component (if applicable): 1.0.0 Cluster Application Migration Operator How reproducible: Hangs every time Steps to Reproduce: 1. Set up cluster migration operator on 3.11 cluster and 4.2 cluster per the documentation here: https://docs.openshift.com/container-platform/4.2/migration/migrating-openshift-3-to-4.html 2. Set up a minio deployment for the replication repository 3. In the CAM tool frontend, set up a plan from the source cluster to the destination cluster. 4. Click "Stage" to begin staging the migration. Actual results: Migration hangs on "EnsureCloudSecretPropagated" Expected results: Successful stage Additional info: I've seen this bug: https://bugzilla.redhat.com/show_bug.cgi?id=1757571 In my case, the migration controller is running on the destination cluster, so this bug doesn't apply. I looked through the code, and it looks like this step is looking for a secret called 'cloud-credentials' in the 'openshift-migration' namespace on each cluster. This secret exists on both hosts. The controller logs have a lot of these lines: {"level":"info","ts":1573069999.4995205,"logger":"migration|qm784","msg":"[RUN]","migration":"c83f2ed0-00c6-11ea-84ea-c33e5d2a6e30","stage":true,"phase":"EnsureCloudSecretPropagated"} {"level":"info","ts":1573070002.7821653,"logger":"migration|2f9cz","msg":"[RUN]","migration":"c83f2ed0-00c6-11ea-84ea-c33e5d2a6e30","stage":true,"phase":"EnsureCloudSecretPropagated"} {"level":"info","ts":1573070006.1517553,"logger":"migration|xqqm5","msg":"[RUN]","migration":"c83f2ed0-00c6-11ea-84ea-c33e5d2a6e30","stage":true,"phase":"EnsureCloudSecretPropagated"}
I set up cluster migration operator on 3.11 cluster and 4.2 cluster per the same documentation as yours. the issue cannot reproduce. OCP 3.11 $ oc version oc v3.11.156 kubernetes v1.11.0+d4cacc0 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://ip-172-18-0-23.ec2.internal:8443 openshift v3.11.157 kubernetes v1.11.0+d4cacc0 OCP4.2 $ oc get clusterversions NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.2.0-0.nightly-2019-12-03-084420 True False 24h Cluster version is 4.2.0-0.nightly-2019-12-03-084420 Migplan: $oc get migmigration 869e9920-171e-11ea-a545-f14ba749a1fa -n openshift-migration -o yaml apiVersion: migration.openshift.io/v1alpha1 kind: MigMigration metadata: annotations: touch: b1c21c5b-281c-48d4-8c29-ebb76eb0aa47 creationTimestamp: "2019-12-05T05:17:26Z" generation: 17 name: 869e9920-171e-11ea-a545-f14ba749a1fa namespace: openshift-migration ownerReferences: - apiVersion: migration.openshift.io/v1alpha1 kind: MigPlan name: test uid: 3bc76fe6-171e-11ea-a0a8-0a6117712c96 resourceVersion: "422944" selfLink: /apis/migration.openshift.io/v1alpha1/namespaces/openshift-migration/migmigrations/869e9920-171e-11ea-a545-f14ba749a1fa uid: 86ddec55-171e-11ea-a0a8-0a6117712c96 spec: migPlanRef: name: test namespace: openshift-migration stage: true status: conditions: - category: Advisory durable: true lastTransitionTime: "2019-12-05T05:19:10Z" message: The migration has completed successfully. reason: Completed status: "True" type: Succeeded phase: Completed startTimestamp: "2019-12-05T05:17:26Z"
It is also worth noting that our 3.11 cluster is running on OKD. Could that cause an issue? OKD 3.11 oc v3.11.0+62803d0-1 kubernetes v1.11.0+d4cacc0 features: Basic-Auth GSSAPI Kerberos SPNEGO Server <redacted> openshift v3.11.0+06cfa24-67 kubernetes v1.11.0+d4cacc0 OCP 4.2 NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.2.9 True False 22h Cluster version is 4.2.9 I have tried uninstalling and re-installing the migration controller on both clusters. Are there any other logs I can give or curl commands I can run from one cluster to another to help debug the issue?
I've been trying to look through the code a bit to see what the EnsureCloudSecretPropagated phase actually does. It looks like it execs into the velero pod on the other cluster and checks that the cloud credentials inside the pod match the cloud-credentials secret in the pod's namespace. I checked it manually and they do match; maybe the path to that secret in the pod isn't being set correctly? The controller logs don't actually have anything about running commands on the other cluster, and it doesn't look like there's any debug logging on that function unfortunately. I'd like to see what API calls are actually coming out of that function so that I can test them manually and see if there are any issues with RBAC or issues with authenticating with the service account token. I think that checking the connection to the remote cluster authenticates with the provided service account token, so I don't think that should be a problem.
I suspect the problem is described in this issue: https://github.com/fusor/mig-controller/issues/377. Does the customer have velero running on either of the clusters in a namespace other than openshift-migration?
(In reply to Jeff Ortel from comment #6) > I suspect the problem is described in this issue: > https://github.com/fusor/mig-controller/issues/377. Does the customer have > velero running on either of the clusters in a namespace other than > openshift-migration? Hello, Sorry for the delay. We had velero running in another namespace because we were evaluating it before the CAM tool existed. After deleting those old velero instances, the migration plan continued. Thank you for your help! - Jeremy
Fixed by: https://github.com/fusor/mig-controller/pull/378
Jeff, What's the expected behavior? I don't know how to verify it.
To test, you need to run a migration with velero running in the "velero" namespace in addition to the openshift-migration namespace on either the source or destination cluster. Without the fix, the migration would get stuck at phase=EnsureCloudSecretPropagated because it was ensuring that the secret mounted in ALL velero pods matched cloud-credentials secret in the openshift-migration namespace. The velero pod running the the "velero" namespace will never mount our cloud-credentials secret. The result you want is that the migration does not get at phase=EnsureCloudSecretPropagated even with velero running in additional namespaces.
Verified in CAM 1.1 stage Controller: image: registry.stage.redhat.io/rhcam-1-1/openshift-migration-controller-rhel8@sha256:ec015e65da93e800a25e522473793c277cd0d86436ab4637d73e2673a5f0083d UI: image: registry.stage.redhat.io/rhcam-1-1/openshift-migration-ui-rhel8@sha256:ecd81e11af0a17bfdd4e6afce1bf60f422115ed3545ad2b0d055f0fded87e422 Velero imageID: registry.stage.redhat.io/rhcam-1-1/openshift-migration-velero-rhel8@sha256:89e56f7f08802e92a763ca3c7336209e58849b9ac9ea90ddc76d9b94d981b8b9 imageID: registry.stage.redhat.io/rhcam-1-1/openshift-migration-plugin-rhel8@sha256:9c6eceba0c422b9f375c3ab785ff392093493ce33def7c761d7cedc51cde775d imageID: registry.stage.redhat.io/rhcam-1-1/openshift-migration-velero-plugin-for-aws-rhel8@sha256:5235eeeee330165eef77ac8d823eed384c9108884f6be49c9ab47944051af91e imageID: registry.stage.redhat.io/rhcam-1-1/openshift-migration-velero-plugin-for-gcp-rhel8@sha256:789b12ff351d3edde735b9f5eebe494a8ac5a94604b419dfd84e87d073b04e9e imageID: registry.stage.redhat.io/rhcam-1-1/openshift-migration-velero-plugin-for-microsoft-azure-rhel8@sha256:b98f1c61ba347aaa0c8dac5c34b6be4b8cce20c8ff462f476a3347d767ad0a93 I deployed another velero in target cluster. The result was 2 velero running in this cluster. And I executed a migration. $ oc get pods -l component=velero --all-namespaces NAMESPACE NAME READY STATUS RESTARTS AGE openshift-migration velero-6b49fd8dc5-ldzrd 1/1 Running 0 5h31m velero velero-b4cf75794-dt96k 1/1 Running 0 44m The migration run properly, it didn't hang and no other problem was found.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2020:0440