1769535 – Cluster Migration Plan hung on EnsureCloudSecretPropagated

Bug 1769535 - Cluster Migration Plan hung on EnsureCloudSecretPropagated

Summary: Cluster Migration Plan hung on EnsureCloudSecretPropagated

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Migration Tooling
Sub Component:
Version:	4.2.z
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	4.3.0
Assignee:	Jeff Ortel
QA Contact:	Xin jiang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-11-06 19:56 UTC by Jeremy Rogers
Modified:	2023-03-24 15:55 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-02-06 20:20:44 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2020:0440	0	None	None	None	2020-02-06 20:21:01 UTC

Description Jeremy Rogers 2019-11-06 19:56:08 UTC

Description of problem:
When migrating a namespace from on-prem OpenShift 3.11 to 4.2, staging hangs on "EnsureCloudSecretPropagated".

Version-Release number of selected component (if applicable):
1.0.0 Cluster Application Migration Operator

How reproducible:
Hangs every time

Steps to Reproduce:
1. Set up cluster migration operator on 3.11 cluster and 4.2 cluster per the documentation here: https://docs.openshift.com/container-platform/4.2/migration/migrating-openshift-3-to-4.html
2. Set up a minio deployment for the replication repository
3. In the CAM tool frontend, set up a plan from the source cluster to the destination cluster.
4. Click "Stage" to begin staging the migration.

Actual results:
Migration hangs on "EnsureCloudSecretPropagated"

Expected results:
Successful stage



Additional info:

I've seen this bug: https://bugzilla.redhat.com/show_bug.cgi?id=1757571
In my case, the migration controller is running on the destination cluster, so this bug doesn't apply.

I looked through the code, and it looks like this step is looking for a secret called 'cloud-credentials' in the 'openshift-migration' namespace on each cluster. This secret exists on both hosts.

The controller logs have a lot of these lines:

{"level":"info","ts":1573069999.4995205,"logger":"migration|qm784","msg":"[RUN]","migration":"c83f2ed0-00c6-11ea-84ea-c33e5d2a6e30","stage":true,"phase":"EnsureCloudSecretPropagated"}
{"level":"info","ts":1573070002.7821653,"logger":"migration|2f9cz","msg":"[RUN]","migration":"c83f2ed0-00c6-11ea-84ea-c33e5d2a6e30","stage":true,"phase":"EnsureCloudSecretPropagated"}
{"level":"info","ts":1573070006.1517553,"logger":"migration|xqqm5","msg":"[RUN]","migration":"c83f2ed0-00c6-11ea-84ea-c33e5d2a6e30","stage":true,"phase":"EnsureCloudSecretPropagated"}

Comment 3 Xin jiang 2019-12-05 05:27:15 UTC

I set up cluster migration operator on 3.11 cluster and 4.2 cluster per the same documentation as yours. the issue cannot reproduce.

OCP 3.11
$ oc version
oc v3.11.156
kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://ip-172-18-0-23.ec2.internal:8443
openshift v3.11.157
kubernetes v1.11.0+d4cacc0

OCP4.2
$ oc get clusterversions                                                                                                                                                           
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.2.0-0.nightly-2019-12-03-084420   True        False         24h     Cluster version is 4.2.0-0.nightly-2019-12-03-084420


Migplan:
$oc get migmigration 869e9920-171e-11ea-a545-f14ba749a1fa -n openshift-migration -o yaml
apiVersion: migration.openshift.io/v1alpha1
kind: MigMigration
metadata:
  annotations:
    touch: b1c21c5b-281c-48d4-8c29-ebb76eb0aa47
  creationTimestamp: "2019-12-05T05:17:26Z"
  generation: 17
  name: 869e9920-171e-11ea-a545-f14ba749a1fa
  namespace: openshift-migration
  ownerReferences:
  - apiVersion: migration.openshift.io/v1alpha1
    kind: MigPlan
    name: test
    uid: 3bc76fe6-171e-11ea-a0a8-0a6117712c96
  resourceVersion: "422944"
  selfLink: /apis/migration.openshift.io/v1alpha1/namespaces/openshift-migration/migmigrations/869e9920-171e-11ea-a545-f14ba749a1fa
  uid: 86ddec55-171e-11ea-a0a8-0a6117712c96
spec:
  migPlanRef:
    name: test
    namespace: openshift-migration
  stage: true
status:
  conditions:
  - category: Advisory
    durable: true
    lastTransitionTime: "2019-12-05T05:19:10Z"
    message: The migration has completed successfully.
    reason: Completed
    status: "True"
    type: Succeeded
  phase: Completed
  startTimestamp: "2019-12-05T05:17:26Z"

Comment 4 Jeremy Rogers 2019-12-05 13:25:36 UTC

It is also worth noting that our 3.11 cluster is running on OKD. Could that cause an issue?

OKD 3.11
oc v3.11.0+62803d0-1
kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server <redacted>
openshift v3.11.0+06cfa24-67
kubernetes v1.11.0+d4cacc0

OCP 4.2
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.2.9     True        False         22h     Cluster version is 4.2.9


I have tried uninstalling and re-installing the migration controller on both clusters. Are there any other logs I can give or curl commands I can run from one cluster to another to help debug the issue?

Comment 5 Jeremy Rogers 2019-12-05 14:04:26 UTC

I've been trying to look through the code a bit to see what the EnsureCloudSecretPropagated phase actually does. It looks like it execs into the velero pod on the other cluster and checks that the cloud credentials inside the pod match the cloud-credentials secret in the pod's namespace. 
I checked it manually and they do match; maybe the path to that secret in the pod isn't being set correctly? The controller logs don't actually have anything about running commands on the other cluster, and it doesn't look like there's any debug logging on that function unfortunately. I'd like to see what API calls are actually coming out of that function so that I can test them manually and see if there are any issues with RBAC or issues with authenticating with the service account token. 

I think that checking the connection to the remote cluster authenticates with the provided service account token, so I don't think that should be a problem.

Comment 6 Jeff Ortel 2019-12-11 14:07:06 UTC

I suspect the problem is described in this issue: https://github.com/fusor/mig-controller/issues/377.  Does the customer have velero running on either of the clusters in a namespace other than openshift-migration?

Comment 7 Jeremy Rogers 2019-12-17 20:51:06 UTC

(In reply to Jeff Ortel from comment #6)
> I suspect the problem is described in this issue:
> https://github.com/fusor/mig-controller/issues/377.  Does the customer have
> velero running on either of the clusters in a namespace other than
> openshift-migration?

Hello, 

Sorry for the delay. We had velero running in another namespace because we were evaluating it before the CAM tool existed. After deleting those old velero instances, the migration plan continued. Thank you for your help!

- Jeremy

Comment 8 Jeff Ortel 2020-01-10 22:26:23 UTC

Fixed by: https://github.com/fusor/mig-controller/pull/378

Comment 9 Xin jiang 2020-01-21 02:50:39 UTC

Jeff, 

What's the expected behavior?  I don't know how to verify it.

Comment 11 Jeff Ortel 2020-01-21 19:36:59 UTC

To test, you need to run a migration with velero running in the "velero" namespace in addition to the openshift-migration namespace on either the source or destination cluster.
Without the fix, the migration would get stuck at phase=EnsureCloudSecretPropagated because it was ensuring that the secret mounted in ALL velero pods matched cloud-credentials secret in the openshift-migration namespace.  The velero pod running the the "velero" namespace will never mount our cloud-credentials secret. 

The result you want is that the migration does not get at phase=EnsureCloudSecretPropagated even with velero running in additional namespaces.

Comment 12 Sergio 2020-01-22 17:00:40 UTC

Verified in CAM 1.1 stage

Controller:
    image: registry.stage.redhat.io/rhcam-1-1/openshift-migration-controller-rhel8@sha256:ec015e65da93e800a25e522473793c277cd0d86436ab4637d73e2673a5f0083d
UI:
    image: registry.stage.redhat.io/rhcam-1-1/openshift-migration-ui-rhel8@sha256:ecd81e11af0a17bfdd4e6afce1bf60f422115ed3545ad2b0d055f0fded87e422
Velero
    imageID: registry.stage.redhat.io/rhcam-1-1/openshift-migration-velero-rhel8@sha256:89e56f7f08802e92a763ca3c7336209e58849b9ac9ea90ddc76d9b94d981b8b9
    imageID: registry.stage.redhat.io/rhcam-1-1/openshift-migration-plugin-rhel8@sha256:9c6eceba0c422b9f375c3ab785ff392093493ce33def7c761d7cedc51cde775d
    imageID: registry.stage.redhat.io/rhcam-1-1/openshift-migration-velero-plugin-for-aws-rhel8@sha256:5235eeeee330165eef77ac8d823eed384c9108884f6be49c9ab47944051af91e
    imageID: registry.stage.redhat.io/rhcam-1-1/openshift-migration-velero-plugin-for-gcp-rhel8@sha256:789b12ff351d3edde735b9f5eebe494a8ac5a94604b419dfd84e87d073b04e9e
    imageID: registry.stage.redhat.io/rhcam-1-1/openshift-migration-velero-plugin-for-microsoft-azure-rhel8@sha256:b98f1c61ba347aaa0c8dac5c34b6be4b8cce20c8ff462f476a3347d767ad0a93


I deployed another velero in target cluster. The result was 2 velero running in this cluster. And I executed a migration.

$ oc get pods -l component=velero --all-namespaces
NAMESPACE             NAME                      READY   STATUS    RESTARTS   AGE
openshift-migration   velero-6b49fd8dc5-ldzrd   1/1     Running   0          5h31m
velero                velero-b4cf75794-dt96k    1/1     Running   0          44m


The migration run properly, it didn't hang and no other problem was found.

Comment 14 errata-xmlrpc 2020-02-06 20:20:44 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:0440

Note You need to log in before you can comment on or make changes to this bug.