1748927 – Migration stuck in ensurequiesced state

Bug 1748927 - Migration stuck in ensurequiesced state

Summary: Migration stuck in ensurequiesced state

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Migration Tooling
Sub Component:
Version:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.2.0
Assignee:	Jeff Ortel
QA Contact:	Sergio
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-09-04 13:21 UTC by Sergio
Modified:	2019-10-16 06:40 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-10-16 06:40:32 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:2922	0	None	None	None	2019-10-16 06:40:41 UTC

Description Sergio 2019-09-04 13:21:26 UTC

Description of problem:
Migration is stuck in ensureQuiesced state when the migration includes pods with different status like "completed" "error" "crashloopbackoff" and a statefulset.


Version-Release number of selected component (if applicable):

OCP3
oc  version
oc v3.11.141
kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://XXXXXXXXXXX
openshift v3.11.141
kubernetes v1.11.0+d4cacc0

OCP4

 oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.2.0-0.nightly-2019-09-02-172410   True        False         6h49m   Cluster version is 4.2.0-0.nightly-2019-09-02-172410


Controller:
    image: quay.io/ocpmigrate/mig-controller:stable
    imageID: quay.io/ocpmigrate/mig-controller@sha256:7ec48a557240f1d2fa6ee6cd62234b0e75f178eca2a0cc5b95124e01bcd2c114

Velero:
    image: quay.io/ocpmigrate/velero:stable
    imageID: quay.io/ocpmigrate/velero@sha256:957725dec5f0fb6a46dee78bd49de9ec4ab66903eabb4561b62ad8f4ad9e6f05
    image: quay.io/ocpmigrate/migration-plugin:stable
    imageID: quay.io/ocpmigrate/migration-plugin@sha256:b4493d826260eb1e3e02ba935aaedfd5310fefefb461ca7dcd9a5d55d4aa8f35



How reproducible:
always

Steps to Reproduce:
1. Deploy a cronjob to generate pods:
   
   $ oc process -f  https://raw.githubusercontent.com/sergiordlr/temp-testfiles/master/app_migration/cronjob/hello_cron_template.yml | oc create -f -

2. Deploy a statefulset
  
  $ oc new-app datagrid-service -p APPLICATION_USER=test -p APPLICATION_PASSWORD=changeme -p NUMBER_OF_INSTANCES=1
3. Wait until the pods have been deployed
  
 $ oc get pods
NAME                          READY     STATUS      RESTARTS   AGE
datagrid-service-0            1/1       Running     0          41s
hello-cron-1567527540-4r6h8   0/1       Completed   0          7s

4. Migrate the namespace


Actual results:
  The migration is stuck in EnsureQuiesced state forever.


Expected results:
  The migration should end without any error.


Additional info:

The migplan is:

apiVersion: migration.openshift.io/v1alpha1
kind: MigPlan
metadata:
  annotations:
    touch: 1c3f5bbd-00a2-4428-b2e1-fab64b0c19d8
  creationTimestamp: "2019-09-04T07:55:36Z"
  generation: 8
  name: testquiesced
  namespace: openshift-migration
  resourceVersion: "869782"
  selfLink: /apis/migration.openshift.io/v1alpha1/namespaces/openshift-migration/migplans/testquiesced
  uid: 60ea0908-cee9-11e9-bec4-064e40a8ff9a
spec:
  destMigClusterRef:
    name: host
    namespace: openshift-migration
  migStorageRef:
    name: awstorage
    namespace: openshift-migration
  namespaces:
  - testquiesced
  persistentVolumes:
  - capacity: 1Gi
    name: pvc-35872bef-cee9-11e9-adda-0e0962963924
    pvc:
      accessModes:
      - ReadWriteOnce
      name: srv-data-datagrid-service-0
      namespace: testquiesced
    selection:
      action: copy
      storageClass: csi-rbd
    storageClass: glusterfs-storage
    supported:
      actions:
      - copy
      - move
  srcMigClusterRef:
    name: ocp3
    namespace: openshift-migration
status:
  conditions:
  - category: Required
    lastTransitionTime: "2019-09-04T07:55:52Z"
    message: The `persistentVolumes` list has been updated with discovered PVs.
    reason: Done
    status: "True"
    type: PvsDiscovered
  - category: Required
    lastTransitionTime: "2019-09-04T07:55:52Z"
    message: The storage resources have been created.
    status: "True"
    type: StorageEnsured
  - category: Required
    lastTransitionTime: "2019-09-04T07:55:53Z"
    message: The migration registry resources have been created.
    status: "True"
    type: RegistriesEnsured
  - category: Required
    lastTransitionTime: "2019-09-04T07:55:53Z"
    message: The migration plan is ready.
    status: "True"
    type: Ready
  - category: Advisory
    lastTransitionTime: "2019-09-04T07:56:09Z"
    message: Limited validation; PV discovery and resource reconciliation suspended.
    status: "True"
    type: Suspended



And the migmigration object:

apiVersion: migration.openshift.io/v1alpha1
kind: MigMigration
metadata:
  annotations:
    touch: ca927231-b476-4dca-876d-73fc0b9f177e
  creationTimestamp: "2019-09-04T07:56:05Z"
  generation: 12
  name: 72803f20-cee9-11e9-a387-c394301cd762
  namespace: openshift-migration
  ownerReferences:
  - apiVersion: migration.openshift.io/v1alpha1
    kind: MigPlan
    name: testquiesced
    uid: 60ea0908-cee9-11e9-bec4-064e40a8ff9a
  resourceVersion: "871445"
  selfLink: /apis/migration.openshift.io/v1alpha1/namespaces/openshift-migration/migmigrations/72803f20-cee9-11e9-a387-c394301cd762
  uid: 7292f075-cee9-11e9-bec4-064e40a8ff9a
spec:
  migPlanRef:
    name: testquiesced
    namespace: openshift-migration
  quiescePods: true
  stage: false
status:
  conditions:
  - category: Advisory
    lastTransitionTime: "2019-09-04T07:56:54Z"
    message: 'Step: 13/27'
    reason: EnsureQuiesced
    status: "True"
    type: Running
  - category: Required
    lastTransitionTime: "2019-09-04T07:56:05Z"
    message: The migration is ready.
    status: "True"
    type: Ready
  phase: EnsureQuiesced
  startTimestamp: "2019-09-04T07:56:05Z"

Comment 1 Jeff Ortel 2019-09-09 14:12:33 UTC

Fixed by: https://github.com/fusor/mig-controller/pull/307

Comment 3 Sergio 2019-09-10 11:57:40 UTC

Verified in:

Controller:
    image: quay.io/ocpmigrate/mig-controller:latest
    imageID: quay.io/ocpmigrate/mig-controller@sha256:259b08d197940932c616dd45f7cfd9799aca6823e83a510f85c83c0c5368496c
Velero:
    image: quay.io/ocpmigrate/velero:latest
    imageID: quay.io/ocpmigrate/velero@sha256:33d0e627aea00d0896a25d0acae6d4aa7deaaf86ddd28c29f8a6020dc16a97fc
    image: quay.io/ocpmigrate/migration-plugin:latest
    imageID: quay.io/ocpmigrate/migration-plugin@sha256:68f0791ce3d51e16e9759465064067d90daba396339ad83aa7aa6eba5a3bd4cf
OCP4:
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.2.0-0.nightly-2019-09-10-014843   True        False         45m     Cluster version is 4.2.0-0.nightly-2019-09-10-014843
OCP3:
oc v3.11.144
kubernetes v1.11.0+d4cacc0


Migration was executed against pods with status:
NAME                          READY     STATUS             RESTARTS   AGE
datagrid-service-0            1/1       Running            0          4m
hello-cron-1568115720-s9pz4   0/1       Completed          0          6m
hello-cron-1568115780-8tz7r   0/1       Completed          0          5m
hello-cron-1568115840-wx9v4   0/1       Completed          0          4m
hello-cron-1568115900-d8m4l   0/1       Error              5          3m
hello-cron-1568115960-mx9pj   0/1       ImagePullBackOff   0          2m
hello-cron-1568116020-6xg7z   0/1       ErrImagePull       0          1m
hello-cron-1568116080-48r9r   0/1       CrashLoopBackOff   1          10s

The migration was done properly. And "quiesced" worked as intended.

Comment 4 errata-xmlrpc 2019-10-16 06:40:32 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922

Note You need to log in before you can comment on or make changes to this bug.