Description of problem: Migration is stuck in ensureQuiesced state when the migration includes pods with different status like "completed" "error" "crashloopbackoff" and a statefulset. Version-Release number of selected component (if applicable): OCP3 oc version oc v3.11.141 kubernetes v1.11.0+d4cacc0 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://XXXXXXXXXXX openshift v3.11.141 kubernetes v1.11.0+d4cacc0 OCP4 oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.2.0-0.nightly-2019-09-02-172410 True False 6h49m Cluster version is 4.2.0-0.nightly-2019-09-02-172410 Controller: image: quay.io/ocpmigrate/mig-controller:stable imageID: quay.io/ocpmigrate/mig-controller@sha256:7ec48a557240f1d2fa6ee6cd62234b0e75f178eca2a0cc5b95124e01bcd2c114 Velero: image: quay.io/ocpmigrate/velero:stable imageID: quay.io/ocpmigrate/velero@sha256:957725dec5f0fb6a46dee78bd49de9ec4ab66903eabb4561b62ad8f4ad9e6f05 image: quay.io/ocpmigrate/migration-plugin:stable imageID: quay.io/ocpmigrate/migration-plugin@sha256:b4493d826260eb1e3e02ba935aaedfd5310fefefb461ca7dcd9a5d55d4aa8f35 How reproducible: always Steps to Reproduce: 1. Deploy a cronjob to generate pods: $ oc process -f https://raw.githubusercontent.com/sergiordlr/temp-testfiles/master/app_migration/cronjob/hello_cron_template.yml | oc create -f - 2. Deploy a statefulset $ oc new-app datagrid-service -p APPLICATION_USER=test -p APPLICATION_PASSWORD=changeme -p NUMBER_OF_INSTANCES=1 3. Wait until the pods have been deployed $ oc get pods NAME READY STATUS RESTARTS AGE datagrid-service-0 1/1 Running 0 41s hello-cron-1567527540-4r6h8 0/1 Completed 0 7s 4. Migrate the namespace Actual results: The migration is stuck in EnsureQuiesced state forever. Expected results: The migration should end without any error. Additional info: The migplan is: apiVersion: migration.openshift.io/v1alpha1 kind: MigPlan metadata: annotations: touch: 1c3f5bbd-00a2-4428-b2e1-fab64b0c19d8 creationTimestamp: "2019-09-04T07:55:36Z" generation: 8 name: testquiesced namespace: openshift-migration resourceVersion: "869782" selfLink: /apis/migration.openshift.io/v1alpha1/namespaces/openshift-migration/migplans/testquiesced uid: 60ea0908-cee9-11e9-bec4-064e40a8ff9a spec: destMigClusterRef: name: host namespace: openshift-migration migStorageRef: name: awstorage namespace: openshift-migration namespaces: - testquiesced persistentVolumes: - capacity: 1Gi name: pvc-35872bef-cee9-11e9-adda-0e0962963924 pvc: accessModes: - ReadWriteOnce name: srv-data-datagrid-service-0 namespace: testquiesced selection: action: copy storageClass: csi-rbd storageClass: glusterfs-storage supported: actions: - copy - move srcMigClusterRef: name: ocp3 namespace: openshift-migration status: conditions: - category: Required lastTransitionTime: "2019-09-04T07:55:52Z" message: The `persistentVolumes` list has been updated with discovered PVs. reason: Done status: "True" type: PvsDiscovered - category: Required lastTransitionTime: "2019-09-04T07:55:52Z" message: The storage resources have been created. status: "True" type: StorageEnsured - category: Required lastTransitionTime: "2019-09-04T07:55:53Z" message: The migration registry resources have been created. status: "True" type: RegistriesEnsured - category: Required lastTransitionTime: "2019-09-04T07:55:53Z" message: The migration plan is ready. status: "True" type: Ready - category: Advisory lastTransitionTime: "2019-09-04T07:56:09Z" message: Limited validation; PV discovery and resource reconciliation suspended. status: "True" type: Suspended And the migmigration object: apiVersion: migration.openshift.io/v1alpha1 kind: MigMigration metadata: annotations: touch: ca927231-b476-4dca-876d-73fc0b9f177e creationTimestamp: "2019-09-04T07:56:05Z" generation: 12 name: 72803f20-cee9-11e9-a387-c394301cd762 namespace: openshift-migration ownerReferences: - apiVersion: migration.openshift.io/v1alpha1 kind: MigPlan name: testquiesced uid: 60ea0908-cee9-11e9-bec4-064e40a8ff9a resourceVersion: "871445" selfLink: /apis/migration.openshift.io/v1alpha1/namespaces/openshift-migration/migmigrations/72803f20-cee9-11e9-a387-c394301cd762 uid: 7292f075-cee9-11e9-bec4-064e40a8ff9a spec: migPlanRef: name: testquiesced namespace: openshift-migration quiescePods: true stage: false status: conditions: - category: Advisory lastTransitionTime: "2019-09-04T07:56:54Z" message: 'Step: 13/27' reason: EnsureQuiesced status: "True" type: Running - category: Required lastTransitionTime: "2019-09-04T07:56:05Z" message: The migration is ready. status: "True" type: Ready phase: EnsureQuiesced startTimestamp: "2019-09-04T07:56:05Z"
Fixed by: https://github.com/fusor/mig-controller/pull/307
Verified in: Controller: image: quay.io/ocpmigrate/mig-controller:latest imageID: quay.io/ocpmigrate/mig-controller@sha256:259b08d197940932c616dd45f7cfd9799aca6823e83a510f85c83c0c5368496c Velero: image: quay.io/ocpmigrate/velero:latest imageID: quay.io/ocpmigrate/velero@sha256:33d0e627aea00d0896a25d0acae6d4aa7deaaf86ddd28c29f8a6020dc16a97fc image: quay.io/ocpmigrate/migration-plugin:latest imageID: quay.io/ocpmigrate/migration-plugin@sha256:68f0791ce3d51e16e9759465064067d90daba396339ad83aa7aa6eba5a3bd4cf OCP4: NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.2.0-0.nightly-2019-09-10-014843 True False 45m Cluster version is 4.2.0-0.nightly-2019-09-10-014843 OCP3: oc v3.11.144 kubernetes v1.11.0+d4cacc0 Migration was executed against pods with status: NAME READY STATUS RESTARTS AGE datagrid-service-0 1/1 Running 0 4m hello-cron-1568115720-s9pz4 0/1 Completed 0 6m hello-cron-1568115780-8tz7r 0/1 Completed 0 5m hello-cron-1568115840-wx9v4 0/1 Completed 0 4m hello-cron-1568115900-d8m4l 0/1 Error 5 3m hello-cron-1568115960-mx9pj 0/1 ImagePullBackOff 0 2m hello-cron-1568116020-6xg7z 0/1 ErrImagePull 0 1m hello-cron-1568116080-48r9r 0/1 CrashLoopBackOff 1 10s The migration was done properly. And "quiesced" worked as intended.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922