This bug was initially created as a copy of Bug #1861259 I am copying this bug because: Description of problem: Before enabled "disable_pv_migration=true", user created an migplan with PVs. When user applied "disable_pv_migration=true" on the migrationController and migration-controller pod was restarted, and then tried to migrate the migplan (I am not sure how long I waited, maybe waited 1 or 2 mins). The "disable_pv_migration=true" doesn't take effect on all current plans. I found the migplan still has PVs. The migration will be stuck at StageRestoreCreated status. And it reported "Limited validation; PV discovery and resource reconciliation suspended" Version-Release number of selected component (if applicable): CAM 1.2.4 How reproducible: Often Steps to Reproduce: 1. Apply "disable_pv_migration": true" on the migrationController $ oc patch migrationcontroller migration-controller -p '{"spec":{"disable_pv_migration": true } }' --type='merge' -n openshift-migration 2. Check if the migration-controller is restarted $ oc get pod -n openshift-migration --watch NAME READY STATUS RESTARTS AGE migration-controller-586c9688b8-6wftl 2/2 Running 0 3h59m migration-operator-68bdbf56f7-p67zv 2/2 Running 0 4h1m migration-ui-65f66946c4-tlt62 1/1 Running 0 3h59m registry-13a03a04-a1a2-4ec8-ae26-61670668c1b8-bxxnt-1-deploy 0/1 Completed 0 37m registry-13a03a04-a1a2-4ec8-ae26-61670668c1b8-bxxnt-2-d5g94 1/1 Running 0 35m registry-13a03a04-a1a2-4ec8-ae26-61670668c1b8-bxxnt-2-deploy 0/1 Completed 0 35m registry-8e89ec15-a682-4e8e-a522-c97c66c413bd-zjf4g-1-deploy 0/1 Completed 0 6m17s registry-8e89ec15-a682-4e8e-a522-c97c66c413bd-zjf4g-1-djt55 1/1 Running 0 6m14s restic-fhq4n 1/1 Running 0 4h restic-hqcgt 1/1 Running 0 4h restic-pgb7l 1/1 Running 0 4h velero-5dfcd8d7c9-pcdc6 1/1 Running 0 4h migration-controller-5546545568-ptmh2 0/2 Pending 0 0s migration-controller-5546545568-ptmh2 0/2 Pending 0 0s migration-controller-5546545568-ptmh2 0/2 ContainerCreating 0 0s migration-controller-5546545568-ptmh2 0/2 ContainerCreating 0 2s migration-controller-5546545568-ptmh2 2/2 Running 0 6s migration-controller-586c9688b8-6wftl 2/2 Terminating 0 4h migration-controller-586c9688b8-6wftl 0/2 Terminating 0 4h migration-controller-586c9688b8-6wftl 0/2 Terminating 0 4h migration-controller-586c9688b8-6wftl 0/2 Terminating 0 4h 3. Check the "disable_pv_migration=true" is applied on the MigrationController $ oc get migrationcontrollers -n openshift-migration -o yaml | grep disable f:disable_pv_migration: {} disable_pv_migration: true $ oc get pod -n openshift-migration migration-controller-5546545568-ptmh2 -o yaml | grep EXCLUDED_RESOURCES -A1 k:{"name":"EXCLUDED_RESOURCES"}: .: {} -- - name: EXCLUDED_RESOURCES value: imagetags,templateinstances,clusterserviceversions,packagemanifests,subscriptions,servicebrokers,servicebindings,serviceclasses,serviceinstances,serviceplans,persistentvolumes,persistentvolumeclaims 4. Execute the migplan before the migplan is updated to remove PVs 5. The migration is stuck at StageRestoreCreated status $ oc get migplan -n openshift-migration mysql -o yaml apiVersion: migration.openshift.io/v1alpha1 kind: MigPlan metadata: annotations: openshift.io/touch: 0047956e-d0a3-11ea-8b60-0a580a830033 creationTimestamp: "2020-07-28T07:00:19Z" ...... apiVersion: migration.openshift.io/v1alpha1 kind: MigPlan metadata: annotations: openshift.io/touch: 0047956e-d0a3-11ea-8b60-0a580a830033 creationTimestamp: "2020-07-28T07:00:19Z" generation: 6 managedFields: - apiVersion: migration.openshift.io/v1alpha1 fieldsType: FieldsV1 fieldsV1: f:spec: .: {} f:destMigClusterRef: .: {} f:name: {} f:namespace: {} f:migStorageRef: .: {} f:name: {} f:namespace: {} f:namespaces: {} f:srcMigClusterRef: .: {} f:name: {} f:namespace: {} manager: Mozilla operation: Update time: "2020-07-28T07:01:06Z" - apiVersion: migration.openshift.io/v1alpha1 fieldsType: FieldsV1 fieldsV1: f:metadata: f:annotations: .: {} f:openshift.io/touch: {} f:spec: f:persistentVolumes: {} f:status: .: {} f:conditions: {} f:excludedResources: {} f:observedDigest: {} manager: manager operation: Update time: "2020-07-28T07:21:49Z" name: mysql namespace: openshift-migration resourceVersion: "159947" selfLink: /apis/migration.openshift.io/v1alpha1/namespaces/openshift-migration/migplans/mysql uid: 8e89ec15-a682-4e8e-a522-c97c66c413bd spec: destMigClusterRef: name: host namespace: openshift-migration migStorageRef: name: automatic namespace: openshift-migration namespaces: - mysql persistentVolumes: - capacity: 1Gi name: pvc-247ac434-d09a-11ea-bd2a-fa163e9c1ec4 pvc: accessModes: - ReadWriteOnce hasReference: true name: mysql namespace: mysql selection: action: copy copyMethod: filesystem storageClass: standard storageClass: standard supported: actions: - copy copyMethods: - filesystem - snapshot srcMigClusterRef: name: source-cluster namespace: openshift-migration status: conditions: - category: Required lastTransitionTime: "2020-07-28T07:00:24Z" message: The `persistentVolumes` list has been updated with discovered PVs. reason: Done status: "True" type: PvsDiscovered - category: Required lastTransitionTime: "2020-07-28T07:00:26Z" message: The storage resources have been created. status: "True" type: StorageEnsured - category: Required lastTransitionTime: "2020-07-28T07:00:28Z" message: The migration registry resources have been created. status: "True" type: RegistriesEnsured - category: Required lastTransitionTime: "2020-07-28T07:00:28Z" message: The migration plan is ready. status: "True" type: Ready - category: Advisory lastTransitionTime: "2020-07-28T07:01:49Z" message: Limited validation; PV discovery and resource reconciliation suspended. status: "True" type: Suspended excludedResources: - imagetags - templateinstances - clusterserviceversions - packagemanifests - subscriptions - servicebrokers - servicebindings - serviceclasses - serviceinstances - serviceplans - persistentvolumes - persistentvolumeclaims observedDigest: 605f525329aa790bd0099ed481f27526ddd0f3aae7a906e823766663fafae209 6. During the migration(almost 13 mins), the migplan still is not updated. From the step#5, you also can see it. Actual results: The migplan is stuck at StageRestoreCreated Expected results: I am not sure what's the expected results. Maybe it should check the migplan before executing migration Additional info:
We may have a bug in https://github.com/konveyor/mig-controller/pull/592 . @Scott, what is the intended behavior in this case? I'm not sure we considered the case where pre-existing plans already have PVCs discovered. I don't see any notes on #592 regarding expected behavior, although it looks like we set PvsDiscovered to true to make the plan "Ready" when PVCs are excluded from the MigPlan via the mig-operator option. @Xin, can you check whether your Velero Backup has completed or if it's in progress? A must-gather dump could also be helpful here. https://github.com/konveyor/must-gather
Regarding the "resource reconciliation suspended" message, this is normal and expected during a migration. We don't run reconciliation on the MigPlan while the migration runs so that the plan isn't changed mid-flight.
@Derek, 1. when CAM enables "disable_pv_migration=true", it dose remove spec.persistentVolumes for the pre-existing plans already have PVCs discovered case. We did see this behavior during 1.2.4 test. So it won't migrate all the PVs. 2. "check whether your Velero Backup has completed or if it's in progress", I don't understand what you mean, would you please clarify why check it?
@xin @Derek We knew after we set "disable_pv_migration=true", migration-controller pod need 2 mins to restart to pick up the new setting in "migrationcontroller". I have run below scenarios test [scenario 1] 1. the original migrationcontroller without disable_pv_migration setting. 2. create migrationplan "A" for a nginx project which has 2 PVCs setting. 3. set disable_pv_migration=true" 4. sleep 3 mins 5. trigger migration plan A to migrate nginx project After migration finished successfully, the nginx pod in target node will pending for lacking PVCs. [scenario 2] 1. the original migrationcontroller without disable_pv_migration setting. 2. create migrationplan "A" for a nginx project which has 2 PVCs setting. 3. set disable_pv_migration=true" 4. trigger migration plan A to migrate nginx project 10 seconds later 5. The migration-controller pod will restart during the migration A running. This time migration A will hung at "StageRestoreCreated" phase The related migplan will have persistentVolumes definition forever. $ oc get migplan mig-plan-33054-ocp-33054-skip-pv-with-data -o yaml apiVersion: migration.openshift.io/v1alpha1 kind: MigPlan metadata: annotations: ........ persistentVolumes: - capacity: 1Gi name: pvc-3a9f4c0f-77b5-48c2-9712-7a67317f905f pvc: accessModes: - ReadWriteOnce hasReference: true name: nginx-html namespace: ocp-33054-skip-pv-with-data selection: action: copy copyMethod: filesystem storageClass: managed-premium storageClass: managed-premium supported: actions: - copy copyMethods: - filesystem - snapshot - capacity: 1Gi name: pvc-4fd1bc04-f7d6-4677-bc82-b039d7c33498 pvc: accessModes: - ReadWriteOnce hasReference: true name: nginx-logs namespace: ocp-33054-skip-pv-with-data selection: action: copy copyMethod: filesystem storageClass: managed-premium storageClass: managed-premium supported: actions: - copy copyMethods: - filesystem - snapshot srcMigClusterRef: name: source-cluster namespace: openshift-migration ..... $ oc get migmigration mig-migration-33054-ocp-33054-skip-pv-with-data -o yaml apiVersion: migration.openshift.io/v1alpha1 kind: MigMigration metadata: annotations: openshift.io/touch: 6e4bb522-d161-11ea-8f10-0a580a80023b creationTimestamp: "2020-07-29T05:35:39Z" ....... status: conditions: - category: Advisory lastTransitionTime: "2020-07-29T05:39:43Z" message: 'Step: 21/33' reason: StageRestoreCreated status: "True" type: Running - category: Required lastTransitionTime: "2020-07-29T05:35:39Z" message: The migration is ready. status: "True" type: Ready - category: Advisory durable: true lastTransitionTime: "2020-07-29T05:37:19Z" message: '[1] Stage pods created.' status: "True" type: StagePodsCreated itenerary: Final observedDigest: 8094a5bdf42f562ec74d69fa785895db2d23a3f0710f7da0d458818fc9123e5c phase: StageRestoreCreated startTimestamp: "2020-07-29T05:35:39Z" NOT SURE how to treat situation in this special time, treat it as a bug, or as a invalid user operation. IN k8s cluster, every new API setting maybe take 10s to work. "migration-controller" will take 2 mins to pick up new setting (maybe it's little long time). If the restart time is reasonable, or can not avoid. What is CAM expected behavior? Do we have any official document to tell customer that migration-controller pod needs 2 mins to pick up new setting? I did not find such notification in official doc. This brouhgt me a lot of trouble during my test.
> [scenario 1] > > 1. the original migrationcontroller without disable_pv_migration setting. > 2. create migrationplan "A" for a nginx project which has 2 PVCs setting. > 3. set disable_pv_migration=true" > 4. sleep 3 mins > 5. trigger migration plan A to migrate nginx project > After migration finished successfully, the nginx pod in target node will pending for lacking PVCs. This sounds reasonable, as you are required to migrate PVCs ahead of time when using disable_pv_migration=true. Not a bug. > [scenario 2] > 1. the original migrationcontroller without disable_pv_migration setting. > 2. create migrationplan "A" for a nginx project which has 2 PVCs setting. > 3. set disable_pv_migration=true" > 4. trigger migration plan A to migrate nginx project 10 seconds later > 5. The migration-controller pod will restart during the migration A running. This also sounds reasonable. The reconcile for mig-operator is long-running, as it will run through ~hundreds of ansible tasks during reconcile. I would also say this is not a bug. Waiting ~2 minutes for a change in settings after modifying the MigrationController resource is universal, not just related to this new setting. I think we should consider this as requiring docs enhancement to let the user know to wait several minutes, waiting for the mig-controller pod definition to reflect the new noPV mode. In the long term if mig-operator reconcile times need perf enhancement, we should file an RFE issue against the mig-operator GitHub repo.
Hey Avital, We need to document that user must wait for disable PV migration mode to be applied to mig-controller Pod after setting change is made to MigrationController CR. mig-operator will take ~2 minutes to apply this change.
@derek please provide a summary of information Avital will need to document as a known issue in format of: Cause: Consequence: Workaround (if any): Result:
Cause: 1. User wants to disable PV migration steps in mig-controller and applies "disable_pv_migration": true" on the MigrationController CR $ oc patch migrationcontroller migration-controller -p '{"spec":{"disable_pv_migration": true } }' --type='merge' -n openshift-migration 2. User does NOT wait for the mig-controller pod to restart with new environment variables in place (~1-2 minute wait) 3. User starts a migration before mig-controller pod restarts 4. mig-controller Pod will restart in the middle of migration procedure and be flipped over to new "no PV migration mode" Consequence: 5. In-progress migration may stall due to discrepancy in newly active controller settings and running migrations Workaround (if any): 6. Watch the mig-controller pod after setting "disable_pv_migration": true" and monitor it for a restart. 7. The mig-controller pod definition should show that .spec.containers.env.EXCLUDED_RESOURCES includes persistentvolumeclaims Result: 8. Assuming the user waits for the pod restart before beginning a migration, PVC migration steps should be skipped and the migration should run to completion.
LGTM