Bug 1861267 - Docs, known issue: migration stuck at StageRestoreCreated status due to PV discovery and resource reconciliation suspended
Summary: Docs, known issue: migration stuck at StageRestoreCreated status due to PV d...
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Migration Tooling
Version: 4.4
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 4.4.z
Assignee: Avital Pinnick
QA Contact: Xin jiang
URL:
Whiteboard:
Depends On: 1861259
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-07-28 08:02 UTC by Xin jiang
Modified: 2020-08-04 15:29 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: Known Issue
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-08-04 15:29:05 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Xin jiang 2020-07-28 08:02:54 UTC
This bug was initially created as a copy of Bug #1861259

I am copying this bug because: 



Description of problem:

Before enabled "disable_pv_migration=true", user created an migplan with PVs. When user applied "disable_pv_migration=true" on the migrationController and migration-controller pod was restarted, and then tried to migrate the migplan (I am not sure how long I waited, maybe waited 1 or 2 mins). The "disable_pv_migration=true" doesn't take effect on all current plans. I found the migplan still has PVs. The migration will be stuck at StageRestoreCreated status. And it reported "Limited validation; PV discovery and resource reconciliation suspended"

Version-Release number of selected component (if applicable):
CAM 1.2.4

How reproducible:
Often

Steps to Reproduce:
1. Apply "disable_pv_migration": true" on the migrationController
$ oc patch migrationcontroller migration-controller -p '{"spec":{"disable_pv_migration": true } }' --type='merge' -n openshift-migration

2. Check if the migration-controller is restarted
$ oc get pod -n openshift-migration --watch
NAME                                                           READY   STATUS      RESTARTS   AGE
migration-controller-586c9688b8-6wftl                          2/2     Running     0          3h59m
migration-operator-68bdbf56f7-p67zv                            2/2     Running     0          4h1m
migration-ui-65f66946c4-tlt62                                  1/1     Running     0          3h59m
registry-13a03a04-a1a2-4ec8-ae26-61670668c1b8-bxxnt-1-deploy   0/1     Completed   0          37m
registry-13a03a04-a1a2-4ec8-ae26-61670668c1b8-bxxnt-2-d5g94    1/1     Running     0          35m
registry-13a03a04-a1a2-4ec8-ae26-61670668c1b8-bxxnt-2-deploy   0/1     Completed   0          35m
registry-8e89ec15-a682-4e8e-a522-c97c66c413bd-zjf4g-1-deploy   0/1     Completed   0          6m17s
registry-8e89ec15-a682-4e8e-a522-c97c66c413bd-zjf4g-1-djt55    1/1     Running     0          6m14s
restic-fhq4n                                                   1/1     Running     0          4h
restic-hqcgt                                                   1/1     Running     0          4h
restic-pgb7l                                                   1/1     Running     0          4h
velero-5dfcd8d7c9-pcdc6                                        1/1     Running     0          4h
migration-controller-5546545568-ptmh2                          0/2     Pending     0          0s
migration-controller-5546545568-ptmh2                          0/2     Pending     0          0s
migration-controller-5546545568-ptmh2                          0/2     ContainerCreating   0          0s
migration-controller-5546545568-ptmh2                          0/2     ContainerCreating   0          2s
migration-controller-5546545568-ptmh2                          2/2     Running             0          6s
migration-controller-586c9688b8-6wftl                          2/2     Terminating         0          4h
migration-controller-586c9688b8-6wftl                          0/2     Terminating         0          4h
migration-controller-586c9688b8-6wftl                          0/2     Terminating         0          4h
migration-controller-586c9688b8-6wftl                          0/2     Terminating         0          4h

3. Check the "disable_pv_migration=true" is applied on the MigrationController
$ oc get migrationcontrollers -n openshift-migration -o yaml | grep disable
          f:disable_pv_migration: {}
    disable_pv_migration: true

$ oc get pod -n openshift-migration migration-controller-5546545568-ptmh2 -o yaml | grep EXCLUDED_RESOURCES -A1
              k:{"name":"EXCLUDED_RESOURCES"}:
                .: {}
--
    - name: EXCLUDED_RESOURCES
      value: imagetags,templateinstances,clusterserviceversions,packagemanifests,subscriptions,servicebrokers,servicebindings,serviceclasses,serviceinstances,serviceplans,persistentvolumes,persistentvolumeclaims

4. Execute the migplan before the migplan is updated to remove PVs

5. The migration is stuck at StageRestoreCreated status
$ oc get migplan -n openshift-migration mysql -o yaml
apiVersion: migration.openshift.io/v1alpha1
kind: MigPlan
metadata:
  annotations:
    openshift.io/touch: 0047956e-d0a3-11ea-8b60-0a580a830033
  creationTimestamp: "2020-07-28T07:00:19Z"
......
apiVersion: migration.openshift.io/v1alpha1
kind: MigPlan
metadata:
  annotations:
    openshift.io/touch: 0047956e-d0a3-11ea-8b60-0a580a830033
  creationTimestamp: "2020-07-28T07:00:19Z"
  generation: 6
  managedFields:
  - apiVersion: migration.openshift.io/v1alpha1
    fieldsType: FieldsV1
    fieldsV1:
      f:spec:
        .: {}
        f:destMigClusterRef:
          .: {}
          f:name: {}
          f:namespace: {}
        f:migStorageRef:
          .: {}
          f:name: {}
          f:namespace: {}
        f:namespaces: {}
        f:srcMigClusterRef:
          .: {}
          f:name: {}
          f:namespace: {}
    manager: Mozilla
    operation: Update
    time: "2020-07-28T07:01:06Z"
  - apiVersion: migration.openshift.io/v1alpha1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:openshift.io/touch: {}
      f:spec:
        f:persistentVolumes: {}
      f:status:
        .: {}
        f:conditions: {}
        f:excludedResources: {}
        f:observedDigest: {}
    manager: manager
    operation: Update
    time: "2020-07-28T07:21:49Z"
  name: mysql
  namespace: openshift-migration
  resourceVersion: "159947"
  selfLink: /apis/migration.openshift.io/v1alpha1/namespaces/openshift-migration/migplans/mysql
  uid: 8e89ec15-a682-4e8e-a522-c97c66c413bd
spec:
  destMigClusterRef:
    name: host
    namespace: openshift-migration
  migStorageRef:
    name: automatic
    namespace: openshift-migration
  namespaces:
  - mysql
  persistentVolumes:
  - capacity: 1Gi
    name: pvc-247ac434-d09a-11ea-bd2a-fa163e9c1ec4
    pvc:
      accessModes:
      - ReadWriteOnce
      hasReference: true
      name: mysql
      namespace: mysql
    selection:
      action: copy
      copyMethod: filesystem
      storageClass: standard
    storageClass: standard
    supported:
      actions:
      - copy
      copyMethods:
      - filesystem
      - snapshot
  srcMigClusterRef:
    name: source-cluster
    namespace: openshift-migration
status:
  conditions:
  - category: Required
    lastTransitionTime: "2020-07-28T07:00:24Z"
    message: The `persistentVolumes` list has been updated with discovered PVs.
    reason: Done
    status: "True"
    type: PvsDiscovered
  - category: Required
    lastTransitionTime: "2020-07-28T07:00:26Z"
    message: The storage resources have been created.
    status: "True"
    type: StorageEnsured
  - category: Required
    lastTransitionTime: "2020-07-28T07:00:28Z"
    message: The migration registry resources have been created.
    status: "True"
    type: RegistriesEnsured
  - category: Required
    lastTransitionTime: "2020-07-28T07:00:28Z"
    message: The migration plan is ready.
    status: "True"
    type: Ready
  - category: Advisory
    lastTransitionTime: "2020-07-28T07:01:49Z"
    message: Limited validation; PV discovery and resource reconciliation suspended.
    status: "True"
    type: Suspended
  excludedResources:
  - imagetags
  - templateinstances
  - clusterserviceversions
  - packagemanifests
  - subscriptions
  - servicebrokers
  - servicebindings
  - serviceclasses
  - serviceinstances
  - serviceplans
  - persistentvolumes
  - persistentvolumeclaims
  observedDigest: 605f525329aa790bd0099ed481f27526ddd0f3aae7a906e823766663fafae209

6. During the migration(almost 13 mins), the migplan still is not updated. From the step#5, you also can see it.

Actual results:
The migplan is stuck at StageRestoreCreated

Expected results:
I am not sure what's the expected results. Maybe it should check the migplan before executing migration

Additional info:

Comment 2 Derek Whatley 2020-07-28 20:47:02 UTC
We may have a bug in https://github.com/konveyor/mig-controller/pull/592 . 

@Scott, what is the intended behavior in this case? I'm not sure we considered the case where pre-existing plans already have PVCs discovered.

I don't see any notes on #592 regarding expected behavior, although it looks like we set PvsDiscovered to true to make the plan "Ready" when PVCs are excluded from the MigPlan via the mig-operator option.

@Xin, can you check whether your Velero Backup has completed or if it's in progress? A must-gather dump could also be helpful here. https://github.com/konveyor/must-gather

Comment 3 Derek Whatley 2020-07-28 20:47:55 UTC
Regarding the "resource reconciliation suspended" message, this is normal and expected during a migration. We don't run reconciliation on the MigPlan while the migration runs so that the plan isn't changed mid-flight.

Comment 4 Xin jiang 2020-07-29 01:52:39 UTC
@Derek,

1. when CAM enables "disable_pv_migration=true",  it dose remove spec.persistentVolumes for the pre-existing plans already have PVCs discovered case. We did see this behavior during 1.2.4 test. So it won't migrate all the PVs.
2. "check  whether your Velero Backup has completed or if it's in progress", I don't understand what you mean, would you please clarify why check it?

Comment 5 whu 2020-07-29 06:37:01 UTC
@xin @Derek

We knew after we set "disable_pv_migration=true",  migration-controller pod need 2 mins to restart to pick up the new setting in "migrationcontroller".  

I have run below scenarios test

[scenario 1]

1. the original migrationcontroller without disable_pv_migration setting.
2. create migrationplan  "A" for a nginx project which has 2 PVCs setting.
3. set disable_pv_migration=true"
4. sleep 3 mins
5. trigger migration plan A to migrate nginx project
After migration finished successfully, the nginx pod in target node will pending for lacking PVCs.



[scenario 2]

1. the original migrationcontroller without disable_pv_migration setting.
2. create migrationplan  "A" for a nginx project which has 2 PVCs setting.
3. set disable_pv_migration=true"
4. trigger migration plan A to migrate nginx project 10 seconds later
5. The migration-controller pod will restart during the migration A running.

This time migration A will hung at "StageRestoreCreated" phase
The related migplan will have persistentVolumes definition forever. 
$ oc get migplan  mig-plan-33054-ocp-33054-skip-pv-with-data  -o yaml
apiVersion: migration.openshift.io/v1alpha1
kind: MigPlan
metadata:
  annotations:
   ........
  persistentVolumes:
  - capacity: 1Gi
    name: pvc-3a9f4c0f-77b5-48c2-9712-7a67317f905f
    pvc:
      accessModes:
      - ReadWriteOnce
      hasReference: true
      name: nginx-html
      namespace: ocp-33054-skip-pv-with-data
    selection:
      action: copy
      copyMethod: filesystem
      storageClass: managed-premium
    storageClass: managed-premium
    supported:
      actions:
      - copy
      copyMethods:
      - filesystem
      - snapshot
  - capacity: 1Gi
    name: pvc-4fd1bc04-f7d6-4677-bc82-b039d7c33498
    pvc:
      accessModes:
      - ReadWriteOnce
      hasReference: true
      name: nginx-logs
      namespace: ocp-33054-skip-pv-with-data
    selection:
      action: copy
      copyMethod: filesystem
      storageClass: managed-premium
    storageClass: managed-premium
    supported:
      actions:
      - copy
      copyMethods:
      - filesystem
      - snapshot
  srcMigClusterRef:
    name: source-cluster
    namespace: openshift-migration
   .....


$ oc get migmigration mig-migration-33054-ocp-33054-skip-pv-with-data -o yaml
apiVersion: migration.openshift.io/v1alpha1
kind: MigMigration
metadata:
  annotations:
    openshift.io/touch: 6e4bb522-d161-11ea-8f10-0a580a80023b
  creationTimestamp: "2020-07-29T05:35:39Z"
 .......
status:
  conditions:
  - category: Advisory
    lastTransitionTime: "2020-07-29T05:39:43Z"
    message: 'Step: 21/33'
    reason: StageRestoreCreated
    status: "True"
    type: Running
  - category: Required
    lastTransitionTime: "2020-07-29T05:35:39Z"
    message: The migration is ready.
    status: "True"
    type: Ready
  - category: Advisory
    durable: true
    lastTransitionTime: "2020-07-29T05:37:19Z"
    message: '[1] Stage pods created.'
    status: "True"
    type: StagePodsCreated
  itenerary: Final
  observedDigest: 8094a5bdf42f562ec74d69fa785895db2d23a3f0710f7da0d458818fc9123e5c
  phase: StageRestoreCreated
  startTimestamp: "2020-07-29T05:35:39Z"


NOT SURE how to treat situation in this special time,  treat it as a bug, or as a invalid user operation.

IN k8s cluster, every new API setting maybe take 10s to work.  "migration-controller" will take 2 mins to pick up new setting (maybe it's little long time). 
If the restart time is reasonable, or can not avoid. What is CAM expected behavior?

Do we have any official document to tell customer that migration-controller pod needs 2 mins to pick up new setting?  I did not find such notification in official doc.  This brouhgt me a lot of trouble during my test.

Comment 6 Derek Whatley 2020-07-30 15:16:25 UTC
> [scenario 1]
> 
> 1. the original migrationcontroller without disable_pv_migration setting.
> 2. create migrationplan  "A" for a nginx project which has 2 PVCs setting.
> 3. set disable_pv_migration=true"
> 4. sleep 3 mins
> 5. trigger migration plan A to migrate nginx project
> After migration finished successfully, the nginx pod in target node will pending for lacking PVCs.

This sounds reasonable, as you are required to migrate PVCs ahead of time when using disable_pv_migration=true. Not a bug.


> [scenario 2]

> 1. the original migrationcontroller without disable_pv_migration setting.
> 2. create migrationplan  "A" for a nginx project which has 2 PVCs setting.
> 3. set disable_pv_migration=true"
> 4. trigger migration plan A to migrate nginx project 10 seconds later
> 5. The migration-controller pod will restart during the migration A running.

This also sounds reasonable. The reconcile for mig-operator is long-running, as it will run through ~hundreds of ansible tasks during reconcile. I would also say this is not a bug. Waiting ~2 minutes for a change in settings after modifying the MigrationController resource is universal, not just related to this new setting. 

I think we should consider this as requiring docs enhancement to let the user know to wait several minutes, waiting for the mig-controller pod definition to reflect the new noPV mode. In the long term if mig-operator reconcile times need perf enhancement, we should file an RFE issue against the mig-operator GitHub repo.

Comment 7 Derek Whatley 2020-07-30 15:23:36 UTC
Hey Avital,

We need to document that user must wait for disable PV migration mode to be applied to mig-controller Pod after setting change is made to MigrationController CR. mig-operator will take ~2 minutes to apply this change.

Comment 9 John Matthews 2020-07-31 16:55:40 UTC
@derek please provide a summary of information Avital will need to document as a known issue in format of:



Cause: 

Consequence: 

Workaround (if any): 

Result:

Comment 10 Derek Whatley 2020-07-31 20:09:34 UTC
Cause: 

1. User wants to disable PV migration steps in mig-controller and applies "disable_pv_migration": true" on the MigrationController CR
$ oc patch migrationcontroller migration-controller -p '{"spec":{"disable_pv_migration": true } }' --type='merge' -n openshift-migration

2. User does NOT wait for the mig-controller pod to restart with new environment variables in place (~1-2 minute wait)

3. User starts a migration before mig-controller pod restarts

4. mig-controller Pod will restart in the middle of migration procedure and be flipped over to new "no PV migration mode" 

Consequence: 

5. In-progress migration may stall due to discrepancy in newly active controller settings and running migrations

Workaround (if any): 

6. Watch the mig-controller pod after setting "disable_pv_migration": true" and monitor it for a restart. 

7. The mig-controller pod definition should show that .spec.containers.env.EXCLUDED_RESOURCES includes persistentvolumeclaims  

Result:

8. Assuming the user waits for the pod restart before beginning a migration, PVC migration steps should be skipped and the migration should run to completion.

Comment 13 Xin jiang 2020-08-03 09:30:48 UTC
LGTM


Note You need to log in before you can comment on or make changes to this bug.