Bug 1861267
| Summary: | Docs, known issue: migration stuck at StageRestoreCreated status due to PV discovery and resource reconciliation suspended | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Xin jiang <xjiang> |
| Component: | Migration Tooling | Assignee: | Avital Pinnick <apinnick> |
| Status: | CLOSED NEXTRELEASE | QA Contact: | Xin jiang <xjiang> |
| Severity: | medium | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.4 | CC: | alpatel, apinnick, dwhatley, ernelson, jmatthew, jmontleo, jortel, mberube, pgaikwad, rjohnson, sseago, whu |
| Target Milestone: | --- | ||
| Target Release: | 4.4.z | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Known Issue | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-08-04 15:29:05 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 1861259 | ||
| Bug Blocks: | |||
|
Description
Xin jiang
2020-07-28 08:02:54 UTC
We may have a bug in https://github.com/konveyor/mig-controller/pull/592 . @Scott, what is the intended behavior in this case? I'm not sure we considered the case where pre-existing plans already have PVCs discovered. I don't see any notes on #592 regarding expected behavior, although it looks like we set PvsDiscovered to true to make the plan "Ready" when PVCs are excluded from the MigPlan via the mig-operator option. @Xin, can you check whether your Velero Backup has completed or if it's in progress? A must-gather dump could also be helpful here. https://github.com/konveyor/must-gather Regarding the "resource reconciliation suspended" message, this is normal and expected during a migration. We don't run reconciliation on the MigPlan while the migration runs so that the plan isn't changed mid-flight. @Derek, 1. when CAM enables "disable_pv_migration=true", it dose remove spec.persistentVolumes for the pre-existing plans already have PVCs discovered case. We did see this behavior during 1.2.4 test. So it won't migrate all the PVs. 2. "check whether your Velero Backup has completed or if it's in progress", I don't understand what you mean, would you please clarify why check it? @xin @Derek
We knew after we set "disable_pv_migration=true", migration-controller pod need 2 mins to restart to pick up the new setting in "migrationcontroller".
I have run below scenarios test
[scenario 1]
1. the original migrationcontroller without disable_pv_migration setting.
2. create migrationplan "A" for a nginx project which has 2 PVCs setting.
3. set disable_pv_migration=true"
4. sleep 3 mins
5. trigger migration plan A to migrate nginx project
After migration finished successfully, the nginx pod in target node will pending for lacking PVCs.
[scenario 2]
1. the original migrationcontroller without disable_pv_migration setting.
2. create migrationplan "A" for a nginx project which has 2 PVCs setting.
3. set disable_pv_migration=true"
4. trigger migration plan A to migrate nginx project 10 seconds later
5. The migration-controller pod will restart during the migration A running.
This time migration A will hung at "StageRestoreCreated" phase
The related migplan will have persistentVolumes definition forever.
$ oc get migplan mig-plan-33054-ocp-33054-skip-pv-with-data -o yaml
apiVersion: migration.openshift.io/v1alpha1
kind: MigPlan
metadata:
annotations:
........
persistentVolumes:
- capacity: 1Gi
name: pvc-3a9f4c0f-77b5-48c2-9712-7a67317f905f
pvc:
accessModes:
- ReadWriteOnce
hasReference: true
name: nginx-html
namespace: ocp-33054-skip-pv-with-data
selection:
action: copy
copyMethod: filesystem
storageClass: managed-premium
storageClass: managed-premium
supported:
actions:
- copy
copyMethods:
- filesystem
- snapshot
- capacity: 1Gi
name: pvc-4fd1bc04-f7d6-4677-bc82-b039d7c33498
pvc:
accessModes:
- ReadWriteOnce
hasReference: true
name: nginx-logs
namespace: ocp-33054-skip-pv-with-data
selection:
action: copy
copyMethod: filesystem
storageClass: managed-premium
storageClass: managed-premium
supported:
actions:
- copy
copyMethods:
- filesystem
- snapshot
srcMigClusterRef:
name: source-cluster
namespace: openshift-migration
.....
$ oc get migmigration mig-migration-33054-ocp-33054-skip-pv-with-data -o yaml
apiVersion: migration.openshift.io/v1alpha1
kind: MigMigration
metadata:
annotations:
openshift.io/touch: 6e4bb522-d161-11ea-8f10-0a580a80023b
creationTimestamp: "2020-07-29T05:35:39Z"
.......
status:
conditions:
- category: Advisory
lastTransitionTime: "2020-07-29T05:39:43Z"
message: 'Step: 21/33'
reason: StageRestoreCreated
status: "True"
type: Running
- category: Required
lastTransitionTime: "2020-07-29T05:35:39Z"
message: The migration is ready.
status: "True"
type: Ready
- category: Advisory
durable: true
lastTransitionTime: "2020-07-29T05:37:19Z"
message: '[1] Stage pods created.'
status: "True"
type: StagePodsCreated
itenerary: Final
observedDigest: 8094a5bdf42f562ec74d69fa785895db2d23a3f0710f7da0d458818fc9123e5c
phase: StageRestoreCreated
startTimestamp: "2020-07-29T05:35:39Z"
NOT SURE how to treat situation in this special time, treat it as a bug, or as a invalid user operation.
IN k8s cluster, every new API setting maybe take 10s to work. "migration-controller" will take 2 mins to pick up new setting (maybe it's little long time).
If the restart time is reasonable, or can not avoid. What is CAM expected behavior?
Do we have any official document to tell customer that migration-controller pod needs 2 mins to pick up new setting? I did not find such notification in official doc. This brouhgt me a lot of trouble during my test.
> [scenario 1] > > 1. the original migrationcontroller without disable_pv_migration setting. > 2. create migrationplan "A" for a nginx project which has 2 PVCs setting. > 3. set disable_pv_migration=true" > 4. sleep 3 mins > 5. trigger migration plan A to migrate nginx project > After migration finished successfully, the nginx pod in target node will pending for lacking PVCs. This sounds reasonable, as you are required to migrate PVCs ahead of time when using disable_pv_migration=true. Not a bug. > [scenario 2] > 1. the original migrationcontroller without disable_pv_migration setting. > 2. create migrationplan "A" for a nginx project which has 2 PVCs setting. > 3. set disable_pv_migration=true" > 4. trigger migration plan A to migrate nginx project 10 seconds later > 5. The migration-controller pod will restart during the migration A running. This also sounds reasonable. The reconcile for mig-operator is long-running, as it will run through ~hundreds of ansible tasks during reconcile. I would also say this is not a bug. Waiting ~2 minutes for a change in settings after modifying the MigrationController resource is universal, not just related to this new setting. I think we should consider this as requiring docs enhancement to let the user know to wait several minutes, waiting for the mig-controller pod definition to reflect the new noPV mode. In the long term if mig-operator reconcile times need perf enhancement, we should file an RFE issue against the mig-operator GitHub repo. Hey Avital, We need to document that user must wait for disable PV migration mode to be applied to mig-controller Pod after setting change is made to MigrationController CR. mig-operator will take ~2 minutes to apply this change. @derek please provide a summary of information Avital will need to document as a known issue in format of: Cause: Consequence: Workaround (if any): Result: Cause:
1. User wants to disable PV migration steps in mig-controller and applies "disable_pv_migration": true" on the MigrationController CR
$ oc patch migrationcontroller migration-controller -p '{"spec":{"disable_pv_migration": true } }' --type='merge' -n openshift-migration
2. User does NOT wait for the mig-controller pod to restart with new environment variables in place (~1-2 minute wait)
3. User starts a migration before mig-controller pod restarts
4. mig-controller Pod will restart in the middle of migration procedure and be flipped over to new "no PV migration mode"
Consequence:
5. In-progress migration may stall due to discrepancy in newly active controller settings and running migrations
Workaround (if any):
6. Watch the mig-controller pod after setting "disable_pv_migration": true" and monitor it for a restart.
7. The mig-controller pod definition should show that .spec.containers.env.EXCLUDED_RESOURCES includes persistentvolumeclaims
Result:
8. Assuming the user waits for the pod restart before beginning a migration, PVC migration steps should be skipped and the migration should run to completion.
LGTM |