Bug 1784899 - MigPlan data lost after failed migration
Summary: MigPlan data lost after failed migration
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Migration Tooling
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.4.z
Assignee: Jeff Ortel
QA Contact: Sergio
Avital Pinnick
URL:
Whiteboard:
Depends On: 1831616
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-12-18 15:53 UTC by Sergio
Modified: 2020-07-27 16:03 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1831616 (view as bug list)
Environment:
Last Closed: 2020-07-27 16:03:30 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Sergio 2019-12-18 15:53:17 UTC
Description of problem:
When a migration fails, the data used to create this migration plan is erased and lost. Hence, the next time we try to run this migration (once the problem that made it fail is fixed) the execution will always be made using the default data, and not using the information that the user provided to create the migration plan.

Version-Release number of selected component (if applicable):
TARGET:
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.2.0-0.nightly-2019-12-14-230621   True        False         22h     Error while reconciling 4.2.0-0.nightly-2019-12-14-230621: an unknown error has occurre

SOURCE:
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.2.0-0.nightly-2019-12-14-230621   True        False         3h40m   Cluster version is 4.2.0-0.nightly-2019-12-14-23062

Controller version 1.0.1 in osbs registry:
    image: image-registry.openshift-image-registry.svc:5000/rhcam-1-0/openshift-migration-controller-rhel8@sha256:e64551a1dd77021ce9c6bf7c01cd184edd7f163c5c9489bb615845eadb867dc7

Migration UI version 1.0.1 osbs registry:
    image: image-registry.openshift-image-registry.svc:5000/rhcam-1-0/openshift-migration-ui-rhel8@sha256:3b84f2053fb58d42771e1a8aece8037ed39f863c5349b89d313000ba7d905641


How reproducible:
Always

Steps to Reproduce:
1. Create a migration plan with PVCs changing the default information in the create migration plan screens. (for instance select "snapshot" or select another destination storage class, just change the default information)
2. Check the migration plan, the "persistentVolumes" section will show the information you selected.

oc get migplan -o yaml -n openshift-migration

3. Migrate the plan, and force it to fail
4. Check the migration plan after the failure: the "persitentVolumes" section has disappeared. 
5. Run the migration plan again.
6. Check the migration plan again: the values in the "persisteVolumes" section used in this second migration are the default ones, and not the ones you selected to create the migration plan.

Actual results:
The migration executed after the first migration failed, will use the default data and not the data selected when the migration plan was created.


Expected results:
When we run again a migration plan that has failed, the second execution should use the same data used when the plan was created.


Additional info:

With the UI disabled (in order to discard a UI problem):
$ oc get deployments
NAME                   READY   UP-TO-DATE   AVAILABLE   AGE
migration-controller   1/1     1            1           3h7m
migration-operator     1/1     1            1           3h21m
migration-ui           0/0     0            0           3h7m
velero                 1/1     1            1           3h7m

This is the original data of the migratoin plan before the failure:

apiVersion: migration.openshift.io/v1alpha1
kind: MigPlan
metadata:
  annotations:
    touch: 806e8b50-40eb-4ebe-9ef6-72fd855d2578
  creationTimestamp: "2019-12-18T13:41:10Z"
  generation: 4
  name: tobefailed-noui
  namespace: openshift-migration
  resourceVersion: "581523"
  selfLink: /apis/migration.openshift.io/v1alpha1/namespaces/openshift-migration/migplans/tobefailed-noui
  uid: 0ccb7eae-219c-11ea-8ff9-42010a000004
spec:
  destMigClusterRef:
    name: host
    namespace: openshift-migration
  migStorageRef:
    name: gcp
    namespace: openshift-migration
  namespaces:
  - tobefailed-noui
  persistentVolumes:
  - capacity: 1Gi
    name: pvc-df9a78ae-219b-11ea-9836-42010a000005
    pvc:
      accessModes:
      - ReadWriteOnce
      name: nginx-logs
      namespace: tobefailed-noui
    selection:
      action: copy
      copyMethod: snapshot
      storageClass: standard
    storageClass: standard
    supported:
      actions:
      - copy
      copyMethods:
      - filesystem
      - snapshot
  - capacity: 1Gi
    name: pvc-dfab0dc3-219b-11ea-9836-42010a000005
    pvc:
      accessModes:
      - ReadWriteOnce
      name: nginx-html
      namespace: tobefailed-noui
    selection:
      action: copy
      copyMethod: snapshot
      storageClass: standard
    storageClass: standard
    supported:
      actions:
      - copy
      copyMethods:
      - filesystem
      - snapshot
  srcMigClusterRef:
    name: gcp42
    namespace: openshift-migration
status:
  conditions:
  - category: Required
    lastTransitionTime: "2019-12-18T13:41:14Z"
    message: The `persistentVolumes` list has been updated with discovered PVs.
    reason: Done
    status: "True"
    type: PvsDiscovered
  - category: Required
    lastTransitionTime: "2019-12-18T13:41:15Z"
    message: The storage resources have been created.
    status: "True"
    type: StorageEnsured
  - category: Required
    lastTransitionTime: "2019-12-18T13:41:16Z"
    message: The migration registry resources have been created.
    status: "True"
    type: RegistriesEnsured
  - category: Required
    lastTransitionTime: "2019-12-18T13:41:16Z"
    message: The migration plan is ready.
    status: "True"
    type: Ready
  - category: Warn
    lastTransitionTime: "2019-12-18T13:41:39Z"
    message: CopyMethod for PV in `persistentVolumes` [pvc-df9a78ae-219b-11ea-9836-42010a000005,pvc-dfab0dc3-219b-11ea-9836-42010a000005]
      is set to `snapshot`. Make sure that the chosen storage class is compatible
      with the source volume's storage type for Snapshot support.
    status: "True"
    type: PvWarnCopyMethodSnapshot



This is the state of the plan after the migration failure (persistetVolumes section has been erased)

apiVersion: migration.openshift.io/v1alpha1
kind: MigPlan
metadata:
  annotations:
    touch: b2459512-dc4b-42ee-ad10-79b889d657d5
  creationTimestamp: "2019-12-18T13:41:10Z"
  generation: 6
  name: tobefailed-noui
  namespace: openshift-migration
  resourceVersion: "582937"
  selfLink: /apis/migration.openshift.io/v1alpha1/namespaces/openshift-migration/migplans/tobefailed-noui
  uid: 0ccb7eae-219c-11ea-8ff9-42010a000004
spec:
  destMigClusterRef:
    name: host
    namespace: openshift-migration
  migStorageRef:
    name: gcp
    namespace: openshift-migration
  namespaces:
  - tobefailed-noui
  srcMigClusterRef:
    name: gcp42
    namespace: openshift-migration
status:
  conditions:
  - category: Required
    lastTransitionTime: "2019-12-18T13:41:14Z"
    message: The `persistentVolumes` list has been updated with discovered PVs.
    reason: Done
    status: "True"
    type: PvsDiscovered
  - category: Required
    lastTransitionTime: "2019-12-18T13:41:15Z"
    message: The storage resources have been created.
    status: "True"
    type: StorageEnsured
  - category: Required
    lastTransitionTime: "2019-12-18T13:41:16Z"
    message: The migration registry resources have been created.
    status: "True"
    type: RegistriesEnsured
  - category: Required
    lastTransitionTime: "2019-12-18T13:41:16Z"
    message: The migration plan is ready.
    status: "True"
    type: Ready

Comment 1 Sergio 2019-12-19 13:28:28 UTC
This is the migmigration that failed, so that we can track the phase where it failed  (StageBackupFailed).


apiVersion: migration.openshift.io/v1alpha1
kind: MigMigration
metadata:
  annotations:
    touch: a2b7832b-c2c0-4fe6-a09e-0292c6cfae0c
  creationTimestamp: "2019-12-18T13:44:26Z"
  generation: 14
  name: trying-without-using-ui
  namespace: openshift-migration
  ownerReferences:
  - apiVersion: migration.openshift.io/v1alpha1
    kind: MigPlan
    name: tobefailed-noui
    uid: 0ccb7eae-219c-11ea-8ff9-42010a000004
  resourceVersion: "582790"
  selfLink: /apis/migration.openshift.io/v1alpha1/namespaces/openshift-migration/migmigrations/trying-without-using-ui
  uid: 81a80b72-219c-11ea-9c4c-42010a000006
spec:
  migPlanRef:
    name: tobefailed-noui
    namespace: openshift-migration
  quiescePods: true
  stage: false
status:
  conditions:
  - category: Advisory
    durable: true
    lastTransitionTime: "2019-12-18T13:45:24Z"
    message: 'The migration has failed.  See: Errors.'
    reason: StageBackupFailed
    status: "True"
    type: Failed
  errors:
  - 'Backup: openshift-migration/trying-without-using-ui-wwb9h partially failed.'
  phase: Completed
  startTimestamp: "2019-12-18T13:44:26Z"

Comment 2 Jeff Ortel 2020-01-13 21:52:00 UTC
The failed migration has: quiescePods: true. When the migration fails and the plan is un-suspended which resumes PV discovery. Since the pod has been scaled-down, the PV is no longer found during PV discovery and is removed from the list.
This is working as designed.

Comment 3 Jeff Ortel 2020-01-13 23:18:19 UTC
To remedy this, the user will need to scale the pod back up. Once the application pod is up and running, the PV list will be repopulated by the controller discovery.  Unfortunately, the user's choices will be gone.

For now, let's document this a a known issue.

Comment 7 John Matthews 2020-07-27 16:03:30 UTC
The behavior noted will remain for CAM 1.2.x, i.e if a migration fails and is restarted the previous selections for PV migration are lost.
We do not intend to fix this in the 1.2 z-stream.
We have a known issue documented here: https://github.com/openshift/openshift-docs/pull/19021/files

We will keep a clone of this BZ open against future release to consider modifying the behavior.
Future tracked in https://bugzilla.redhat.com/show_bug.cgi?id=1831616


Note You need to log in before you can comment on or make changes to this bug.