Bug 1784899

Summary: MigPlan data lost after failed migration
Product: OpenShift Container Platform Reporter: Sergio <sregidor>
Component: Migration ToolingAssignee: Jeff Ortel <jortel>
Status: CLOSED WONTFIX QA Contact: Sergio <sregidor>
Severity: medium Docs Contact: Avital Pinnick <apinnick>
Priority: medium    
Version: 4.2.0CC: apinnick, chezhang, jmatthew, jortel, rpattath, xjiang
Target Milestone: ---   
Target Release: 4.4.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1831616 (view as bug list) Environment:
Last Closed: 2020-07-27 16:03:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1831616    
Bug Blocks:    

Description Sergio 2019-12-18 15:53:17 UTC
Description of problem:
When a migration fails, the data used to create this migration plan is erased and lost. Hence, the next time we try to run this migration (once the problem that made it fail is fixed) the execution will always be made using the default data, and not using the information that the user provided to create the migration plan.

Version-Release number of selected component (if applicable):
TARGET:
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.2.0-0.nightly-2019-12-14-230621   True        False         22h     Error while reconciling 4.2.0-0.nightly-2019-12-14-230621: an unknown error has occurre

SOURCE:
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.2.0-0.nightly-2019-12-14-230621   True        False         3h40m   Cluster version is 4.2.0-0.nightly-2019-12-14-23062

Controller version 1.0.1 in osbs registry:
    image: image-registry.openshift-image-registry.svc:5000/rhcam-1-0/openshift-migration-controller-rhel8@sha256:e64551a1dd77021ce9c6bf7c01cd184edd7f163c5c9489bb615845eadb867dc7

Migration UI version 1.0.1 osbs registry:
    image: image-registry.openshift-image-registry.svc:5000/rhcam-1-0/openshift-migration-ui-rhel8@sha256:3b84f2053fb58d42771e1a8aece8037ed39f863c5349b89d313000ba7d905641


How reproducible:
Always

Steps to Reproduce:
1. Create a migration plan with PVCs changing the default information in the create migration plan screens. (for instance select "snapshot" or select another destination storage class, just change the default information)
2. Check the migration plan, the "persistentVolumes" section will show the information you selected.

oc get migplan -o yaml -n openshift-migration

3. Migrate the plan, and force it to fail
4. Check the migration plan after the failure: the "persitentVolumes" section has disappeared. 
5. Run the migration plan again.
6. Check the migration plan again: the values in the "persisteVolumes" section used in this second migration are the default ones, and not the ones you selected to create the migration plan.

Actual results:
The migration executed after the first migration failed, will use the default data and not the data selected when the migration plan was created.


Expected results:
When we run again a migration plan that has failed, the second execution should use the same data used when the plan was created.


Additional info:

With the UI disabled (in order to discard a UI problem):
$ oc get deployments
NAME                   READY   UP-TO-DATE   AVAILABLE   AGE
migration-controller   1/1     1            1           3h7m
migration-operator     1/1     1            1           3h21m
migration-ui           0/0     0            0           3h7m
velero                 1/1     1            1           3h7m

This is the original data of the migratoin plan before the failure:

apiVersion: migration.openshift.io/v1alpha1
kind: MigPlan
metadata:
  annotations:
    touch: 806e8b50-40eb-4ebe-9ef6-72fd855d2578
  creationTimestamp: "2019-12-18T13:41:10Z"
  generation: 4
  name: tobefailed-noui
  namespace: openshift-migration
  resourceVersion: "581523"
  selfLink: /apis/migration.openshift.io/v1alpha1/namespaces/openshift-migration/migplans/tobefailed-noui
  uid: 0ccb7eae-219c-11ea-8ff9-42010a000004
spec:
  destMigClusterRef:
    name: host
    namespace: openshift-migration
  migStorageRef:
    name: gcp
    namespace: openshift-migration
  namespaces:
  - tobefailed-noui
  persistentVolumes:
  - capacity: 1Gi
    name: pvc-df9a78ae-219b-11ea-9836-42010a000005
    pvc:
      accessModes:
      - ReadWriteOnce
      name: nginx-logs
      namespace: tobefailed-noui
    selection:
      action: copy
      copyMethod: snapshot
      storageClass: standard
    storageClass: standard
    supported:
      actions:
      - copy
      copyMethods:
      - filesystem
      - snapshot
  - capacity: 1Gi
    name: pvc-dfab0dc3-219b-11ea-9836-42010a000005
    pvc:
      accessModes:
      - ReadWriteOnce
      name: nginx-html
      namespace: tobefailed-noui
    selection:
      action: copy
      copyMethod: snapshot
      storageClass: standard
    storageClass: standard
    supported:
      actions:
      - copy
      copyMethods:
      - filesystem
      - snapshot
  srcMigClusterRef:
    name: gcp42
    namespace: openshift-migration
status:
  conditions:
  - category: Required
    lastTransitionTime: "2019-12-18T13:41:14Z"
    message: The `persistentVolumes` list has been updated with discovered PVs.
    reason: Done
    status: "True"
    type: PvsDiscovered
  - category: Required
    lastTransitionTime: "2019-12-18T13:41:15Z"
    message: The storage resources have been created.
    status: "True"
    type: StorageEnsured
  - category: Required
    lastTransitionTime: "2019-12-18T13:41:16Z"
    message: The migration registry resources have been created.
    status: "True"
    type: RegistriesEnsured
  - category: Required
    lastTransitionTime: "2019-12-18T13:41:16Z"
    message: The migration plan is ready.
    status: "True"
    type: Ready
  - category: Warn
    lastTransitionTime: "2019-12-18T13:41:39Z"
    message: CopyMethod for PV in `persistentVolumes` [pvc-df9a78ae-219b-11ea-9836-42010a000005,pvc-dfab0dc3-219b-11ea-9836-42010a000005]
      is set to `snapshot`. Make sure that the chosen storage class is compatible
      with the source volume's storage type for Snapshot support.
    status: "True"
    type: PvWarnCopyMethodSnapshot



This is the state of the plan after the migration failure (persistetVolumes section has been erased)

apiVersion: migration.openshift.io/v1alpha1
kind: MigPlan
metadata:
  annotations:
    touch: b2459512-dc4b-42ee-ad10-79b889d657d5
  creationTimestamp: "2019-12-18T13:41:10Z"
  generation: 6
  name: tobefailed-noui
  namespace: openshift-migration
  resourceVersion: "582937"
  selfLink: /apis/migration.openshift.io/v1alpha1/namespaces/openshift-migration/migplans/tobefailed-noui
  uid: 0ccb7eae-219c-11ea-8ff9-42010a000004
spec:
  destMigClusterRef:
    name: host
    namespace: openshift-migration
  migStorageRef:
    name: gcp
    namespace: openshift-migration
  namespaces:
  - tobefailed-noui
  srcMigClusterRef:
    name: gcp42
    namespace: openshift-migration
status:
  conditions:
  - category: Required
    lastTransitionTime: "2019-12-18T13:41:14Z"
    message: The `persistentVolumes` list has been updated with discovered PVs.
    reason: Done
    status: "True"
    type: PvsDiscovered
  - category: Required
    lastTransitionTime: "2019-12-18T13:41:15Z"
    message: The storage resources have been created.
    status: "True"
    type: StorageEnsured
  - category: Required
    lastTransitionTime: "2019-12-18T13:41:16Z"
    message: The migration registry resources have been created.
    status: "True"
    type: RegistriesEnsured
  - category: Required
    lastTransitionTime: "2019-12-18T13:41:16Z"
    message: The migration plan is ready.
    status: "True"
    type: Ready

Comment 1 Sergio 2019-12-19 13:28:28 UTC
This is the migmigration that failed, so that we can track the phase where it failed  (StageBackupFailed).


apiVersion: migration.openshift.io/v1alpha1
kind: MigMigration
metadata:
  annotations:
    touch: a2b7832b-c2c0-4fe6-a09e-0292c6cfae0c
  creationTimestamp: "2019-12-18T13:44:26Z"
  generation: 14
  name: trying-without-using-ui
  namespace: openshift-migration
  ownerReferences:
  - apiVersion: migration.openshift.io/v1alpha1
    kind: MigPlan
    name: tobefailed-noui
    uid: 0ccb7eae-219c-11ea-8ff9-42010a000004
  resourceVersion: "582790"
  selfLink: /apis/migration.openshift.io/v1alpha1/namespaces/openshift-migration/migmigrations/trying-without-using-ui
  uid: 81a80b72-219c-11ea-9c4c-42010a000006
spec:
  migPlanRef:
    name: tobefailed-noui
    namespace: openshift-migration
  quiescePods: true
  stage: false
status:
  conditions:
  - category: Advisory
    durable: true
    lastTransitionTime: "2019-12-18T13:45:24Z"
    message: 'The migration has failed.  See: Errors.'
    reason: StageBackupFailed
    status: "True"
    type: Failed
  errors:
  - 'Backup: openshift-migration/trying-without-using-ui-wwb9h partially failed.'
  phase: Completed
  startTimestamp: "2019-12-18T13:44:26Z"

Comment 2 Jeff Ortel 2020-01-13 21:52:00 UTC
The failed migration has: quiescePods: true. When the migration fails and the plan is un-suspended which resumes PV discovery. Since the pod has been scaled-down, the PV is no longer found during PV discovery and is removed from the list.
This is working as designed.

Comment 3 Jeff Ortel 2020-01-13 23:18:19 UTC
To remedy this, the user will need to scale the pod back up. Once the application pod is up and running, the PV list will be repopulated by the controller discovery.  Unfortunately, the user's choices will be gone.

For now, let's document this a a known issue.

Comment 7 John Matthews 2020-07-27 16:03:30 UTC
The behavior noted will remain for CAM 1.2.x, i.e if a migration fails and is restarted the previous selections for PV migration are lost.
We do not intend to fix this in the 1.2 z-stream.
We have a known issue documented here: https://github.com/openshift/openshift-docs/pull/19021/files

We will keep a clone of this BZ open against future release to consider modifying the behavior.
Future tracked in https://bugzilla.redhat.com/show_bug.cgi?id=1831616