Bug 1831605

Summary: Migration fails due to restic time out when migrating 50 projects.
Product: OpenShift Container Platform Reporter: John Matthews <jmatthew>
Component: Migration ToolingAssignee: Derek Whatley <dwhatley>
Status: CLOSED DUPLICATE QA Contact: Xin jiang <xjiang>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.4CC: chezhang, dwhatley, dymurray, jmatthew, rpattath, sregidor, whu, xjiang
Target Milestone: ---   
Target Release: 4.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1813025 Environment:
Last Closed: 2020-06-16 13:04:42 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description John Matthews 2020-05-05 11:20:04 UTC
+++ This bug was initially created as a clone of Bug #1813025 +++

Description of problem:
Migration fails due to restic time out when migrating 50 projects.

Version-Release number of selected component (if applicable):
# oc describe pod/velero-658c4d8945-2mppg | grep Image
    Image:          quay.io/konveyor/migration-plugin:latest
    Image ID:       quay.io/konveyor/migration-plugin@sha256:fd6617aa9f86e4760cc076c25f973152ce9c85f83ac1de2cdaafdab860f69d5c
    Image:          quay.io/konveyor/velero-plugin-for-aws:latest
    Image ID:       quay.io/konveyor/velero-plugin-for-aws@sha256:b9867c14816ce3c6797c676988192df771fa54503596931b138aafad91af36a5
    Image:          quay.io/konveyor/velero-plugin-for-gcp:latest
    Image ID:       quay.io/konveyor/velero-plugin-for-gcp@sha256:a641d610403dbbd3a83f2bcb1f46d91fe9b79563c27bea744b3c60147be93cd5
    Image:          quay.io/konveyor/velero-plugin-for-microsoft-azure:latest
    Image ID:       quay.io/konveyor/velero-plugin-for-microsoft-azure@sha256:a57c97de744d967d591023e1847507689b66a312ccac709e6c5e5468d865c3d3
    Image:         quay.io/konveyor/velero:latest
    Image ID:      quay.io/konveyor/velero@sha256:e96d4ba17adfbe4032bd850f3c2b268d87d91422e4f03c97ed816148970b3e9e
    Image:         quay.io/konveyor/velero:latest
    Image ID:      quay.io/konveyor/velero@sha256:e96d4ba17adfbe4032bd850f3c2b268d87d91422e4f03c97ed816148970b3e9e
[root@rpattath ~]# oc describe pod/restic-7j88n | grep Image
    Image:         quay.io/konveyor/velero:latest
    Image ID:      quay.io/konveyor/velero@sha256:e96d4ba17adfbe4032bd850f3c2b268d87d91422e4f03c97ed816148970b3e9e
    Image:         quay.io/konveyor/velero:latest
    Image ID:      quay.io/konveyor/velero@sha256:e96d4ba17adfbe4032bd850f3c2b268d87d91422e4f03c97ed816148970b3e9e

# oc get clusterversions.config.openshift.io 
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.4.0-0.nightly-2020-03-09-033343   True        False         3d3h    Cluster version is 4.4.0-0.nightly-2020-03-09-033343


How reproducible:
always

Steps to Reproduce:
1. Do the following to load the source cluster with the workload
# cat svt/openshift_scalability/mig-test-project-scale.yaml 
projects:
  - num: 50
    basename: migtest-
    templates:
      -
        num: 5
        file: ./content/build-template.json
      -
        num: 1
        file: ./content/quickstarts/django/django-postgresql-pv.json
      -
        num: 1
        file: ./content/deployment-config-2rep-template.json
        parameters:
          -
            ENV_VALUE: "asodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij12"
      -
        num: 20
        file: ./content/ssh-secret-template.json
      -
        num: 2
        file: ./content/configmap-template.json
      # rcs and services are implemented in deployments.
quotas:
  - name: default

Run the above yaml using cluster-loader.yaml from https://github.com/openshift/svt/tree/master/openshift_scalability

2. This is the migration controller I am using
# oc get migrationcontroller migration-controller -o yaml
apiVersion: migration.openshift.io/v1alpha1
kind: MigrationController
metadata:
  creationTimestamp: "2020-03-12T15:06:13Z"
  generation: 2
  name: migration-controller
  namespace: openshift-migration
  resourceVersion: "1689757"
  selfLink: /apis/migration.openshift.io/v1alpha1/namespaces/openshift-migration/migrationcontrollers/migration-controller
  uid: 450de38c-0b04-4359-92d5-bb255f1ac69e
spec:
  azure_resource_group: ""
  cluster_name: host
  mig_controller_image: quay.io/jortel/mig-controller
  mig_controller_version: ocp4.4-compat
  mig_namespace_limit: "60"
  mig_pod_limit: "500"
  mig_pv_limit: "500"
  migration_controller: true
  migration_ui: true
  migration_velero: true
  olm_managed: true
  restic_timeout: 10h
  version: 1.0 (OLM)
status:
  phase: Reconciled
3.Migrate 50 projects

Actual results:
Migration fails during backup restore.

Expected results:
Migration should be successful.

Additional info:
Attaching the complete velero debug log from the destination cluster. 
I had filed https://bugzilla.redhat.com/show_bug.cgi?id=1749831, not sure if the reason behind the issue is the same

--- Additional comment from Dylan Murray on 2020-03-16 20:24:13 UTC ---

This likely has nothing to do with migrating 50 namespaces, but more specifically that the restic restores were never acted upon properly. It would be helpful to get the output of `oc get podvolumerestores -n openshift-migration -o yaml` as this will tell us what exactly happened to the restic restores being run.

The timeout generally means something went wrong with restic and Velero couldn't recover.

Please paste this output if you still have this environment available.

--- Additional comment from Roshni on 2020-03-25 17:41:30 UTC ---

Attaching the output of 

# oc get podvolumerestores -n openshift-migration -o yaml

Comment 1 John Matthews 2020-06-16 13:04:42 UTC

*** This bug has been marked as a duplicate of bug 1813025 ***