Description of problem: Backup creation fails for a project with 1000Gi pvc Version-Release number of selected component (if applicable): # oc describe pod/controller-manager-78d9589445-5xztn | grep Image Image: quay.io/ocpmigrate/mig-controller:release-1.0 Image ID: quay.io/ocpmigrate/mig-controller@sha256:0f74db7171712ffc440b3d7b0f02a775ccd71238827ec856b7d090f90f2feffb # oc describe pod/velero-58f7447985-d9hzf | grep Image Image: quay.io/ocpmigrate/migration-plugin:release-1.0 Image ID: quay.io/ocpmigrate/migration-plugin@sha256:eb9b82c3f26bcd876bc501e18dde7cffe7e451c8c8a231959ed4d9f1127b91a6 Image: quay.io/ocpmigrate/velero:fusor-1.1 Image ID: quay.io/ocpmigrate/velero@sha256:6c16a1288bf6aca74afbb0184fa987506839c5193ae8bb2be05cb6aa0a9f3dc5 # oc describe pod/restic-9hst9 | grep Image Image: quay.io/ocpmigrate/velero:fusor-1.1 Image ID: quay.io/ocpmigrate/velero@sha256:6c16a1288bf6aca74afbb0184fa987506839c5193ae8bb2be05cb6aa0a9f3dc5 # oc describe pod/migration-operator-5cb94b46fb-vgs5k | grep Image Image: quay.io/ocpmigrate/mig-operator:release-1.0 Image ID: quay.io/ocpmigrate/mig-operator@sha256:c5e3a0c4ca4ec954f0c6552b367bc7b3baafa5acea833496147d0b6611bef241 Image: quay.io/ocpmigrate/mig-operator:release-1.0 Image ID: quay.io/ocpmigrate/mig-operator@sha256:c5e3a0c4ca4ec954f0c6552b367bc7b3baafa5acea833496147d0b6611bef241 # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.2.0-0.nightly-2019-09-17-001320 True False 176m Cluster version is 4.2.0-0.nightly-2019-09-17-001320 How reproducible: always Steps to Reproduce: 1. On a 3.11 cluster create a project with the following CRs # oc get all -n big-pvc NAME READY STATUS RESTARTS AGE pod/postgresql-1-65lbd 1/1 Running 0 1d pod/rails-postgresql-example-1-build 0/1 Completed 0 1d pod/rails-postgresql-example-1-swmkz 1/1 Running 0 1d NAME DESIRED CURRENT READY AGE replicationcontroller/postgresql-1 1 1 1 1d replicationcontroller/rails-postgresql-example-1 1 1 1 1d NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/postgresql ClusterIP 172.27.110.229 <none> 5432/TCP 1d service/rails-postgresql-example ClusterIP 172.26.253.220 <none> 8080/TCP 1d NAME REVISION DESIRED CURRENT TRIGGERED BY deploymentconfig.apps.openshift.io/postgresql 1 1 1 config,image(postgresql:9.5) deploymentconfig.apps.openshift.io/rails-postgresql-example 1 1 1 config,image(rails-postgresql-example:latest) NAME TYPE FROM LATEST buildconfig.build.openshift.io/rails-postgresql-example Source Git 1 NAME TYPE FROM STATUS STARTED DURATION build.build.openshift.io/rails-postgresql-example-1 Source Git@67d882b Complete 25 hours ago 1m40s NAME DOCKER REPO TAGS UPDATED imagestream.image.openshift.io/rails-postgresql-example docker-registry.default.svc:5000/big-pvc/rails-postgresql-example latest 25 hours ago NAME HOST/PORT PATH SERVICES PORT TERMINATION WILDCARD route.route.openshift.io/rails-postgresql-example rails-postgresql-example-big-pvc.apps.0906-5ce.qe.rhcloud.com rails-postgresql-example <all> None root@ip-172-31-43-162: /tmp/AWS/3.11 GLUSTERFS # oc get pv NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE pvc-b1e6180f-d8a1-11e9-b34f-029519c5614c 1000Gi RWO Delete Bound big-pvc/postgresql gp2 1d 2. Configure migration CRs with the namespace to be migrated and the pv information. 3. Start migration Actual results: Backup creation fails # oc logs velero-c9b9cd88f-rgj7w | grep restic | grep error time="2019-09-17T16:15:29Z" level=error msg="Error backing up item" backup=openshift-migration/migmigration-sample-pbx7w error="timed out waiting for all PodVolumeBackups to complete" error.file="/go/src/github.com/heptio/velero/pkg/restic/backupper.go:165" error.function="github.com/heptio/velero/pkg/restic.(*backupper).BackupPodVolumes" group=v1 logSource="pkg/backup/resource_backupper.go:264" name=postgresql-1-65lbd-stage namespace=big-pvc resource=pods Expected results: Backup and migration should be successful. Additional info:
Can we get any more logs to learn what went wrong? Any errors from Restic? How did you populate the data in the PV? Was this just a 1000GB PV/PVC that was mostly empty, or did you have data in there up to 1000GB? What did you use for the object storage? Did you have sufficient room in object storage?
(In reply to John Matthews from comment #1) > Can we get any more logs to learn what went wrong? > Any errors from Restic? No errors in restic > > How did you populate the data in the PV? https://gist.github.com/mffiedler/21e751f99945646998a3e42092af4da8 > Was this just a 1000GB PV/PVC that was mostly empty, or did you have data in > there up to 1000GB? I think I answered this above. I tried migrating with only 25 files (instead of 110) and migration was successful. I could see the pv migrated to the destination. > > What did you use for the object storage? AWS S3 bucket > Did you have sufficient room in object storage? Since it is Amazon S3 bucket, I believe there is no restriction on how much we want to store. I am attaching a screenshot of the storage when I tried migrating 25 8.8Gi files.
Created attachment 1617236 [details] AWS S3 storage screenshot
Migration was successful when for 400Gi and below. When I tested with 600Gi the failure happened. I am crating the workload following these steps https://gist.github.com/mffiedler/21e751f99945646998a3e42092af4da8
Seems like this may be a velero issue. Looks similar to: https://github.com/vmware-tanzu/velero/issues/1868
velero server has a timeout setting for restic. It's an hour by default. I propose we make this customizable from the MigrationController CR for the operator. I'll work on a PR. /velero server --help | grep restic-timeout --restic-timeout duration how long backups/restores of pod volumes should be allowed to run before timing out (default 1h0m0s)
https://github.com/fusor/mig-operator/pull/110
restic_timeout: 1h can be modified in the MigrationController CR to allow for larger backups. Change it to 2h, 3h, ..., 24h, ..., 48h, etc. as necessary.
Issue in the bug description cannot be reproduced using the following builds. Migration was successful when restic_timeout: 3h was set in the yaml for the controller CR. # oc describe pod/migration-operator-66495ccf7c-kckt9 -n openshift-migration | grep Image containerImage: image-registry.openshift-image-registry.svc:5000/rhcam/openshift-migration-operator:v1.0 Image: image-registry.openshift-image-registry.svc:5000/rhcam/openshift-migration-operator:v1.0 Image ID: image-registry.openshift-image-registry.svc:5000/rhcam/openshift-migration-operator@sha256:db6350c9386343fef1c27b88b3fefc31fc692e97049469564bdc21dbf465454b Image: image-registry.openshift-image-registry.svc:5000/rhcam/openshift-migration-operator:v1.0 Image ID: image-registry.openshift-image-registry.svc:5000/rhcam/openshift-migration-operator@sha256:db6350c9386343fef1c27b88b3fefc31fc692e97049469564bdc21dbf465454b [root@rpattath ~]# oc get pods NAME READY STATUS RESTARTS AGE mongodb-1-deploy 0/1 Completed 0 7h44m mongodb-1-hzn7p 1/1 Running 0 7h44m nodejs-mongo-persistent-1-build 0/1 Completed 0 7h44m nodejs-mongo-persistent-1-deploy 0/1 Completed 0 7h44m nodejs-mongo-persistent-2-deploy 0/1 Completed 0 7h43m nodejs-mongo-persistent-2-r2lq5 1/1 Running 0 7h42m [root@rpattath ~]# oc get pods -n openshift-migration NAME READY STATUS RESTARTS AGE controller-manager-56d558c5f-l498l 1/1 Running 0 17h migration-operator-66495ccf7c-kckt9 2/2 Running 0 17h migration-ui-6f7df75875-rkg2k 1/1 Running 0 17h restic-c8x6d 1/1 Running 0 17h restic-cz7rf 1/1 Running 0 17h restic-n2l5r 1/1 Running 0 17h velero-bdbd6cc56-gfq8p 1/1 Running 0 17h [root@rpattath ~]# oc describe pod/controller-manager-56d558c5f-l498l -n openshift-migration | grep Image Image: image-registry.openshift-image-registry.svc:5000/rhcam/openshift-migration-controller:v1.0 Image ID: image-registry.openshift-image-registry.svc:5000/rhcam/openshift-migration-controller@sha256:46c622e0fbe64165b09930738a7d111c875976b54c8236ddce328cb5470d60ab [root@rpattath ~]# oc describe pod/migration-ui-6f7df75875-rkg2k -n openshift-migration | grep Image Image: image-registry.openshift-image-registry.svc:5000/rhcam/openshift-migration-ui:v1.0 Image ID: image-registry.openshift-image-registry.svc:5000/rhcam/openshift-migration-ui@sha256:59e60d7036ebdc5b7d29895104e7b459a53a1c004e876f50b3e79cdc2b78941c