1752985 – Backup creation fails for a project with 1000Gi pvc

Bug 1752985 - Backup creation fails for a project with 1000Gi pvc

Summary: Backup creation fails for a project with 1000Gi pvc

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Migration Tooling
Sub Component:
Version:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.2.0
Assignee:	Jason Montleon
QA Contact:	Roshni
Docs Contact:
URL:	https://github.com/fusor/mig-operator...
Whiteboard:
Depends On:
Blocks:	1757487
TreeView+	depends on / blocked

Reported:	2019-09-17 17:44 UTC by Roshni
Modified:	2020-01-08 11:03 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	1757487 (view as bug list)
Environment:
Last Closed:	2019-11-21 18:38:14 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
AWS S3 storage screenshot (40.40 KB, image/png) 2019-09-20 14:58 UTC, Roshni	no flags	Details
View All

Description Roshni 2019-09-17 17:44:41 UTC

Description of problem:
Backup creation fails for a project with 1000Gi pvc

Version-Release number of selected component (if applicable):
# oc describe pod/controller-manager-78d9589445-5xztn | grep Image
    Image:         quay.io/ocpmigrate/mig-controller:release-1.0
    Image ID:      quay.io/ocpmigrate/mig-controller@sha256:0f74db7171712ffc440b3d7b0f02a775ccd71238827ec856b7d090f90f2feffb
# oc describe pod/velero-58f7447985-d9hzf | grep Image
    Image:          quay.io/ocpmigrate/migration-plugin:release-1.0
    Image ID:       quay.io/ocpmigrate/migration-plugin@sha256:eb9b82c3f26bcd876bc501e18dde7cffe7e451c8c8a231959ed4d9f1127b91a6
    Image:         quay.io/ocpmigrate/velero:fusor-1.1
    Image ID:      quay.io/ocpmigrate/velero@sha256:6c16a1288bf6aca74afbb0184fa987506839c5193ae8bb2be05cb6aa0a9f3dc5
# oc describe pod/restic-9hst9 | grep Image
    Image:         quay.io/ocpmigrate/velero:fusor-1.1
    Image ID:      quay.io/ocpmigrate/velero@sha256:6c16a1288bf6aca74afbb0184fa987506839c5193ae8bb2be05cb6aa0a9f3dc5
# oc describe pod/migration-operator-5cb94b46fb-vgs5k | grep Image
    Image:         quay.io/ocpmigrate/mig-operator:release-1.0
    Image ID:      quay.io/ocpmigrate/mig-operator@sha256:c5e3a0c4ca4ec954f0c6552b367bc7b3baafa5acea833496147d0b6611bef241
    Image:          quay.io/ocpmigrate/mig-operator:release-1.0
    Image ID:       quay.io/ocpmigrate/mig-operator@sha256:c5e3a0c4ca4ec954f0c6552b367bc7b3baafa5acea833496147d0b6611bef241
# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.2.0-0.nightly-2019-09-17-001320   True        False         176m    Cluster version is 4.2.0-0.nightly-2019-09-17-001320

How reproducible:
always

Steps to Reproduce:
1. On a 3.11 cluster create a project with the following CRs
# oc get all -n big-pvc 
NAME                                   READY     STATUS      RESTARTS   AGE
pod/postgresql-1-65lbd                 1/1       Running     0          1d
pod/rails-postgresql-example-1-build   0/1       Completed   0          1d
pod/rails-postgresql-example-1-swmkz   1/1       Running     0          1d

NAME                                               DESIRED   CURRENT   READY     AGE
replicationcontroller/postgresql-1                 1         1         1         1d
replicationcontroller/rails-postgresql-example-1   1         1         1         1d

NAME                               TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
service/postgresql                 ClusterIP   172.27.110.229   <none>        5432/TCP   1d
service/rails-postgresql-example   ClusterIP   172.26.253.220   <none>        8080/TCP   1d

NAME                                                          REVISION   DESIRED   CURRENT   TRIGGERED BY
deploymentconfig.apps.openshift.io/postgresql                 1          1         1         config,image(postgresql:9.5)
deploymentconfig.apps.openshift.io/rails-postgresql-example   1          1         1         config,image(rails-postgresql-example:latest)

NAME                                                      TYPE      FROM      LATEST
buildconfig.build.openshift.io/rails-postgresql-example   Source    Git       1

NAME                                                  TYPE      FROM          STATUS     STARTED        DURATION
build.build.openshift.io/rails-postgresql-example-1   Source    Git@67d882b   Complete   25 hours ago   1m40s

NAME                                                      DOCKER REPO                                                         TAGS      UPDATED
imagestream.image.openshift.io/rails-postgresql-example   docker-registry.default.svc:5000/big-pvc/rails-postgresql-example   latest    25 hours ago

NAME                                                HOST/PORT                                                       PATH      SERVICES                   PORT      TERMINATION   WILDCARD
route.route.openshift.io/rails-postgresql-example   rails-postgresql-example-big-pvc.apps.0906-5ce.qe.rhcloud.com             rails-postgresql-example   <all>                   None
root@ip-172-31-43-162: /tmp/AWS/3.11 GLUSTERFS # oc get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS    CLAIM                STORAGECLASS   REASON    AGE
pvc-b1e6180f-d8a1-11e9-b34f-029519c5614c   1000Gi     RWO            Delete           Bound     big-pvc/postgresql   gp2                      1d

2. Configure migration CRs with the namespace to be migrated and the pv information.
3. Start migration

Actual results:
Backup creation fails

# oc logs velero-c9b9cd88f-rgj7w | grep restic | grep error
time="2019-09-17T16:15:29Z" level=error msg="Error backing up item" backup=openshift-migration/migmigration-sample-pbx7w error="timed out waiting for all PodVolumeBackups to complete" error.file="/go/src/github.com/heptio/velero/pkg/restic/backupper.go:165" error.function="github.com/heptio/velero/pkg/restic.(*backupper).BackupPodVolumes" group=v1 logSource="pkg/backup/resource_backupper.go:264" name=postgresql-1-65lbd-stage namespace=big-pvc resource=pods

Expected results:
Backup and migration should be successful.

Additional info:

Comment 1 John Matthews 2019-09-17 17:58:14 UTC

Can we get any more logs to learn what went wrong?
Any errors from Restic?

How did you populate the data in the PV?
Was this just a 1000GB PV/PVC that was mostly empty, or did you have data in there up to 1000GB?

What did you use for the object storage?
Did you have sufficient room in object storage?

Comment 2 Roshni 2019-09-20 14:57:19 UTC

(In reply to John Matthews from comment #1)
> Can we get any more logs to learn what went wrong?
> Any errors from Restic?
No errors in restic
> 
> How did you populate the data in the PV?
https://gist.github.com/mffiedler/21e751f99945646998a3e42092af4da8

> Was this just a 1000GB PV/PVC that was mostly empty, or did you have data in
> there up to 1000GB?
I think I answered this above. I tried migrating with only 25 files (instead of 110) and migration was successful. I could see the pv migrated to the destination.

> 
> What did you use for the object storage?
AWS S3 bucket
> Did you have sufficient room in object storage?
Since it is Amazon S3 bucket, I believe there is no restriction on how much we want to store. I am attaching a screenshot of the storage when I tried migrating 25 8.8Gi files.

Comment 3 Roshni 2019-09-20 14:58:05 UTC

Created attachment 1617236 [details]
AWS S3 storage screenshot

Comment 4 Roshni 2019-09-27 00:37:20 UTC

Migration was  successful when for 400Gi and below. When I tested with 600Gi the failure happened. I am crating the workload following these steps https://gist.github.com/mffiedler/21e751f99945646998a3e42092af4da8

Comment 5 Jason Montleon 2019-10-07 15:50:45 UTC

Seems like this may be a velero issue. Looks similar to: https://github.com/vmware-tanzu/velero/issues/1868

Comment 6 Jason Montleon 2019-10-07 16:21:45 UTC

velero server has a timeout setting for restic. It's an hour by default. I propose we make this customizable from the MigrationController CR for the operator. I'll work on a PR.

/velero server --help | grep restic-timeout
      --restic-timeout duration                             how long backups/restores of pod volumes should be allowed to run before timing out (default 1h0m0s)

Comment 7 Jason Montleon 2019-10-07 17:13:55 UTC

https://github.com/fusor/mig-operator/pull/110

Comment 9 Jason Montleon 2019-10-07 17:55:18 UTC

restic_timeout: 1h can be modified in the MigrationController CR to allow for larger backups. Change it to 2h, 3h, ..., 24h, ..., 48h, etc. as necessary.

Comment 10 Roshni 2019-10-09 11:16:41 UTC

Issue in the bug description cannot be reproduced using the following builds. Migration was successful when restic_timeout: 3h was set in the yaml for the controller CR.

# oc describe pod/migration-operator-66495ccf7c-kckt9 -n openshift-migration | grep Image
                    containerImage: image-registry.openshift-image-registry.svc:5000/rhcam/openshift-migration-operator:v1.0
    Image:         image-registry.openshift-image-registry.svc:5000/rhcam/openshift-migration-operator:v1.0
    Image ID:      image-registry.openshift-image-registry.svc:5000/rhcam/openshift-migration-operator@sha256:db6350c9386343fef1c27b88b3fefc31fc692e97049469564bdc21dbf465454b
    Image:          image-registry.openshift-image-registry.svc:5000/rhcam/openshift-migration-operator:v1.0
    Image ID:       image-registry.openshift-image-registry.svc:5000/rhcam/openshift-migration-operator@sha256:db6350c9386343fef1c27b88b3fefc31fc692e97049469564bdc21dbf465454b
[root@rpattath ~]# oc get pods
NAME                               READY   STATUS      RESTARTS   AGE
mongodb-1-deploy                   0/1     Completed   0          7h44m
mongodb-1-hzn7p                    1/1     Running     0          7h44m
nodejs-mongo-persistent-1-build    0/1     Completed   0          7h44m
nodejs-mongo-persistent-1-deploy   0/1     Completed   0          7h44m
nodejs-mongo-persistent-2-deploy   0/1     Completed   0          7h43m
nodejs-mongo-persistent-2-r2lq5    1/1     Running     0          7h42m
[root@rpattath ~]# oc get pods -n openshift-migration
NAME                                  READY   STATUS    RESTARTS   AGE
controller-manager-56d558c5f-l498l    1/1     Running   0          17h
migration-operator-66495ccf7c-kckt9   2/2     Running   0          17h
migration-ui-6f7df75875-rkg2k         1/1     Running   0          17h
restic-c8x6d                          1/1     Running   0          17h
restic-cz7rf                          1/1     Running   0          17h
restic-n2l5r                          1/1     Running   0          17h
velero-bdbd6cc56-gfq8p                1/1     Running   0          17h
[root@rpattath ~]# oc describe pod/controller-manager-56d558c5f-l498l -n openshift-migration | grep Image
    Image:         image-registry.openshift-image-registry.svc:5000/rhcam/openshift-migration-controller:v1.0
    Image ID:      image-registry.openshift-image-registry.svc:5000/rhcam/openshift-migration-controller@sha256:46c622e0fbe64165b09930738a7d111c875976b54c8236ddce328cb5470d60ab
[root@rpattath ~]# oc describe pod/migration-ui-6f7df75875-rkg2k -n openshift-migration | grep Image
    Image:          image-registry.openshift-image-registry.svc:5000/rhcam/openshift-migration-ui:v1.0
    Image ID:       image-registry.openshift-image-registry.svc:5000/rhcam/openshift-migration-ui@sha256:59e60d7036ebdc5b7d29895104e7b459a53a1c004e876f50b3e79cdc2b78941c

Note You need to log in before you can comment on or make changes to this bug.