Bug 1757487 - [Docs] Backup creation fails for a project with 1000Gi pvc
Summary: [Docs] Backup creation fails for a project with 1000Gi pvc
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Migration Tooling
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 4.2.0
Assignee: Avital Pinnick
QA Contact: Roshni
URL:
Whiteboard:
Depends On: 1752985
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-10-01 16:42 UTC by John Matthews
Modified: 2019-10-16 06:42 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1752985
Environment:
Last Closed: 2019-10-16 06:41:54 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:2922 0 None None None 2019-10-16 06:42:08 UTC

Description John Matthews 2019-10-01 16:42:22 UTC
We'd like to ensure this BZ is captured in Release Notes/Known issues (it won't be fixed in 4.2.0 release).


+++ This bug was initially created as a clone of Bug #1752985 +++

Description of problem:
Backup creation fails for a project with 1000Gi pvc

Version-Release number of selected component (if applicable):
# oc describe pod/controller-manager-78d9589445-5xztn | grep Image
    Image:         quay.io/ocpmigrate/mig-controller:release-1.0
    Image ID:      quay.io/ocpmigrate/mig-controller@sha256:0f74db7171712ffc440b3d7b0f02a775ccd71238827ec856b7d090f90f2feffb
# oc describe pod/velero-58f7447985-d9hzf | grep Image
    Image:          quay.io/ocpmigrate/migration-plugin:release-1.0
    Image ID:       quay.io/ocpmigrate/migration-plugin@sha256:eb9b82c3f26bcd876bc501e18dde7cffe7e451c8c8a231959ed4d9f1127b91a6
    Image:         quay.io/ocpmigrate/velero:fusor-1.1
    Image ID:      quay.io/ocpmigrate/velero@sha256:6c16a1288bf6aca74afbb0184fa987506839c5193ae8bb2be05cb6aa0a9f3dc5
# oc describe pod/restic-9hst9 | grep Image
    Image:         quay.io/ocpmigrate/velero:fusor-1.1
    Image ID:      quay.io/ocpmigrate/velero@sha256:6c16a1288bf6aca74afbb0184fa987506839c5193ae8bb2be05cb6aa0a9f3dc5
# oc describe pod/migration-operator-5cb94b46fb-vgs5k | grep Image
    Image:         quay.io/ocpmigrate/mig-operator:release-1.0
    Image ID:      quay.io/ocpmigrate/mig-operator@sha256:c5e3a0c4ca4ec954f0c6552b367bc7b3baafa5acea833496147d0b6611bef241
    Image:          quay.io/ocpmigrate/mig-operator:release-1.0
    Image ID:       quay.io/ocpmigrate/mig-operator@sha256:c5e3a0c4ca4ec954f0c6552b367bc7b3baafa5acea833496147d0b6611bef241
# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.2.0-0.nightly-2019-09-17-001320   True        False         176m    Cluster version is 4.2.0-0.nightly-2019-09-17-001320

How reproducible:
always

Steps to Reproduce:
1. On a 3.11 cluster create a project with the following CRs
# oc get all -n big-pvc 
NAME                                   READY     STATUS      RESTARTS   AGE
pod/postgresql-1-65lbd                 1/1       Running     0          1d
pod/rails-postgresql-example-1-build   0/1       Completed   0          1d
pod/rails-postgresql-example-1-swmkz   1/1       Running     0          1d

NAME                                               DESIRED   CURRENT   READY     AGE
replicationcontroller/postgresql-1                 1         1         1         1d
replicationcontroller/rails-postgresql-example-1   1         1         1         1d

NAME                               TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
service/postgresql                 ClusterIP   172.27.110.229   <none>        5432/TCP   1d
service/rails-postgresql-example   ClusterIP   172.26.253.220   <none>        8080/TCP   1d

NAME                                                          REVISION   DESIRED   CURRENT   TRIGGERED BY
deploymentconfig.apps.openshift.io/postgresql                 1          1         1         config,image(postgresql:9.5)
deploymentconfig.apps.openshift.io/rails-postgresql-example   1          1         1         config,image(rails-postgresql-example:latest)

NAME                                                      TYPE      FROM      LATEST
buildconfig.build.openshift.io/rails-postgresql-example   Source    Git       1

NAME                                                  TYPE      FROM          STATUS     STARTED        DURATION
build.build.openshift.io/rails-postgresql-example-1   Source    Git@67d882b   Complete   25 hours ago   1m40s

NAME                                                      DOCKER REPO                                                         TAGS      UPDATED
imagestream.image.openshift.io/rails-postgresql-example   docker-registry.default.svc:5000/big-pvc/rails-postgresql-example   latest    25 hours ago

NAME                                                HOST/PORT                                                       PATH      SERVICES                   PORT      TERMINATION   WILDCARD
route.route.openshift.io/rails-postgresql-example   rails-postgresql-example-big-pvc.apps.0906-5ce.qe.rhcloud.com             rails-postgresql-example   <all>                   None
root@ip-172-31-43-162: /tmp/AWS/3.11 GLUSTERFS # oc get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS    CLAIM                STORAGECLASS   REASON    AGE
pvc-b1e6180f-d8a1-11e9-b34f-029519c5614c   1000Gi     RWO            Delete           Bound     big-pvc/postgresql   gp2                      1d

2. Configure migration CRs with the namespace to be migrated and the pv information.
3. Start migration

Actual results:
Backup creation fails

# oc logs velero-c9b9cd88f-rgj7w | grep restic | grep error
time="2019-09-17T16:15:29Z" level=error msg="Error backing up item" backup=openshift-migration/migmigration-sample-pbx7w error="timed out waiting for all PodVolumeBackups to complete" error.file="/go/src/github.com/heptio/velero/pkg/restic/backupper.go:165" error.function="github.com/heptio/velero/pkg/restic.(*backupper).BackupPodVolumes" group=v1 logSource="pkg/backup/resource_backupper.go:264" name=postgresql-1-65lbd-stage namespace=big-pvc resource=pods

Expected results:
Backup and migration should be successful.

Additional info:

--- Additional comment from John Matthews on 2019-09-17 17:58:14 UTC ---

Can we get any more logs to learn what went wrong?
Any errors from Restic?

How did you populate the data in the PV?
Was this just a 1000GB PV/PVC that was mostly empty, or did you have data in there up to 1000GB?

What did you use for the object storage?
Did you have sufficient room in object storage?

--- Additional comment from Roshni on 2019-09-20 14:57:19 UTC ---

(In reply to John Matthews from comment #1)
> Can we get any more logs to learn what went wrong?
> Any errors from Restic?
No errors in restic
> 
> How did you populate the data in the PV?
https://gist.github.com/mffiedler/21e751f99945646998a3e42092af4da8

> Was this just a 1000GB PV/PVC that was mostly empty, or did you have data in
> there up to 1000GB?
I think I answered this above. I tried migrating with only 25 files (instead of 110) and migration was successful. I could see the pv migrated to the destination.

> 
> What did you use for the object storage?
AWS S3 bucket
> Did you have sufficient room in object storage?
Since it is Amazon S3 bucket, I believe there is no restriction on how much we want to store. I am attaching a screenshot of the storage when I tried migrating 25 8.8Gi files.

--- Additional comment from Roshni on 2019-09-20 14:58:05 UTC ---



--- Additional comment from Roshni on 2019-09-27 00:37:20 UTC ---

Migration was  successful when for 400Gi and below. When I tested with 600Gi the failure happened. I am crating the workload following these steps https://gist.github.com/mffiedler/21e751f99945646998a3e42092af4da8

Comment 1 Avital Pinnick 2019-10-02 09:32:58 UTC
John, 

Roshni commented:

> --- Additional comment from Roshni on 2019-09-27 00:37:20 UTC ---

> Migration was  successful when for 400Gi and below. When I tested with 600Gi the failure happened.

Are you sure you want to document that failure occurs at 1000 Gi if it occurred at 600 Gi?

Comment 14 errata-xmlrpc 2019-10-16 06:41:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922


Note You need to log in before you can comment on or make changes to this bug.