+++ This bug was initially created as a clone of Bug #1776397 +++ Description of problem: There are failures in restic that are not failing the migration execution. For instance, if restic has a "no space left on device" error, the migration ends with OK status. Version-Release number of selected component (if applicable): TARGET CLUSTER $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.2.0 True False 3h18m Cluster version is 4.2.0 SOURCE CLUSTER $ oc version oc v3.7.126 kubernetes v1.7.6+a08f5eeb62 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https:// openshift v3.7.23 kubernetes v1.7.6+a08f5eeb62 Controller: imagePullSecrets: image: quay.io/ocpmigrate/mig-controller:latest imageID: quay.io/ocpmigrate/mig-controller@sha256:40c4b4f149569354c2a023c44f1c0d119c78f1d7baa87c795cdfe9626b979148 Velero: image: quay.io/ocpmigrate/velero:latest imageID: quay.io/ocpmigrate/velero@sha256:ca97a8f80566038563e1c00bfc22303033047d32208065597477b94a34460c0f image: quay.io/ocpmigrate/migration-plugin:latest imageID: quay.io/ocpmigrate/migration-plugin@sha256:96e956dd650b72dfd1db4f951b4ecb545e94c6253968d83ef33de559d83ece85 How reproducible: Always Steps to Reproduce: Migration from glusterfs-storage to ceph rbd. 1. Deploy a mysql deployment with 1Mi pvc. Take the following deployment, replace the space requested to 1Mi, configure the storageclass as glusterfs-storage and deploy it. Make sure that the pvc received is BIGGER than the 1Mi requested. https://raw.githubusercontent.com/fusor/mig-controller/master/docs/scenarios/nfs-pv/mysql-persistent-template.yaml Verify that the space is not honored, and that you received more than 1Mi. For instance, I got 1Gi instead of 1Mi. NAME STATUS VOLUME CAPACITY ACCESSMODES STORAGECLASS AGE mysql Bound pvc-35d1678e-0f85-11ea-8100-064d5e4c320a 1Gi RWO glusterfs-storage 1h 2. Migrate the application to ceph rbd storageclass. Rbd provider will honor your requested space, and will give you 1Mi space. Actual results: The migration ends OK, but the pods are crashing. NAME READY STATUS RESTARTS AGE mysql-1-54pr7 0/1 CrashLoopBackOff 12 41m mysql-1-deploy 0/1 Completed 0 41m Expected results: The migration should fail. Additional info: All the logs regarding this migration are attached to this issue. A important error is this one in a restic pod: time="2019-11-25T13:35:34Z" level=info msg="Restore starting" controller=pod-volume-restore logSource="pkg/controller/pod_volume_restore_controller.go:262" name=2ada4d20-0f88-11ea-9e67-d3954c7d7472-mxsr7-lkwn2 namespace=openshift-migration restore=openshift-migration/2ada4d20-0f88-11ea-9e67-d3954c7d7472-mxsr7 time="2019-11-25T13:35:36Z" level=error msg="Error restoring volume" controller=pod-volume-restore error="error creating .velero directory for done file: mkdir /host_pods/67c54608-0f88-11ea-9715-06daf3f1257c/volumes/kubernetes.io~csi/pvc-67acec75-0f88-11ea-9715-06daf3f1257c/mount/.velero: no space left on device" error.file="/go/src/github.com/heptio/velero/pkg/controller/pod_volume_restore_controller.go:366" error.function="github.com/heptio/velero/pkg/controller.(*podVolumeRestoreController).restorePodVolume" logSource="pkg/controller/pod_volume_restore_controller.go:298" name=2ada4d20-0f88-11ea-9e67-d3954c7d7472-mxsr7-lkwn2 namespace=openshift-migration restore=openshift-migration/2ada4d20-0f88-11ea-9e67-d3954c7d7472-mxsr7 And the status (failed) of the restore: $ velero describe restore 2ada4d20-0f88-11ea-9e67-d3954c7d7472-mxsr7 -n openshift-migration ...... Restore PVs: true Restic Restores (specify --details for more information): Failed: 1
There is a PR upstream that should resolve this: https://github.com/vmware-tanzu/velero/pull/2201 Will update the bug once we bring this into our fork.
The upstream PR was included in Velero 1.3.1
Verified in CAM 1.2 stage (4.1 -> 4.4) Following the steps to reproduce the problem, we got this error in the velero Restore logs ime="2020-05-07T16:31:08Z" level=error msg="unable to successfully complete restic restores of pod's volumes" error="pod volume restore failed: error restoring volume: error creating .velero directory for done file: mkdir /host_pods/e6803556-0d07-4094-b345-a99ba8b63e79/volumes/kubernetes.io~csi/pvc-ae1b6167-d9eb-4279-bfbc-3e42e8681c40/mount/.velero: no space left on device" logSource="pkg/restore/restore.go:1287" restore=openshift-migration/db76daa0-907e-11ea-83e3-37726d4749fc-rr4bt And the migmigration resource failed with this status: status: conditions: - category: Advisory durable: true lastTransitionTime: "2020-05-07T16:22:41Z" message: '[1] Stage pods created.' status: "True" type: StagePodsCreated - category: Warn durable: true lastTransitionTime: "2020-05-07T16:31:08Z" message: There were errors found in 1 Restic volume restores. See restore `db76daa0-907e-11ea-83e3-37726d4749fc-rr4bt` for details status: "True" type: ResticErrors - category: Advisory durable: true lastTransitionTime: "2020-05-07T16:31:08Z" message: 'The migration has failed. See: Errors.' reason: StageRestoreFailed status: "True" type: Failed
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2020:2326