Created attachment 1639538 [details] All logs Description of problem: There are failures in restic that are not failing the migration execution. For instance, if restic has a "no space left on device" error, the migration ends with OK status. Version-Release number of selected component (if applicable): TARGET CLUSTER $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.2.0 True False 3h18m Cluster version is 4.2.0 SOURCE CLUSTER $ oc version oc v3.7.126 kubernetes v1.7.6+a08f5eeb62 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https:// openshift v3.7.23 kubernetes v1.7.6+a08f5eeb62 Controller: imagePullSecrets: image: quay.io/ocpmigrate/mig-controller:latest imageID: quay.io/ocpmigrate/mig-controller@sha256:40c4b4f149569354c2a023c44f1c0d119c78f1d7baa87c795cdfe9626b979148 Velero: image: quay.io/ocpmigrate/velero:latest imageID: quay.io/ocpmigrate/velero@sha256:ca97a8f80566038563e1c00bfc22303033047d32208065597477b94a34460c0f image: quay.io/ocpmigrate/migration-plugin:latest imageID: quay.io/ocpmigrate/migration-plugin@sha256:96e956dd650b72dfd1db4f951b4ecb545e94c6253968d83ef33de559d83ece85 How reproducible: Always Steps to Reproduce: Migration from glusterfs-storage to ceph rbd. 1. Deploy a mysql deployment with 1Mi pvc. Take the following deployment, replace the space requested to 1Mi, configure the storageclass as glusterfs-storage and deploy it. Make sure that the pvc received is BIGGER than the 1Mi requested. https://raw.githubusercontent.com/fusor/mig-controller/master/docs/scenarios/nfs-pv/mysql-persistent-template.yaml Verify that the space is not honored, and that you received more than 1Mi. For instance, I got 1Gi instead of 1Mi. NAME STATUS VOLUME CAPACITY ACCESSMODES STORAGECLASS AGE mysql Bound pvc-35d1678e-0f85-11ea-8100-064d5e4c320a 1Gi RWO glusterfs-storage 1h 2. Migrate the application to ceph rbd storageclass. Rbd provider will honor your requested space, and will give you 1Mi space. Actual results: The migration ends OK, but the pods are crashing. NAME READY STATUS RESTARTS AGE mysql-1-54pr7 0/1 CrashLoopBackOff 12 41m mysql-1-deploy 0/1 Completed 0 41m Expected results: The migration should fail. Additional info: All the logs regarding this migration are attached to this issue. A important error is this one in a restic pod: time="2019-11-25T13:35:34Z" level=info msg="Restore starting" controller=pod-volume-restore logSource="pkg/controller/pod_volume_restore_controller.go:262" name=2ada4d20-0f88-11ea-9e67-d3954c7d7472-mxsr7-lkwn2 namespace=openshift-migration restore=openshift-migration/2ada4d20-0f88-11ea-9e67-d3954c7d7472-mxsr7 time="2019-11-25T13:35:36Z" level=error msg="Error restoring volume" controller=pod-volume-restore error="error creating .velero directory for done file: mkdir /host_pods/67c54608-0f88-11ea-9715-06daf3f1257c/volumes/kubernetes.io~csi/pvc-67acec75-0f88-11ea-9715-06daf3f1257c/mount/.velero: no space left on device" error.file="/go/src/github.com/heptio/velero/pkg/controller/pod_volume_restore_controller.go:366" error.function="github.com/heptio/velero/pkg/controller.(*podVolumeRestoreController).restorePodVolume" logSource="pkg/controller/pod_volume_restore_controller.go:298" name=2ada4d20-0f88-11ea-9e67-d3954c7d7472-mxsr7-lkwn2 namespace=openshift-migration restore=openshift-migration/2ada4d20-0f88-11ea-9e67-d3954c7d7472-mxsr7 And the status (failed) of the restore: $ velero describe restore 2ada4d20-0f88-11ea-9e67-d3954c7d7472-mxsr7 -n openshift-migration ...... Restore PVs: true Restic Restores (specify --details for more information): Failed: 1
Hey Sergio, I am trying desperately to reproduce this and falling short. Would you be able to confirm for me that in the above scenario whether or not the restore itself actually failed? On the `velero describe restore` command I'm looking for the `Phase` of the object (velero describe restore 2ada4d20-0f88-11ea-9e67-d3954c7d7472-mxsr7 -n openshift-migration | grep Phase). That would be a huge help in determining whether what I'm seeing on my machine makes sense. Thanks in advance.
Thanks to Sergio's help I have been able to reproduce this. The PodVolumeRestore resource is in a `Failed` state yet the associated Velero restore resource is in a `Completed` state. Spoke with the upstream folks and this is definitely a bug though I'm not certain why this specific restic failure is causing the restore not to fail. Will create an issue upstream and follow back.
https://github.com/vmware-tanzu/velero/issues/2121 Speaking upstream to determine how this could happen. Based on the code this should have failed the restore.
We will work to fix this issue in the 1.1.0 release, tracked in https://bugzilla.redhat.com/show_bug.cgi?id=1778918 I am removing this from the 1.0.1 release by closing as wontfix
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days