Bug 1776397 - A failure in restic does not fail the migration
Summary: A failure in restic does not fail the migration
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Migration Tooling
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.2.z
Assignee: Dylan Murray
QA Contact: Sergio
URL:
Whiteboard:
Depends On: 1778918
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-11-25 15:38 UTC by Sergio
Modified: 2023-09-14 05:47 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1778918 (view as bug list)
Environment:
Last Closed: 2019-12-12 14:23:41 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
All logs (1.04 MB, application/zip)
2019-11-25 15:38 UTC, Sergio
no flags Details

Description Sergio 2019-11-25 15:38:28 UTC
Created attachment 1639538 [details]
All logs

Description of problem:
There are failures in restic that are not failing the migration execution. For instance, if restic has a "no space left on device" error, the migration ends with OK status.


Version-Release number of selected component (if applicable):
TARGET CLUSTER
$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.2.0     True        False         3h18m   Cluster version is 4.2.0

SOURCE CLUSTER
$ oc version
oc v3.7.126
kubernetes v1.7.6+a08f5eeb62
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://
openshift v3.7.23
kubernetes v1.7.6+a08f5eeb62

Controller:
  imagePullSecrets:
    image: quay.io/ocpmigrate/mig-controller:latest
    imageID: quay.io/ocpmigrate/mig-controller@sha256:40c4b4f149569354c2a023c44f1c0d119c78f1d7baa87c795cdfe9626b979148
Velero:
    image: quay.io/ocpmigrate/velero:latest
    imageID: quay.io/ocpmigrate/velero@sha256:ca97a8f80566038563e1c00bfc22303033047d32208065597477b94a34460c0f
    image: quay.io/ocpmigrate/migration-plugin:latest
    imageID: quay.io/ocpmigrate/migration-plugin@sha256:96e956dd650b72dfd1db4f951b4ecb545e94c6253968d83ef33de559d83ece85


How reproducible:
Always


Steps to Reproduce:
Migration from glusterfs-storage to ceph rbd.

1. Deploy a mysql deployment with 1Mi pvc. 

Take the following deployment, replace the space requested to 1Mi, configure the storageclass as glusterfs-storage and deploy it. Make sure that the pvc received is BIGGER than the 1Mi requested.

https://raw.githubusercontent.com/fusor/mig-controller/master/docs/scenarios/nfs-pv/mysql-persistent-template.yaml

Verify that the space is not honored, and that you received more than 1Mi. For instance, I got 1Gi instead of 1Mi.

NAME      STATUS    VOLUME                                     CAPACITY   ACCESSMODES   STORAGECLASS        AGE
mysql     Bound     pvc-35d1678e-0f85-11ea-8100-064d5e4c320a   1Gi        RWO           glusterfs-storage   1h


2. Migrate the application to ceph rbd storageclass. Rbd provider will honor your requested space, and will give you 1Mi space.


Actual results:
The migration ends OK, but the pods are crashing.

NAME             READY   STATUS             RESTARTS   AGE
mysql-1-54pr7    0/1     CrashLoopBackOff   12         41m
mysql-1-deploy   0/1     Completed          0          41m


Expected results:
The migration should fail.


Additional info:
All the logs regarding this migration are attached to this issue.

A important error is this one in a restic pod:

time="2019-11-25T13:35:34Z" level=info msg="Restore starting" controller=pod-volume-restore logSource="pkg/controller/pod_volume_restore_controller.go:262" name=2ada4d20-0f88-11ea-9e67-d3954c7d7472-mxsr7-lkwn2 namespace=openshift-migration restore=openshift-migration/2ada4d20-0f88-11ea-9e67-d3954c7d7472-mxsr7
time="2019-11-25T13:35:36Z" level=error msg="Error restoring volume" controller=pod-volume-restore error="error creating .velero directory for done file: mkdir /host_pods/67c54608-0f88-11ea-9715-06daf3f1257c/volumes/kubernetes.io~csi/pvc-67acec75-0f88-11ea-9715-06daf3f1257c/mount/.velero: no space left on device" error.file="/go/src/github.com/heptio/velero/pkg/controller/pod_volume_restore_controller.go:366" error.function="github.com/heptio/velero/pkg/controller.(*podVolumeRestoreController).restorePodVolume" logSource="pkg/controller/pod_volume_restore_controller.go:298" name=2ada4d20-0f88-11ea-9e67-d3954c7d7472-mxsr7-lkwn2 namespace=openshift-migration restore=openshift-migration/2ada4d20-0f88-11ea-9e67-d3954c7d7472-mxsr7


And the status (failed) of the restore:
$ velero describe restore 2ada4d20-0f88-11ea-9e67-d3954c7d7472-mxsr7 -n openshift-migration
......
Restore PVs:  true
 
Restic Restores (specify --details for more information):
  Failed:  1

Comment 1 Dylan Murray 2019-12-10 14:31:27 UTC
Hey Sergio,

I am trying desperately to reproduce this and falling short. Would you be able to confirm for me that in the above scenario whether or not the restore itself actually failed? On the `velero describe restore` command I'm looking for the `Phase` of the object (velero describe restore 2ada4d20-0f88-11ea-9e67-d3954c7d7472-mxsr7 -n openshift-migration | grep Phase). That would be a huge help in determining whether what I'm seeing on my machine makes sense.

Thanks in advance.

Comment 2 Dylan Murray 2019-12-11 17:52:17 UTC
Thanks to Sergio's help I have been able to reproduce this. The PodVolumeRestore resource is in a `Failed` state yet the associated Velero restore resource is in a `Completed` state. Spoke with the upstream folks and this is definitely a bug though I'm not certain why this specific restic failure is causing the restore not to fail. Will create an issue upstream and follow back.

Comment 3 Dylan Murray 2019-12-11 19:18:33 UTC
https://github.com/vmware-tanzu/velero/issues/2121

Speaking upstream to determine how this could happen. Based on the code this should have failed the restore.

Comment 4 John Matthews 2019-12-12 14:23:41 UTC
We will work to fix this issue in the 1.1.0 release, tracked in https://bugzilla.redhat.com/show_bug.cgi?id=1778918

I am removing this from the 1.0.1 release by closing as wontfix

Comment 5 Red Hat Bugzilla 2023-09-14 05:47:32 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days


Note You need to log in before you can comment on or make changes to this bug.