1778918 – A failure in restic does not fail the migration

Bug 1778918 - A failure in restic does not fail the migration

Summary: A failure in restic does not fail the migration

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Migration Tooling
Sub Component:
Version:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.4.0
Assignee:	Scott Seago
QA Contact:	Xin jiang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1776397
TreeView+	depends on / blocked

Reported:	2019-12-02 19:21 UTC by Erik Nelson
Modified:	2020-05-28 11:10 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1776397
Environment:
Last Closed:	2020-05-28 11:09:55 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2020:2326	0	None	None	None	2020-05-28 11:10:20 UTC

Description Erik Nelson 2019-12-02 19:21:00 UTC

+++ This bug was initially created as a clone of Bug #1776397 +++

Description of problem:
There are failures in restic that are not failing the migration execution. For instance, if restic has a "no space left on device" error, the migration ends with OK status.


Version-Release number of selected component (if applicable):
TARGET CLUSTER
$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.2.0     True        False         3h18m   Cluster version is 4.2.0

SOURCE CLUSTER
$ oc version
oc v3.7.126
kubernetes v1.7.6+a08f5eeb62
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://
openshift v3.7.23
kubernetes v1.7.6+a08f5eeb62

Controller:
  imagePullSecrets:
    image: quay.io/ocpmigrate/mig-controller:latest
    imageID: quay.io/ocpmigrate/mig-controller@sha256:40c4b4f149569354c2a023c44f1c0d119c78f1d7baa87c795cdfe9626b979148
Velero:
    image: quay.io/ocpmigrate/velero:latest
    imageID: quay.io/ocpmigrate/velero@sha256:ca97a8f80566038563e1c00bfc22303033047d32208065597477b94a34460c0f
    image: quay.io/ocpmigrate/migration-plugin:latest
    imageID: quay.io/ocpmigrate/migration-plugin@sha256:96e956dd650b72dfd1db4f951b4ecb545e94c6253968d83ef33de559d83ece85


How reproducible:
Always


Steps to Reproduce:
Migration from glusterfs-storage to ceph rbd.

1. Deploy a mysql deployment with 1Mi pvc. 

Take the following deployment, replace the space requested to 1Mi, configure the storageclass as glusterfs-storage and deploy it. Make sure that the pvc received is BIGGER than the 1Mi requested.

https://raw.githubusercontent.com/fusor/mig-controller/master/docs/scenarios/nfs-pv/mysql-persistent-template.yaml

Verify that the space is not honored, and that you received more than 1Mi. For instance, I got 1Gi instead of 1Mi.

NAME      STATUS    VOLUME                                     CAPACITY   ACCESSMODES   STORAGECLASS        AGE
mysql     Bound     pvc-35d1678e-0f85-11ea-8100-064d5e4c320a   1Gi        RWO           glusterfs-storage   1h


2. Migrate the application to ceph rbd storageclass. Rbd provider will honor your requested space, and will give you 1Mi space.


Actual results:
The migration ends OK, but the pods are crashing.

NAME             READY   STATUS             RESTARTS   AGE
mysql-1-54pr7    0/1     CrashLoopBackOff   12         41m
mysql-1-deploy   0/1     Completed          0          41m


Expected results:
The migration should fail.


Additional info:
All the logs regarding this migration are attached to this issue.

A important error is this one in a restic pod:

time="2019-11-25T13:35:34Z" level=info msg="Restore starting" controller=pod-volume-restore logSource="pkg/controller/pod_volume_restore_controller.go:262" name=2ada4d20-0f88-11ea-9e67-d3954c7d7472-mxsr7-lkwn2 namespace=openshift-migration restore=openshift-migration/2ada4d20-0f88-11ea-9e67-d3954c7d7472-mxsr7
time="2019-11-25T13:35:36Z" level=error msg="Error restoring volume" controller=pod-volume-restore error="error creating .velero directory for done file: mkdir /host_pods/67c54608-0f88-11ea-9715-06daf3f1257c/volumes/kubernetes.io~csi/pvc-67acec75-0f88-11ea-9715-06daf3f1257c/mount/.velero: no space left on device" error.file="/go/src/github.com/heptio/velero/pkg/controller/pod_volume_restore_controller.go:366" error.function="github.com/heptio/velero/pkg/controller.(*podVolumeRestoreController).restorePodVolume" logSource="pkg/controller/pod_volume_restore_controller.go:298" name=2ada4d20-0f88-11ea-9e67-d3954c7d7472-mxsr7-lkwn2 namespace=openshift-migration restore=openshift-migration/2ada4d20-0f88-11ea-9e67-d3954c7d7472-mxsr7


And the status (failed) of the restore:
$ velero describe restore 2ada4d20-0f88-11ea-9e67-d3954c7d7472-mxsr7 -n openshift-migration
......
Restore PVs:  true
 
Restic Restores (specify --details for more information):
  Failed:  1

Comment 2 Dylan Murray 2020-01-21 19:23:49 UTC

There is a PR upstream that should resolve this: https://github.com/vmware-tanzu/velero/pull/2201

Will update the bug once we bring this into our fork.

Comment 3 Scott Seago 2020-03-30 23:42:19 UTC

The upstream PR was included in Velero 1.3.1

Comment 7 Sergio 2020-05-07 16:54:16 UTC

Verified in CAM 1.2 stage (4.1 -> 4.4)

Following the steps to reproduce the problem, we got this error in the velero Restore logs


ime="2020-05-07T16:31:08Z" level=error msg="unable to successfully complete restic restores of pod's volumes" error="pod volume restore failed: error restoring volume: error creating .velero directory for done file: mkdir /host_pods/e6803556-0d07-4094-b345-a99ba8b63e79/volumes/kubernetes.io~csi/pvc-ae1b6167-d9eb-4279-bfbc-3e42e8681c40/mount/.velero: no space left on device" logSource="pkg/restore/restore.go:1287" restore=openshift-migration/db76daa0-907e-11ea-83e3-37726d4749fc-rr4bt


And the migmigration resource failed with this status:
status:
  conditions:
  - category: Advisory
    durable: true
    lastTransitionTime: "2020-05-07T16:22:41Z"
    message: '[1] Stage pods created.'
    status: "True"
    type: StagePodsCreated
  - category: Warn
    durable: true
    lastTransitionTime: "2020-05-07T16:31:08Z"
    message: There were errors found in 1 Restic volume restores. See restore `db76daa0-907e-11ea-83e3-37726d4749fc-rr4bt`
      for details
    status: "True"
    type: ResticErrors
  - category: Advisory
    durable: true
    lastTransitionTime: "2020-05-07T16:31:08Z"
    message: 'The migration has failed.  See: Errors.'
    reason: StageRestoreFailed
    status: "True"
    type: Failed

Comment 9 errata-xmlrpc 2020-05-28 11:09:55 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:2326

Note You need to log in before you can comment on or make changes to this bug.