1871059 – Migration stuck when restic restore helper pod image cannot be pulled

Bug 1871059 - Migration stuck when restic restore helper pod image cannot be pulled

Summary: Migration stuck when restic restore helper pod image cannot be pulled

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Migration Toolkit for Containers
Classification:	Red Hat
Component:	General
Sub Component:
Version:	1.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	1.4.0
Assignee:	Shawn Hurley
QA Contact:	Xin jiang
Docs Contact:	Avital Pinnick
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-08-21 09:01 UTC by Sergio
Modified:	2021-02-11 12:55 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-02-11 12:54:46 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	konveyor mig-controller pull 747	0	None	closed	Bug 1871059: Adding ability to wait for init containers to start and finish	2021-01-11 15:08:04 UTC
Red Hat Product Errata	RHBA-2020:5329	0	None	None	None	2021-02-11 12:55:07 UTC

Description Sergio 2020-08-21 09:01:55 UTC

Description of problem:
If there is a problem pulling the restic restore helper pod, the migration is stuck forever instead of failed.

Version-Release number of selected component (if applicable):
CAM 1.2.5

How reproducible:
Always

Steps to Reproduce:
1. In source cluster, create a namespace 
oc new-project bztest

2. In this namespace, deploy an application
oc new-app cakephp-mysql-persistent

3. In target cluster, configure a wrong value for velero_restic_restore_helper_version

oc edit migrationcontroller
....
    restic_timeout: 1h
    velero_restic_restore_helper_version: THISISAFAKEVALUETHATCANNOTBEPULLED

4. Create a migration plan and migrate the namespace created in step 1


Actual results:

The migration is stuck forever in StageRestoreCreated stated

In target cluster we can see that the stage pod cannot be created

$ oc get pods
NAME                        READY   STATUS                  RESTARTS   AGE
stage-mysql-1-dmgvm-2flgs   0/1     Init:ImagePullBackOff   0          15m


Expected results:
When CAM can see that the stage pod cannot be created, the migration should fail instead of remain stuck.

Additional info:
If we use this configuration
    migration_stage_image: mybadregistry.com/bad
    migration_stage_repo: mybadrepo
    migration_stage_version: badversion

The problem happens too, but it's stuck in StagePodsCreated status instead.

Comment 2 Erik Nelson 2020-10-05 17:43:09 UTC

Alay, this is probably related to the registry health check work. Think the expectation here is a failure, which the dependency checks should satisfy.

Comment 7 Sergio 2021-01-11 15:16:18 UTC

Verified using MTC 1.4.0

In 1.4.0 the error is visible in the UI , like this:

Container restic-wait Failed to apply default image tag "registry.stage.redhat.io/rhmtc/openshift-migration-velero-restic-restore-helper-rhel8@sha256:THISISAFAKEVALUETHATCANNOTBEPULLED": couldn't parse image reference "registry.stage.redhat.io/rhmtc/openshift-migration-velero-restic-restore-helper-rhel8@sha256:THISISAFAKEVALUETHATCANNOTBEPULLED": invalid reference format


The migration will be aborted and a warning will be reported once the restic timeout is reached. It happened before 1.4.0 too, but the cause of this timeout was hidden.


Given that the error is now reported to the user, and that actually the restic timeout will make the migration not to wait forever, we can consider that this BZ is verified.

Comment 9 errata-xmlrpc 2021-02-11 12:54:46 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Migration Toolkit for Containers (MTC) tool image release advisory 1.4.0), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:5329

Note You need to log in before you can comment on or make changes to this bug.