Bug 1871059 - Migration stuck when restic restore helper pod image cannot be pulled
Summary: Migration stuck when restic restore helper pod image cannot be pulled
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Migration Toolkit for Containers
Classification: Red Hat
Component: General
Version: 1.3.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 1.4.0
Assignee: Shawn Hurley
QA Contact: Xin jiang
Avital Pinnick
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-08-21 09:01 UTC by Sergio
Modified: 2021-02-11 12:55 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-02-11 12:54:46 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github konveyor mig-controller pull 747 0 None closed Bug 1871059: Adding ability to wait for init containers to start and finish 2021-01-11 15:08:04 UTC
Red Hat Product Errata RHBA-2020:5329 0 None None None 2021-02-11 12:55:07 UTC

Description Sergio 2020-08-21 09:01:55 UTC
Description of problem:
If there is a problem pulling the restic restore helper pod, the migration is stuck forever instead of failed.

Version-Release number of selected component (if applicable):
CAM 1.2.5

How reproducible:
Always

Steps to Reproduce:
1. In source cluster, create a namespace 
oc new-project bztest

2. In this namespace, deploy an application
oc new-app cakephp-mysql-persistent

3. In target cluster, configure a wrong value for velero_restic_restore_helper_version

oc edit migrationcontroller
....
    restic_timeout: 1h
    velero_restic_restore_helper_version: THISISAFAKEVALUETHATCANNOTBEPULLED

4. Create a migration plan and migrate the namespace created in step 1


Actual results:

The migration is stuck forever in StageRestoreCreated stated

In target cluster we can see that the stage pod cannot be created

$ oc get pods
NAME                        READY   STATUS                  RESTARTS   AGE
stage-mysql-1-dmgvm-2flgs   0/1     Init:ImagePullBackOff   0          15m


Expected results:
When CAM can see that the stage pod cannot be created, the migration should fail instead of remain stuck.

Additional info:
If we use this configuration
    migration_stage_image: mybadregistry.com/bad
    migration_stage_repo: mybadrepo
    migration_stage_version: badversion

The problem happens too, but it's stuck in StagePodsCreated status instead.

Comment 2 Erik Nelson 2020-10-05 17:43:09 UTC
Alay, this is probably related to the registry health check work. Think the expectation here is a failure, which the dependency checks should satisfy.

Comment 7 Sergio 2021-01-11 15:16:18 UTC
Verified using MTC 1.4.0

In 1.4.0 the error is visible in the UI , like this:

Container restic-wait Failed to apply default image tag "registry.stage.redhat.io/rhmtc/openshift-migration-velero-restic-restore-helper-rhel8@sha256:THISISAFAKEVALUETHATCANNOTBEPULLED": couldn't parse image reference "registry.stage.redhat.io/rhmtc/openshift-migration-velero-restic-restore-helper-rhel8@sha256:THISISAFAKEVALUETHATCANNOTBEPULLED": invalid reference format


The migration will be aborted and a warning will be reported once the restic timeout is reached. It happened before 1.4.0 too, but the cause of this timeout was hidden.


Given that the error is now reported to the user, and that actually the restic timeout will make the migration not to wait forever, we can consider that this BZ is verified.

Comment 9 errata-xmlrpc 2021-02-11 12:54:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Migration Toolkit for Containers (MTC) tool image release advisory 1.4.0), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:5329


Note You need to log in before you can comment on or make changes to this bug.