Bug 1871059

Summary: Migration stuck when restic restore helper pod image cannot be pulled
Product: Migration Toolkit for Containers Reporter: Sergio <sregidor>
Component: GeneralAssignee: Shawn Hurley <shurley>
Status: CLOSED ERRATA QA Contact: Xin jiang <xjiang>
Severity: medium Docs Contact: Avital Pinnick <apinnick>
Priority: medium    
Version: 1.3.0CC: alpatel, chezhang, dymurray, ernelson, jmatthew, jmontleo, rjohnson, rpattath, whu, xjiang
Target Milestone: ---   
Target Release: 1.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-02-11 12:54:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Sergio 2020-08-21 09:01:55 UTC
Description of problem:
If there is a problem pulling the restic restore helper pod, the migration is stuck forever instead of failed.

Version-Release number of selected component (if applicable):
CAM 1.2.5

How reproducible:
Always

Steps to Reproduce:
1. In source cluster, create a namespace 
oc new-project bztest

2. In this namespace, deploy an application
oc new-app cakephp-mysql-persistent

3. In target cluster, configure a wrong value for velero_restic_restore_helper_version

oc edit migrationcontroller
....
    restic_timeout: 1h
    velero_restic_restore_helper_version: THISISAFAKEVALUETHATCANNOTBEPULLED

4. Create a migration plan and migrate the namespace created in step 1


Actual results:

The migration is stuck forever in StageRestoreCreated stated

In target cluster we can see that the stage pod cannot be created

$ oc get pods
NAME                        READY   STATUS                  RESTARTS   AGE
stage-mysql-1-dmgvm-2flgs   0/1     Init:ImagePullBackOff   0          15m


Expected results:
When CAM can see that the stage pod cannot be created, the migration should fail instead of remain stuck.

Additional info:
If we use this configuration
    migration_stage_image: mybadregistry.com/bad
    migration_stage_repo: mybadrepo
    migration_stage_version: badversion

The problem happens too, but it's stuck in StagePodsCreated status instead.

Comment 2 Erik Nelson 2020-10-05 17:43:09 UTC
Alay, this is probably related to the registry health check work. Think the expectation here is a failure, which the dependency checks should satisfy.

Comment 7 Sergio 2021-01-11 15:16:18 UTC
Verified using MTC 1.4.0

In 1.4.0 the error is visible in the UI , like this:

Container restic-wait Failed to apply default image tag "registry.stage.redhat.io/rhmtc/openshift-migration-velero-restic-restore-helper-rhel8@sha256:THISISAFAKEVALUETHATCANNOTBEPULLED": couldn't parse image reference "registry.stage.redhat.io/rhmtc/openshift-migration-velero-restic-restore-helper-rhel8@sha256:THISISAFAKEVALUETHATCANNOTBEPULLED": invalid reference format


The migration will be aborted and a warning will be reported once the restic timeout is reached. It happened before 1.4.0 too, but the cause of this timeout was hidden.


Given that the error is now reported to the user, and that actually the restic timeout will make the migration not to wait forever, we can consider that this BZ is verified.

Comment 9 errata-xmlrpc 2021-02-11 12:54:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Migration Toolkit for Containers (MTC) tool image release advisory 1.4.0), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:5329