Bug 1862543

Summary: Failed to upgrade CAM from 1.2.3 to 1.2.4 on OCP3.7 as Velero and restic pod always take the old images
Product: Migration Toolkit for Containers Reporter: John Matthews <jmatthew>
Component: GeneralAssignee: John Matthews <jmatthew>
Status: CLOSED ERRATA QA Contact: Xin jiang <xjiang>
Severity: medium Docs Contact: Avital Pinnick <apinnick>
Priority: unspecified    
Version: 1.3.0CC: chezhang, ernelson, jmontleo, pvauter, rjohnson, sregidor, whu, xjiang
Target Milestone: ---   
Target Release: 1.3.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1861971
: 1862544 (view as bug list) Environment:
Last Closed: 2020-09-30 18:42:39 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1861971, 1862544    
Bug Blocks:    

Description John Matthews 2020-07-31 16:32:46 UTC
+++ This bug was initially created as a clone of Bug #1861971 +++

Description of problem:
This issue just occurred on OCP 3.7(I verified that CAM upgrade on OCP 3.11 and it works), restic and velero pod always reported Init:ImagePullBackOff error as they don't take the latest images after applied the latest operator.yml on OCP 3.7.


Version-Release number of selected component (if applicable):
CAM 1.2.4

How reproducible:
always

Steps to Reproduce:
1. To test the CAM upgrade, we set up a Quay as a replacement for stage registry

2. Mirroing CAM 1.2.3 to the Quay

3. Install CAM 1.2.3 on OCP 3.7 and OCP 4.5 from the Quay
OCP 4.5
$ oc get csv -n openshift-migration
NAME                  DISPLAY                                  VERSION   REPLACES   PHASE
cam-operator.v1.2.3   Cluster Application Migration Operator   1.2.3                Succeeded

4. Execute a migration and it works well
$ oc get migmigration -n openshift-migration 5791b9b0-d20c-11ea-9ffe-f1003b8734ce -o yaml
......
  name: 5791b9b0-d20c-11ea-9ffe-f1003b8734ce
  namespace: openshift-migration
  ownerReferences:
  - apiVersion: migration.openshift.io/v1alpha1
    kind: MigPlan
    name: test1
    uid: 21b710e8-1151-4a1a-8913-8e45a015a687
  resourceVersion: "68219"
  selfLink: /apis/migration.openshift.io/v1alpha1/namespaces/openshift-migration/migmigrations/5791b9b0-d20c-11ea-9ffe-f1003b8734ce
  uid: 9254d163-9a2b-47a3-997e-be0317a7c8c4
spec:
  migPlanRef:
    name: test1
    namespace: openshift-migration
  stage: false
status:
  conditions:
  - category: Advisory
    durable: true
    lastTransitionTime: "2020-07-30T02:32:41Z"
    message: The migration has completed successfully.
    reason: Completed
    status: "True"
    type: Succeeded
  itenerary: Final
  observedDigest: cef11161d78ec93695f08cffaa75b06081667e259cd823ce8c76bce9269082ed
  phase: Completed
  startTimestamp: "2020-07-30T02:28:24Z"

5. Mirroring the latest CAM to the Quay

6.CAM is automatically updated to the latest(1.2.4) on OCP 4.5
$ oc get csv -n openshift-migration
NAME                  DISPLAY                                  VERSION   REPLACES              PHASE
cam-operator.v1.2.4   Cluster Application Migration Operator   1.2.4     cam-operator.v1.2.3   Succeeded

$ oc get pod -n openshift-migration
NAME                                                           READY   STATUS      RESTARTS   AGE
migration-controller-5fc8cf748d-ghwb8                          2/2     Running     0          67m
migration-operator-8657c8878d-9qgdz                            2/2     Running     0          69m
migration-ui-7b47c9c9d6-4nkv7                                  1/1     Running     0          171m
registry-21b710e8-1151-4a1a-8913-8e45a015a687-t5tn7-1-cdjp7    1/1     Running     0          97m
registry-21b710e8-1151-4a1a-8913-8e45a015a687-t5tn7-1-deploy   0/1     Completed   0          97m
restic-cpzxq                                                   1/1     Running     0          67m
restic-njxt2                                                   1/1     Running     0          66m
restic-vcqsw                                                   1/1     Running     0          67m
velero-676884c78c-xfrg4                                        1/1     Running     0          67m

7. Download the latest operator.yml and update image path defined in the operator.yml
$ podman cp $(podman create quay-enterprise-quay-enterprise.apps.cam-tgt-7410.qe.devcluster.openshift.com/admin/openshift-migration-rhel7-operator:v1.2):/operator.yml ./ 

$ sed -i 's/rhcam-1-2/admin/g' operator.yml
$ sed -i 's/registry.redhat.io/quay-enterprise-quay-enterprise.apps.cam-tgt-7410.qe.devcluster.openshift.com/g' operator.yml


8. apply the operator.yml to OCP 3.7
$ oc replace -f operator.yml

Actual results:
1. Restic and Velero pods failed at "Init:ImagePullBackOff"
$ oc get pod -n openshift-migration --watch
NAME                                                          READY     STATUS              RESTARTS   AGE
migration-operator-1643177695-9rn89                           2/2       Running             0          1h
migration-operator-345024850-xdxpv                            0/2       ContainerCreating   0          <invalid>
registry-21b710e8-1151-4a1a-8913-8e45a015a687-cpg4f-1-9p2t6   1/1       Running             0          34m
restic-7mkvp                                                  1/1       Running             0          32m
restic-9ck2b                                                  1/1       Running             0          32m
restic-kgqj9                                                  1/1       Running             0          32m
restic-wh7dx                                                  1/1       Running             0          32m
velero-2797337289-8tjwd                                       1/1       Running             0          1h
migration-operator-345024850-xdxpv   2/2       Running   0         <invalid>
migration-operator-1643177695-9rn89   2/2       Terminating   0         1h
migration-operator-1643177695-9rn89   0/2       Terminating   0         1h
migration-operator-1643177695-9rn89   0/2       Terminating   0         1h
migration-operator-1643177695-9rn89   0/2       Terminating   0         1h
migration-operator-1643177695-9rn89   0/2       Terminating   0         1h
velero-3550465265-4jcsd   0/1       Pending   0         <invalid>
velero-3550465265-4jcsd   0/1       Pending   0         <invalid>
restic-wh7dx   1/1       Terminating   0         34m
velero-3550465265-4jcsd   0/1       Init:0/5   0         <invalid>
^@restic-wh7dx   0/1       Terminating   0         34m
velero-3550465265-4jcsd   0/1       Init:ErrImagePull   0         <invalid>
velero-3550465265-4jcsd   0/1       Init:ImagePullBackOff   0         <invalid>
restic-wh7dx   0/1       Terminating   0         34m
restic-wh7dx   0/1       Terminating   0         34m
restic-vzcgn   0/1       Pending   0         <invalid>
restic-vzcgn   0/1       Init:0/1   0         <invalid>
restic-vzcgn   0/1       Init:ErrImagePull   0         <invalid>
velero-3550465265-4jcsd   0/1       Init:ErrImagePull   0         <invalid>
velero-3550465265-4jcsd   0/1       Init:ImagePullBackOff   0         <invalid>
restic-vzcgn   0/1       Init:ImagePullBackOff   0         <invalid>
restic-vzcgn   0/1       Init:ImagePullBackOff   0         <invalid>
velero-3550465265-4jcsd   0/1       Init:ErrImagePull   0         <invalid>
velero-3550465265-4jcsd   0/1       Init:ImagePullBackOff   0         <invalid>
^@restic-vzcgn   0/1       Init:ErrImagePull   0         <invalid>
restic-vzcgn   0/1       Init:ImagePullBackOff   0         <invalid>
restic-vzcgn   0/1       Init:ErrImagePull   0         <invalid>
restic-vzcgn   0/1       Init:ImagePullBackOff   0         <invalid>
velero-3550465265-4jcsd   0/1       Init:ErrImagePull   0         <invalid>
velero-3550465265-4jcsd   0/1       Init:ImagePullBackOff   0         <invalid>
^@restic-vzcgn   0/1       Init:ErrImagePull   0         <invalid>
restic-vzcgn   0/1       Init:ImagePullBackOff   0         <invalid>
^@velero-3550465265-4jcsd   0/1       Init:ErrImagePull   0         <invalid>
velero-3550465265-4jcsd   0/1       Init:ImagePullBackOff   0         <invalid>
restic-vzcgn   0/1       Init:ErrImagePull   0         <invalid>
^@restic-vzcgn   0/1       Init:ImagePullBackOff   0         <invalid>

$ oc describe pod velero-3550465265-4jcsd -n openshift-migration
......
Events:
  FirstSeen	LastSeen	Count	From					SubObjectPath					Type		Reason			Message
  ---------	--------	-----	----					-------------					--------	------			-------
  45m		45m		1	kubelet, ip-172-18-0-75.ec2.internal							Normal		SuccessfulMountVolume	MountVolume.SetUp succeeded for volume "certs"
  45m		45m		1	kubelet, ip-172-18-0-75.ec2.internal							Normal		SuccessfulMountVolume	MountVolume.SetUp succeeded for volume "scratch"
  45m		45m		1	kubelet, ip-172-18-0-75.ec2.internal							Normal		SuccessfulMountVolume	MountVolume.SetUp succeeded for volume "host-pods"
  45m		45m		1	kubelet, ip-172-18-0-75.ec2.internal							Normal		SuccessfulMountVolume	MountVolume.SetUp succeeded for volume "velero-token-nn78h"
  45m		45m		1	kubelet, ip-172-18-0-75.ec2.internal							Normal		SuccessfulMountVolume	MountVolume.SetUp succeeded for volume "cloud-credentials"
  45m		45m		1	kubelet, ip-172-18-0-75.ec2.internal							Normal		SuccessfulMountVolume	MountVolume.SetUp succeeded for volume "gcp-cloud-credentials"
  45m		45m		1	kubelet, ip-172-18-0-75.ec2.internal							Normal		SuccessfulMountVolume	MountVolume.SetUp succeeded for volume "azure-cloud-credentials"
  45m		45m		2	kubelet, ip-172-18-0-75.ec2.internal	spec.initContainers{setup-certificate-secret}	Normal		Pulling			pulling image "quay-enterprise-quay-enterprise.apps.cam-tgt-7410.qe.devcluster.openshift.com/admin/openshift-migration-velero-rhel8@sha256:1a33e327dd610f0eebaaeae5b3c9b4170ab5db572b01a170be35b9ce946c0281"
  45m		45m		2	kubelet, ip-172-18-0-75.ec2.internal	spec.initContainers{setup-certificate-secret}	Warning		Failed			Failed to pull image "quay-enterprise-quay-enterprise.apps.cam-tgt-7410.qe.devcluster.openshift.com/admin/openshift-migration-velero-rhel8@sha256:1a33e327dd610f0eebaaeae5b3c9b4170ab5db572b01a170be35b9ce946c0281": rpc error: code = 2 desc = manifest unknown: manifest unknown
  45m		45m		2	kubelet, ip-172-18-0-75.ec2.internal	spec.initContainers{setup-certificate-secret}	Warning		Failed			Error: ErrImagePull
  45m		35m		51	kubelet, ip-172-18-0-75.ec2.internal	spec.initContainers{setup-certificate-secret}	Normal		BackOff			Back-off pulling image "quay-enterprise-quay-enterprise.apps.cam-tgt-7410.qe.devcluster.openshift.com/admin/openshift-migration-velero-rhel8@sha256:1a33e327dd610f0eebaaeae5b3c9b4170ab5db572b01a170be35b9ce946c0281"
  45m		<invalid>	264	kubelet, ip-172-18-0-75.ec2.internal	spec.initContainers{setup-certificate-secret}	Warning		Failed			Error: ImagePullBackOff


$oc describe pod velero-3550465265-4jcsd -n openshift-migration
Events:
  FirstSeen	LastSeen	Count	From					SubObjectPath				Type		Reason			Message
  ---------	--------	-----	----					-------------				--------	------			-------
  46m		46m		1	kubelet, ip-172-18-0-75.ec2.internal						Normal		SuccessfulMountVolume	MountVolume.SetUp succeeded for volume "cloud-credentials"
  46m		46m		1	kubelet, ip-172-18-0-75.ec2.internal						Normal		SuccessfulMountVolume	MountVolume.SetUp succeeded for volume "scratch"
  46m		46m		1	kubelet, ip-172-18-0-75.ec2.internal						Normal		SuccessfulMountVolume	MountVolume.SetUp succeeded for volume "certs"
  46m		46m		1	kubelet, ip-172-18-0-75.ec2.internal						Normal		SuccessfulMountVolume	MountVolume.SetUp succeeded for volume "plugins"
  46m		46m		1	kubelet, ip-172-18-0-75.ec2.internal						Normal		SuccessfulMountVolume	MountVolume.SetUp succeeded for volume "velero-token-nn78h"
  46m		46m		1	kubelet, ip-172-18-0-75.ec2.internal						Normal		SuccessfulMountVolume	MountVolume.SetUp succeeded for volume "gcp-cloud-credentials"
  46m		46m		1	kubelet, ip-172-18-0-75.ec2.internal						Normal		SuccessfulMountVolume	MountVolume.SetUp succeeded for volume "azure-cloud-credentials"
  46m		46m		1	default-scheduler								Normal		Scheduled		Successfully assigned velero-3550465265-4jcsd to ip-172-18-0-75.ec2.internal
  46m		46m		2	kubelet, ip-172-18-0-75.ec2.internal	spec.initContainers{velero-plugin}	Normal		Pulling			pulling image "quay-enterprise-quay-enterprise.apps.cam-tgt-7410.qe.devcluster.openshift.com/admin/openshift-migration-plugin-rhel8@sha256:d9e2c4a9db9a88c68d3f6b18927c7f00d50a172a9a721ea6add0855e4db1fda0"
  46m		46m		2	kubelet, ip-172-18-0-75.ec2.internal	spec.initContainers{velero-plugin}	Warning		Failed			Failed to pull image "quay-enterprise-quay-enterprise.apps.cam-tgt-7410.qe.devcluster.openshift.com/admin/openshift-migration-plugin-rhel8@sha256:d9e2c4a9db9a88c68d3f6b18927c7f00d50a172a9a721ea6add0855e4db1fda0": rpc error: code = 2 desc = manifest unknown: manifest unknown
  46m		46m		2	kubelet, ip-172-18-0-75.ec2.internal	spec.initContainers{velero-plugin}	Warning		Failed			Error: ErrImagePull
  46m		45m		6	kubelet, ip-172-18-0-75.ec2.internal	spec.initContainers{velero-plugin}	Normal		BackOff			Back-off pulling image "quay-enterprise-quay-enterprise.apps.cam-tgt-7410.qe.devcluster.openshift.com/admin/openshift-migration-plugin-rhel8@sha256:d9e2c4a9db9a88c68d3f6b18927c7f00d50a172a9a721ea6add0855e4db1fda0"
  46m		<invalid>	265	kubelet, ip-172-18-0-75.ec2.internal	spec.initContainers{velero-plugin}	Warning		Failed			Error: ImagePullBackOff


CAM 1.2.4 images:
TASK [operator-mirror : Get images in /tmp/ansible.9UTiOY_cam-operator/manifests/cam-operator/v1.2.4/konveyor-operator.v1.2.4.clusterserviceversion.yaml] ************************************************************************************
ok: [localhost] => (item=registry.stage.redhat.io/rhcam-1-2/openshift-migration-controller-rhel8@sha256:4c58451f338eeb20e9bade9e5c61fd3ca64b469de96af77487e334dd8c9fc0e6)
ok: [localhost] => (item=registry.stage.redhat.io/rhcam-1-2/openshift-migration-hook-runner-rhel7@sha256:86a048f0ee9726b4331d10190dc5851330b66c0326d94652ac07f33a501ae323)
ok: [localhost] => (item=registry.stage.redhat.io/rhcam-1-2/openshift-migration-plugin-rhel8@sha256:40fee0819d750149b282b58019f4a118e296a754414fceaa4a1162deebee4898)
ok: [localhost] => (item=registry.stage.redhat.io/rhcam-1-2/openshift-migration-registry-rhel8@sha256:37536b4487d3668a7105737695a0651e6be64720bc72a69da74153a8443ac9e1)
ok: [localhost] => (item=registry.stage.redhat.io/rhcam-1-2/openshift-migration-rhel7-operator@sha256:a8d31fdb96e9d5e3fe42e928d0862141b7e39780e52121a995aeeb34270dd894)
ok: [localhost] => (item=registry.stage.redhat.io/rhcam-1-2/openshift-migration-ui-rhel8@sha256:6abfaea8ac04e3b5bbf9648a3479b420b4baec35201033471020c9cae1fe1e11)
ok: [localhost] => (item=registry.stage.redhat.io/rhcam-1-2/openshift-migration-velero-plugin-for-aws-rhel8@sha256:bfda4f3c7f95993b5f9dace49856b124505e72bd87d42a50918f4194b7e6d7f0)
ok: [localhost] => (item=registry.stage.redhat.io/rhcam-1-2/openshift-migration-velero-plugin-for-gcp-rhel8@sha256:fa6c5c8dc38b8965dd9eedb9c2a86dc9a8441cb280392961a1b8b42379648014)
ok: [localhost] => (item=registry.stage.redhat.io/rhcam-1-2/openshift-migration-velero-plugin-for-microsoft-azure-rhel8@sha256:c8b0fb034244ef9598703ec9534ecfb5c97cff42157d2571eab382bdb1aeb5a2)
ok: [localhost] => (item=registry.stage.redhat.io/rhcam-1-2/openshift-migration-velero-restic-restore-helper-rhel8@sha256:356e8d9dede186325e3e4f8700cbde7121b6c4dc35c0099b8337c6cfb83049d8)
ok: [localhost] => (item=registry.stage.redhat.io/rhcam-1-2/openshift-migration-velero-rhel8@sha256:461ea0c165ed525d4276056f6aab879dcf011facb00e94acc88ae6e9f33f1637)


CAM 1.2.3 images:
TASK [operator-mirror : Get images in /tmp/ansible.9UTiOY_cam-operator/manifests/cam-operator/v1.2.3/konveyor-operator.v1.2.3.clusterserviceversion.yaml] ************************************************************************************
ok: [localhost] => (item=registry.stage.redhat.io/rhcam-1-2/openshift-migration-controller-rhel8@sha256:f3de5a7b0e6eeee722da155622a9f20425696bd25f833519b7aec320a7b64659)
ok: [localhost] => (item=registry.stage.redhat.io/rhcam-1-2/openshift-migration-hook-runner-rhel7@sha256:86a048f0ee9726b4331d10190dc5851330b66c0326d94652ac07f33a501ae323)
ok: [localhost] => (item=registry.stage.redhat.io/rhcam-1-2/openshift-migration-plugin-rhel8@sha256:d9e2c4a9db9a88c68d3f6b18927c7f00d50a172a9a721ea6add0855e4db1fda0)
ok: [localhost] => (item=registry.stage.redhat.io/rhcam-1-2/openshift-migration-registry-rhel8@sha256:ea6301a15277d448c8756881c7e2e712893ca8041c913476640f52da9e76cad9)
ok: [localhost] => (item=registry.stage.redhat.io/rhcam-1-2/openshift-migration-rhel7-operator@sha256:cb509b4cf5566088a81cfbc17918aeae00fefd2bfcc4bef33cded372836e3d59)
ok: [localhost] => (item=registry.stage.redhat.io/rhcam-1-2/openshift-migration-ui-rhel8@sha256:6abfaea8ac04e3b5bbf9648a3479b420b4baec35201033471020c9cae1fe1e11)
ok: [localhost] => (item=registry.stage.redhat.io/rhcam-1-2/openshift-migration-velero-plugin-for-aws-rhel8@sha256:22c58f575ce2f54bf995fced82f89ba173329d9b88409cf371122f9ae8cabda1)
ok: [localhost] => (item=registry.stage.redhat.io/rhcam-1-2/openshift-migration-velero-plugin-for-gcp-rhel8@sha256:37c0b170d168fcebb104e465621e4ce97515d82549cd37cb42be94e3e55a4271)
ok: [localhost] => (item=registry.stage.redhat.io/rhcam-1-2/openshift-migration-velero-plugin-for-microsoft-azure-rhel8@sha256:dd92ad748a84754e5d78287e29576a5b95448e929824e86e80c60857d0c7aff9)
ok: [localhost] => (item=registry.stage.redhat.io/rhcam-1-2/openshift-migration-velero-restic-restore-helper-rhel8@sha256:e9459138ec3531eefbefa181dae3fd93fe5cf210b2a0bd3bca7ba38fbec97f60)
ok: [localhost] => (item=registry.stage.redhat.io/rhcam-1-2/openshift-migration-velero-rhel8@sha256:1a33e327dd610f0eebaaeae5b3c9b4170ab5db572b01a170be35b9ce946c0281)

You can see Restric and Velero are taking the old images. 

Expected results:
The upgrade should be done successfully on OCP3.7


Additional info:

--- Additional comment from Jason Montleon on 2020-07-30 12:19:16 UTC ---

So far it looks like this is specific to just OCP 3.7. There is no issue with 3.11 at least. 

Looking at the logs it seems the operator tasks are not failing when trying to patch the velero deployment and restic daemonset. This may be an issue with python-openshift that we need to take to them for investigation after doing some of our own, and I'm not sure we're going to get much help with 3.7 support.

It's likely always been an issue and hasn't been detected because the old images remain on the production registry so they wouldn't fail to pull even if the patch command fails to update the image tag.

For now the workaround is:
oc delete --ignore-not-found=true deployment migration-controller migration-ui velero && oc delete --ignore-not-found=true daemonset restic

Comment 2 John Matthews 2020-08-19 21:28:05 UTC
*** Bug 1862544 has been marked as a duplicate of this bug. ***

Comment 3 Jason Montleon 2020-09-23 15:18:05 UTC
Looking further it seems to be just the initContainers.

Comment 4 Jason Montleon 2020-09-23 15:23:17 UTC
This might be an openshift bug:https://github.com/kubernetes/kubernetes/issues/47264

Comment 5 Jason Montleon 2020-09-23 15:26:01 UTC
https://github.com/kubernetes/kubernetes/issues/47264
"Correct. Pre-1.8, the information is duplicated in two places in the object, with the annotation taking precedence. 1.8+, only the field is honored. You should set the field for forward compatibility, and set the annotation if you want your changes to be effective against a pre-1.8 server."

We may be able to template in the relevant information for 3.7 the same way we set v1beta1 / v1 for the Deployment depending on the version.

Comment 6 Jason Montleon 2020-09-23 16:12:25 UTC
https://github.com/konveyor/mig-operator/pull/457

Comment 10 Xin jiang 2020-09-24 07:58:53 UTC
verified with MTC 1.3.0

Comment 14 errata-xmlrpc 2020-09-30 18:42:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Migration Toolkit for Containers (MTC) Tool image release advisory 1.3.0), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4148