Bug 1861971 - Docs, known issue: Failed to upgrade CAM from 1.2.3 to 1.2.4 on OCP3.7 as Velero and restic pod always take the old images
Summary: Docs, known issue: Failed to upgrade CAM from 1.2.3 to 1.2.4 on OCP3.7 as Ve...
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Migration Tooling
Version: 4.4
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 4.4.z
Assignee: Avital Pinnick
QA Contact: Xin jiang
URL:
Whiteboard:
Depends On: 1862544
Blocks: 1862543
TreeView+ depends on / blocked
 
Reported: 2020-07-30 04:41 UTC by Xin jiang
Modified: 2020-08-04 07:35 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Known Issue
Doc Text:
Cause: Upgrades of the CAM Operator on OCP 3.7 clusters will require a manual workaround below to ensure that deployment resources are correctly updated. Issue is that CAM Operator relies on the 'patch' operation to update resources and patch is not available in OCP 3.7. Consequence: Deployment resources are not updated Workaround (if any): Delete the deployment resources for controller, ui, velero and the daemonset for Restic when they exist and rely on the running operator to recreate them Example command: "oc delete --ignore-not-found=true deployment migration-controller migration-ui velero && oc delete --ignore-not-found=true daemonset restic" Result: After deleting the older resources the operator will recreate
Clone Of:
: 1862543 (view as bug list)
Environment:
Last Closed: 2020-08-04 07:35:56 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Xin jiang 2020-07-30 04:41:02 UTC
Description of problem:
This issue just occurred on OCP 3.7(I verified that CAM upgrade on OCP 3.11 and it works), restic and velero pod always reported Init:ImagePullBackOff error as they don't take the latest images after applied the latest operator.yml on OCP 3.7.


Version-Release number of selected component (if applicable):
CAM 1.2.4

How reproducible:
always

Steps to Reproduce:
1. To test the CAM upgrade, we set up a Quay as a replacement for stage registry

2. Mirroing CAM 1.2.3 to the Quay

3. Install CAM 1.2.3 on OCP 3.7 and OCP 4.5 from the Quay
OCP 4.5
$ oc get csv -n openshift-migration
NAME                  DISPLAY                                  VERSION   REPLACES   PHASE
cam-operator.v1.2.3   Cluster Application Migration Operator   1.2.3                Succeeded

4. Execute a migration and it works well
$ oc get migmigration -n openshift-migration 5791b9b0-d20c-11ea-9ffe-f1003b8734ce -o yaml
......
  name: 5791b9b0-d20c-11ea-9ffe-f1003b8734ce
  namespace: openshift-migration
  ownerReferences:
  - apiVersion: migration.openshift.io/v1alpha1
    kind: MigPlan
    name: test1
    uid: 21b710e8-1151-4a1a-8913-8e45a015a687
  resourceVersion: "68219"
  selfLink: /apis/migration.openshift.io/v1alpha1/namespaces/openshift-migration/migmigrations/5791b9b0-d20c-11ea-9ffe-f1003b8734ce
  uid: 9254d163-9a2b-47a3-997e-be0317a7c8c4
spec:
  migPlanRef:
    name: test1
    namespace: openshift-migration
  stage: false
status:
  conditions:
  - category: Advisory
    durable: true
    lastTransitionTime: "2020-07-30T02:32:41Z"
    message: The migration has completed successfully.
    reason: Completed
    status: "True"
    type: Succeeded
  itenerary: Final
  observedDigest: cef11161d78ec93695f08cffaa75b06081667e259cd823ce8c76bce9269082ed
  phase: Completed
  startTimestamp: "2020-07-30T02:28:24Z"

5. Mirroring the latest CAM to the Quay

6.CAM is automatically updated to the latest(1.2.4) on OCP 4.5
$ oc get csv -n openshift-migration
NAME                  DISPLAY                                  VERSION   REPLACES              PHASE
cam-operator.v1.2.4   Cluster Application Migration Operator   1.2.4     cam-operator.v1.2.3   Succeeded

$ oc get pod -n openshift-migration
NAME                                                           READY   STATUS      RESTARTS   AGE
migration-controller-5fc8cf748d-ghwb8                          2/2     Running     0          67m
migration-operator-8657c8878d-9qgdz                            2/2     Running     0          69m
migration-ui-7b47c9c9d6-4nkv7                                  1/1     Running     0          171m
registry-21b710e8-1151-4a1a-8913-8e45a015a687-t5tn7-1-cdjp7    1/1     Running     0          97m
registry-21b710e8-1151-4a1a-8913-8e45a015a687-t5tn7-1-deploy   0/1     Completed   0          97m
restic-cpzxq                                                   1/1     Running     0          67m
restic-njxt2                                                   1/1     Running     0          66m
restic-vcqsw                                                   1/1     Running     0          67m
velero-676884c78c-xfrg4                                        1/1     Running     0          67m

7. Download the latest operator.yml and update image path defined in the operator.yml
$ podman cp $(podman create quay-enterprise-quay-enterprise.apps.cam-tgt-7410.qe.devcluster.openshift.com/admin/openshift-migration-rhel7-operator:v1.2):/operator.yml ./ 

$ sed -i 's/rhcam-1-2/admin/g' operator.yml
$ sed -i 's/registry.redhat.io/quay-enterprise-quay-enterprise.apps.cam-tgt-7410.qe.devcluster.openshift.com/g' operator.yml


8. apply the operator.yml to OCP 3.7
$ oc replace -f operator.yml

Actual results:
1. Restic and Velero pods failed at "Init:ImagePullBackOff"
$ oc get pod -n openshift-migration --watch
NAME                                                          READY     STATUS              RESTARTS   AGE
migration-operator-1643177695-9rn89                           2/2       Running             0          1h
migration-operator-345024850-xdxpv                            0/2       ContainerCreating   0          <invalid>
registry-21b710e8-1151-4a1a-8913-8e45a015a687-cpg4f-1-9p2t6   1/1       Running             0          34m
restic-7mkvp                                                  1/1       Running             0          32m
restic-9ck2b                                                  1/1       Running             0          32m
restic-kgqj9                                                  1/1       Running             0          32m
restic-wh7dx                                                  1/1       Running             0          32m
velero-2797337289-8tjwd                                       1/1       Running             0          1h
migration-operator-345024850-xdxpv   2/2       Running   0         <invalid>
migration-operator-1643177695-9rn89   2/2       Terminating   0         1h
migration-operator-1643177695-9rn89   0/2       Terminating   0         1h
migration-operator-1643177695-9rn89   0/2       Terminating   0         1h
migration-operator-1643177695-9rn89   0/2       Terminating   0         1h
migration-operator-1643177695-9rn89   0/2       Terminating   0         1h
velero-3550465265-4jcsd   0/1       Pending   0         <invalid>
velero-3550465265-4jcsd   0/1       Pending   0         <invalid>
restic-wh7dx   1/1       Terminating   0         34m
velero-3550465265-4jcsd   0/1       Init:0/5   0         <invalid>
^@restic-wh7dx   0/1       Terminating   0         34m
velero-3550465265-4jcsd   0/1       Init:ErrImagePull   0         <invalid>
velero-3550465265-4jcsd   0/1       Init:ImagePullBackOff   0         <invalid>
restic-wh7dx   0/1       Terminating   0         34m
restic-wh7dx   0/1       Terminating   0         34m
restic-vzcgn   0/1       Pending   0         <invalid>
restic-vzcgn   0/1       Init:0/1   0         <invalid>
restic-vzcgn   0/1       Init:ErrImagePull   0         <invalid>
velero-3550465265-4jcsd   0/1       Init:ErrImagePull   0         <invalid>
velero-3550465265-4jcsd   0/1       Init:ImagePullBackOff   0         <invalid>
restic-vzcgn   0/1       Init:ImagePullBackOff   0         <invalid>
restic-vzcgn   0/1       Init:ImagePullBackOff   0         <invalid>
velero-3550465265-4jcsd   0/1       Init:ErrImagePull   0         <invalid>
velero-3550465265-4jcsd   0/1       Init:ImagePullBackOff   0         <invalid>
^@restic-vzcgn   0/1       Init:ErrImagePull   0         <invalid>
restic-vzcgn   0/1       Init:ImagePullBackOff   0         <invalid>
restic-vzcgn   0/1       Init:ErrImagePull   0         <invalid>
restic-vzcgn   0/1       Init:ImagePullBackOff   0         <invalid>
velero-3550465265-4jcsd   0/1       Init:ErrImagePull   0         <invalid>
velero-3550465265-4jcsd   0/1       Init:ImagePullBackOff   0         <invalid>
^@restic-vzcgn   0/1       Init:ErrImagePull   0         <invalid>
restic-vzcgn   0/1       Init:ImagePullBackOff   0         <invalid>
^@velero-3550465265-4jcsd   0/1       Init:ErrImagePull   0         <invalid>
velero-3550465265-4jcsd   0/1       Init:ImagePullBackOff   0         <invalid>
restic-vzcgn   0/1       Init:ErrImagePull   0         <invalid>
^@restic-vzcgn   0/1       Init:ImagePullBackOff   0         <invalid>

$ oc describe pod velero-3550465265-4jcsd -n openshift-migration
......
Events:
  FirstSeen	LastSeen	Count	From					SubObjectPath					Type		Reason			Message
  ---------	--------	-----	----					-------------					--------	------			-------
  45m		45m		1	kubelet, ip-172-18-0-75.ec2.internal							Normal		SuccessfulMountVolume	MountVolume.SetUp succeeded for volume "certs"
  45m		45m		1	kubelet, ip-172-18-0-75.ec2.internal							Normal		SuccessfulMountVolume	MountVolume.SetUp succeeded for volume "scratch"
  45m		45m		1	kubelet, ip-172-18-0-75.ec2.internal							Normal		SuccessfulMountVolume	MountVolume.SetUp succeeded for volume "host-pods"
  45m		45m		1	kubelet, ip-172-18-0-75.ec2.internal							Normal		SuccessfulMountVolume	MountVolume.SetUp succeeded for volume "velero-token-nn78h"
  45m		45m		1	kubelet, ip-172-18-0-75.ec2.internal							Normal		SuccessfulMountVolume	MountVolume.SetUp succeeded for volume "cloud-credentials"
  45m		45m		1	kubelet, ip-172-18-0-75.ec2.internal							Normal		SuccessfulMountVolume	MountVolume.SetUp succeeded for volume "gcp-cloud-credentials"
  45m		45m		1	kubelet, ip-172-18-0-75.ec2.internal							Normal		SuccessfulMountVolume	MountVolume.SetUp succeeded for volume "azure-cloud-credentials"
  45m		45m		2	kubelet, ip-172-18-0-75.ec2.internal	spec.initContainers{setup-certificate-secret}	Normal		Pulling			pulling image "quay-enterprise-quay-enterprise.apps.cam-tgt-7410.qe.devcluster.openshift.com/admin/openshift-migration-velero-rhel8@sha256:1a33e327dd610f0eebaaeae5b3c9b4170ab5db572b01a170be35b9ce946c0281"
  45m		45m		2	kubelet, ip-172-18-0-75.ec2.internal	spec.initContainers{setup-certificate-secret}	Warning		Failed			Failed to pull image "quay-enterprise-quay-enterprise.apps.cam-tgt-7410.qe.devcluster.openshift.com/admin/openshift-migration-velero-rhel8@sha256:1a33e327dd610f0eebaaeae5b3c9b4170ab5db572b01a170be35b9ce946c0281": rpc error: code = 2 desc = manifest unknown: manifest unknown
  45m		45m		2	kubelet, ip-172-18-0-75.ec2.internal	spec.initContainers{setup-certificate-secret}	Warning		Failed			Error: ErrImagePull
  45m		35m		51	kubelet, ip-172-18-0-75.ec2.internal	spec.initContainers{setup-certificate-secret}	Normal		BackOff			Back-off pulling image "quay-enterprise-quay-enterprise.apps.cam-tgt-7410.qe.devcluster.openshift.com/admin/openshift-migration-velero-rhel8@sha256:1a33e327dd610f0eebaaeae5b3c9b4170ab5db572b01a170be35b9ce946c0281"
  45m		<invalid>	264	kubelet, ip-172-18-0-75.ec2.internal	spec.initContainers{setup-certificate-secret}	Warning		Failed			Error: ImagePullBackOff


$oc describe pod velero-3550465265-4jcsd -n openshift-migration
Events:
  FirstSeen	LastSeen	Count	From					SubObjectPath				Type		Reason			Message
  ---------	--------	-----	----					-------------				--------	------			-------
  46m		46m		1	kubelet, ip-172-18-0-75.ec2.internal						Normal		SuccessfulMountVolume	MountVolume.SetUp succeeded for volume "cloud-credentials"
  46m		46m		1	kubelet, ip-172-18-0-75.ec2.internal						Normal		SuccessfulMountVolume	MountVolume.SetUp succeeded for volume "scratch"
  46m		46m		1	kubelet, ip-172-18-0-75.ec2.internal						Normal		SuccessfulMountVolume	MountVolume.SetUp succeeded for volume "certs"
  46m		46m		1	kubelet, ip-172-18-0-75.ec2.internal						Normal		SuccessfulMountVolume	MountVolume.SetUp succeeded for volume "plugins"
  46m		46m		1	kubelet, ip-172-18-0-75.ec2.internal						Normal		SuccessfulMountVolume	MountVolume.SetUp succeeded for volume "velero-token-nn78h"
  46m		46m		1	kubelet, ip-172-18-0-75.ec2.internal						Normal		SuccessfulMountVolume	MountVolume.SetUp succeeded for volume "gcp-cloud-credentials"
  46m		46m		1	kubelet, ip-172-18-0-75.ec2.internal						Normal		SuccessfulMountVolume	MountVolume.SetUp succeeded for volume "azure-cloud-credentials"
  46m		46m		1	default-scheduler								Normal		Scheduled		Successfully assigned velero-3550465265-4jcsd to ip-172-18-0-75.ec2.internal
  46m		46m		2	kubelet, ip-172-18-0-75.ec2.internal	spec.initContainers{velero-plugin}	Normal		Pulling			pulling image "quay-enterprise-quay-enterprise.apps.cam-tgt-7410.qe.devcluster.openshift.com/admin/openshift-migration-plugin-rhel8@sha256:d9e2c4a9db9a88c68d3f6b18927c7f00d50a172a9a721ea6add0855e4db1fda0"
  46m		46m		2	kubelet, ip-172-18-0-75.ec2.internal	spec.initContainers{velero-plugin}	Warning		Failed			Failed to pull image "quay-enterprise-quay-enterprise.apps.cam-tgt-7410.qe.devcluster.openshift.com/admin/openshift-migration-plugin-rhel8@sha256:d9e2c4a9db9a88c68d3f6b18927c7f00d50a172a9a721ea6add0855e4db1fda0": rpc error: code = 2 desc = manifest unknown: manifest unknown
  46m		46m		2	kubelet, ip-172-18-0-75.ec2.internal	spec.initContainers{velero-plugin}	Warning		Failed			Error: ErrImagePull
  46m		45m		6	kubelet, ip-172-18-0-75.ec2.internal	spec.initContainers{velero-plugin}	Normal		BackOff			Back-off pulling image "quay-enterprise-quay-enterprise.apps.cam-tgt-7410.qe.devcluster.openshift.com/admin/openshift-migration-plugin-rhel8@sha256:d9e2c4a9db9a88c68d3f6b18927c7f00d50a172a9a721ea6add0855e4db1fda0"
  46m		<invalid>	265	kubelet, ip-172-18-0-75.ec2.internal	spec.initContainers{velero-plugin}	Warning		Failed			Error: ImagePullBackOff


CAM 1.2.4 images:
TASK [operator-mirror : Get images in /tmp/ansible.9UTiOY_cam-operator/manifests/cam-operator/v1.2.4/konveyor-operator.v1.2.4.clusterserviceversion.yaml] ************************************************************************************
ok: [localhost] => (item=registry.stage.redhat.io/rhcam-1-2/openshift-migration-controller-rhel8@sha256:4c58451f338eeb20e9bade9e5c61fd3ca64b469de96af77487e334dd8c9fc0e6)
ok: [localhost] => (item=registry.stage.redhat.io/rhcam-1-2/openshift-migration-hook-runner-rhel7@sha256:86a048f0ee9726b4331d10190dc5851330b66c0326d94652ac07f33a501ae323)
ok: [localhost] => (item=registry.stage.redhat.io/rhcam-1-2/openshift-migration-plugin-rhel8@sha256:40fee0819d750149b282b58019f4a118e296a754414fceaa4a1162deebee4898)
ok: [localhost] => (item=registry.stage.redhat.io/rhcam-1-2/openshift-migration-registry-rhel8@sha256:37536b4487d3668a7105737695a0651e6be64720bc72a69da74153a8443ac9e1)
ok: [localhost] => (item=registry.stage.redhat.io/rhcam-1-2/openshift-migration-rhel7-operator@sha256:a8d31fdb96e9d5e3fe42e928d0862141b7e39780e52121a995aeeb34270dd894)
ok: [localhost] => (item=registry.stage.redhat.io/rhcam-1-2/openshift-migration-ui-rhel8@sha256:6abfaea8ac04e3b5bbf9648a3479b420b4baec35201033471020c9cae1fe1e11)
ok: [localhost] => (item=registry.stage.redhat.io/rhcam-1-2/openshift-migration-velero-plugin-for-aws-rhel8@sha256:bfda4f3c7f95993b5f9dace49856b124505e72bd87d42a50918f4194b7e6d7f0)
ok: [localhost] => (item=registry.stage.redhat.io/rhcam-1-2/openshift-migration-velero-plugin-for-gcp-rhel8@sha256:fa6c5c8dc38b8965dd9eedb9c2a86dc9a8441cb280392961a1b8b42379648014)
ok: [localhost] => (item=registry.stage.redhat.io/rhcam-1-2/openshift-migration-velero-plugin-for-microsoft-azure-rhel8@sha256:c8b0fb034244ef9598703ec9534ecfb5c97cff42157d2571eab382bdb1aeb5a2)
ok: [localhost] => (item=registry.stage.redhat.io/rhcam-1-2/openshift-migration-velero-restic-restore-helper-rhel8@sha256:356e8d9dede186325e3e4f8700cbde7121b6c4dc35c0099b8337c6cfb83049d8)
ok: [localhost] => (item=registry.stage.redhat.io/rhcam-1-2/openshift-migration-velero-rhel8@sha256:461ea0c165ed525d4276056f6aab879dcf011facb00e94acc88ae6e9f33f1637)


CAM 1.2.3 images:
TASK [operator-mirror : Get images in /tmp/ansible.9UTiOY_cam-operator/manifests/cam-operator/v1.2.3/konveyor-operator.v1.2.3.clusterserviceversion.yaml] ************************************************************************************
ok: [localhost] => (item=registry.stage.redhat.io/rhcam-1-2/openshift-migration-controller-rhel8@sha256:f3de5a7b0e6eeee722da155622a9f20425696bd25f833519b7aec320a7b64659)
ok: [localhost] => (item=registry.stage.redhat.io/rhcam-1-2/openshift-migration-hook-runner-rhel7@sha256:86a048f0ee9726b4331d10190dc5851330b66c0326d94652ac07f33a501ae323)
ok: [localhost] => (item=registry.stage.redhat.io/rhcam-1-2/openshift-migration-plugin-rhel8@sha256:d9e2c4a9db9a88c68d3f6b18927c7f00d50a172a9a721ea6add0855e4db1fda0)
ok: [localhost] => (item=registry.stage.redhat.io/rhcam-1-2/openshift-migration-registry-rhel8@sha256:ea6301a15277d448c8756881c7e2e712893ca8041c913476640f52da9e76cad9)
ok: [localhost] => (item=registry.stage.redhat.io/rhcam-1-2/openshift-migration-rhel7-operator@sha256:cb509b4cf5566088a81cfbc17918aeae00fefd2bfcc4bef33cded372836e3d59)
ok: [localhost] => (item=registry.stage.redhat.io/rhcam-1-2/openshift-migration-ui-rhel8@sha256:6abfaea8ac04e3b5bbf9648a3479b420b4baec35201033471020c9cae1fe1e11)
ok: [localhost] => (item=registry.stage.redhat.io/rhcam-1-2/openshift-migration-velero-plugin-for-aws-rhel8@sha256:22c58f575ce2f54bf995fced82f89ba173329d9b88409cf371122f9ae8cabda1)
ok: [localhost] => (item=registry.stage.redhat.io/rhcam-1-2/openshift-migration-velero-plugin-for-gcp-rhel8@sha256:37c0b170d168fcebb104e465621e4ce97515d82549cd37cb42be94e3e55a4271)
ok: [localhost] => (item=registry.stage.redhat.io/rhcam-1-2/openshift-migration-velero-plugin-for-microsoft-azure-rhel8@sha256:dd92ad748a84754e5d78287e29576a5b95448e929824e86e80c60857d0c7aff9)
ok: [localhost] => (item=registry.stage.redhat.io/rhcam-1-2/openshift-migration-velero-restic-restore-helper-rhel8@sha256:e9459138ec3531eefbefa181dae3fd93fe5cf210b2a0bd3bca7ba38fbec97f60)
ok: [localhost] => (item=registry.stage.redhat.io/rhcam-1-2/openshift-migration-velero-rhel8@sha256:1a33e327dd610f0eebaaeae5b3c9b4170ab5db572b01a170be35b9ce946c0281)

You can see Restric and Velero are taking the old images. 

Expected results:
The upgrade should be done successfully on OCP3.7


Additional info:

Comment 1 Jason Montleon 2020-07-30 12:19:16 UTC
So far it looks like this is specific to just OCP 3.7. There is no issue with 3.11 at least. 

Looking at the logs it seems the operator tasks are not failing when trying to patch the velero deployment and restic daemonset. This may be an issue with python-openshift that we need to take to them for investigation after doing some of our own, and I'm not sure we're going to get much help with 3.7 support.

It's likely always been an issue and hasn't been detected because the old images remain on the production registry so they wouldn't fail to pull even if the patch command fails to update the image tag.

For now the workaround is:
oc delete --ignore-not-found=true deployment migration-controller migration-ui velero && oc delete --ignore-not-found=true daemonset restic

Comment 2 John Matthews 2020-07-31 16:48:40 UTC
Avital,

Please add a known issue to docs

Issue is:
  Upgrades of the CAM Operator on OCP 3.7 clusters will require a manual workaround below to ensure that deployment resources are correctly updated.  Issue is that CAM Operator relies on the 'patch' operation to update resources and patch is not available in OCP 3.7.

Workaround is: 
  On the OCP 3.7 cluster
  "oc delete --ignore-not-found=true deployment migration-controller migration-ui velero && oc delete --ignore-not-found=true daemonset restic"
  After deleting resources the operator will recreate

We are tracking a fix for this to align with CAM 1.3.0 (Target Release 4.5.0) via https://bugzilla.redhat.com/show_bug.cgi?id=1862543

Comment 4 Xin jiang 2020-08-03 03:08:24 UTC
verified.


Note You need to log in before you can comment on or make changes to this bug.