Bug 1866931 - Upgrade from 1.2.0 to 1.2.4 fails in OCP 4.5 clusters
Summary: Upgrade from 1.2.0 to 1.2.4 fails in OCP 4.5 clusters
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Migration Tooling
Version: 4.5
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 4.5.0
Assignee: Jason Montleon
QA Contact: Xin jiang
URL:
Whiteboard:
Depends On:
Blocks: 1866867
TreeView+ depends on / blocked
 
Reported: 2020-08-06 20:49 UTC by John Matthews
Modified: 2020-09-30 18:43 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of: 1866867
Environment:
Last Closed: 2020-09-30 18:42:39 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:4148 0 None None None 2020-09-30 18:42:59 UTC

Description John Matthews 2020-08-06 20:49:19 UTC
+++ This bug was initially created as a clone of Bug #1866867 +++

Description of problem:
When we upgrade CAM 1.2.0 to CAM 1.2.4 in an OCP4.5 cluster, the migration-operator pod reports a CrashLoopBackOff status. After deleting the pod, the upgrade finishes correctly.

Version-Release number of selected component (if applicable):
CAM 1.2.4
OCP 4.5

How reproducible:
Always

Steps to Reproduce:
1. Install 1.2.0 CAM in an OCP4.5 cluster
2. Upgrade it to CAM 1.2.4


Actual results:
After the upgrade this is the migration-operator pod status is CrashLoopBackOff.

NAME                                    READY   STATUS             RESTARTS   AGE
migration-controller-7d68b9f9cd-x9q8j   2/2     Running            0          12m
migration-operator-547d96f4d4-wxkxq     1/2     CrashLoopBackOff   6          7m6s
migration-ui-d998d7bb9-862vl            1/1     Running            0          12m
restic-6nphn                            1/1     Running            0          13m
restic-h7c5x                            1/1     Running            0          13m
restic-zcbk7                            1/1     Running            0          13m
velero-8446f669f-7kq7m                  1/1     Running            0          13m


If we have a look at the pod's yaml, the service account token secret mounted is the old one, and not the new one.
$ oc get pods migration-operator-547d96f4d4-wxkxq  -o yaml | grep -A1 serviceaccount
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: migration-operator-token-pxthl
--
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: migration-operator-token-pxthl

THESE WERE THE SECRETS AFTER THE UPGRADE:
$ oc get secret | grep migration-oper
migration-operator-dockercfg-kfnxn     kubernetes.io/dockercfg               1      39s
migration-operator-token-fxtv4         kubernetes.io/service-account-token   4      39s
migration-operator-token-pgp5p         kubernetes.io/service-account-token   4      39s


THESE WERE THE SECRETS BEFORE THE UPGRADE:
$ oc get secret | grep migration-oper
migration-operator-dockercfg-s9xkz     kubernetes.io/dockercfg               1      4m2s
migration-operator-token-22xwv         kubernetes.io/service-account-token   4      4m2s
migration-operator-token-pxthl         kubernetes.io/service-account-token   4      4m2s

As we can see, the new pod is using the old secrets instead of the new ones.


Expected results:
The new pod should be in Running status and all pods should be updated with the new versions.

Additional info:
If we delete the crashed pod, the pod is recreated and the upgrade is executed properly.

It only happened in OCP4.5 cluster, in 4.2 was not happening for instance. I'm not sure about other versions.


We can see this in the namespaces events:
16s         Warning   FailedMount           pod/migration-operator-547d96f4d4-wxkxq      MountVolume.SetUp failed for volume "migration-operator-token-pxthl" : secret "migration-operator-token-pxthl" not found



This is the log inside the crashed operator:

Setting up watches.  Beware: since -r was given, this may take a while!
Watches established.
(python2_virtual_env) [fedora@preserve-appmigration-workmachine ~]$ oc logs migration-operator-6bc6c59b84-mjrhd -c operator
{"level":"info","ts":1596722990.8317368,"logger":"cmd","msg":"Go Version: go1.13.4"}
{"level":"info","ts":1596722990.8317828,"logger":"cmd","msg":"Go OS/Arch: linux/amd64"}
{"level":"info","ts":1596722990.8317914,"logger":"cmd","msg":"Version of operator-sdk: v0.12.0+git"}
{"level":"info","ts":1596722990.8318172,"logger":"cmd","msg":"Watching namespace.","Namespace":"openshift-migration"}
{"level":"error","ts":1596722990.8651814,"logger":"controller-runtime.manager","msg":"Failed to get API Group-Resources","error":"Unauthorized","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\tsrc/github.com/operator-framework/operator-sdk/vendor/github.com/go-logr/zapr/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/manager.New\n\tsrc/github.com/operator-framework/operator-sdk/vendor/sigs.k8s.io/controller-runtime/pkg/manager/manager.go:220\ngithub.com/operator-framework/operator-sdk/pkg/ansible.Run\n\tsrc/github.com/operator-framework/operator-sdk/pkg/ansible/run.go:80\ngithub.com/operator-framework/operator-sdk/cmd/operator-sdk/run.newRunAnsibleCmd.func1\n\tsrc/github.com/operator-framework/operator-sdk/cmd/operator-sdk/run/ansible.go:38\ngithub.com/spf13/cobra.(*Command).execute\n\tsrc/github.com/operator-framework/operator-sdk/vendor/github.com/spf13/cobra/command.go:826\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\tsrc/github.com/operator-framework/operator-sdk/vendor/github.com/spf13/cobra/command.go:914\ngithub.com/spf13/cobra.(*Command).Execute\n\tsrc/github.com/operator-framework/operator-sdk/vendor/github.com/spf13/cobra/command.go:864\nmain.main\n\tsrc/github.com/operator-framework/operator-sdk/cmd/operator-sdk/main.go:84\nruntime.main\n\t/opt/rh/go-toolset-1.13/root/usr/lib/go-toolset-1.13-golang/src/runtime/proc.go:203"}
{"level":"error","ts":1596722990.865287,"logger":"cmd","msg":"Failed to create a new manager.","Namespace":"openshift-migration","error":"Unauthorized","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\tsrc/github.com/operator-framework/operator-sdk/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/operator-framework/operator-sdk/pkg/ansible.Run\n\tsrc/github.com/operator-framework/operator-sdk/pkg/ansible/run.go:86\ngithub.com/operator-framework/operator-sdk/cmd/operator-sdk/run.newRunAnsibleCmd.func1\n\tsrc/github.com/operator-framework/operator-sdk/cmd/operator-sdk/run/ansible.go:38\ngithub.com/spf13/cobra.(*Command).execute\n\tsrc/github.com/operator-framework/operator-sdk/vendor/github.com/spf13/cobra/command.go:826\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\tsrc/github.com/operator-framework/operator-sdk/vendor/github.com/spf13/cobra/command.go:914\ngithub.com/spf13/cobra.(*Command).Execute\n\tsrc/github.com/operator-framework/operator-sdk/vendor/github.com/spf13/cobra/command.go:864\nmain.main\n\tsrc/github.com/operator-framework/operator-sdk/cmd/operator-sdk/main.go:84\nruntime.main\n\t/opt/rh/go-toolset-1.13/root/usr/lib/go-toolset-1.13-golang/src/runtime/proc.go:203"}
Error: Unauthorized
Usage:
  operator-sdk run ansible [flags]

Flags:
      --ansible-verbosity int            Ansible verbosity. Overridden by environment variable. (default 2)
  -h, --help                             help for ansible
      --inject-owner-ref                 The ansible operator will inject owner references unless this flag is false (default true)
      --max-workers int                  Maximum number of workers to use. Overridden by environment variable. (default 1)
      --reconcile-period duration        Default reconcile period for controllers (default 1m0s)
      --watches-file string              Path to the watches file to use (default "./watches.yaml")
      --zap-devel                        Enable zap development mode (changes defaults to console encoder, debug log level, and disables sampling)
      --zap-encoder encoder              Zap log encoding ('json' or 'console')
      --zap-level level                  Zap log level (one of 'debug', 'info', 'error' or any integer value > 0) (default info)
      --zap-sample sample                Enable zap log sampling. Sampling will be disabled for integer log levels > 1
      --zap-time-encoding timeEncoding   Sets the zap time format ('epoch', 'millis', 'nano', or 'iso8601') (default )

Global Flags:
      --verbose   Enable verbose logging

Comment 1 Jason Montleon 2020-08-20 00:38:27 UTC
Can this still be reproduced. I got the impression from Slack conversations it start working, possibly with a 4.5 z-stream update.

Comment 6 Xin jiang 2020-09-22 09:37:26 UTC
verify with MTC 1.3.0


$ oc get pods -n openshift-migration migration-operator-cb65d55b4-5nv66 -o yaml | grep -A1 serviceaccount
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: migration-operator-token-xjz6z

$ oc get secret -n openshift-migration | grep migration-oper
migration-operator-dockercfg-vm7nz                    kubernetes.io/dockercfg               1      55m
migration-operator-token-mghqs                        kubernetes.io/service-account-token   4      55m
migration-operator-token-xjz6z                        kubernetes.io/service-account-token   4      55m


Operator image:
$ oc get pods -n openshift-migration migration-operator-cb65d55b4-5nv66 -o yaml | grep image
            f:image: {}
            f:imagePullPolicy: {}
    image: quay-enterprise-quay-enterprise.apps.cam-tgt-8790.qe.devcluster.openshift.com/admin/openshift-migration-rhel7-operator@sha256:66efea27fa3d6498ef8c722ef9dec45ceba2a9db695b8092e0e65b5070c94d87
    imagePullPolicy: Always
  imagePullSecrets:
    image: quay-enterprise-quay-enterprise.apps.cam-tgt-8790.qe.devcluster.openshift.com/admin/openshift-migration-rhel7-operator@sha256:66efea27fa3d6498ef8c722ef9dec45ceba2a9db695b8092e0e65b5070c94d87
    imageID: quay-enterprise-quay-enterprise.apps.cam-tgt-8790.qe.devcluster.openshift.com/admin/openshift-migration-rhel7-operator@sha256:66efea27fa3d6498ef8c722ef9dec45ceba2a9db695b8092e0e65b5070c94d87

Comment 10 errata-xmlrpc 2020-09-30 18:42:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Migration Toolkit for Containers (MTC) Tool image release advisory 1.3.0), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4148


Note You need to log in before you can comment on or make changes to this bug.