Bug 1866931

Summary:	Upgrade from 1.2.0 to 1.2.4 fails in OCP 4.5 clusters
Product:	OpenShift Container Platform	Reporter:	John Matthews <jmatthew>
Component:	Migration Tooling	Assignee:	Jason Montleon <jmontleo>
Status:	CLOSED ERRATA	QA Contact:	Xin jiang <xjiang>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	4.5	CC:	ernelson, jmontleo, mberube, pgaikwad, rjohnson, sregidor, xjiang
Target Milestone:	---
Target Release:	4.5.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:	1866867	Environment:
Last Closed:	2020-09-30 18:42:39 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1866867

Description John Matthews 2020-08-06 20:49:19 UTC

+++ This bug was initially created as a clone of Bug #1866867 +++

Description of problem:
When we upgrade CAM 1.2.0 to CAM 1.2.4 in an OCP4.5 cluster, the migration-operator pod reports a CrashLoopBackOff status. After deleting the pod, the upgrade finishes correctly.

Version-Release number of selected component (if applicable):
CAM 1.2.4
OCP 4.5

How reproducible:
Always

Steps to Reproduce:
1. Install 1.2.0 CAM in an OCP4.5 cluster
2. Upgrade it to CAM 1.2.4


Actual results:
After the upgrade this is the migration-operator pod status is CrashLoopBackOff.

NAME                                    READY   STATUS             RESTARTS   AGE
migration-controller-7d68b9f9cd-x9q8j   2/2     Running            0          12m
migration-operator-547d96f4d4-wxkxq     1/2     CrashLoopBackOff   6          7m6s
migration-ui-d998d7bb9-862vl            1/1     Running            0          12m
restic-6nphn                            1/1     Running            0          13m
restic-h7c5x                            1/1     Running            0          13m
restic-zcbk7                            1/1     Running            0          13m
velero-8446f669f-7kq7m                  1/1     Running            0          13m


If we have a look at the pod's yaml, the service account token secret mounted is the old one, and not the new one.
$ oc get pods migration-operator-547d96f4d4-wxkxq  -o yaml | grep -A1 serviceaccount
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: migration-operator-token-pxthl
--
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: migration-operator-token-pxthl

THESE WERE THE SECRETS AFTER THE UPGRADE:
$ oc get secret | grep migration-oper
migration-operator-dockercfg-kfnxn     kubernetes.io/dockercfg               1      39s
migration-operator-token-fxtv4         kubernetes.io/service-account-token   4      39s
migration-operator-token-pgp5p         kubernetes.io/service-account-token   4      39s


THESE WERE THE SECRETS BEFORE THE UPGRADE:
$ oc get secret | grep migration-oper
migration-operator-dockercfg-s9xkz     kubernetes.io/dockercfg               1      4m2s
migration-operator-token-22xwv         kubernetes.io/service-account-token   4      4m2s
migration-operator-token-pxthl         kubernetes.io/service-account-token   4      4m2s

As we can see, the new pod is using the old secrets instead of the new ones.


Expected results:
The new pod should be in Running status and all pods should be updated with the new versions.

Additional info:
If we delete the crashed pod, the pod is recreated and the upgrade is executed properly.

It only happened in OCP4.5 cluster, in 4.2 was not happening for instance. I'm not sure about other versions.


We can see this in the namespaces events:
16s         Warning   FailedMount           pod/migration-operator-547d96f4d4-wxkxq      MountVolume.SetUp failed for volume "migration-operator-token-pxthl" : secret "migration-operator-token-pxthl" not found



This is the log inside the crashed operator:

Setting up watches.  Beware: since -r was given, this may take a while!
Watches established.
(python2_virtual_env) [fedora@preserve-appmigration-workmachine ~]$ oc logs migration-operator-6bc6c59b84-mjrhd -c operator
{"level":"info","ts":1596722990.8317368,"logger":"cmd","msg":"Go Version: go1.13.4"}
{"level":"info","ts":1596722990.8317828,"logger":"cmd","msg":"Go OS/Arch: linux/amd64"}
{"level":"info","ts":1596722990.8317914,"logger":"cmd","msg":"Version of operator-sdk: v0.12.0+git"}
{"level":"info","ts":1596722990.8318172,"logger":"cmd","msg":"Watching namespace.","Namespace":"openshift-migration"}
{"level":"error","ts":1596722990.8651814,"logger":"controller-runtime.manager","msg":"Failed to get API Group-Resources","error":"Unauthorized","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\tsrc/github.com/operator-framework/operator-sdk/vendor/github.com/go-logr/zapr/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/manager.New\n\tsrc/github.com/operator-framework/operator-sdk/vendor/sigs.k8s.io/controller-runtime/pkg/manager/manager.go:220\ngithub.com/operator-framework/operator-sdk/pkg/ansible.Run\n\tsrc/github.com/operator-framework/operator-sdk/pkg/ansible/run.go:80\ngithub.com/operator-framework/operator-sdk/cmd/operator-sdk/run.newRunAnsibleCmd.func1\n\tsrc/github.com/operator-framework/operator-sdk/cmd/operator-sdk/run/ansible.go:38\ngithub.com/spf13/cobra.(*Command).execute\n\tsrc/github.com/operator-framework/operator-sdk/vendor/github.com/spf13/cobra/command.go:826\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\tsrc/github.com/operator-framework/operator-sdk/vendor/github.com/spf13/cobra/command.go:914\ngithub.com/spf13/cobra.(*Command).Execute\n\tsrc/github.com/operator-framework/operator-sdk/vendor/github.com/spf13/cobra/command.go:864\nmain.main\n\tsrc/github.com/operator-framework/operator-sdk/cmd/operator-sdk/main.go:84\nruntime.main\n\t/opt/rh/go-toolset-1.13/root/usr/lib/go-toolset-1.13-golang/src/runtime/proc.go:203"}
{"level":"error","ts":1596722990.865287,"logger":"cmd","msg":"Failed to create a new manager.","Namespace":"openshift-migration","error":"Unauthorized","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\tsrc/github.com/operator-framework/operator-sdk/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/operator-framework/operator-sdk/pkg/ansible.Run\n\tsrc/github.com/operator-framework/operator-sdk/pkg/ansible/run.go:86\ngithub.com/operator-framework/operator-sdk/cmd/operator-sdk/run.newRunAnsibleCmd.func1\n\tsrc/github.com/operator-framework/operator-sdk/cmd/operator-sdk/run/ansible.go:38\ngithub.com/spf13/cobra.(*Command).execute\n\tsrc/github.com/operator-framework/operator-sdk/vendor/github.com/spf13/cobra/command.go:826\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\tsrc/github.com/operator-framework/operator-sdk/vendor/github.com/spf13/cobra/command.go:914\ngithub.com/spf13/cobra.(*Command).Execute\n\tsrc/github.com/operator-framework/operator-sdk/vendor/github.com/spf13/cobra/command.go:864\nmain.main\n\tsrc/github.com/operator-framework/operator-sdk/cmd/operator-sdk/main.go:84\nruntime.main\n\t/opt/rh/go-toolset-1.13/root/usr/lib/go-toolset-1.13-golang/src/runtime/proc.go:203"}
Error: Unauthorized
Usage:
  operator-sdk run ansible [flags]

Flags:
      --ansible-verbosity int            Ansible verbosity. Overridden by environment variable. (default 2)
  -h, --help                             help for ansible
      --inject-owner-ref                 The ansible operator will inject owner references unless this flag is false (default true)
      --max-workers int                  Maximum number of workers to use. Overridden by environment variable. (default 1)
      --reconcile-period duration        Default reconcile period for controllers (default 1m0s)
      --watches-file string              Path to the watches file to use (default "./watches.yaml")
      --zap-devel                        Enable zap development mode (changes defaults to console encoder, debug log level, and disables sampling)
      --zap-encoder encoder              Zap log encoding ('json' or 'console')
      --zap-level level                  Zap log level (one of 'debug', 'info', 'error' or any integer value > 0) (default info)
      --zap-sample sample                Enable zap log sampling. Sampling will be disabled for integer log levels > 1
      --zap-time-encoding timeEncoding   Sets the zap time format ('epoch', 'millis', 'nano', or 'iso8601') (default )

Global Flags:
      --verbose   Enable verbose logging

Comment 1 Jason Montleon 2020-08-20 00:38:27 UTC

Can this still be reproduced. I got the impression from Slack conversations it start working, possibly with a 4.5 z-stream update.

Comment 6 Xin jiang 2020-09-22 09:37:26 UTC

verify with MTC 1.3.0


$ oc get pods -n openshift-migration migration-operator-cb65d55b4-5nv66 -o yaml | grep -A1 serviceaccount
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: migration-operator-token-xjz6z

$ oc get secret -n openshift-migration | grep migration-oper
migration-operator-dockercfg-vm7nz                    kubernetes.io/dockercfg               1      55m
migration-operator-token-mghqs                        kubernetes.io/service-account-token   4      55m
migration-operator-token-xjz6z                        kubernetes.io/service-account-token   4      55m


Operator image:
$ oc get pods -n openshift-migration migration-operator-cb65d55b4-5nv66 -o yaml | grep image
            f:image: {}
            f:imagePullPolicy: {}
    image: quay-enterprise-quay-enterprise.apps.cam-tgt-8790.qe.devcluster.openshift.com/admin/openshift-migration-rhel7-operator@sha256:66efea27fa3d6498ef8c722ef9dec45ceba2a9db695b8092e0e65b5070c94d87
    imagePullPolicy: Always
  imagePullSecrets:
    image: quay-enterprise-quay-enterprise.apps.cam-tgt-8790.qe.devcluster.openshift.com/admin/openshift-migration-rhel7-operator@sha256:66efea27fa3d6498ef8c722ef9dec45ceba2a9db695b8092e0e65b5070c94d87
    imageID: quay-enterprise-quay-enterprise.apps.cam-tgt-8790.qe.devcluster.openshift.com/admin/openshift-migration-rhel7-operator@sha256:66efea27fa3d6498ef8c722ef9dec45ceba2a9db695b8092e0e65b5070c94d87

Comment 10 errata-xmlrpc 2020-09-30 18:42:39 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Migration Toolkit for Containers (MTC) Tool image release advisory 1.3.0), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4148