1868376 – cloud-credential operator pod is in CrashLoopBackOff and blocking downgrade from 4.6 to 4.5

Bug 1868376 - cloud-credential operator pod is in CrashLoopBackOff and blocking downgrade from 4.6 to 4.5

Summary: cloud-credential operator pod is in CrashLoopBackOff and blocking downgrade f...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Credential Operator
Sub Component:
Version:	4.5
Hardware:	All
OS:	All
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Joel Diaz
QA Contact:	wang lin
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1860922 1873345
TreeView+	depends on / blocked

Reported:	2020-08-12 12:50 UTC by pmali
Modified:	2020-11-10 05:07 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: Moving from 4.5 to 4.6 some fields that were left to their defaults are now specified in 4.6. Consequence: This affects the ability to downgrade from 4.6 to 4.5. Fix: Rather than leave the fields unspecified in 4.5, explicitly specify the default values so that on a downgrade attempt, those fields are restored to what they should be for 4.5. Result: Downgrading from 4.6 to 4.5 can succeed.
Clone Of:
Clones:	1873345 (view as bug list)
Environment:
Last Closed:	2020-10-27 16:28:06 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2020:4196	0	None	None	None	2020-10-27 16:28:24 UTC

Description pmali 2020-08-12 12:50:37 UTC

Description of problem:

While downgrading ocp 4.6 to 4.5 cloud-credential operator pod goes in CrashLoopBackOff and blocking downgrade.

$ oc get pods -n openshift-cloud-credential-operator 
NAME                                         READY   STATUS             RESTARTS   AGE
cloud-credential-operator-5898d86997-zqh5s   0/1     CrashLoopBackOff   76         6h8m
pod-identity-webhook-596ff668d-sc96x         1/1     Running            0          6h55m

$ oc get pods cloud-credential-operator-5898d86997-zqh5s -n openshift-cloud-credential-operator -oyaml
...
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2020-08-12T06:35:34Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2020-08-12T12:44:02Z"
    message: 'containers with unready status: [cloud-credential-operator]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2020-08-12T12:44:02Z"
    message: 'containers with unready status: [cloud-credential-operator]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2020-08-12T06:35:34Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: cri-o://f7eb2f196fc3b720b47c84afc94af5397f6cd8d2b77b680e37ea9ecc3a270b24
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:534642e97f55406840394474970a39f2828732c6b2d98870da8734d7aadca2a4
    imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:534642e97f55406840394474970a39f2828732c6b2d98870da8734d7aadca2a4
    lastState:
      terminated:
        containerID: cri-o://f7eb2f196fc3b720b47c84afc94af5397f6cd8d2b77b680e37ea9ecc3a270b24
        exitCode: 1
        finishedAt: "2020-08-12T12:44:01Z"
        message: |
          Copying system trust bundle
          time="2020-08-12T12:43:59Z" level=debug msg="debug logging enabled"
          time="2020-08-12T12:43:59Z" level=info msg="setting up client for manager"
          time="2020-08-12T12:43:59Z" level=info msg="setting up manager"
          time="2020-08-12T12:44:01Z" level=info msg="registering components"
          time="2020-08-12T12:44:01Z" level=info msg="setting up scheme"
          time="2020-08-12T12:44:01Z" level=info msg="setting up controller"
          time="2020-08-12T12:44:01Z" level=fatal msg="infrastructures.config.openshift.io \"cluster\" is forbidden: User \"system:serviceaccount:openshift-cloud-credential-operator:cloud-credential-operator\" cannot get resource \"infrastructures\" in API group \"config.openshift.io\" at the cluster scope"
        reason: Error
        startedAt: "2020-08-12T12:43:59Z"
    name: cloud-credential-operator
    ready: false
    restartCount: 76
    started: false
    state:
      waiting:
        message: back-off 5m0s restarting failed container=cloud-credential-operator
          pod=cloud-credential-operator-5898d86997-zqh5s_openshift-cloud-credential-operator(04e00acd-c083-4b9c-9c70-a159fb05851e)
        reason: CrashLoopBackOff
  hostIP: 10.x.x.x
  phase: Running
  podIP: 10.x.x.x
  podIPs:
  - ip: 10.x.x.x
  qosClass: Burstable
  startTime: "2020-08-12T06:35:34Z"


Version-Release number of selected component (if applicable):
4.6.0-0.nightly-2020-08-12-003456
4.5.0-0.nightly-2020-08-08-162221

How reproducible:
Always

Steps to Reproduce:
1.Down grade from 4.6.0-0.nightly-2020-08-12-003456 to 4.5.0-0.nightly-2020-08-08-162221
2.
3.

Actual results:
cloud-credential operator pod is in CrashLoopBackOff

Expected results:
There should not be any issue

Additional info:

Comment 2 Greg Sheremeta 2020-08-22 11:55:34 UTC

will investigate next sprint

Comment 3 Joel Diaz 2020-08-27 18:19:23 UTC

Moving to 4.5. The issue appears to be that the Deployment in the release-4.5 branch for cloud-cred-operator doesn't specify a ServiceAccount, and the one in 4.6 does specify one (one that is new to 4.6).

After downgrade to 4.5, the cloud-cred-operator deployment has the reference to the orphaned ServiceAccount (named "cloud-credential-operator") instead of to the ServiceAccount named "default" (the one we actually use in 4.5).

Comment 5 Scott Dodson 2020-08-27 23:04:36 UTC

Can you please confirm that this is a test case where you started with 4.5, upgraded to 4.6, then downgraded back to 4.5?

Comment 6 W. Trevor King 2020-08-27 23:28:33 UTC

To keep Eric's bot happy, we'll probably want to move this bug to MODIFIED so we can VERIFY with "4.6->4.6" does not crash-loop the cred operator.  Then we can clone back to a bug targeting 4.5.z and actually fix it.

Comment 9 wang lin 2020-08-28 04:36:42 UTC

downgrading 4.6 -> 4.6, cco won't crash.

downgrade from 4.6.0-0.nightly-2020-08-26-215737 to 4.6.0-0.nightly-2020-08-21-084833
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-08-26-215737   True        True          9s      Working towards registry.svc.ci.openshift.org/ocp/release:4.6.0-0.nightly-2020-08-21-084833: downloading update
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-08-21-084833   True        False         6m23s   Cluster version is 4.6.0-0.nightly-2020-08-21-084833
$ oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.6.0-0.nightly-2020-08-21-084833   True        False         False      8m56s
cloud-credential                           4.6.0-0.nightly-2020-08-21-084833   True        False         False      139m
cluster-autoscaler                         4.6.0-0.nightly-2020-08-21-084833   True        False         False      127m
config-operator                            4.6.0-0.nightly-2020-08-21-084833   True        False         False      131m
console                                    4.6.0-0.nightly-2020-08-21-084833   True        False         False      11m
csi-snapshot-controller                    4.6.0-0.nightly-2020-08-21-084833   True        False         False      11m
dns                                        4.6.0-0.nightly-2020-08-21-084833   True        False         False      22m
etcd                                       4.6.0-0.nightly-2020-08-21-084833   True        False         False      129m
image-registry                             4.6.0-0.nightly-2020-08-21-084833   True        False         False      122m
ingress                                    4.6.0-0.nightly-2020-08-21-084833   True        False         False      122m
insights                                   4.6.0-0.nightly-2020-08-21-084833   True        False         False      127m
kube-apiserver                             4.6.0-0.nightly-2020-08-21-084833   True        False         False      129m
kube-controller-manager                    4.6.0-0.nightly-2020-08-21-084833   True        False         False      128m
kube-scheduler                             4.6.0-0.nightly-2020-08-21-084833   True        False         False      128m
kube-storage-version-migrator              4.6.0-0.nightly-2020-08-21-084833   True        False         False      11m
machine-api                                4.6.0-0.nightly-2020-08-21-084833   True        False         False      123m
machine-approver                           4.6.0-0.nightly-2020-08-21-084833   True        False         False      127m
machine-config                             4.6.0-0.nightly-2020-08-21-084833   True        False         False      8m16s
marketplace                                4.6.0-0.nightly-2020-08-21-084833   True        False         False      12m
monitoring                                 4.6.0-0.nightly-2020-08-21-084833   True        False         False      121m
network                                    4.6.0-0.nightly-2020-08-21-084833   True        False         False      122m
node-tuning                                4.6.0-0.nightly-2020-08-21-084833   True        False         False      32m
openshift-apiserver                        4.6.0-0.nightly-2020-08-21-084833   True        False         False      9m5s
openshift-controller-manager               4.6.0-0.nightly-2020-08-21-084833   True        False         False      125m
openshift-samples                          4.6.0-0.nightly-2020-08-21-084833   True        False         False      32m
operator-lifecycle-manager                 4.6.0-0.nightly-2020-08-21-084833   True        False         False      130m
operator-lifecycle-manager-catalog         4.6.0-0.nightly-2020-08-21-084833   True        False         False      130m
operator-lifecycle-manager-packageserver   4.6.0-0.nightly-2020-08-21-084833   True        False         False      9m7s
service-ca                                 4.6.0-0.nightly-2020-08-21-084833   True        False         False      131m
storage                                    4.6.0-0.nightly-2020-08-21-084833   True        False         False      11m
$ oc get pods -n openshift-cloud-credential-operator
NAME                                         READY   STATUS    RESTARTS   AGE
cloud-credential-operator-869b565fc5-gcws4   2/2     Running   0          14m
pod-identity-webhook-7f99757f4c-nj7tq        1/1     Running   0          14m

Comment 10 Xingxing Xia 2020-09-01 11:10:42 UTC

(In reply to Scott Dodson from comment #5)
> Can you please confirm that this is a test case where you started with 4.5, upgraded to 4.6, then downgraded back to 4.5?
Yes. I tried today, still hit. I launched 4.5.0-0.nightly-2020-08-31-101523 ipi gcp env, then upgraded successfully to 4.6.0-0.nightly-2020-09-01-042030, then tried the downgrade to 4.5.0-0.nightly-2020-08-31-10152, hit it.

Comment 11 wang lin 2020-09-01 11:40:12 UTC

The fix haven't backport to 4.5, so downgrading from 4.6 to 4.5 still hits this issue. 
isn't this a verify 4.6 -> 4.6 downgrade ? refer to https://bugzilla.redhat.com/show_bug.cgi?id=1868376#c6

And this one (https://bugzilla.redhat.com/show_bug.cgi?id=1873345) is an actual fix for this downgrade issue.

Is my understanding wrong?

Comment 12 Xingxing Xia 2020-09-02 02:20:58 UTC

Sorry I didn't notice carefully there was already a 4.5 clone.

Comment 14 errata-xmlrpc 2020-10-27 16:28:06 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.