Bug 1868376

Summary: cloud-credential operator pod is in CrashLoopBackOff and blocking downgrade from 4.6 to 4.5
Product: OpenShift Container Platform Reporter: pmali
Component: Cloud Credential OperatorAssignee: Joel Diaz <jdiaz>
Status: CLOSED ERRATA QA Contact: wang lin <lwan>
Severity: high Docs Contact:
Priority: high    
Version: 4.5CC: gshereme, jdiaz, lwan, sandeepredhat, sdodson, wking, xxia
Target Milestone: ---Keywords: TestBlocker
Target Release: 4.6.0   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: Moving from 4.5 to 4.6 some fields that were left to their defaults are now specified in 4.6. Consequence: This affects the ability to downgrade from 4.6 to 4.5. Fix: Rather than leave the fields unspecified in 4.5, explicitly specify the default values so that on a downgrade attempt, those fields are restored to what they should be for 4.5. Result: Downgrading from 4.6 to 4.5 can succeed.
Story Points: ---
Clone Of:
: 1873345 (view as bug list) Environment:
Last Closed: 2020-10-27 16:28:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1860922, 1873345    

Description pmali 2020-08-12 12:50:37 UTC
Description of problem:

While downgrading ocp 4.6 to 4.5 cloud-credential operator pod goes in CrashLoopBackOff and blocking downgrade.

$ oc get pods -n openshift-cloud-credential-operator 
NAME                                         READY   STATUS             RESTARTS   AGE
cloud-credential-operator-5898d86997-zqh5s   0/1     CrashLoopBackOff   76         6h8m
pod-identity-webhook-596ff668d-sc96x         1/1     Running            0          6h55m

$ oc get pods cloud-credential-operator-5898d86997-zqh5s -n openshift-cloud-credential-operator -oyaml
...
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2020-08-12T06:35:34Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2020-08-12T12:44:02Z"
    message: 'containers with unready status: [cloud-credential-operator]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2020-08-12T12:44:02Z"
    message: 'containers with unready status: [cloud-credential-operator]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2020-08-12T06:35:34Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: cri-o://f7eb2f196fc3b720b47c84afc94af5397f6cd8d2b77b680e37ea9ecc3a270b24
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:534642e97f55406840394474970a39f2828732c6b2d98870da8734d7aadca2a4
    imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:534642e97f55406840394474970a39f2828732c6b2d98870da8734d7aadca2a4
    lastState:
      terminated:
        containerID: cri-o://f7eb2f196fc3b720b47c84afc94af5397f6cd8d2b77b680e37ea9ecc3a270b24
        exitCode: 1
        finishedAt: "2020-08-12T12:44:01Z"
        message: |
          Copying system trust bundle
          time="2020-08-12T12:43:59Z" level=debug msg="debug logging enabled"
          time="2020-08-12T12:43:59Z" level=info msg="setting up client for manager"
          time="2020-08-12T12:43:59Z" level=info msg="setting up manager"
          time="2020-08-12T12:44:01Z" level=info msg="registering components"
          time="2020-08-12T12:44:01Z" level=info msg="setting up scheme"
          time="2020-08-12T12:44:01Z" level=info msg="setting up controller"
          time="2020-08-12T12:44:01Z" level=fatal msg="infrastructures.config.openshift.io \"cluster\" is forbidden: User \"system:serviceaccount:openshift-cloud-credential-operator:cloud-credential-operator\" cannot get resource \"infrastructures\" in API group \"config.openshift.io\" at the cluster scope"
        reason: Error
        startedAt: "2020-08-12T12:43:59Z"
    name: cloud-credential-operator
    ready: false
    restartCount: 76
    started: false
    state:
      waiting:
        message: back-off 5m0s restarting failed container=cloud-credential-operator
          pod=cloud-credential-operator-5898d86997-zqh5s_openshift-cloud-credential-operator(04e00acd-c083-4b9c-9c70-a159fb05851e)
        reason: CrashLoopBackOff
  hostIP: 10.x.x.x
  phase: Running
  podIP: 10.x.x.x
  podIPs:
  - ip: 10.x.x.x
  qosClass: Burstable
  startTime: "2020-08-12T06:35:34Z"


Version-Release number of selected component (if applicable):
4.6.0-0.nightly-2020-08-12-003456
4.5.0-0.nightly-2020-08-08-162221

How reproducible:
Always

Steps to Reproduce:
1.Down grade from 4.6.0-0.nightly-2020-08-12-003456 to 4.5.0-0.nightly-2020-08-08-162221
2.
3.

Actual results:
cloud-credential operator pod is in CrashLoopBackOff

Expected results:
There should not be any issue

Additional info:

Comment 2 Greg Sheremeta 2020-08-22 11:55:34 UTC
will investigate next sprint

Comment 3 Joel Diaz 2020-08-27 18:19:23 UTC
Moving to 4.5. The issue appears to be that the Deployment in the release-4.5 branch for cloud-cred-operator doesn't specify a ServiceAccount, and the one in 4.6 does specify one (one that is new to 4.6).

After downgrade to 4.5, the cloud-cred-operator deployment has the reference to the orphaned ServiceAccount (named "cloud-credential-operator") instead of to the ServiceAccount named "default" (the one we actually use in 4.5).

Comment 5 Scott Dodson 2020-08-27 23:04:36 UTC
Can you please confirm that this is a test case where you started with 4.5, upgraded to 4.6, then downgraded back to 4.5?

Comment 6 W. Trevor King 2020-08-27 23:28:33 UTC
To keep Eric's bot happy, we'll probably want to move this bug to MODIFIED so we can VERIFY with "4.6->4.6" does not crash-loop the cred operator.  Then we can clone back to a bug targeting 4.5.z and actually fix it.

Comment 9 wang lin 2020-08-28 04:36:42 UTC
downgrading 4.6 -> 4.6, cco won't crash.

downgrade from 4.6.0-0.nightly-2020-08-26-215737 to 4.6.0-0.nightly-2020-08-21-084833
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-08-26-215737   True        True          9s      Working towards registry.svc.ci.openshift.org/ocp/release:4.6.0-0.nightly-2020-08-21-084833: downloading update
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-08-21-084833   True        False         6m23s   Cluster version is 4.6.0-0.nightly-2020-08-21-084833
$ oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.6.0-0.nightly-2020-08-21-084833   True        False         False      8m56s
cloud-credential                           4.6.0-0.nightly-2020-08-21-084833   True        False         False      139m
cluster-autoscaler                         4.6.0-0.nightly-2020-08-21-084833   True        False         False      127m
config-operator                            4.6.0-0.nightly-2020-08-21-084833   True        False         False      131m
console                                    4.6.0-0.nightly-2020-08-21-084833   True        False         False      11m
csi-snapshot-controller                    4.6.0-0.nightly-2020-08-21-084833   True        False         False      11m
dns                                        4.6.0-0.nightly-2020-08-21-084833   True        False         False      22m
etcd                                       4.6.0-0.nightly-2020-08-21-084833   True        False         False      129m
image-registry                             4.6.0-0.nightly-2020-08-21-084833   True        False         False      122m
ingress                                    4.6.0-0.nightly-2020-08-21-084833   True        False         False      122m
insights                                   4.6.0-0.nightly-2020-08-21-084833   True        False         False      127m
kube-apiserver                             4.6.0-0.nightly-2020-08-21-084833   True        False         False      129m
kube-controller-manager                    4.6.0-0.nightly-2020-08-21-084833   True        False         False      128m
kube-scheduler                             4.6.0-0.nightly-2020-08-21-084833   True        False         False      128m
kube-storage-version-migrator              4.6.0-0.nightly-2020-08-21-084833   True        False         False      11m
machine-api                                4.6.0-0.nightly-2020-08-21-084833   True        False         False      123m
machine-approver                           4.6.0-0.nightly-2020-08-21-084833   True        False         False      127m
machine-config                             4.6.0-0.nightly-2020-08-21-084833   True        False         False      8m16s
marketplace                                4.6.0-0.nightly-2020-08-21-084833   True        False         False      12m
monitoring                                 4.6.0-0.nightly-2020-08-21-084833   True        False         False      121m
network                                    4.6.0-0.nightly-2020-08-21-084833   True        False         False      122m
node-tuning                                4.6.0-0.nightly-2020-08-21-084833   True        False         False      32m
openshift-apiserver                        4.6.0-0.nightly-2020-08-21-084833   True        False         False      9m5s
openshift-controller-manager               4.6.0-0.nightly-2020-08-21-084833   True        False         False      125m
openshift-samples                          4.6.0-0.nightly-2020-08-21-084833   True        False         False      32m
operator-lifecycle-manager                 4.6.0-0.nightly-2020-08-21-084833   True        False         False      130m
operator-lifecycle-manager-catalog         4.6.0-0.nightly-2020-08-21-084833   True        False         False      130m
operator-lifecycle-manager-packageserver   4.6.0-0.nightly-2020-08-21-084833   True        False         False      9m7s
service-ca                                 4.6.0-0.nightly-2020-08-21-084833   True        False         False      131m
storage                                    4.6.0-0.nightly-2020-08-21-084833   True        False         False      11m
$ oc get pods -n openshift-cloud-credential-operator
NAME                                         READY   STATUS    RESTARTS   AGE
cloud-credential-operator-869b565fc5-gcws4   2/2     Running   0          14m
pod-identity-webhook-7f99757f4c-nj7tq        1/1     Running   0          14m

Comment 10 Xingxing Xia 2020-09-01 11:10:42 UTC
(In reply to Scott Dodson from comment #5)
> Can you please confirm that this is a test case where you started with 4.5, upgraded to 4.6, then downgraded back to 4.5?
Yes. I tried today, still hit. I launched 4.5.0-0.nightly-2020-08-31-101523 ipi gcp env, then upgraded successfully to 4.6.0-0.nightly-2020-09-01-042030, then tried the downgrade to 4.5.0-0.nightly-2020-08-31-10152, hit it.

Comment 11 wang lin 2020-09-01 11:40:12 UTC
The fix haven't backport to 4.5, so downgrading from 4.6 to 4.5 still hits this issue. 
isn't this a verify 4.6 -> 4.6 downgrade ? refer to https://bugzilla.redhat.com/show_bug.cgi?id=1868376#c6

And this one (https://bugzilla.redhat.com/show_bug.cgi?id=1873345) is an actual fix for this downgrade issue.

Is my understanding wrong?

Comment 12 Xingxing Xia 2020-09-02 02:20:58 UTC
Sorry I didn't notice carefully there was already a 4.5 clone.

Comment 14 errata-xmlrpc 2020-10-27 16:28:06 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196