Bug 1707827

Summary:	cloud-credential-operator did not upgrade
Product:	OpenShift Container Platform	Reporter:	Alex Krzos <akrzos>
Component:	Cloud Credential Operator	Assignee:	Devan Goodwin <dgoodwin>
Status:	CLOSED ERRATA	QA Contact:	Oleg Nesterov <olnester>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.1.0	CC:	aos-bugs, bleanhar, jokerman, mmccomas, pruan, sponnaga, wjiang, wking, xtian
Target Milestone:	---	Keywords:	Reopened
Target Release:	4.1.z	Flags:	akrzos: needinfo-
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	aos-scalability-41
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-08-28 19:54:45 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Alex Krzos 2019-05-08 13:36:55 UTC

Description of problem:
Upgraded large scale cluster (250 worker nodes) from 4.1.0-0.nightly-2019-04-22-005054 to 4.1.0-0.nightly-2019-04-22-192604 and the cloud-credential-operator did not upgrade according to clusteroperators object.

Captured clusterversion / clusteroperator status at completion of upgrade:

https://gist.github.com/akrzos/6c8252da9bb5d18e889907d00504d7cd

Even if we wait, 20 minutes the cloud-credential operator never upgraded its version.

Eventually the clusterversion object complains that it has an error reconciling the upgraded version:

Fri May  3 23:47:11 UTC 2019  version   4.1.0-0.nightly-2019-04-22-192604   True        False         26m     Error while reconciling 4.1.0-0.nightly-2019-04-22-192604: the update could not be applied

The clusterversion operator is complaining about the cloud-credential operator in its status:

Message:               Could not update deployment "openshift-cloud-credential-operator/cloud-credential-operator" (93 of 333)

Cloud credentials operator
https://gist.github.com/akrzos/d37edc8c97b588652e61cf8378ec4740

Version-Release number of the following components:


How reproducible:

Steps to Reproduce:
1.
2.
3.

Actual results:
Please include the entire output from the last TASK line through the end of output if an error is generated

Expected results:

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 1 Abhinav Dahiya 2019-05-08 14:02:27 UTC

Can you include logs from cluster-version-operator

`oc logs deploy/cluster-version-operator -n openshift-cluster-version`

Comment 3 W. Trevor King 2019-05-08 14:20:26 UTC

$ oc adm must-gather --dest-dir ...

Comment 8 Scott Dodson 2019-05-08 17:34:54 UTC

Closing until this is reproduced, please re-open if this happens again and we can capture logs.

Comment 10 Peter Ruan 2019-05-31 20:29:33 UTC

Saw it again by going through the following upgrade path

1. installed 4.1.0-0.nightly-2019-05-29-220142
2. upgrade to 4.1.0-0.nightly-2019-05-31-174150 using --force=true
3. downgrade to image 4.1.0-rc.7 by using `oc adm upgrade --to-image quay.io/openshift-release-dev/ocp-release@sha256:7e1e73c66702daa39223b3e6dd2cf5e15c057ef30c988256f55fae27448c3b01`

4. cloud-credential operator is showing the pre-downgrade image id
pruan@fedora-vm ~/workspace/sandbox/bash_stuff $ oc get clusteroperators 
NAME                                 VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                       4.1.0-rc.7                          True        False         False      52m
cloud-credential                     4.1.0-0.nightly-2019-05-31-174150   True        False         False      73m
cluster-autoscaler                   4.1.0-rc.7                          True        False         False      73m
console                              4.1.0-rc.7                          True        False         False      61m
dns                                  4.1.0-rc.7                          True        False         False      72m
image-registry                       4.1.0-rc.7                          True        False         False      64m
ingress                              4.1.0-rc.7                          True        False         False      64m
kube-apiserver                       4.1.0-rc.7                          True        False         False      70m
kube-controller-manager              4.1.0-rc.7                          True        False         False      70m
kube-scheduler                       4.1.0-rc.7                          True        False         False      70m
machine-api                          4.1.0-rc.7                          True        False         False      73m
machine-config                       4.1.0-rc.7                          True        False         False      72m
marketplace                          4.1.0-rc.7                          True        False         False      9m41s
monitoring                           4.1.0-rc.7                          True        False         False      62m
network                              4.1.0-rc.7                          True        False         False      72m
node-tuning                          4.1.0-rc.7                          True        False         False      10m
openshift-apiserver                  4.1.0-rc.7                          True        False         False      69m
openshift-controller-manager         4.1.0-rc.7                          True        False         False      72m
openshift-samples                    4.1.0-rc.7                          True        False         False      9m57s
operator-lifecycle-manager           4.1.0-rc.7                          True        False         False      72m
operator-lifecycle-manager-catalog   4.1.0-rc.7                          True        False         False      72m
service-ca                           4.1.0-rc.7                          True        False         False      72m
service-catalog-apiserver            4.1.0-rc.7                          True        False         False      68m
service-catalog-controller-manager   4.1.0-rc.7                          True        False         False      69m
storage                              4.1.0-rc.7                          True        False         False      10m

5. delete the cloud-credential pod, after its resurrection, `oc get clusteroperator` will show the correct image 

NAME                                 VERSION      AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                       4.1.0-rc.7   True        False         False      63m
cloud-credential                     4.1.0-rc.7   True        False         False      83m
cluster-autoscaler                   4.1.0-rc.7   True        False         False      84m
console                              4.1.0-rc.7   True        False         False      72m
dns                                  4.1.0-rc.7   True        False         False      83m
image-registry                       4.1.0-rc.7   True        False         False      75m
ingress                              4.1.0-rc.7   True        False         False      75m
kube-apiserver                       4.1.0-rc.7   True        False         False      81m
kube-controller-manager              4.1.0-rc.7   True        False         False      81m
kube-scheduler                       4.1.0-rc.7   True        False         False      80m
machine-api                          4.1.0-rc.7   True        False         False      83m
machine-config                       4.1.0-rc.7   True        False         False      83m
marketplace                          4.1.0-rc.7   True        False         False      20m
monitoring                           4.1.0-rc.7   True        False         False      73m
network                              4.1.0-rc.7   True        False         False      83m
node-tuning                          4.1.0-rc.7   True        False         False      21m
openshift-apiserver                  4.1.0-rc.7   True        False         False      80m
openshift-controller-manager         4.1.0-rc.7   True        False         False      82m
openshift-samples                    4.1.0-rc.7   True        False         False      20m
operator-lifecycle-manager           4.1.0-rc.7   True        False         False      82m
operator-lifecycle-manager-catalog   4.1.0-rc.7   True        False         False      83m
service-ca                           4.1.0-rc.7   True        False         False      83m
service-catalog-apiserver            4.1.0-rc.7   True        False         False      79m
service-catalog-controller-manager   4.1.0-rc.7   True        False         False      79m
storage                              4.1.0-rc.7   True        False         False      20m

Comment 13 Devan Goodwin 2019-06-27 15:01:25 UTC

*** Bug 1712775 has been marked as a duplicate of this bug. ***

Comment 14 Devan Goodwin 2019-08-08 16:57:19 UTC

Fix has been live in 4.2 for some time and appears to have corrected the problem. We were never able to reproduce, however we discovered we were not using leader election and developed the theory that this was causing the problem. We believe the new pod was coming up, setting the new version, but the old pod was still running for a bit and on some clusters, if the timing was right, would reset to the old version. This corresponds with some of the logs we've seen. After several weeks in 4.2 it appears the problem has been fixed.

4.1 backport in: https://github.com/openshift/cloud-credential-operator/pull/98

Comment 16 Oleg Nesterov 2019-08-13 06:41:02 UTC

I am not sure that I am able to create cluster with 250 worker nodes from my account on AWS to test it. 
Alex, could you check it from your side?

Comment 21 errata-xmlrpc 2019-08-28 19:54:45 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2547