Description of problem: Upgraded large scale cluster (250 worker nodes) from 4.1.0-0.nightly-2019-04-22-005054 to 4.1.0-0.nightly-2019-04-22-192604 and the cloud-credential-operator did not upgrade according to clusteroperators object. Captured clusterversion / clusteroperator status at completion of upgrade: https://gist.github.com/akrzos/6c8252da9bb5d18e889907d00504d7cd Even if we wait, 20 minutes the cloud-credential operator never upgraded its version. Eventually the clusterversion object complains that it has an error reconciling the upgraded version: Fri May 3 23:47:11 UTC 2019 version 4.1.0-0.nightly-2019-04-22-192604 True False 26m Error while reconciling 4.1.0-0.nightly-2019-04-22-192604: the update could not be applied The clusterversion operator is complaining about the cloud-credential operator in its status: Message: Could not update deployment "openshift-cloud-credential-operator/cloud-credential-operator" (93 of 333) Cloud credentials operator https://gist.github.com/akrzos/d37edc8c97b588652e61cf8378ec4740 Version-Release number of the following components: How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Please include the entire output from the last TASK line through the end of output if an error is generated Expected results: Additional info: Please attach logs from ansible-playbook with the -vvv flag
Can you include logs from cluster-version-operator `oc logs deploy/cluster-version-operator -n openshift-cluster-version`
$ oc adm must-gather --dest-dir ...
Closing until this is reproduced, please re-open if this happens again and we can capture logs.
Saw it again by going through the following upgrade path 1. installed 4.1.0-0.nightly-2019-05-29-220142 2. upgrade to 4.1.0-0.nightly-2019-05-31-174150 using --force=true 3. downgrade to image 4.1.0-rc.7 by using `oc adm upgrade --to-image quay.io/openshift-release-dev/ocp-release@sha256:7e1e73c66702daa39223b3e6dd2cf5e15c057ef30c988256f55fae27448c3b01` 4. cloud-credential operator is showing the pre-downgrade image id pruan@fedora-vm ~/workspace/sandbox/bash_stuff $ oc get clusteroperators NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.1.0-rc.7 True False False 52m cloud-credential 4.1.0-0.nightly-2019-05-31-174150 True False False 73m cluster-autoscaler 4.1.0-rc.7 True False False 73m console 4.1.0-rc.7 True False False 61m dns 4.1.0-rc.7 True False False 72m image-registry 4.1.0-rc.7 True False False 64m ingress 4.1.0-rc.7 True False False 64m kube-apiserver 4.1.0-rc.7 True False False 70m kube-controller-manager 4.1.0-rc.7 True False False 70m kube-scheduler 4.1.0-rc.7 True False False 70m machine-api 4.1.0-rc.7 True False False 73m machine-config 4.1.0-rc.7 True False False 72m marketplace 4.1.0-rc.7 True False False 9m41s monitoring 4.1.0-rc.7 True False False 62m network 4.1.0-rc.7 True False False 72m node-tuning 4.1.0-rc.7 True False False 10m openshift-apiserver 4.1.0-rc.7 True False False 69m openshift-controller-manager 4.1.0-rc.7 True False False 72m openshift-samples 4.1.0-rc.7 True False False 9m57s operator-lifecycle-manager 4.1.0-rc.7 True False False 72m operator-lifecycle-manager-catalog 4.1.0-rc.7 True False False 72m service-ca 4.1.0-rc.7 True False False 72m service-catalog-apiserver 4.1.0-rc.7 True False False 68m service-catalog-controller-manager 4.1.0-rc.7 True False False 69m storage 4.1.0-rc.7 True False False 10m 5. delete the cloud-credential pod, after its resurrection, `oc get clusteroperator` will show the correct image NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.1.0-rc.7 True False False 63m cloud-credential 4.1.0-rc.7 True False False 83m cluster-autoscaler 4.1.0-rc.7 True False False 84m console 4.1.0-rc.7 True False False 72m dns 4.1.0-rc.7 True False False 83m image-registry 4.1.0-rc.7 True False False 75m ingress 4.1.0-rc.7 True False False 75m kube-apiserver 4.1.0-rc.7 True False False 81m kube-controller-manager 4.1.0-rc.7 True False False 81m kube-scheduler 4.1.0-rc.7 True False False 80m machine-api 4.1.0-rc.7 True False False 83m machine-config 4.1.0-rc.7 True False False 83m marketplace 4.1.0-rc.7 True False False 20m monitoring 4.1.0-rc.7 True False False 73m network 4.1.0-rc.7 True False False 83m node-tuning 4.1.0-rc.7 True False False 21m openshift-apiserver 4.1.0-rc.7 True False False 80m openshift-controller-manager 4.1.0-rc.7 True False False 82m openshift-samples 4.1.0-rc.7 True False False 20m operator-lifecycle-manager 4.1.0-rc.7 True False False 82m operator-lifecycle-manager-catalog 4.1.0-rc.7 True False False 83m service-ca 4.1.0-rc.7 True False False 83m service-catalog-apiserver 4.1.0-rc.7 True False False 79m service-catalog-controller-manager 4.1.0-rc.7 True False False 79m storage 4.1.0-rc.7 True False False 20m
*** Bug 1712775 has been marked as a duplicate of this bug. ***
Fix has been live in 4.2 for some time and appears to have corrected the problem. We were never able to reproduce, however we discovered we were not using leader election and developed the theory that this was causing the problem. We believe the new pod was coming up, setting the new version, but the old pod was still running for a bit and on some clusters, if the timing was right, would reset to the old version. This corresponds with some of the logs we've seen. After several weeks in 4.2 it appears the problem has been fixed. 4.1 backport in: https://github.com/openshift/cloud-credential-operator/pull/98
I am not sure that I am able to create cluster with 250 worker nodes from my account on AWS to test it. Alex, could you check it from your side?
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2547