Bug 1707827
Summary: | cloud-credential-operator did not upgrade | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Alex Krzos <akrzos> |
Component: | Cloud Credential Operator | Assignee: | Devan Goodwin <dgoodwin> |
Status: | CLOSED ERRATA | QA Contact: | Oleg Nesterov <olnester> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 4.1.0 | CC: | aos-bugs, bleanhar, jokerman, mmccomas, pruan, sponnaga, wjiang, wking, xtian |
Target Milestone: | --- | Keywords: | Reopened |
Target Release: | 4.1.z | Flags: | akrzos:
needinfo-
|
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | aos-scalability-41 | ||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2019-08-28 19:54:45 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Alex Krzos
2019-05-08 13:36:55 UTC
Can you include logs from cluster-version-operator `oc logs deploy/cluster-version-operator -n openshift-cluster-version` $ oc adm must-gather --dest-dir ... Closing until this is reproduced, please re-open if this happens again and we can capture logs. Saw it again by going through the following upgrade path 1. installed 4.1.0-0.nightly-2019-05-29-220142 2. upgrade to 4.1.0-0.nightly-2019-05-31-174150 using --force=true 3. downgrade to image 4.1.0-rc.7 by using `oc adm upgrade --to-image quay.io/openshift-release-dev/ocp-release@sha256:7e1e73c66702daa39223b3e6dd2cf5e15c057ef30c988256f55fae27448c3b01` 4. cloud-credential operator is showing the pre-downgrade image id pruan@fedora-vm ~/workspace/sandbox/bash_stuff $ oc get clusteroperators NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.1.0-rc.7 True False False 52m cloud-credential 4.1.0-0.nightly-2019-05-31-174150 True False False 73m cluster-autoscaler 4.1.0-rc.7 True False False 73m console 4.1.0-rc.7 True False False 61m dns 4.1.0-rc.7 True False False 72m image-registry 4.1.0-rc.7 True False False 64m ingress 4.1.0-rc.7 True False False 64m kube-apiserver 4.1.0-rc.7 True False False 70m kube-controller-manager 4.1.0-rc.7 True False False 70m kube-scheduler 4.1.0-rc.7 True False False 70m machine-api 4.1.0-rc.7 True False False 73m machine-config 4.1.0-rc.7 True False False 72m marketplace 4.1.0-rc.7 True False False 9m41s monitoring 4.1.0-rc.7 True False False 62m network 4.1.0-rc.7 True False False 72m node-tuning 4.1.0-rc.7 True False False 10m openshift-apiserver 4.1.0-rc.7 True False False 69m openshift-controller-manager 4.1.0-rc.7 True False False 72m openshift-samples 4.1.0-rc.7 True False False 9m57s operator-lifecycle-manager 4.1.0-rc.7 True False False 72m operator-lifecycle-manager-catalog 4.1.0-rc.7 True False False 72m service-ca 4.1.0-rc.7 True False False 72m service-catalog-apiserver 4.1.0-rc.7 True False False 68m service-catalog-controller-manager 4.1.0-rc.7 True False False 69m storage 4.1.0-rc.7 True False False 10m 5. delete the cloud-credential pod, after its resurrection, `oc get clusteroperator` will show the correct image NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.1.0-rc.7 True False False 63m cloud-credential 4.1.0-rc.7 True False False 83m cluster-autoscaler 4.1.0-rc.7 True False False 84m console 4.1.0-rc.7 True False False 72m dns 4.1.0-rc.7 True False False 83m image-registry 4.1.0-rc.7 True False False 75m ingress 4.1.0-rc.7 True False False 75m kube-apiserver 4.1.0-rc.7 True False False 81m kube-controller-manager 4.1.0-rc.7 True False False 81m kube-scheduler 4.1.0-rc.7 True False False 80m machine-api 4.1.0-rc.7 True False False 83m machine-config 4.1.0-rc.7 True False False 83m marketplace 4.1.0-rc.7 True False False 20m monitoring 4.1.0-rc.7 True False False 73m network 4.1.0-rc.7 True False False 83m node-tuning 4.1.0-rc.7 True False False 21m openshift-apiserver 4.1.0-rc.7 True False False 80m openshift-controller-manager 4.1.0-rc.7 True False False 82m openshift-samples 4.1.0-rc.7 True False False 20m operator-lifecycle-manager 4.1.0-rc.7 True False False 82m operator-lifecycle-manager-catalog 4.1.0-rc.7 True False False 83m service-ca 4.1.0-rc.7 True False False 83m service-catalog-apiserver 4.1.0-rc.7 True False False 79m service-catalog-controller-manager 4.1.0-rc.7 True False False 79m storage 4.1.0-rc.7 True False False 20m *** Bug 1712775 has been marked as a duplicate of this bug. *** Fix has been live in 4.2 for some time and appears to have corrected the problem. We were never able to reproduce, however we discovered we were not using leader election and developed the theory that this was causing the problem. We believe the new pod was coming up, setting the new version, but the old pod was still running for a bit and on some clusters, if the timing was right, would reset to the old version. This corresponds with some of the logs we've seen. After several weeks in 4.2 it appears the problem has been fixed. 4.1 backport in: https://github.com/openshift/cloud-credential-operator/pull/98 I am not sure that I am able to create cluster with 250 worker nodes from my account on AWS to test it. Alex, could you check it from your side? Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2547 |