Bug 2059716
Summary: | cloud-controller-manager flaps operator version during 4.9 -> 4.10 update | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | W. Trevor King <wking> | |
Component: | Cloud Compute | Assignee: | Joel Speed <jspeed> | |
Cloud Compute sub component: | Cloud Controller Manager | QA Contact: | sunzhaohua <zhsun> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | urgent | |||
Priority: | urgent | CC: | aos-bugs, mimccune | |
Version: | 4.10 | Keywords: | Upgrades | |
Target Milestone: | --- | |||
Target Release: | 4.11.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: |
Cause: Multiple controllers reconcile the status of the cluster operator, not all of these were properly reading the cluster version
Consequence: As different controller reconcile the status, they set the version to what they observe, which in some cases, was unknown
Fix: Ensure all controllers have a consistent view of the release version
Result: The release version is now stable on the cluster operator status
|
Story Points: | --- | |
Clone Of: | ||||
: | 2079791 (view as bug list) | Environment: | ||
Last Closed: | 2022-08-10 10:51:27 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 2079791 |
Description
W. Trevor King
2022-03-01 19:52:19 UTC
hey Trevor, just curious, did the cluster have the feature gate enabled on the 4.9 cluster to enable the CCMs or is this an infrastructure platform where it is enabled by default? i'm wondering because, i /thought/ we did not support upgrades for FG enabled CCMs at the moment. $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp-modern/1498127962409537536/artifacts/launch/gather-must-gather/artifacts/must-gather.tar | tar xOz quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-df3b0ec40395ea460fbf2728ca7adff79dbaebddcffce003e88b3fb9cb2c9759/cluster-scoped-resources/config.openshift.io/featuregates.yaml --- apiVersion: config.openshift.io/v1 items: - apiVersion: config.openshift.io/v1 kind: FeatureGate metadata: annotations: include.release.openshift.io/ibm-cloud-managed: "true" include.release.openshift.io/self-managed-high-availability: "true" include.release.openshift.io/single-node-developer: "true" release.openshift.io/create-only: "true" creationTimestamp: "2022-02-28T02:56:37Z" generation: 1 name: cluster ownerReferences: - apiVersion: config.openshift.io/v1 kind: ClusterVersion name: version uid: 27b8563a-05bb-4eb4-a324-e4500f56ca68 resourceVersion: "1665" uid: eeb46f8a-785c-437b-81d4-9ed0e0c331c1 spec: {} kind: FeatureGateList metadata: continue: "" resourceVersion: "69957" No FeatureGate spec value set, creationTimestamp at 2:56, and generation still 1, so I don't think anything was twiddling with this during the CI run. It's possible the controller had a hiccuped Kube-API connection while following its informer or something, though? So the issue is that we have multiple controllers managing the clusteroperator, and, they aren't configured identically. The main operator has the release version injected: https://github.com/openshift/cluster-cloud-controller-manager-operator/blob/1ee6073d7e06fa30b4bab4090c1fc1f164cde8db/cmd/cluster-cloud-controller-manager-operator/main.go#L136-L141 Where the config sync controller does not: https://github.com/openshift/cluster-cloud-controller-manager-operator/blob/1ee6073d7e06fa30b4bab4090c1fc1f164cde8db/cmd/config-sync-controllers/main.go#L120-L124 When the config sync controller runs, it clears the release version, when the main operator runs, it sets it back again. Setting this to blocker+ as it will produce unstable clusteroperator version reporting which will interfere with upgrades, we will need to backport this to 4.10.0 Fix landed in 4.11.0-0.ci-2022-03-02-205446 [1]. Looking at a 4.11.0-0.ci-2022-03-02-205446 -> 4.11.0-0.ci-2022-03-03-031508 job [2]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-gcp-upgrade/1499224549961502720/artifacts/e2e-gcp-upgrade/openshift-e2e-test/build-log.txt | grep 'clusteroperator/.*versions.*operator' | tail Mar 03 05:41:34.584 I clusteroperator/cloud-controller-manager versions: operator 4.11.0-0.ci-2022-03-03-031508 -> Mar 03 05:41:34.612 I clusteroperator/cloud-controller-manager versions: operator -> 4.11.0-0.ci-2022-03-03-031508 Mar 03 05:41:34.783 I clusteroperator/cloud-controller-manager versions: operator 4.11.0-0.ci-2022-03-03-031508 -> Mar 03 05:41:34.809 I clusteroperator/cloud-controller-manager versions: operator -> 4.11.0-0.ci-2022-03-03-031508 Mar 03 05:41:34.983 I clusteroperator/cloud-controller-manager versions: operator 4.11.0-0.ci-2022-03-03-031508 -> Mar 03 05:41:35.003 I clusteroperator/cloud-controller-manager versions: operator -> 4.11.0-0.ci-2022-03-03-031508 Mar 03 05:41:35.233 I clusteroperator/cloud-controller-manager versions: operator 4.11.0-0.ci-2022-03-03-031508 -> Mar 03 05:41:35.255 I clusteroperator/cloud-controller-manager versions: operator -> 4.11.0-0.ci-2022-03-03-031508 Mar 03 05:44:39.635 I clusteroperator/cloud-controller-manager versions: operator 4.11.0-0.ci-2022-03-03-031508 -> Mar 03 05:44:39.788 I clusteroperator/cloud-controller-manager versions: operator -> 4.11.0-0.ci-2022-03-03-031508 Hrm. Confirming the commits in each of those releases just in case: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-gcp-upgrade/1499224549961502720/artifacts/release/artifacts/release-images-initial | jq -r '.spec.tags[] | select(.name == "cluster-cloud-controller-manager-operator").annotations["io.openshift.build.commit.id"]' 14b9550e346f3786d380dd19bebd4c8ccbd6f885 $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-gcp-upgrade/1499224549961502720/artifacts/release/artifacts/release-images-latest | jq -r '.spec.tags[] | select(.name == "cluster-cloud-controller-manager-operator").annotations["io.openshift.build.commit.id"]' 14b9550e346f3786d380dd19bebd4c8ccbd6f885 which is the PR landing [3]. So I'm punting this back to ASSIGNED, but feel free to reverse my change if I'm not making a coherent argument. [1]: https://amd64.ocp.releases.ci.openshift.org/releasestream/4.11.0-0.ci/release/4.11.0-0.ci-2022-03-02-205446 [2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-gcp-upgrade/1499224549961502720 [3]: https://github.com/openshift/cluster-cloud-controller-manager-operator/pull/175 Verfied Check ci https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_clus[…]-azure-upgrade/openshift-e2e-test/build-log.txt the flapping no longer keeps appearing $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-cloud-controller-manager-operator/176/pull-ci-openshift-cluster-cloud-controller-manager-operator-master-e2e-azure-upgrade/1499615350587658240/artifacts/e2e-azure-upgrade/openshift-e2e-test/build-log.txt | grep clusteroperator/cloud-controller-manager Mar 04 06:17:55.460 I clusteroperator/cloud-controller-manager versions: operator 4.11.0-0.ci.test-2022-03-04-051909-ci-op-gdjqfjpj-initial -> Mar 04 06:17:55.472 I clusteroperator/cloud-controller-manager versions: operator -> 4.11.0-0.ci.test-2022-03-04-051909-ci-op-gdjqfjpj-initial Mar 04 06:18:45.426 I clusteroperator/cloud-controller-manager versions: operator 4.11.0-0.ci.test-2022-03-04-051909-ci-op-gdjqfjpj-initial -> Mar 04 06:18:45.440 I clusteroperator/cloud-controller-manager versions: operator -> 4.11.0-0.ci.test-2022-03-04-051909-ci-op-gdjqfjpj-initial Mar 04 06:22:18.886 I clusteroperator/cloud-controller-manager versions: operator 4.11.0-0.ci.test-2022-03-04-051909-ci-op-gdjqfjpj-initial -> 4.11.0-0.ci.test-2022-03-04-052257-ci-op-gdjqfjpj-latest Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069 |