In 4.9.23 -> 4.10.0-fc.6 CI [1]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp-modern/1498127962409537536/artifacts/launch/openshift-e2e-test/build-log.txt | grep 'clusteroperator/.*versions.*operator' | tail Feb 28 04:00:39.231 I clusteroperator/cloud-controller-manager versions: operator -> 4.10.0-rc.6 Feb 28 04:00:39.294 I clusteroperator/cloud-controller-manager versions: operator 4.10.0-rc.6 -> Feb 28 04:00:39.313 I clusteroperator/cloud-controller-manager versions: operator -> 4.10.0-rc.6 Feb 28 04:00:39.341 I clusteroperator/cloud-controller-manager versions: operator 4.10.0-rc.6 -> Feb 28 04:00:39.387 I clusteroperator/cloud-controller-manager versions: operator -> 4.10.0-rc.6 Feb 28 04:01:01.795 I clusteroperator/cloud-controller-manager versions: operator 4.10.0-rc.6 -> Feb 28 04:01:01.821 I clusteroperator/cloud-controller-manager versions: operator -> 4.10.0-rc.6 Feb 28 04:05:40.389 I clusteroperator/cloud-controller-manager versions: operator 4.10.0-rc.6 -> Feb 28 04:06:04.340 I clusteroperator/cloud-controller-manager versions: operator -> 4.10.0-rc.6 Feb 28 04:15:24.654 I clusteroperator/machine-config versions: operator 4.9.23 -> 4.10.0-rc.6 Context on the broader OpenShift update: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp-modern/1498127962409537536/artifacts/launch/gather-extra/artifacts/clusterversion.json | jq -r '.items[].status.history[] | .startedTime + " " + .completionTime + " " + .state + " " + .version' 2022-02-28T03:15:47Z 2022-02-28T04:16:26Z Completed 4.10.0-rc.6 2022-02-28T02:56:11Z 2022-02-28T03:12:38Z Completed 4.9.23 Looking at the beginning of the controller update: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp-modern/1498127962409537536/artifacts/launch/openshift-e2e-test/build-log.txt | grep clusteroperator/cloud-controller-manager Feb 28 03:32:53.400 W clusteroperator/cloud-controller-manager condition/CloudConfigControllerAvailable status/True reason/AsExpected changed: Cloud Config Controller works as expected Feb 28 03:32:53.400 W clusteroperator/cloud-controller-manager condition/CloudConfigControllerDegraded status/False reason/AsExpected changed: Cloud Config Controller works as expected Feb 28 03:32:53.400 I clusteroperator/cloud-controller-manager versions: operator 4.9.23 -> Feb 28 03:32:53.435 W clusteroperator/cloud-controller-manager condition/TrustedCABundleControllerControllerAvailable status/True reason/AsExpected changed: Trusted CA Bundle Controller works as expected Feb 28 03:32:53.435 W clusteroperator/cloud-controller-manager condition/TrustedCABundleControllerControllerDegraded status/False reason/AsExpected changed: Trusted CA Bundle Controller works as expected Feb 28 03:33:26.372 I clusteroperator/cloud-controller-manager versions: operator -> 4.10.0-rc.6 Feb 28 03:50:59.845 I clusteroperator/cloud-controller-manager versions: operator 4.10.0-rc.6 -> Feb 28 03:50:59.859 I clusteroperator/cloud-controller-manager versions: operator -> 4.10.0-rc.6 ... So there's some excitement around 3:33 that looks like "the actual update" (although no Progressing=True [2] to inform the admins that we're rolling out new code?), and then there's the flapping between empty and rc.6 from 3:50 through 4:06. Let's check out events: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp-modern/1498127962409537536/artifacts/launch/gather-extra/artifacts/events.json | jq -r '[.items[] | select(.metadata.namespace == "openshift-cloud-controller-manager-operator")] | sort_by(.metadata.creationTimestamp)[] | .metadata.creationTimestamp + " " + (.involvedObject | .kind + " " + .name) + " " + .reason + ": " + .message' | grep 'ReplicaSet\|LeaderElection' 2022-02-28T02:56:58Z ReplicaSet cluster-cloud-controller-manager-operator-8484f6cbd5 SuccessfulCreate: Created pod: cluster-cloud-controller-manager-operator-8484f6cbd5-zw6lr 2022-02-28T02:56:58Z Deployment cluster-cloud-controller-manager-operator ScalingReplicaSet: Scaled up replica set cluster-cloud-controller-manager-operator-8484f6cbd5 to 1 2022-02-28T02:58:54Z ConfigMap cluster-cloud-controller-manager-leader LeaderElection: ci-ln-g8fqpq2-72292-rrqpw-master-1_b9568517-a208-4929-9a27-ee3fe431fdfc became leader 2022-02-28T02:58:54Z Lease cluster-cloud-controller-manager-leader LeaderElection: ci-ln-g8fqpq2-72292-rrqpw-master-1_b9568517-a208-4929-9a27-ee3fe431fdfc became leader 2022-02-28T02:58:55Z ConfigMap cluster-cloud-config-sync-leader LeaderElection: ci-ln-g8fqpq2-72292-rrqpw-master-1_abc9f3c1-d040-4adb-a6a6-8432a12e4a61 became leader 2022-02-28T02:58:55Z Lease cluster-cloud-config-sync-leader LeaderElection: ci-ln-g8fqpq2-72292-rrqpw-master-1_abc9f3c1-d040-4adb-a6a6-8432a12e4a61 became leader 2022-02-28T03:30:21Z ReplicaSet cluster-cloud-controller-manager-operator-8484f6cbd5 SuccessfulDelete: Deleted pod: cluster-cloud-controller-manager-operator-8484f6cbd5-zw6lr 2022-02-28T03:30:21Z Deployment cluster-cloud-controller-manager-operator ScalingReplicaSet: Scaled down replica set cluster-cloud-controller-manager-operator-8484f6cbd5 to 0 2022-02-28T03:30:22Z ReplicaSet cluster-cloud-controller-manager-operator-c94fbddfd SuccessfulCreate: Created pod: cluster-cloud-controller-manager-operator-c94fbddfd-hdssn 2022-02-28T03:30:22Z Deployment cluster-cloud-controller-manager-operator ScalingReplicaSet: Scaled up replica set cluster-cloud-controller-manager-operator-c94fbddfd to 1 2022-02-28T03:32:53Z ConfigMap cluster-cloud-config-sync-leader LeaderElection: ci-ln-g8fqpq2-72292-rrqpw-master-1_d69454eb-d4a6-4d97-b788-45f1472e8497 became leader 2022-02-28T03:32:53Z Lease cluster-cloud-config-sync-leader LeaderElection: ci-ln-g8fqpq2-72292-rrqpw-master-1_d69454eb-d4a6-4d97-b788-45f1472e8497 became leader 2022-02-28T03:33:26Z ConfigMap cluster-cloud-controller-manager-leader LeaderElection: ci-ln-g8fqpq2-72292-rrqpw-master-1_f4a204d9-cfbb-407a-a16f-0c259f2171c6 became leader 2022-02-28T03:33:26Z Lease cluster-cloud-controller-manager-leader LeaderElection: ci-ln-g8fqpq2-72292-rrqpw-master-1_f4a204d9-cfbb-407a-a16f-0c259f2171c6 became leader 2022-02-28T04:02:31Z ReplicaSet cluster-cloud-controller-manager-operator-c94fbddfd SuccessfulCreate: Created pod: cluster-cloud-controller-manager-operator-c94fbddfd-w8588 2022-02-28T04:05:40Z ConfigMap cluster-cloud-config-sync-leader LeaderElection: ci-ln-g8fqpq2-72292-rrqpw-master-0_78cf1c6d-26f3-48c0-9020-213dddbf97ef became leader 2022-02-28T04:05:40Z Lease cluster-cloud-config-sync-leader LeaderElection: ci-ln-g8fqpq2-72292-rrqpw-master-0_78cf1c6d-26f3-48c0-9020-213dddbf97ef became leader 2022-02-28T04:06:04Z ConfigMap cluster-cloud-controller-manager-leader LeaderElection: ci-ln-g8fqpq2-72292-rrqpw-master-0_f87ece62-4c52-4da6-ad5e-33d79dfcc2a9 became leader 2022-02-28T04:06:04Z Lease cluster-cloud-controller-manager-leader LeaderElection: ci-ln-g8fqpq2-72292-rrqpw-master-0_f87ece62-4c52-4da6-ad5e-33d79dfcc2a9 became leader That's: * 02:58, install stuff; ci-ln-g8fqpq2-72292-rrqpw-master-1_abc9f3c1-d040-4adb-a6a6-8432a12e4a61 is the leader * 03:30, scaled down the original leader; scaled up an updated replacement * 03:32, replacement becomes leader ci-ln-g8fqpq2-72292-rrqpw-master-1_d69454eb-d4a6-4d97-b788-45f1472e8497 * 03:33, ci-ln-g8fqpq2-72292-rrqpw-master-1_f4a204d9-cfbb-407a-a16f-0c259f2171c6 becomes the leader * 04:0*, similar new pod, two sets of leader leases, presumably as a result of the machine-config node rolling. But nothing in there that looks like a 3:50 instability trigger. Dropping into Loki with: {invoker="openshift-internal-ci/release-openshift-origin-installer-launch-gcp-modern/1498127962409537536"} | unpack | namespace="openshift-cloud-controller-manager-operator" |= "leader" includes: I0228 04:02:31.954483 1 logr.go:249] CCCMOConfigSyncControllers "msg"="Stopping and waiting for non leader election runnables" Show context I0228 04:02:31.954608 1 logr.go:249] CCCMOConfigSyncControllers "msg"="Stopping and waiting for leader election runnables" I0228 04:02:31.893391 1 logr.go:249] CCMOperator "msg"="Stopping and waiting for non leader election runnables" I0228 04:02:31.893572 1 logr.go:249] CCMOperator "msg"="Stopping and waiting for leader election runnables" I0228 04:03:16.008611 1 leaderelection.go:248] attempting to acquire leader lease openshift-cloud-controller-manager-operator/cluster-cloud-config-sync-leader... I0228 04:03:13.641800 1 leaderelection.go:248] attempting to acquire leader lease openshift-cloud-controller-manager-operator/cluster-cloud-controller-manager-leader... I0228 04:05:40.018756 1 leaderelection.go:258] successfully acquired lease openshift-cloud-controller-manager-operator/cluster-cloud-config-sync-leader I0228 04:06:04.206329 1 leaderelection.go:258] successfully acquired lease openshift-cloud-controller-manager-operator/cluster-cloud-controller-manager-leader that explains the apparent lease doublings in the event output (cluster-cloud-config-sync-leader and cluster-cloud-controller-manager-leader). Still hunting for a 3:50 instability trigger: {invoker="openshift-internal-ci/release-openshift-origin-installer-launch-gcp-modern/1498127962409537536"} | unpack | namespace="openshift-cloud-controller-manager-operator" |~ " (03:4 [0-9]|03:50):" gives: I0228 03:50:59.811374 1 cloud_config_sync_controller.go:37] /cluster emitted event, syncing cloud-conf ConfigMap I0228 03:50:59.811525 1 cloud_config_sync_controller.go:112] source and target cloud-config content are equal, no sync needed I0228 03:50:59.811587 1 cloud_config_sync_controller.go:229] Cloud Config Controller is available I0228 03:50:59.829192 1 clusteroperator_controller.go:253] FeatureGate cluster is not specifying external cloud provider requirement. Skipping... I0228 03:50:59.843663 1 clusteroperator_controller.go:253] FeatureGate cluster is not specifying external cloud provider requirement. Skipping... That doesn't sound too terrible to me, but it's definitely the same time that the flapping starts. I dunno... [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp-modern/1498127962409537536 [2]: https://github.com/openshift/api/blob/cc0db1116639638254e87a564902833f1ee006d5/config/v1/types_cluster_operator.go#L152-L159
hey Trevor, just curious, did the cluster have the feature gate enabled on the 4.9 cluster to enable the CCMs or is this an infrastructure platform where it is enabled by default? i'm wondering because, i /thought/ we did not support upgrades for FG enabled CCMs at the moment.
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp-modern/1498127962409537536/artifacts/launch/gather-must-gather/artifacts/must-gather.tar | tar xOz quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-df3b0ec40395ea460fbf2728ca7adff79dbaebddcffce003e88b3fb9cb2c9759/cluster-scoped-resources/config.openshift.io/featuregates.yaml --- apiVersion: config.openshift.io/v1 items: - apiVersion: config.openshift.io/v1 kind: FeatureGate metadata: annotations: include.release.openshift.io/ibm-cloud-managed: "true" include.release.openshift.io/self-managed-high-availability: "true" include.release.openshift.io/single-node-developer: "true" release.openshift.io/create-only: "true" creationTimestamp: "2022-02-28T02:56:37Z" generation: 1 name: cluster ownerReferences: - apiVersion: config.openshift.io/v1 kind: ClusterVersion name: version uid: 27b8563a-05bb-4eb4-a324-e4500f56ca68 resourceVersion: "1665" uid: eeb46f8a-785c-437b-81d4-9ed0e0c331c1 spec: {} kind: FeatureGateList metadata: continue: "" resourceVersion: "69957" No FeatureGate spec value set, creationTimestamp at 2:56, and generation still 1, so I don't think anything was twiddling with this during the CI run. It's possible the controller had a hiccuped Kube-API connection while following its informer or something, though?
So the issue is that we have multiple controllers managing the clusteroperator, and, they aren't configured identically. The main operator has the release version injected: https://github.com/openshift/cluster-cloud-controller-manager-operator/blob/1ee6073d7e06fa30b4bab4090c1fc1f164cde8db/cmd/cluster-cloud-controller-manager-operator/main.go#L136-L141 Where the config sync controller does not: https://github.com/openshift/cluster-cloud-controller-manager-operator/blob/1ee6073d7e06fa30b4bab4090c1fc1f164cde8db/cmd/config-sync-controllers/main.go#L120-L124 When the config sync controller runs, it clears the release version, when the main operator runs, it sets it back again.
Setting this to blocker+ as it will produce unstable clusteroperator version reporting which will interfere with upgrades, we will need to backport this to 4.10.0
Fix landed in 4.11.0-0.ci-2022-03-02-205446 [1]. Looking at a 4.11.0-0.ci-2022-03-02-205446 -> 4.11.0-0.ci-2022-03-03-031508 job [2]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-gcp-upgrade/1499224549961502720/artifacts/e2e-gcp-upgrade/openshift-e2e-test/build-log.txt | grep 'clusteroperator/.*versions.*operator' | tail Mar 03 05:41:34.584 I clusteroperator/cloud-controller-manager versions: operator 4.11.0-0.ci-2022-03-03-031508 -> Mar 03 05:41:34.612 I clusteroperator/cloud-controller-manager versions: operator -> 4.11.0-0.ci-2022-03-03-031508 Mar 03 05:41:34.783 I clusteroperator/cloud-controller-manager versions: operator 4.11.0-0.ci-2022-03-03-031508 -> Mar 03 05:41:34.809 I clusteroperator/cloud-controller-manager versions: operator -> 4.11.0-0.ci-2022-03-03-031508 Mar 03 05:41:34.983 I clusteroperator/cloud-controller-manager versions: operator 4.11.0-0.ci-2022-03-03-031508 -> Mar 03 05:41:35.003 I clusteroperator/cloud-controller-manager versions: operator -> 4.11.0-0.ci-2022-03-03-031508 Mar 03 05:41:35.233 I clusteroperator/cloud-controller-manager versions: operator 4.11.0-0.ci-2022-03-03-031508 -> Mar 03 05:41:35.255 I clusteroperator/cloud-controller-manager versions: operator -> 4.11.0-0.ci-2022-03-03-031508 Mar 03 05:44:39.635 I clusteroperator/cloud-controller-manager versions: operator 4.11.0-0.ci-2022-03-03-031508 -> Mar 03 05:44:39.788 I clusteroperator/cloud-controller-manager versions: operator -> 4.11.0-0.ci-2022-03-03-031508 Hrm. Confirming the commits in each of those releases just in case: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-gcp-upgrade/1499224549961502720/artifacts/release/artifacts/release-images-initial | jq -r '.spec.tags[] | select(.name == "cluster-cloud-controller-manager-operator").annotations["io.openshift.build.commit.id"]' 14b9550e346f3786d380dd19bebd4c8ccbd6f885 $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-gcp-upgrade/1499224549961502720/artifacts/release/artifacts/release-images-latest | jq -r '.spec.tags[] | select(.name == "cluster-cloud-controller-manager-operator").annotations["io.openshift.build.commit.id"]' 14b9550e346f3786d380dd19bebd4c8ccbd6f885 which is the PR landing [3]. So I'm punting this back to ASSIGNED, but feel free to reverse my change if I'm not making a coherent argument. [1]: https://amd64.ocp.releases.ci.openshift.org/releasestream/4.11.0-0.ci/release/4.11.0-0.ci-2022-03-02-205446 [2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-gcp-upgrade/1499224549961502720 [3]: https://github.com/openshift/cluster-cloud-controller-manager-operator/pull/175
Verfied Check ci https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_clus[…]-azure-upgrade/openshift-e2e-test/build-log.txt the flapping no longer keeps appearing $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-cloud-controller-manager-operator/176/pull-ci-openshift-cluster-cloud-controller-manager-operator-master-e2e-azure-upgrade/1499615350587658240/artifacts/e2e-azure-upgrade/openshift-e2e-test/build-log.txt | grep clusteroperator/cloud-controller-manager Mar 04 06:17:55.460 I clusteroperator/cloud-controller-manager versions: operator 4.11.0-0.ci.test-2022-03-04-051909-ci-op-gdjqfjpj-initial -> Mar 04 06:17:55.472 I clusteroperator/cloud-controller-manager versions: operator -> 4.11.0-0.ci.test-2022-03-04-051909-ci-op-gdjqfjpj-initial Mar 04 06:18:45.426 I clusteroperator/cloud-controller-manager versions: operator 4.11.0-0.ci.test-2022-03-04-051909-ci-op-gdjqfjpj-initial -> Mar 04 06:18:45.440 I clusteroperator/cloud-controller-manager versions: operator -> 4.11.0-0.ci.test-2022-03-04-051909-ci-op-gdjqfjpj-initial Mar 04 06:22:18.886 I clusteroperator/cloud-controller-manager versions: operator 4.11.0-0.ci.test-2022-03-04-051909-ci-op-gdjqfjpj-initial -> 4.11.0-0.ci.test-2022-03-04-052257-ci-op-gdjqfjpj-latest
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069