2059716 – cloud-controller-manager flaps operator version during 4.9 -> 4.10 update

Bug 2059716 - cloud-controller-manager flaps operator version during 4.9 -> 4.10 update

Summary: cloud-controller-manager flaps operator version during 4.9 -> 4.10 update

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.11.0
Assignee:	Joel Speed
QA Contact:	sunzhaohua
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2079791
TreeView+	depends on / blocked

Reported:	2022-03-01 19:52 UTC by W. Trevor King
Modified:	2022-08-10 10:51 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: Multiple controllers reconcile the status of the cluster operator, not all of these were properly reading the cluster version Consequence: As different controller reconcile the status, they set the version to what they observe, which in some cases, was unknown Fix: Ensure all controllers have a consistent view of the release version Result: The release version is now stable on the cluster operator status
Clone Of:
Clones:	2079791 (view as bug list)
Environment:
Last Closed:	2022-08-10 10:51:27 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-cloud-controller-manager-operator pull 175	None	open	Bug 2059716: Ensure release version is injected into all controller status clients	2022-03-02 10:45:28 UTC
Github	openshift cluster-cloud-controller-manager-operator pull 176	None	open	Bug 2059716: Ensure release version is set on config sync controller	2022-03-03 08:46:27 UTC
Red Hat Product Errata	RHSA-2022:5069	None	None	None	2022-08-10 10:51:50 UTC

Description W. Trevor King 2022-03-01 19:52:19 UTC

In 4.9.23 -> 4.10.0-fc.6 CI [1]:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp-modern/1498127962409537536/artifacts/launch/openshift-e2e-test/build-log.txt | grep 'clusteroperator/.*versions.*operator' | tail
Feb 28 04:00:39.231 I clusteroperator/cloud-controller-manager versions: operator  -> 4.10.0-rc.6
Feb 28 04:00:39.294 I clusteroperator/cloud-controller-manager versions: operator 4.10.0-rc.6 -> 
Feb 28 04:00:39.313 I clusteroperator/cloud-controller-manager versions: operator  -> 4.10.0-rc.6
Feb 28 04:00:39.341 I clusteroperator/cloud-controller-manager versions: operator 4.10.0-rc.6 -> 
Feb 28 04:00:39.387 I clusteroperator/cloud-controller-manager versions: operator  -> 4.10.0-rc.6
Feb 28 04:01:01.795 I clusteroperator/cloud-controller-manager versions: operator 4.10.0-rc.6 -> 
Feb 28 04:01:01.821 I clusteroperator/cloud-controller-manager versions: operator  -> 4.10.0-rc.6
Feb 28 04:05:40.389 I clusteroperator/cloud-controller-manager versions: operator 4.10.0-rc.6 -> 
Feb 28 04:06:04.340 I clusteroperator/cloud-controller-manager versions: operator  -> 4.10.0-rc.6
Feb 28 04:15:24.654 I clusteroperator/machine-config versions: operator 4.9.23 -> 4.10.0-rc.6

Context on the broader OpenShift update:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp-modern/1498127962409537536/artifacts/launch/gather-extra/artifacts/clusterversion.json | jq -r '.items[].status.history[] | .startedTime + " " + .completionTime + " " + .state + " " + .version'
2022-02-28T03:15:47Z 2022-02-28T04:16:26Z Completed 4.10.0-rc.6
2022-02-28T02:56:11Z 2022-02-28T03:12:38Z Completed 4.9.23

Looking at the beginning of the controller update:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp-modern/1498127962409537536/artifacts/launch/openshift-e2e-test/build-log.txt | grep clusteroperator/cloud-controller-manager
Feb 28 03:32:53.400 W clusteroperator/cloud-controller-manager condition/CloudConfigControllerAvailable status/True reason/AsExpected changed: Cloud Config Controller works as expected
Feb 28 03:32:53.400 W clusteroperator/cloud-controller-manager condition/CloudConfigControllerDegraded status/False reason/AsExpected changed: Cloud Config Controller works as expected
Feb 28 03:32:53.400 I clusteroperator/cloud-controller-manager versions: operator 4.9.23 -> 
Feb 28 03:32:53.435 W clusteroperator/cloud-controller-manager condition/TrustedCABundleControllerControllerAvailable status/True reason/AsExpected changed: Trusted CA Bundle Controller works as expected
Feb 28 03:32:53.435 W clusteroperator/cloud-controller-manager condition/TrustedCABundleControllerControllerDegraded status/False reason/AsExpected changed: Trusted CA Bundle Controller works as expected
Feb 28 03:33:26.372 I clusteroperator/cloud-controller-manager versions: operator  -> 4.10.0-rc.6
Feb 28 03:50:59.845 I clusteroperator/cloud-controller-manager versions: operator 4.10.0-rc.6 -> 
Feb 28 03:50:59.859 I clusteroperator/cloud-controller-manager versions: operator  -> 4.10.0-rc.6
...

So there's some excitement around 3:33 that looks like "the actual update" (although no Progressing=True [2] to inform the admins that we're rolling out new code?), and then there's the flapping between empty and rc.6 from 3:50 through 4:06.  Let's check out events:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp-modern/1498127962409537536/artifacts/launch/gather-extra/artifacts/events.json | jq -r '[.items[] | select(.metadata.namespace == "openshift-cloud-controller-manager-operator")] | sort_by(.metadata.creationTimestamp)[] | .metadata.creationTimestamp + " " + (.involvedObject | .kind + " " + .name) + " " + .reason + ": " + .message' | grep 'ReplicaSet\|LeaderElection'
2022-02-28T02:56:58Z ReplicaSet cluster-cloud-controller-manager-operator-8484f6cbd5 SuccessfulCreate: Created pod: cluster-cloud-controller-manager-operator-8484f6cbd5-zw6lr
2022-02-28T02:56:58Z Deployment cluster-cloud-controller-manager-operator ScalingReplicaSet: Scaled up replica set cluster-cloud-controller-manager-operator-8484f6cbd5 to 1
2022-02-28T02:58:54Z ConfigMap cluster-cloud-controller-manager-leader LeaderElection: ci-ln-g8fqpq2-72292-rrqpw-master-1_b9568517-a208-4929-9a27-ee3fe431fdfc became leader
2022-02-28T02:58:54Z Lease cluster-cloud-controller-manager-leader LeaderElection: ci-ln-g8fqpq2-72292-rrqpw-master-1_b9568517-a208-4929-9a27-ee3fe431fdfc became leader
2022-02-28T02:58:55Z ConfigMap cluster-cloud-config-sync-leader LeaderElection: ci-ln-g8fqpq2-72292-rrqpw-master-1_abc9f3c1-d040-4adb-a6a6-8432a12e4a61 became leader
2022-02-28T02:58:55Z Lease cluster-cloud-config-sync-leader LeaderElection: ci-ln-g8fqpq2-72292-rrqpw-master-1_abc9f3c1-d040-4adb-a6a6-8432a12e4a61 became leader
2022-02-28T03:30:21Z ReplicaSet cluster-cloud-controller-manager-operator-8484f6cbd5 SuccessfulDelete: Deleted pod: cluster-cloud-controller-manager-operator-8484f6cbd5-zw6lr
2022-02-28T03:30:21Z Deployment cluster-cloud-controller-manager-operator ScalingReplicaSet: Scaled down replica set cluster-cloud-controller-manager-operator-8484f6cbd5 to 0
2022-02-28T03:30:22Z ReplicaSet cluster-cloud-controller-manager-operator-c94fbddfd SuccessfulCreate: Created pod: cluster-cloud-controller-manager-operator-c94fbddfd-hdssn
2022-02-28T03:30:22Z Deployment cluster-cloud-controller-manager-operator ScalingReplicaSet: Scaled up replica set cluster-cloud-controller-manager-operator-c94fbddfd to 1
2022-02-28T03:32:53Z ConfigMap cluster-cloud-config-sync-leader LeaderElection: ci-ln-g8fqpq2-72292-rrqpw-master-1_d69454eb-d4a6-4d97-b788-45f1472e8497 became leader
2022-02-28T03:32:53Z Lease cluster-cloud-config-sync-leader LeaderElection: ci-ln-g8fqpq2-72292-rrqpw-master-1_d69454eb-d4a6-4d97-b788-45f1472e8497 became leader
2022-02-28T03:33:26Z ConfigMap cluster-cloud-controller-manager-leader LeaderElection: ci-ln-g8fqpq2-72292-rrqpw-master-1_f4a204d9-cfbb-407a-a16f-0c259f2171c6 became leader
2022-02-28T03:33:26Z Lease cluster-cloud-controller-manager-leader LeaderElection: ci-ln-g8fqpq2-72292-rrqpw-master-1_f4a204d9-cfbb-407a-a16f-0c259f2171c6 became leader
2022-02-28T04:02:31Z ReplicaSet cluster-cloud-controller-manager-operator-c94fbddfd SuccessfulCreate: Created pod: cluster-cloud-controller-manager-operator-c94fbddfd-w8588
2022-02-28T04:05:40Z ConfigMap cluster-cloud-config-sync-leader LeaderElection: ci-ln-g8fqpq2-72292-rrqpw-master-0_78cf1c6d-26f3-48c0-9020-213dddbf97ef became leader
2022-02-28T04:05:40Z Lease cluster-cloud-config-sync-leader LeaderElection: ci-ln-g8fqpq2-72292-rrqpw-master-0_78cf1c6d-26f3-48c0-9020-213dddbf97ef became leader
2022-02-28T04:06:04Z ConfigMap cluster-cloud-controller-manager-leader LeaderElection: ci-ln-g8fqpq2-72292-rrqpw-master-0_f87ece62-4c52-4da6-ad5e-33d79dfcc2a9 became leader
2022-02-28T04:06:04Z Lease cluster-cloud-controller-manager-leader LeaderElection: ci-ln-g8fqpq2-72292-rrqpw-master-0_f87ece62-4c52-4da6-ad5e-33d79dfcc2a9 became leader

That's:

* 02:58, install stuff; ci-ln-g8fqpq2-72292-rrqpw-master-1_abc9f3c1-d040-4adb-a6a6-8432a12e4a61 is the leader
* 03:30, scaled down the original leader; scaled up an updated replacement
* 03:32, replacement becomes leader ci-ln-g8fqpq2-72292-rrqpw-master-1_d69454eb-d4a6-4d97-b788-45f1472e8497
* 03:33, ci-ln-g8fqpq2-72292-rrqpw-master-1_f4a204d9-cfbb-407a-a16f-0c259f2171c6 becomes the leader
* 04:0*, similar new pod, two sets of leader leases, presumably as a result of the machine-config node rolling.

But nothing in there that looks like a 3:50 instability trigger.  Dropping into Loki with:

  {invoker="openshift-internal-ci/release-openshift-origin-installer-launch-gcp-modern/1498127962409537536"} | unpack | namespace="openshift-cloud-controller-manager-operator" |= "leader"

includes:

  I0228 04:02:31.954483       1 logr.go:249] CCCMOConfigSyncControllers "msg"="Stopping and waiting for non leader election runnables"  
Show context
  I0228 04:02:31.954608       1 logr.go:249] CCCMOConfigSyncControllers "msg"="Stopping and waiting for leader election runnables"  
  I0228 04:02:31.893391       1 logr.go:249] CCMOperator "msg"="Stopping and waiting for non leader election runnables"  
  I0228 04:02:31.893572       1 logr.go:249] CCMOperator "msg"="Stopping and waiting for leader election runnables"  
  I0228 04:03:16.008611       1 leaderelection.go:248] attempting to acquire leader lease openshift-cloud-controller-manager-operator/cluster-cloud-config-sync-leader...
  I0228 04:03:13.641800       1 leaderelection.go:248] attempting to acquire leader lease openshift-cloud-controller-manager-operator/cluster-cloud-controller-manager-leader...
  I0228 04:05:40.018756       1 leaderelection.go:258] successfully acquired lease openshift-cloud-controller-manager-operator/cluster-cloud-config-sync-leader
  I0228 04:06:04.206329       1 leaderelection.go:258] successfully acquired lease openshift-cloud-controller-manager-operator/cluster-cloud-controller-manager-leader

that explains the apparent lease doublings in the event output (cluster-cloud-config-sync-leader and cluster-cloud-controller-manager-leader).  Still hunting for a 3:50 instability trigger:

  {invoker="openshift-internal-ci/release-openshift-origin-installer-launch-gcp-modern/1498127962409537536"} | unpack | namespace="openshift-cloud-controller-manager-operator" |~ " (03:4
[0-9]|03:50):"

gives:

  I0228 03:50:59.811374       1 cloud_config_sync_controller.go:37] /cluster emitted event, syncing cloud-conf ConfigMap
  I0228 03:50:59.811525       1 cloud_config_sync_controller.go:112] source and target cloud-config content are equal, no sync needed
  I0228 03:50:59.811587       1 cloud_config_sync_controller.go:229] Cloud Config Controller is available
  I0228 03:50:59.829192       1 clusteroperator_controller.go:253] FeatureGate cluster is not specifying external cloud provider requirement. Skipping...
  I0228 03:50:59.843663       1 clusteroperator_controller.go:253] FeatureGate cluster is not specifying external cloud provider requirement. Skipping...

That doesn't sound too terrible to me, but it's definitely the same time that the flapping starts.  I dunno...

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp-modern/1498127962409537536
[2]: https://github.com/openshift/api/blob/cc0db1116639638254e87a564902833f1ee006d5/config/v1/types_cluster_operator.go#L152-L159

Comment 1 Michael McCune 2022-03-01 20:06:00 UTC

hey Trevor, just curious, did the cluster have the feature gate enabled on the 4.9 cluster to enable the CCMs or is this an infrastructure platform where it is enabled by default?

i'm wondering because, i /thought/ we did not support upgrades for FG enabled CCMs at the moment.

Comment 2 W. Trevor King 2022-03-01 20:10:47 UTC

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp-modern/1498127962409537536/artifacts/launch/gather-must-gather/artifacts/must-gather.tar | tar xOz quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-df3b0ec40395ea460fbf2728ca7adff79dbaebddcffce003e88b3fb9cb2c9759/cluster-scoped-resources/config.openshift.io/featuregates.yaml
---
apiVersion: config.openshift.io/v1
items:
- apiVersion: config.openshift.io/v1
  kind: FeatureGate
  metadata:
    annotations:
      include.release.openshift.io/ibm-cloud-managed: "true"
      include.release.openshift.io/self-managed-high-availability: "true"
      include.release.openshift.io/single-node-developer: "true"
      release.openshift.io/create-only: "true"
    creationTimestamp: "2022-02-28T02:56:37Z"
    generation: 1
    name: cluster
    ownerReferences:
    - apiVersion: config.openshift.io/v1
      kind: ClusterVersion
      name: version
      uid: 27b8563a-05bb-4eb4-a324-e4500f56ca68
    resourceVersion: "1665"
    uid: eeb46f8a-785c-437b-81d4-9ed0e0c331c1
  spec: {}
kind: FeatureGateList
metadata:
  continue: ""
  resourceVersion: "69957"

No FeatureGate spec value set, creationTimestamp at 2:56, and generation still 1, so I don't think anything was twiddling with this during the CI run.  It's possible the controller had a hiccuped Kube-API connection while following its informer or something, though?

Comment 3 Joel Speed 2022-03-02 10:34:45 UTC

So the issue is that we have multiple controllers managing the clusteroperator, and, they aren't configured identically. 

The main operator has the release version injected: https://github.com/openshift/cluster-cloud-controller-manager-operator/blob/1ee6073d7e06fa30b4bab4090c1fc1f164cde8db/cmd/cluster-cloud-controller-manager-operator/main.go#L136-L141

Where the config sync controller does not: https://github.com/openshift/cluster-cloud-controller-manager-operator/blob/1ee6073d7e06fa30b4bab4090c1fc1f164cde8db/cmd/config-sync-controllers/main.go#L120-L124

When the config sync controller runs, it clears the release version, when the main operator runs, it sets it back again.

Comment 4 Joel Speed 2022-03-02 10:44:53 UTC

Setting this to blocker+ as it will produce unstable clusteroperator version reporting which will interfere with upgrades, we will need to backport this to 4.10.0

Comment 6 W. Trevor King 2022-03-03 07:37:25 UTC

Fix landed in 4.11.0-0.ci-2022-03-02-205446 [1].  Looking at a 4.11.0-0.ci-2022-03-02-205446 -> 4.11.0-0.ci-2022-03-03-031508 job [2]:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-gcp-upgrade/1499224549961502720/artifacts/e2e-gcp-upgrade/openshift-e2e-test/build-log.txt | grep 'clusteroperator/.*versions.*operator' | tail
Mar 03 05:41:34.584 I clusteroperator/cloud-controller-manager versions: operator 4.11.0-0.ci-2022-03-03-031508 -> 
Mar 03 05:41:34.612 I clusteroperator/cloud-controller-manager versions: operator  -> 4.11.0-0.ci-2022-03-03-031508
Mar 03 05:41:34.783 I clusteroperator/cloud-controller-manager versions: operator 4.11.0-0.ci-2022-03-03-031508 -> 
Mar 03 05:41:34.809 I clusteroperator/cloud-controller-manager versions: operator  -> 4.11.0-0.ci-2022-03-03-031508
Mar 03 05:41:34.983 I clusteroperator/cloud-controller-manager versions: operator 4.11.0-0.ci-2022-03-03-031508 -> 
Mar 03 05:41:35.003 I clusteroperator/cloud-controller-manager versions: operator  -> 4.11.0-0.ci-2022-03-03-031508
Mar 03 05:41:35.233 I clusteroperator/cloud-controller-manager versions: operator 4.11.0-0.ci-2022-03-03-031508 -> 
Mar 03 05:41:35.255 I clusteroperator/cloud-controller-manager versions: operator  -> 4.11.0-0.ci-2022-03-03-031508
Mar 03 05:44:39.635 I clusteroperator/cloud-controller-manager versions: operator 4.11.0-0.ci-2022-03-03-031508 -> 
Mar 03 05:44:39.788 I clusteroperator/cloud-controller-manager versions: operator  -> 4.11.0-0.ci-2022-03-03-031508

Hrm.  Confirming the commits in each of those releases just in case:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-gcp-upgrade/1499224549961502720/artifacts/release/artifacts/release-images-initial | jq -r '.spec.tags[] | select(.name == "cluster-cloud-controller-manager-operator").annotations["io.openshift.build.commit.id"]'
14b9550e346f3786d380dd19bebd4c8ccbd6f885
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-gcp-upgrade/1499224549961502720/artifacts/release/artifacts/release-images-latest | jq -r '.spec.tags[] | select(.name == "cluster-cloud-controller-manager-operator").annotations["io.openshift.build.commit.id"]'
14b9550e346f3786d380dd19bebd4c8ccbd6f885

which is the PR landing [3].  So I'm punting this back to ASSIGNED, but feel free to reverse my change if I'm not making a coherent argument.

[1]: https://amd64.ocp.releases.ci.openshift.org/releasestream/4.11.0-0.ci/release/4.11.0-0.ci-2022-03-02-205446
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-gcp-upgrade/1499224549961502720
[3]: https://github.com/openshift/cluster-cloud-controller-manager-operator/pull/175

Comment 9 sunzhaohua 2022-03-07 09:50:54 UTC

Verfied
Check ci https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_clus[…]-azure-upgrade/openshift-e2e-test/build-log.txt
the flapping no longer keeps appearing

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-cloud-controller-manager-operator/176/pull-ci-openshift-cluster-cloud-controller-manager-operator-master-e2e-azure-upgrade/1499615350587658240/artifacts/e2e-azure-upgrade/openshift-e2e-test/build-log.txt | grep clusteroperator/cloud-controller-manager 
Mar 04 06:17:55.460 I clusteroperator/cloud-controller-manager versions: operator 4.11.0-0.ci.test-2022-03-04-051909-ci-op-gdjqfjpj-initial -> 
Mar 04 06:17:55.472 I clusteroperator/cloud-controller-manager versions: operator  -> 4.11.0-0.ci.test-2022-03-04-051909-ci-op-gdjqfjpj-initial
Mar 04 06:18:45.426 I clusteroperator/cloud-controller-manager versions: operator 4.11.0-0.ci.test-2022-03-04-051909-ci-op-gdjqfjpj-initial -> 
Mar 04 06:18:45.440 I clusteroperator/cloud-controller-manager versions: operator  -> 4.11.0-0.ci.test-2022-03-04-051909-ci-op-gdjqfjpj-initial
Mar 04 06:22:18.886 I clusteroperator/cloud-controller-manager versions: operator 4.11.0-0.ci.test-2022-03-04-051909-ci-op-gdjqfjpj-initial -> 4.11.0-0.ci.test-2022-03-04-052257-ci-op-gdjqfjpj-latest

Comment 11 errata-xmlrpc 2022-08-10 10:51:27 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Note You need to log in before you can comment on or make changes to this bug.