2117033 – Cluster-version operator ClusterOperator checks are unecessarily slow on update

Bug 2117033 - Cluster-version operator ClusterOperator checks are unecessarily slow on update

Summary: Cluster-version operator ClusterOperator checks are unecessarily slow on update

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	4.12
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	low
Target Milestone:	---
Target Release:	4.12.0
Assignee:	W. Trevor King
QA Contact:	liujia
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-08-09 19:15 UTC by Stephen Benjamin
Modified:	2023-01-17 19:54 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-01-17 19:54:37 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-version-operator pull 818	0	None	open	Bug 2117033: pkg/cvo/sync_worker: Trigger new sync round on ClusterOperator versions[name=operator] changes	2022-08-09 22:27:18 UTC
Red Hat Product Errata	RHSA-2022:7399	0	None	None	None	2023-01-17 19:54:54 UTC

Description Stephen Benjamin 2022-08-09 19:15:19 UTC

We're seeing a slight uptick in how long upgrades are taking[1][2]. We are not 100% sure of the cause, but it looks like it started with 4.11 rc.7. There's no obvious culprits in the diff[3].

Looking at some of the jobs, we are seeing the gaps between kube-scheduler being updated and then machine-api appear to take longer. Example job run[4] showing 10+ minutes waiting for it.

TRT had a debugging session, and we have two suggestions:

- Adding logging around when CVO sees an operator version changed

- Instead of a fixed polling interval at 5 minutes (which is what we think CVO is doing), would it be possible to trigger on the CO to know when to look again? We think there could be some substantial savings on upgrade time by doing this.


[1] https://search.ci.openshift.org/graph/metrics?metric=job%3Aduration%3Atotal%3Aseconds&job=periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade&job=periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-aws-sdn-upgrade&job=periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-azure-upgrade&job=periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-gcp-ovn-upgrade&job=periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-gcp-sdn-upgrade
[2] https://sippy.dptools.openshift.org/sippy-ng/tests/4.12/analysis?test=Cluster%20upgrade.%5Bsig-cluster-lifecycle%5D%20cluster%20upgrade%20should%20complete%20in%2075.00%20minutes
[3] https://amd64.ocp.releases.ci.openshift.org/releasestream/4-stable/release/4.11.0-rc.7
[4] https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-e2e-azure-sdn-upgrade/1556865989923049472

Comment 1 W. Trevor King 2022-08-09 21:39:10 UTC

Talking over the details of [1] with David, here's a relevant chunk of timeline:

* 9:42:00, CVO gives up on Kube API server ClusterOperator [2]
* 9:42:47, Kube API server operator achieves 4.12 [3]
* 9:46:22, after a cool-off sleep, the CVO starts in on a new manifest graph-walk attempt [4]
* 9:46:34, CVO notices that the Kube API server ClusterOperator is happy [5]

So we could save that 3+ minute delay from 9:42:47 to 9:46:22 by introducing a "ClusterOperator bumped versions[name=operator]" trigger to break out of the cool-off sleep [6] that we set up in [7].

[1]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-gcp-sdn-upgrade/1556564581915037696
[2]: From Loki: E0808 09:42:00.022500       1 task.go:117] error running apply for clusteroperator "kube-apiserver" (107 of 806): Cluster operator kube-apiserver is updating versions
[3]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-gcp-sdn-upgrade/1556564581915037696/artifacts/e2e-gcp-sdn-upgrade/openshift-e2e-test/build-log.txt | grep 'clusteroperator/kube-apiserver versions:'
     Aug 08 09:33:48.603 I clusteroperator/kube-apiserver versions: raw-internal 4.11.0-rc.7 -> 4.12.0-0.ci-2022-08-07-192220
     Aug 08 09:42:47.917 I clusteroperator/kube-apiserver versions: operator 4.11.0-rc.7 -> 4.12.0-0.ci-2022-08-07-192220
[4]: From Loki: I0808 09:46:22.998344       1 sync_worker.go:850] apply: 4.12.0-0.ci-2022-08-07-192220 on generation 3 in state Updating at attempt 5
[5]: From Loki: I0808 09:46:34.556374       1 sync_worker.go:973] Done syncing for clusteroperator "kube-apiserver" (107 of 806)
[6]: https://github.com/openshift/cluster-version-operator/blob/02838c74a98ae2a36c1584f28063bb35d03bc3cc/pkg/cvo/sync_worker.go#L546-L548
[7]: https://github.com/openshift/cluster-version-operator/blob/02838c74a98ae2a36c1584f28063bb35d03bc3cc/pkg/cvo/sync_worker.go#L605-L612

Comment 3 W. Trevor King 2022-08-15 16:37:34 UTC

Looking at [1], a recent passing run is [2].  Checking Loki with:

  {invoker="openshift-internal-ci/periodic-ci-openshift-release-master-ci-4.12-e2e-aws-ovn-upgrade/1559113299906269184"} | unpack | namespace="openshift-cluster-version" |~ "clusteroperator \"kube-apiserver|in state|[Nn]otif"

includes:

  E0815 10:39:34.058968       1 task.go:117] error running apply for clusteroperator "kube-apiserver" (107 of 817): Cluster operator kube-apiserver is updating versions
  I0815 10:42:39.423392       1 sync_worker.go:235] Notify the sync worker: Cluster operator kube-apiserver changed versions[name="operator"] from "4.12.0-0.ci-2022-08-15-021818" to "4.12.0-0.ci-2022-08-15-094116"
  I0815 10:42:39.423618       1 sync_worker.go:864] apply: 4.12.0-0.ci-2022-08-15-094116 on generation 3 in state Updating at attempt 6
  I0815 10:42:44.579862       1 sync_worker.go:972] Running sync for customresourcedefinition "consolenotifications.console.openshift.io" (60 of 817)
  I0815 10:42:44.629617       1 sync_worker.go:987] Done syncing for customresourcedefinition "consolenotifications.console.openshift.io" (60 of 817)
  I0815 10:42:44.983003       1 sync_worker.go:962] Precreated resource clusteroperator "kube-apiserver" (107 of 817)
  I0815 10:42:46.979315       1 sync_worker.go:972] Running sync for clusteroperator "kube-apiserver" (107 of 817)
  I0815 10:42:46.979498       1 sync_worker.go:987] Done syncing for clusteroperator "kube-apiserver" (107 of 817)

and from the ClusterOperator monitors in the test step:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-e2e-aws-ovn-upgrade/1559113299906269184/artifacts/e2e-aws-ovn-upgrade/openshift-e2e-test/artifacts/e2e.log | grep 'clusteroperator/kube-apiserver versions:'
  Aug 15 10:30:34.290 I clusteroperator/kube-apiserver versions: raw-internal 4.12.0-0.ci-2022-08-15-021818 -> 4.12.0-0.ci-2022-08-15-094116
  Aug 15 10:42:39.452 I clusteroperator/kube-apiserver versions: operator 4.12.0-0.ci-2022-08-15-021818 -> 4.12.0-0.ci-2022-08-15-094116

So at 10:42:39, both the CVO and the test suite notice versions[name=operator] bumping, and seconds later at 10:42:47, the CVO has passed that ClusterOperator in the manifest graph.

[1]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.12-informing#periodic-ci-openshift-release-master-ci-4.12-e2e-aws-ovn-upgrade
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-e2e-aws-ovn-upgrade/1559113299906269184

Comment 4 W. Trevor King 2022-08-15 16:46:24 UTC

I guess that was a CI build.  Looking at [1] for nightlies, a recent passing run is [2].  Checking Loki with:

  {invoker="openshift-internal-ci/periodic-ci-openshift-release-master-nightly-4.12-upgrade-from-stable-4.11-e2e-aws-sdn-upgrade/1559110968141680640"} | unpack | namespace="openshift-cluster-version" |~ "clusteroperator \"kube-apiserver|in state|[Nn]otif"

includes:

  E0815 10:27:37.372767       1 task.go:117] error running apply for clusteroperator "kube-apiserver" (107 of 806): Cluster operator kube-apiserver is updating versions
  I0815 10:27:57.511316       1 sync_worker.go:235] Notify the sync worker: Cluster operator kube-apiserver changed versions[name="operator"] from "4.11.0" to "4.12.0-0.nightly-2022-08-15-092951"
  I0815 10:27:57.511405       1 sync_worker.go:864] apply: 4.12.0-0.nightly-2022-08-15-092951 on generation 3 in state Updating at attempt 7
  I0815 10:28:02.630744       1 sync_worker.go:972] Running sync for customresourcedefinition "consolenotifications.console.openshift.io" (60 of 806)
  I0815 10:28:02.667184       1 sync_worker.go:987] Done syncing for customresourcedefinition "consolenotifications.console.openshift.io" (60 of 806)
  I0815 10:28:02.972678       1 sync_worker.go:962] Precreated resource clusteroperator "kube-apiserver" (107 of 806)
  I0815 10:28:05.026631       1 sync_worker.go:972] Running sync for clusteroperator "kube-apiserver" (107 of 806)
  I0815 10:28:05.026822       1 sync_worker.go:987] Done syncing for clusteroperator "kube-apiserver" (107 of 806)

and from the ClusterOperator monitors in the test step:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.12-upgrade-from-stable-4.11-e2e-aws-sdn-upgrade/1559110968141680640/artifacts/e2e-aws-sdn-upgrade/openshift-e2e-test/artifacts/e2e.log | grep 'clusteroperator/kube-apiserver versions:'
  Aug 15 10:15:55.976 I clusteroperator/kube-apiserver versions: raw-internal 4.11.0 -> 4.12.0-0.nightly-2022-08-15-092951
  Aug 15 10:27:57.521 I clusteroperator/kube-apiserver versions: operator 4.11.0 -> 4.12.0-0.nightly-2022-08-15-092951

So at 10:27:57, both the CVO and the test suite notice versions[name=operator] bumping, and seconds later at 10:28:05, the CVO has passed that ClusterOperator in the manifest graph.

[1]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.12-informing#periodic-ci-openshift-release-master-nightly-4.12-upgrade-from-stable-4.11-e2e-aws-sdn-upgrade
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.12-upgrade-from-stable-4.11-e2e-aws-sdn-upgrade/1559110968141680640

Comment 5 liujia 2022-08-16 05:12:28 UTC

The `cluster upgrade should complete in 75.00 minutes` failure happened not only for 4.11-4.12, but also old upgrade path from ci search result.
# w3m -dump -cols 200 'https://search.ci.openshift.org/?search=cluster+upgrade+should+complete+in+75.00+minutes&maxAge=168h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job'|grep 'failures match'
periodic-ci-openshift-multiarch-master-nightly-4.10-upgrade-from-nightly-4.9-ocp-remote-libvirt-ppc64le (all) - 7 runs, 100% failed, 29% of failures match = 29% impact
periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-azure-sdn-upgrade (all) - 370 runs, 80% failed, 32% of failures match = 26% impact
periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.11-upgrade-from-stable-4.10-e2e-openstack-kuryr-upgrade (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.11-upgrade-from-stable-4.10-e2e-openstack-upgrade (all) - 3 runs, 100% failed, 67% of failures match = 67% impact
periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade (all) - 184 runs, 71% failed, 2% of failures match = 1% impact
periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-ovirt-upgrade (all) - 28 runs, 43% failed, 17% of failures match = 7% impact
periodic-ci-openshift-release-master-ci-4.12-e2e-azure-sdn-upgrade (all) - 42 runs, 48% failed, 20% of failures match = 10% impact
aggregated-azure-sdn-upgrade-4.12-minor-release-openshift-release-analysis-aggregator (all) - 37 runs, 24% failed, 22% of failures match = 5% impact
periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.10-upgrade-from-stable-4.9-e2e-openstack-upgrade (all) - 3 runs, 100% failed, 33% of failures match = 33% impact
periodic-ci-openshift-multiarch-master-nightly-4.10-upgrade-from-nightly-4.9-ocp-remote-libvirt-s390x (all) - 7 runs, 100% failed, 29% of failures match = 29% impact
openshift-cluster-policy-controller-84-ci-4.12-upgrade-from-stable-4.11-e2e-azure-sdn-upgrade (all) - 10 runs, 50% failed, 120% of failures match = 60% impact
periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.12-upgrade-from-stable-4.11-e2e-openstack-upgrade (all) - 4 runs, 100% failed, 25% of failures match = 25% impact
periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-vsphere-upgrade (all) - 7 runs, 29% failed, 50% of failures match = 14% impact

Since the fix went into 4.12, so we can verify the bug through an update path to latest 4.12. Tried to find some nightly build upgrade job with the fix version, but no Loki log for them, for example:
{invoker="openshift-internal-ci/periodic-ci-openshift-release-master-nightly-4.12-e2e-metal-ipi-sdn-upgrade/1557964664065757184"} | unpack | namespace="openshift-cluster-version"
{invoker="openshift-internal-ci/periodic-ci-openshift-release-master-nightly-4.12-e2e-metal-ipi-upgrade-ovn-ipv6/1557964692498944000"} | unpack | namespace="openshift-cluster-version"

Thanks Trevor to provide the check and job in comment 4, in which we could say:
1) the delay between "Kube API server operator achieves target version" and "CVO notice the version update" has been dropped.
2) new cvo log "Cluster operator %s changed versions[name=%q] from %q to %q" has been added

Search ci in recent 2 days, there is still 4.11-4.12(4.11.0 to 4.12.0-0.ci-2022-08-14-220458 ) upgrade failed for "cluster upgrade should complete in 75.00 minutes"
# w3m -dump -cols 200 'https://search.ci.openshift.org/?search=cluster+upgrade+should+complete+in+75.00+minutes&maxAge=48h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job'|grep -A3 '4.12'
periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-azure-sdn-upgrade (all) - 88 runs, 78% failed, 1% of failures match = 1% impact
#1558939619842592768                                                                         junit                   27 hours ago
# [sig-cluster-lifecycle] cluster upgrade should complete in 75.00 minutes
upgrade to registry.build03.ci.openshift.org/ci-op-zxgig7fi/release@sha256:117ace5ae3d9833219219d650e5becf25b5a82e8e7b3ad38a271ae752fab0703 took too long: 76.17 minutes

The job is https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-azure-sdn-upgrade/1558939619842592768.

From above job, we can see "Kube API server operator achieves target version" at 23:39:26.
# curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-azure-sdn-upgrade/1558939619842592768/artifacts/e2e-azure-sdn-upgrade/openshift-e2e-test/artifacts/e2e.log| grep 'clusteroperator/kube-apiserver versions:'
Aug 14 23:29:59.147 I clusteroperator/kube-apiserver versions: raw-internal 4.11.0 -> 4.12.0-0.ci-2022-08-14-220458
Aug 14 23:39:26.838 I clusteroperator/kube-apiserver versions: operator 4.11.0 -> 4.12.0-0.ci-2022-08-14-220458

But due to no Loki log available, so no correct time of "CVO notice the version update". Further more, i checked must-gather log to see new cvo log had already been added(for machine-config operator only)

{invoker="openshift-internal-ci/periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-azure-sdn-upgrade/1558939619842592768"} | unpack | namespace="openshift-cluster-version" |~ "clusteroperator \"kube-apiserver|in state|[Nn]otif"

2022-08-15T00:35:16.840640486Z I0815 00:35:16.839633       1 sync_worker.go:235] Notify the sync worker: Cluster operator machine-config changed versions[name="operator"] from "4.11.0" to "4.12.0-0.ci-2022-08-14-220458"

@Trevor
I'm not sure if we could ignore above failure due to no better way to check if the delay has been dropped in above job without Loki log, do you have any other check way?

Comment 6 liujia 2022-08-17 03:01:59 UTC

About the failure in https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-azure-sdn-upgrade/1558939619842592768.

Confirmed with Trevor, the Loki logs was available by changing the time window. 

From above job, we can see "Kube API server operator achieves target version" at 23:39:26.
# curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-azure-sdn-upgrade/1558939619842592768/artifacts/e2e-azure-sdn-upgrade/openshift-e2e-test/artifacts/e2e.log| grep 'clusteroperator/kube-apiserver versions:'
Aug 14 23:29:59.147 I clusteroperator/kube-apiserver versions: raw-internal 4.11.0 -> 4.12.0-0.ci-2022-08-14-220458
Aug 14 23:39:26.838 I clusteroperator/kube-apiserver versions: operator 4.11.0 -> 4.12.0-0.ci-2022-08-14-220458

From Loki log, we can see "CVO notice the version update" at 23:39:26.

I0814 23:39:26.722746       1 sync_worker.go:235] Notify the sync worker: Cluster operator kube-apiserver changed versions[name="operator"] from "4.11.0" to "4.12.0-0.ci-2022-08-14-220458"Show context
I0814 23:39:26.723810       1 sync_worker.go:864] apply: 4.12.0-0.ci-2022-08-14-220458 on generation 3 in state Updating at attempt 6

So this failure should not be related with the scope in the pr:
1) the delay between "Kube API server operator achieves target version" and "CVO notice the version update" has been dropped.
2) new cvo log "Cluster operator %s changed versions[name=%q] from %q to %q" has been added

Move the bug status, and if we have other followup debug/fix for other causes of the ci failure `upgrade should complete in 75.00 minutes`, maybe a new bug/jira ticket needed.

Comment 7 W. Trevor King 2022-08-18 20:46:02 UTC

We're considering a 4.11 backport in [1].

[1]: https://issues.redhat.com/browse/OCPBUGS-306

Comment 11 errata-xmlrpc 2023-01-17 19:54:37 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399

Note You need to log in before you can comment on or make changes to this bug.