We're seeing a slight uptick in how long upgrades are taking[1][2]. We are not 100% sure of the cause, but it looks like it started with 4.11 rc.7. There's no obvious culprits in the diff[3]. Looking at some of the jobs, we are seeing the gaps between kube-scheduler being updated and then machine-api appear to take longer. Example job run[4] showing 10+ minutes waiting for it. TRT had a debugging session, and we have two suggestions: - Adding logging around when CVO sees an operator version changed - Instead of a fixed polling interval at 5 minutes (which is what we think CVO is doing), would it be possible to trigger on the CO to know when to look again? We think there could be some substantial savings on upgrade time by doing this. [1] https://search.ci.openshift.org/graph/metrics?metric=job%3Aduration%3Atotal%3Aseconds&job=periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade&job=periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-aws-sdn-upgrade&job=periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-azure-upgrade&job=periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-gcp-ovn-upgrade&job=periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-gcp-sdn-upgrade [2] https://sippy.dptools.openshift.org/sippy-ng/tests/4.12/analysis?test=Cluster%20upgrade.%5Bsig-cluster-lifecycle%5D%20cluster%20upgrade%20should%20complete%20in%2075.00%20minutes [3] https://amd64.ocp.releases.ci.openshift.org/releasestream/4-stable/release/4.11.0-rc.7 [4] https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-e2e-azure-sdn-upgrade/1556865989923049472
Talking over the details of [1] with David, here's a relevant chunk of timeline: * 9:42:00, CVO gives up on Kube API server ClusterOperator [2] * 9:42:47, Kube API server operator achieves 4.12 [3] * 9:46:22, after a cool-off sleep, the CVO starts in on a new manifest graph-walk attempt [4] * 9:46:34, CVO notices that the Kube API server ClusterOperator is happy [5] So we could save that 3+ minute delay from 9:42:47 to 9:46:22 by introducing a "ClusterOperator bumped versions[name=operator]" trigger to break out of the cool-off sleep [6] that we set up in [7]. [1]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-gcp-sdn-upgrade/1556564581915037696 [2]: From Loki: E0808 09:42:00.022500 1 task.go:117] error running apply for clusteroperator "kube-apiserver" (107 of 806): Cluster operator kube-apiserver is updating versions [3]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-gcp-sdn-upgrade/1556564581915037696/artifacts/e2e-gcp-sdn-upgrade/openshift-e2e-test/build-log.txt | grep 'clusteroperator/kube-apiserver versions:' Aug 08 09:33:48.603 I clusteroperator/kube-apiserver versions: raw-internal 4.11.0-rc.7 -> 4.12.0-0.ci-2022-08-07-192220 Aug 08 09:42:47.917 I clusteroperator/kube-apiserver versions: operator 4.11.0-rc.7 -> 4.12.0-0.ci-2022-08-07-192220 [4]: From Loki: I0808 09:46:22.998344 1 sync_worker.go:850] apply: 4.12.0-0.ci-2022-08-07-192220 on generation 3 in state Updating at attempt 5 [5]: From Loki: I0808 09:46:34.556374 1 sync_worker.go:973] Done syncing for clusteroperator "kube-apiserver" (107 of 806) [6]: https://github.com/openshift/cluster-version-operator/blob/02838c74a98ae2a36c1584f28063bb35d03bc3cc/pkg/cvo/sync_worker.go#L546-L548 [7]: https://github.com/openshift/cluster-version-operator/blob/02838c74a98ae2a36c1584f28063bb35d03bc3cc/pkg/cvo/sync_worker.go#L605-L612
Looking at [1], a recent passing run is [2]. Checking Loki with: {invoker="openshift-internal-ci/periodic-ci-openshift-release-master-ci-4.12-e2e-aws-ovn-upgrade/1559113299906269184"} | unpack | namespace="openshift-cluster-version" |~ "clusteroperator \"kube-apiserver|in state|[Nn]otif" includes: E0815 10:39:34.058968 1 task.go:117] error running apply for clusteroperator "kube-apiserver" (107 of 817): Cluster operator kube-apiserver is updating versions I0815 10:42:39.423392 1 sync_worker.go:235] Notify the sync worker: Cluster operator kube-apiserver changed versions[name="operator"] from "4.12.0-0.ci-2022-08-15-021818" to "4.12.0-0.ci-2022-08-15-094116" I0815 10:42:39.423618 1 sync_worker.go:864] apply: 4.12.0-0.ci-2022-08-15-094116 on generation 3 in state Updating at attempt 6 I0815 10:42:44.579862 1 sync_worker.go:972] Running sync for customresourcedefinition "consolenotifications.console.openshift.io" (60 of 817) I0815 10:42:44.629617 1 sync_worker.go:987] Done syncing for customresourcedefinition "consolenotifications.console.openshift.io" (60 of 817) I0815 10:42:44.983003 1 sync_worker.go:962] Precreated resource clusteroperator "kube-apiserver" (107 of 817) I0815 10:42:46.979315 1 sync_worker.go:972] Running sync for clusteroperator "kube-apiserver" (107 of 817) I0815 10:42:46.979498 1 sync_worker.go:987] Done syncing for clusteroperator "kube-apiserver" (107 of 817) and from the ClusterOperator monitors in the test step: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-e2e-aws-ovn-upgrade/1559113299906269184/artifacts/e2e-aws-ovn-upgrade/openshift-e2e-test/artifacts/e2e.log | grep 'clusteroperator/kube-apiserver versions:' Aug 15 10:30:34.290 I clusteroperator/kube-apiserver versions: raw-internal 4.12.0-0.ci-2022-08-15-021818 -> 4.12.0-0.ci-2022-08-15-094116 Aug 15 10:42:39.452 I clusteroperator/kube-apiserver versions: operator 4.12.0-0.ci-2022-08-15-021818 -> 4.12.0-0.ci-2022-08-15-094116 So at 10:42:39, both the CVO and the test suite notice versions[name=operator] bumping, and seconds later at 10:42:47, the CVO has passed that ClusterOperator in the manifest graph. [1]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.12-informing#periodic-ci-openshift-release-master-ci-4.12-e2e-aws-ovn-upgrade [2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-e2e-aws-ovn-upgrade/1559113299906269184
I guess that was a CI build. Looking at [1] for nightlies, a recent passing run is [2]. Checking Loki with: {invoker="openshift-internal-ci/periodic-ci-openshift-release-master-nightly-4.12-upgrade-from-stable-4.11-e2e-aws-sdn-upgrade/1559110968141680640"} | unpack | namespace="openshift-cluster-version" |~ "clusteroperator \"kube-apiserver|in state|[Nn]otif" includes: E0815 10:27:37.372767 1 task.go:117] error running apply for clusteroperator "kube-apiserver" (107 of 806): Cluster operator kube-apiserver is updating versions I0815 10:27:57.511316 1 sync_worker.go:235] Notify the sync worker: Cluster operator kube-apiserver changed versions[name="operator"] from "4.11.0" to "4.12.0-0.nightly-2022-08-15-092951" I0815 10:27:57.511405 1 sync_worker.go:864] apply: 4.12.0-0.nightly-2022-08-15-092951 on generation 3 in state Updating at attempt 7 I0815 10:28:02.630744 1 sync_worker.go:972] Running sync for customresourcedefinition "consolenotifications.console.openshift.io" (60 of 806) I0815 10:28:02.667184 1 sync_worker.go:987] Done syncing for customresourcedefinition "consolenotifications.console.openshift.io" (60 of 806) I0815 10:28:02.972678 1 sync_worker.go:962] Precreated resource clusteroperator "kube-apiserver" (107 of 806) I0815 10:28:05.026631 1 sync_worker.go:972] Running sync for clusteroperator "kube-apiserver" (107 of 806) I0815 10:28:05.026822 1 sync_worker.go:987] Done syncing for clusteroperator "kube-apiserver" (107 of 806) and from the ClusterOperator monitors in the test step: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.12-upgrade-from-stable-4.11-e2e-aws-sdn-upgrade/1559110968141680640/artifacts/e2e-aws-sdn-upgrade/openshift-e2e-test/artifacts/e2e.log | grep 'clusteroperator/kube-apiserver versions:' Aug 15 10:15:55.976 I clusteroperator/kube-apiserver versions: raw-internal 4.11.0 -> 4.12.0-0.nightly-2022-08-15-092951 Aug 15 10:27:57.521 I clusteroperator/kube-apiserver versions: operator 4.11.0 -> 4.12.0-0.nightly-2022-08-15-092951 So at 10:27:57, both the CVO and the test suite notice versions[name=operator] bumping, and seconds later at 10:28:05, the CVO has passed that ClusterOperator in the manifest graph. [1]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.12-informing#periodic-ci-openshift-release-master-nightly-4.12-upgrade-from-stable-4.11-e2e-aws-sdn-upgrade [2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.12-upgrade-from-stable-4.11-e2e-aws-sdn-upgrade/1559110968141680640
The `cluster upgrade should complete in 75.00 minutes` failure happened not only for 4.11-4.12, but also old upgrade path from ci search result. # w3m -dump -cols 200 'https://search.ci.openshift.org/?search=cluster+upgrade+should+complete+in+75.00+minutes&maxAge=168h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job'|grep 'failures match' periodic-ci-openshift-multiarch-master-nightly-4.10-upgrade-from-nightly-4.9-ocp-remote-libvirt-ppc64le (all) - 7 runs, 100% failed, 29% of failures match = 29% impact periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-azure-sdn-upgrade (all) - 370 runs, 80% failed, 32% of failures match = 26% impact periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.11-upgrade-from-stable-4.10-e2e-openstack-kuryr-upgrade (all) - 2 runs, 100% failed, 50% of failures match = 50% impact periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.11-upgrade-from-stable-4.10-e2e-openstack-upgrade (all) - 3 runs, 100% failed, 67% of failures match = 67% impact periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade (all) - 184 runs, 71% failed, 2% of failures match = 1% impact periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-ovirt-upgrade (all) - 28 runs, 43% failed, 17% of failures match = 7% impact periodic-ci-openshift-release-master-ci-4.12-e2e-azure-sdn-upgrade (all) - 42 runs, 48% failed, 20% of failures match = 10% impact aggregated-azure-sdn-upgrade-4.12-minor-release-openshift-release-analysis-aggregator (all) - 37 runs, 24% failed, 22% of failures match = 5% impact periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.10-upgrade-from-stable-4.9-e2e-openstack-upgrade (all) - 3 runs, 100% failed, 33% of failures match = 33% impact periodic-ci-openshift-multiarch-master-nightly-4.10-upgrade-from-nightly-4.9-ocp-remote-libvirt-s390x (all) - 7 runs, 100% failed, 29% of failures match = 29% impact openshift-cluster-policy-controller-84-ci-4.12-upgrade-from-stable-4.11-e2e-azure-sdn-upgrade (all) - 10 runs, 50% failed, 120% of failures match = 60% impact periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.12-upgrade-from-stable-4.11-e2e-openstack-upgrade (all) - 4 runs, 100% failed, 25% of failures match = 25% impact periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-vsphere-upgrade (all) - 7 runs, 29% failed, 50% of failures match = 14% impact Since the fix went into 4.12, so we can verify the bug through an update path to latest 4.12. Tried to find some nightly build upgrade job with the fix version, but no Loki log for them, for example: {invoker="openshift-internal-ci/periodic-ci-openshift-release-master-nightly-4.12-e2e-metal-ipi-sdn-upgrade/1557964664065757184"} | unpack | namespace="openshift-cluster-version" {invoker="openshift-internal-ci/periodic-ci-openshift-release-master-nightly-4.12-e2e-metal-ipi-upgrade-ovn-ipv6/1557964692498944000"} | unpack | namespace="openshift-cluster-version" Thanks Trevor to provide the check and job in comment 4, in which we could say: 1) the delay between "Kube API server operator achieves target version" and "CVO notice the version update" has been dropped. 2) new cvo log "Cluster operator %s changed versions[name=%q] from %q to %q" has been added Search ci in recent 2 days, there is still 4.11-4.12(4.11.0 to 4.12.0-0.ci-2022-08-14-220458 ) upgrade failed for "cluster upgrade should complete in 75.00 minutes" # w3m -dump -cols 200 'https://search.ci.openshift.org/?search=cluster+upgrade+should+complete+in+75.00+minutes&maxAge=48h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job'|grep -A3 '4.12' periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-azure-sdn-upgrade (all) - 88 runs, 78% failed, 1% of failures match = 1% impact #1558939619842592768 junit 27 hours ago # [sig-cluster-lifecycle] cluster upgrade should complete in 75.00 minutes upgrade to registry.build03.ci.openshift.org/ci-op-zxgig7fi/release@sha256:117ace5ae3d9833219219d650e5becf25b5a82e8e7b3ad38a271ae752fab0703 took too long: 76.17 minutes The job is https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-azure-sdn-upgrade/1558939619842592768. From above job, we can see "Kube API server operator achieves target version" at 23:39:26. # curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-azure-sdn-upgrade/1558939619842592768/artifacts/e2e-azure-sdn-upgrade/openshift-e2e-test/artifacts/e2e.log| grep 'clusteroperator/kube-apiserver versions:' Aug 14 23:29:59.147 I clusteroperator/kube-apiserver versions: raw-internal 4.11.0 -> 4.12.0-0.ci-2022-08-14-220458 Aug 14 23:39:26.838 I clusteroperator/kube-apiserver versions: operator 4.11.0 -> 4.12.0-0.ci-2022-08-14-220458 But due to no Loki log available, so no correct time of "CVO notice the version update". Further more, i checked must-gather log to see new cvo log had already been added(for machine-config operator only) {invoker="openshift-internal-ci/periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-azure-sdn-upgrade/1558939619842592768"} | unpack | namespace="openshift-cluster-version" |~ "clusteroperator \"kube-apiserver|in state|[Nn]otif" 2022-08-15T00:35:16.840640486Z I0815 00:35:16.839633 1 sync_worker.go:235] Notify the sync worker: Cluster operator machine-config changed versions[name="operator"] from "4.11.0" to "4.12.0-0.ci-2022-08-14-220458" @Trevor I'm not sure if we could ignore above failure due to no better way to check if the delay has been dropped in above job without Loki log, do you have any other check way?
About the failure in https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-azure-sdn-upgrade/1558939619842592768. Confirmed with Trevor, the Loki logs was available by changing the time window. From above job, we can see "Kube API server operator achieves target version" at 23:39:26. # curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-azure-sdn-upgrade/1558939619842592768/artifacts/e2e-azure-sdn-upgrade/openshift-e2e-test/artifacts/e2e.log| grep 'clusteroperator/kube-apiserver versions:' Aug 14 23:29:59.147 I clusteroperator/kube-apiserver versions: raw-internal 4.11.0 -> 4.12.0-0.ci-2022-08-14-220458 Aug 14 23:39:26.838 I clusteroperator/kube-apiserver versions: operator 4.11.0 -> 4.12.0-0.ci-2022-08-14-220458 From Loki log, we can see "CVO notice the version update" at 23:39:26. I0814 23:39:26.722746 1 sync_worker.go:235] Notify the sync worker: Cluster operator kube-apiserver changed versions[name="operator"] from "4.11.0" to "4.12.0-0.ci-2022-08-14-220458"Show context I0814 23:39:26.723810 1 sync_worker.go:864] apply: 4.12.0-0.ci-2022-08-14-220458 on generation 3 in state Updating at attempt 6 So this failure should not be related with the scope in the pr: 1) the delay between "Kube API server operator achieves target version" and "CVO notice the version update" has been dropped. 2) new cvo log "Cluster operator %s changed versions[name=%q] from %q to %q" has been added Move the bug status, and if we have other followup debug/fix for other causes of the ci failure `upgrade should complete in 75.00 minutes`, maybe a new bug/jira ticket needed.
We're considering a 4.11 backport in [1]. [1]: https://issues.redhat.com/browse/OCPBUGS-306
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7399