Bug 1942164
Summary: | [sig-cluster-lifecycle] cluster upgrade should be fast | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Nalin Dahyabhai <nalin> | ||||||
Component: | kube-apiserver | Assignee: | Lukasz Szaszkiewicz <lszaszki> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Rahul Gangwar <rgangwar> | ||||||
Severity: | medium | Docs Contact: | |||||||
Priority: | medium | ||||||||
Version: | 4.8 | CC: | aos-bugs, cholman, gzaidman, jluhrsen, jokerman, mfojtik, mgugino, mimccune, stbenjam, travi, wking, xxia | ||||||
Target Milestone: | --- | ||||||||
Target Release: | 4.9.0 | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | LifecycleReset tag-ci | ||||||||
Fixed In Version: | Doc Type: | No Doc Update | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: |
[sig-cluster-lifecycle] cluster upgrade should be fast
job=periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-azure-ovn-upgrade=all
job=periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade=all
job=periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-gcp-ovn-upgrade=all
|
|||||||
Last Closed: | 2021-10-18 17:29:21 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 1929650, 1943566 | ||||||||
Attachments: |
|
Description
Nalin Dahyabhai
2021-03-23 19:15:13 UTC
I'm not sure how to triage which components took the most time to upgrade, but this isn't a machine-api or related problem as far as I can tell. Ideally, we use something other than sig-cluster-lifecycle on this test. this also seems like the same (or similar) as these two: * https://bugzilla.redhat.com/show_bug.cgi?id=1885322 * https://bugzilla.redhat.com/show_bug.cgi?id=1929650 are these all different or should we de-duplicate here? I've shuffled the deps around just now to get: * 4.8 (this bug) * 4.7 bug 1929650 * 4.6 bug 1885322 As mentioned in [1], we have an internal ticket around tooling to make this more accessible. But it's also possible to audit manually today. I'm agnostic about the assigned team; ideally once we get helpful tooling in place, it will be easier for any reviewer to hop straight towards the component that started slowing things down, instead of needing a delegation step to a limited pool of diagnosing devs. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1885322#c4 I'm not sure how to audit the process manually today. Any pointers to docs would be appreciated. There are not docs for this today; it's mostly reviewing assets out of the slow run and looking for fat, possibly by comparing against an earlier, faster run. For example, timelining [1]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1374364130718257152/artifacts/e2e-aws-ovn-upgrade/openshift-e2e-test/build-log.txt | grep 'clusteroperator/.*versions:.*operator' | sed 's/^/ /' Mar 23 15:08:36.315 I clusteroperator/etcd versions: raw-internal 4.7.3 -> 4.8.0-0.ci-2021-03-23-141431, operator 4.7.3 -> 4.8.0-0.ci-2021-03-23-141431 Mar 23 15:26:18.758 I clusteroperator/kube-apiserver versions: operator 4.7.3 -> 4.8.0-0.ci-2021-03-23-141431 Mar 23 15:29:57.347 I clusteroperator/kube-controller-manager versions: operator 4.7.3 -> 4.8.0-0.ci-2021-03-23-141431 Mar 23 15:30:29.537 I clusteroperator/kube-scheduler versions: operator 4.7.3 -> 4.8.0-0.ci-2021-03-23-141431 Mar 23 15:30:58.275 I clusteroperator/openshift-apiserver versions: operator 4.7.3 -> 4.8.0-0.ci-2021-03-23-141431 Mar 23 15:31:05.083 I clusteroperator/config-operator versions: operator 4.7.3 -> 4.8.0-0.ci-2021-03-23-141431 Mar 23 15:33:28.636 I clusteroperator/machine-api versions: operator 4.7.3 -> 4.8.0-0.ci-2021-03-23-141431 Mar 23 15:37:10.915 I clusteroperator/baremetal versions: operator 4.7.3 -> 4.8.0-0.ci-2021-03-23-141431 ... Mar 23 15:42:31.138 I clusteroperator/monitoring versions: operator 4.7.3 -> 4.8.0-0.ci-2021-03-23-141431 Mar 23 15:57:33.713 I clusteroperator/network versions: operator 4.7.3 -> 4.8.0-0.ci-2021-03-23-141431 Mar 23 15:57:44.117 I clusteroperator/dns versions: operator 4.7.3 -> 4.8.0-0.ci-2021-03-23-141431, coredns quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8ff781a1580e39d92d8d6f47adfe8067145b9303f4c47f874a8962e2086dd9a5 -> registry.ci.openshift.org/ocp/4.8-2021-03-23-141431@sha256:57d17c3cee833b864a6c580a5a279ec2dc425343905ae7d7f137420c9e67ea7a, openshift-cli quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:a1548c618dfbfeeaa18d02ecb6f2cffbac0217e04b15729cf42bb14b6c56d595 -> registry.ci.openshift.org/ocp/4.8-2021-03-23-141431@sha256:e685efa430b027a7abd9824af58192620e0596fde83929e2250d9f630440c01f, kube-rbac-proxy quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:513cca8279a0d9c1cb5155699dafbea416714cc3aa90043f10db71f481f64140 -> registry.ci.openshift.org/ocp/4.8-2021-03-23-141431@sha256:c1397cf1367d4f5554fdb5f6208cfed0a348b5d8fc22244746fbbb1dafb9f184 Mar 23 16:24:01.197 I clusteroperator/machine-config versions: operator 4.7.3 -> 4.8.0-0.ci-2021-03-23-141431 So of the 77m that job took to update [1], 18m was kube-apiserver, 15m was network, 25m was machine-config, and that leaves ~19m for everyone else. Drilling on on the network delay: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1374364130718257152/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/events.json | jq -r '.items[] | select(.metadata.namespace | test("openshift-network-operator|openshift-network-diagnostics")) | .firstTimestamp + " " + (.count | tostring) + " " + .reason + ": " + .message' | sort ... 2021-03-23T15:36:43Z 1 ConnectivityOutageDetected: Connectivity outage detected: openshift-apiserver-endpoint-ip-10-0-197-105: failed to establish a TCP connection to 10.128.0.49:8443: dial tcp 10.128.0.49:8443: connect: connection refused 2021-03-23T15:36:43Z 1 ConnectivityRestored: Connectivity restored after 1m59.999901774s: openshift-apiserver-endpoint-ip-10-0-197-105: tcp connection to 10.128.0.67:8443 succeeded 2021-03-23T15:36:43Z 1 ConnectivityRestored: Connectivity restored after 59.999565993s: openshift-apiserver-endpoint-ip-10-0-184-232: tcp connection to 10.130.0.68:8443 succeeded 2021-03-23T15:36:43Z 1 ConnectivityRestored: Connectivity restored after 59.999963901s: openshift-apiserver-endpoint-ip-10-0-167-149: tcp connection to 10.129.0.79:8443 succeeded 2021-03-23T15:43:43Z 1 ConnectivityOutageDetected: Connectivity outage detected: kubernetes-apiserver-endpoint-ip-10-0-167-149: failed to establish a TCP connection to 10.0.167.149:6443: dial tcp 10.0.167.149:6443: connect: connection refused 2021-03-23T15:43:43Z 1 ConnectivityOutageDetected: Connectivity outage detected: kubernetes-apiserver-service-cluster: failed to establish a TCP connection to 172.30.214.213:443: dial tcp 172.30.214.213:443: connect: connection refused 2021-03-23T15:43:43Z 1 ConnectivityRestored: Connectivity restored after 2m0.000208651s: kubernetes-apiserver-endpoint-ip-10-0-167-149: tcp connection to 10.0.167.149:6443 succeeded 2021-03-23T15:43:43Z 1 ConnectivityRestored: Connectivity restored after 59.99993054s: kubernetes-apiserver-service-cluster: tcp connection to 172.30.214.213:443 succeeded 2021-03-23T15:46:01Z 1 ScalingReplicaSet: Scaled up replica set network-operator-586f7fbbf to 1 2021-03-23T15:46:01Z 1 SuccessfulCreate: Created pod: network-operator-586f7fbbf-w8r84 ... With lots of that sort of thing. So probably the networking folks and/or the API-server folks can shed some light on what's going on there. Maybe bug 1928946 is mixed in with the initial kube-apiserver delays? Part of the machine-config delay is probably bug 1817075. I dunno, but at least we're down to three teams instead of a blanket "OCP updates are currently too slow" :p [1]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1374364130718257152 Moving to API server team to investigate. This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that. Raising to medium, based on CI prevalence: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=336h&type=junit&search=cluster+upgrade+should+be+fast' | grep 'failures match' | sort periodic-ci-openshift-release-master-ci-4.6-e2e-aws-upgrade-rollback (all) - 14 runs, 100% failed, 21% of failures match = 21% impact periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-aws-ovn-upgrade (all) - 158 runs, 100% failed, 7% of failures match = 7% impact periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-aws-upgrade (all) - 163 runs, 87% failed, 4% of failures match = 3% impact periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-azure-ovn-upgrade (all) - 56 runs, 98% failed, 29% of failures match = 29% impact periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-azure-upgrade (all) - 56 runs, 93% failed, 4% of failures match = 4% impact periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-gcp-ovn-upgrade (all) - 28 runs, 100% failed, 4% of failures match = 4% impact periodic-ci-openshift-release-master-ci-4.7-e2e-aws-upgrade-rollback (all) - 14 runs, 7% failed, 100% of failures match = 7% impact periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-ovn-upgrade (all) - 155 runs, 32% failed, 20% of failures match = 6% impact periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-upgrade (all) - 89 runs, 25% failed, 5% of failures match = 1% impact periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-azure-upgrade (all) - 56 runs, 27% failed, 13% of failures match = 4% impact periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-gcp-ovn-upgrade (all) - 56 runs, 64% failed, 28% of failures match = 18% impact periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-gcp-upgrade (all) - 58 runs, 33% failed, 47% of failures match = 16% impact periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-ovirt-upgrade (all) - 28 runs, 54% failed, 113% of failures match = 61% impact periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-vsphere-upgrade (all) - 14 runs, 71% failed, 10% of failures match = 7% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 244 runs, 100% failed, 54% of failures match = 54% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 215 runs, 87% failed, 4% of failures match = 3% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 55 runs, 100% failed, 47% of failures match = 47% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade (all) - 56 runs, 100% failed, 82% of failures match = 82% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade (all) - 14 runs, 93% failed, 31% of failures match = 29% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade (all) - 32 runs, 100% failed, 38% of failures match = 38% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-vsphere-upgrade (all) - 14 runs, 93% failed, 15% of failures match = 14% impact periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 225 runs, 98% failed, 0% of failures match = 0% impact periodic-ci-openshift-release-master-nightly-4.6-e2e-metal-ipi-upgrade (all) - 19 runs, 95% failed, 6% of failures match = 5% impact periodic-ci-openshift-release-master-nightly-4.6-upgrade-from-stable-4.5-e2e-metal-ipi-upgrade (all) - 19 runs, 95% failed, 28% of failures match = 26% impact periodic-ci-openshift-release-master-nightly-4.7-upgrade-from-stable-4.6-e2e-aws-upgrade (all) - 22 runs, 9% failed, 50% of failures match = 5% impact periodic-ci-openshift-release-master-nightly-4.7-upgrade-from-stable-4.6-e2e-metal-ipi-upgrade (all) - 22 runs, 100% failed, 14% of failures match = 14% impact periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 87 runs, 93% failed, 2% of failures match = 2% impact periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade (all) - 85 runs, 99% failed, 10% of failures match = 9% impact pull-ci-openshift-cluster-etcd-operator-master-e2e-gcp-upgrade-five-control-plane-replicas (all) - 5 runs, 100% failed, 20% of failures match = 20% impact ... pull-ci-openshift-ovn-kubernetes-master-e2e-gcp-ovn-upgrade (all) - 41 runs, 85% failed, 31% of failures match = 27% impact rehearse-17519-periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 2 runs, 100% failed, 100% of failures match = 100% impact ... rehearse-17928-periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact release-openshift-ocp-installer-upgrade-remote-libvirt-ppc64le-4.7-to-4.8 (all) - 28 runs, 100% failed, 25% of failures match = 25% impact release-openshift-ocp-installer-upgrade-remote-libvirt-s390x-4.7-to-4.8 (all) - 28 runs, 100% failed, 25% of failures match = 25% impact release-openshift-okd-installer-e2e-aws-upgrade (all) - 202 runs, 73% failed, 1% of failures match = 1% impact release-openshift-origin-installer-e2e-aws-upgrade (all) - 137 runs, 37% failed, 16% of failures match = 6% impact release-openshift-origin-installer-e2e-aws-upgrade-4.3-to-4.4-to-4.5-to-4.6-ci (all) - 14 runs, 36% failed, 20% of failures match = 7% impact release-openshift-origin-installer-e2e-gcp-ovn-upgrade-4.5-stable-to-4.6-ci (all) - 56 runs, 32% failed, 17% of failures match = 5% impact release-openshift-origin-installer-launch-aws (all) - 393 runs, 45% failed, 2% of failures match = 1% impact release-openshift-origin-installer-launch-azure (all) - 135 runs, 49% failed, 2% of failures match = 1% impact release-openshift-origin-installer-launch-gcp (all) - 985 runs, 58% failed, 0% of failures match = 0% impact The LifecycleStale keyword was removed because the bug got commented on recently. The bug assignee was notified. @wking, I noticed that this 'cluster upgrade should be fast' test case is passing periodically now in the ovn-upgrade jobs. looking at two recent passing runs with your bash-fu, I see the network took ~11m both times: ❯ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1394608429129011200/artifacts/e2e-aws-ovn-upgrade/openshift-e2e-test/build-log.txt | rg 'clusteroperator/.*versions:.*operator' | sed 's/^/ /' | rg 'network versions|autoscaler versions' May 18 12:20:15.304 I clusteroperator/cluster-autoscaler versions: operator 4.7.11 -> 4.8.0-0.ci-2021-05-18-104756 May 18 12:31:56.582 I clusteroperator/network versions: operator 4.7.11 -> 4.8.0-0.ci-2021-05-18-104756 ❯ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1394628146623418368/artifacts/e2e-aws-ovn-upgrade/openshift-e2e-test/build-log.txt | rg 'clusteroperator/.*versions:.*operator' | sed 's/^/ /' | rg 'network versions|autoscaler versions' May 18 13:44:18.063 I clusteroperator/cluster-autoscaler versions: operator 4.7.11 -> 4.8.0-0.ci-2021-05-18-121200 May 18 13:55:31.023 I clusteroperator/network versions: operator 4.7.11 -> 4.8.0-0.ci-2021-05-18-121200 the two most recently failing jobs were ~16m and ~14m: ❯ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1394587518900375552/artifacts/e2e-aws-ovn-upgrade/openshift-e2e-test/build-log.txt | rg 'clusteroperator/.*versions:.*operator' | sed 's/^/ /' | rg 'network versions|autoscaler versions' May 18 11:02:14.270 I clusteroperator/cluster-autoscaler versions: operator 4.7.11 -> 4.8.0-0.ci-2021-05-18-092904 May 18 11:18:49.986 I clusteroperator/network versions: operator 4.7.11 -> 4.8.0-0.ci-2021-05-18-092904 ❯ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1394548277868761088/artifacts/e2e-aws-ovn-upgrade/openshift-e2e-test/build-log.txt | rg 'clusteroperator/.*versions:.*operator' | sed 's/^/ /' | rg 'network versions|autoscaler versions' May 18 08:14:33.545 I clusteroperator/cluster-autoscaler versions: operator 4.7.11 -> 4.8.0-0.ci-2021-05-18-065758 May 18 08:28:49.018 I clusteroperator/network versions: operator 4.7.11 -> 4.8.0-0.ci-2021-05-18-065758 Now, comparing to non-ovn jobs which fails this test periodically but with much less frequency. network took ~7m and ~8m for last two failing cases: ❯ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1394646605461721088/artifacts/e2e-aws-upgrade/openshift-e2e-test/build-log.txt | rg 'clusteroperator/.*versions:.*operator' | sed 's/^/ /' | rg 'network versions|autoscaler versions' May 18 16:49:27.029 I clusteroperator/cluster-autoscaler versions: operator 4.7.11 -> 4.8.0-0.ci-2021-05-18-132557 May 18 16:56:34.779 I clusteroperator/network versions: operator 4.7.11 -> 4.8.0-0.ci-2021-05-18-132557 ❯ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1394608429045125120/artifacts/e2e-aws-upgrade/openshift-e2e-test/build-log.txt | rg 'clusteroperator/.*versions:.*operator' | sed 's/^/ /' | rg 'network versions|autoscaler versions' May 18 12:09:06.342 I clusteroperator/cluster-autoscaler versions: operator 4.7.11 -> 4.8.0-0.ci-2021-05-18-104756 May 18 12:17:26.701 I clusteroperator/network versions: operator 4.7.11 -> 4.8.0-0.ci-2021-05-18-104756 and ~5m and ~12m in the last two passing cases: ❯ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1394587518837460992/artifacts/e2e-aws-upgrade/openshift-e2e-test/build-log.txt | rg 'clusteroperator/.*versions:.*operator' | sed 's/^/ /' | rg 'network versions|autoscaler versions' May 18 10:49:26.040 I clusteroperator/cluster-autoscaler versions: operator 4.7.11 -> 4.8.0-0.ci-2021-05-18-092904 May 18 11:04:44.419 I clusteroperator/network versions: operator 4.7.11 -> 4.8.0-0.ci-2021-05-18-092904 ❯ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1394274953615904768/artifacts/e2e-aws-upgrade/openshift-e2e-test/build-log.txt | rg 'clusteroperator/.*versions:.*operator' | sed 's/^/ /' | rg 'network versions|autoscaler versions' May 17 15:05:10.484 I clusteroperator/cluster-autoscaler versions: operator 4.7.11 -> 4.8.0-0.ci-2021-05-17-125107 May 17 15:17:22.899 I clusteroperator/network versions: operator 4.7.11 -> 4.8.0-0.ci-2021-05-17-125107 If it's worth collecting more data so we can be more confident we can, but my hunch is that OVN networking is taking some 50-100% longer to come up compared to openshift-sdn. I'll bring this up with the team, and maybe we need a separate BZ just for this case as I know there are other operators you called out (e.g., kube-apiserver) as taking a while. The upgrade timeout is 75m before this test gets marked as a failure. I don't know the history of that number, but if it's semi-arbitrary and we decide that it's acceptible that ovn networking should take longer then we can bump that 75 by a handful of minutes I guess. yeah, my impression is that 75m is fairly arbitrary. But we should make a solid case before bumping it, to avoid getting slower and slower out of laziness ;). Bug 1817075 and bug 1949677 will both also squeeze out a few minutes each. Created attachment 1784787 [details]
kube-apis-termination-events
Created attachment 1784788 [details]
kube-apis-termination-events-2
For me the two most recently failing jobs were ~13m and ~14m: 1. https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1394587518900375552/build-log.txt from the build-log.txt: 10:33:23.000 I ns/openshift-kube-apiserver-operator deployment/kube-apiserver-operator reason/OperatorVersionChanged clusteroperator/kube-apiserver version "raw-internal" changed from "4.7.11" to "4.8.0-0.ci-2021-05-18-092904" 10:46:34.037 I clusteroperator/kube-apiserver versions: kube-apiserver 1.20.0-beta.2 -> 1.21.0-rc.0, operator 4.7.11 -> 4.8.0-0.ci-2021-05-18-092904 from the events: 10:33:23 openshift-kube-apiserver-operator kube-apiserver-operator-status-controller-statussyncer_kube-apiserver kube-apiserver-operator OperatorVersionChanged clusteroperator/kube-apiserver version "raw-internal" changed from "4.7.11" to "4.8.0-0.ci-2021-05-18-092904" 10:46:33 openshift-kube-apiserver-operator kube-apiserver-operator-status-controller-statussyncer_kube-apiserver kube-apiserver-operator OperatorVersionChanged clusteroperator/kube-apiserver version "operator" changed from "4.7.11" to "4.8.0-0.ci-2021-05-18-092904" 10:46:33 openshift-kube-apiserver-operator kube-apiserver-operator-status-controller-statussyncer_kube-apiserver kube-apiserver-operator OperatorVersionChanged clusteroperator/kube-apiserver version "kube-apiserver" changed from "1.20.0-beta.2" to "1.21.0-rc.0" 2. https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1394548277868761088/artifacts/e2e-aws-ovn-upgrade/openshift-e2e-test/build-log.txt from the build-log.txt: May 18 07:44:41.000 I ns/openshift-kube-apiserver-operator deployment/kube-apiserver-operator reason/OperatorVersionChanged clusteroperator/kube-apiserver version "raw-internal" changed from "4.7.11" to "4.8.0-0.ci-2021-05-18-065758" May 18 07:58:39.483 I clusteroperator/kube-apiserver versions: kube-apiserver 1.20.0-beta.2 -> 1.21.0-rc.0, operator 4.7.11 -> 4.8.0-0.ci-2021-05-18-065758 from the events: 07:44:41 openshift-kube-apiserver-operator kube-apiserver-operator-status-controller-statussyncer_kube-apiserver kube-apiserver-operator OperatorVersionChanged clusteroperator/kube-apiserver version "raw-internal" changed from "4.7.11" to "4.8.0-0.ci-2021-05-18-065758" 07:58:39 openshift-kube-apiserver-operator kube-apiserver-operator-status-controller-statussyncer_kube-apiserver kube-apiserver-operator OperatorVersionChanged clusteroperator/kube-apiserver version "operator" changed from "4.7.11" to "4.8.0-0.ci-2021-05-18-065758" 07:58:39 openshift-kube-apiserver-operator kube-apiserver-operator-status-controller-statussyncer_kube-apiserver kube-apiserver-operator OperatorVersionChanged clusteroperator/kube-apiserver version "kube-apiserver" changed from "1.20.0-beta.2" to "1.21.0-rc.0" What is concerning is that processing pending requests always takes 60 seconds. I have attached two screenshots on which this can be seen. This is something I would like to investigate. I think I will create a separate BZ for that. Also, the time difference between closing one server and starting termination of the next one is also significant around 2 minutes. This deserves further investigation as well. Also, once https://github.com/openshift/cluster-kube-apiserver-operator/pull/1079 merges the time needed to install a new version of Kube API servers will be significantly increased on AWS. @wking , I opened a ticket[0] to investigate the OVN network upgrade taking so long. I'm not sure it's as big of a deal as I originally thought. You can read the comments in the ticket and maybe I can dig further from there. [0] https://issues.redhat.com/browse/SDN-1867 According to @trozet, OVN upgrade should take a bit longer than the SDN upgrade. It looks like, on average, OVN is taking ~13m longer. I know MCO has some ideas about decreasing their upgrade duration (see bz 1817075), but not sure that is coming any time soon. The current limit is 75m. +13 is 88m. Here is a PR to make it 90. Someone originally thought 75 was safe for an SDN upgrade scenario, so adding 15m to be "safe" for an OVN upgrade scenario. Although, in reality the OVN upgrade jobs that I look at are finishing just a few (maybe 5-ish) minutes longer than the 75m limit. but, to be safe... https://github.com/openshift/origin/pull/26202 https://github.com/openshift/kubernetes/pull/799 will speed up the upgrading of the API servers by additional 3 minutes per revision. the original PR did not work like I thought. An update PR is here: https://github.com/openshift/origin/pull/26324 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759 |