Bug 1942164 - [sig-cluster-lifecycle] cluster upgrade should be fast
Summary: [sig-cluster-lifecycle] cluster upgrade should be fast
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-apiserver
Version: 4.8
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.9.0
Assignee: Lukasz Szaszkiewicz
QA Contact: Rahul Gangwar
URL:
Whiteboard: LifecycleReset tag-ci
Depends On:
Blocks: 1929650 1943566
TreeView+ depends on / blocked
 
Reported: 2021-03-23 19:15 UTC by Nalin Dahyabhai
Modified: 2021-10-18 17:29 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
[sig-cluster-lifecycle] cluster upgrade should be fast job=periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-azure-ovn-upgrade=all job=periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade=all job=periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-gcp-ovn-upgrade=all
Last Closed: 2021-10-18 17:29:21 UTC
Target Upstream Version:


Attachments (Terms of Use)
kube-apis-termination-events (245.60 KB, image/png)
2021-05-19 10:46 UTC, Lukasz Szaszkiewicz
no flags Details
kube-apis-termination-events-2 (242.18 KB, image/png)
2021-05-19 10:47 UTC, Lukasz Szaszkiewicz
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift origin pull 26202 0 None open Bug 1942164: Increase upgrade timeout to 90m from 75m 2021-06-30 18:07:01 UTC
Github openshift origin pull 26324 0 None open Bug 1942164: Fix time calc ordering for upgrades 2021-07-13 21:50:07 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:29:59 UTC

Description Nalin Dahyabhai 2021-03-23 19:15:13 UTC
test:
[sig-cluster-lifecycle] cluster upgrade should be fast 

is failing frequently in CI, see search results:
https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-cluster-lifecycle%5C%5D+cluster+upgrade+should+be+fast

https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1374364130718257152

Cluster upgrade should be fast: upgrade to registry.build02.ci.openshift.org/**************/release@sha256:10a122c88469bc686a95147a647481043923417831c7ffab0dc01846a012d1dc took too long: 77.33724422838333

The set of other tests that are failing for this run are different from the set we see in bug #1885322, so I don't know if it's the same root cause.

Comment 1 Michael Gugino 2021-03-23 20:06:00 UTC
I'm not sure how to triage which components took the most time to upgrade, but this isn't a machine-api or related problem as far as I can tell.  Ideally, we use something other than sig-cluster-lifecycle on this test.

Comment 2 Michael McCune 2021-03-23 21:11:12 UTC
this also seems like the same (or similar) as these two:

* https://bugzilla.redhat.com/show_bug.cgi?id=1885322

* https://bugzilla.redhat.com/show_bug.cgi?id=1929650

are these all different or should we de-duplicate here?

Comment 3 W. Trevor King 2021-03-23 21:28:21 UTC
I've shuffled the deps around just now to get:

* 4.8 (this bug)
* 4.7 bug 1929650
* 4.6 bug 1885322

As mentioned in [1], we have an internal ticket around tooling to make this more accessible.  But it's also possible to audit manually today.  I'm agnostic about the assigned team; ideally once we get helpful tooling in place, it will be easier for any reviewer to hop straight towards the component that started slowing things down, instead of needing a delegation step to a limited pool of diagnosing devs.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1885322#c4

Comment 4 Michael Gugino 2021-03-24 00:48:14 UTC
I'm not sure how to audit the process manually today.  Any pointers to docs would be appreciated.

Comment 5 W. Trevor King 2021-03-24 04:27:03 UTC
There are not docs for this today; it's mostly reviewing assets out of the slow run and looking for fat, possibly by comparing against an earlier, faster run.  For example, timelining [1]:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1374364130718257152/artifacts/e2e-aws-ovn-upgrade/openshift-e2e-test/build-log.txt | grep 'clusteroperator/.*versions:.*operator' | sed 's/^/  /'
  Mar 23 15:08:36.315 I clusteroperator/etcd versions: raw-internal 4.7.3 -> 4.8.0-0.ci-2021-03-23-141431, operator 4.7.3 -> 4.8.0-0.ci-2021-03-23-141431
  Mar 23 15:26:18.758 I clusteroperator/kube-apiserver versions: operator 4.7.3 -> 4.8.0-0.ci-2021-03-23-141431
  Mar 23 15:29:57.347 I clusteroperator/kube-controller-manager versions: operator 4.7.3 -> 4.8.0-0.ci-2021-03-23-141431
  Mar 23 15:30:29.537 I clusteroperator/kube-scheduler versions: operator 4.7.3 -> 4.8.0-0.ci-2021-03-23-141431
  Mar 23 15:30:58.275 I clusteroperator/openshift-apiserver versions: operator 4.7.3 -> 4.8.0-0.ci-2021-03-23-141431
  Mar 23 15:31:05.083 I clusteroperator/config-operator versions: operator 4.7.3 -> 4.8.0-0.ci-2021-03-23-141431
  Mar 23 15:33:28.636 I clusteroperator/machine-api versions: operator 4.7.3 -> 4.8.0-0.ci-2021-03-23-141431
  Mar 23 15:37:10.915 I clusteroperator/baremetal versions: operator 4.7.3 -> 4.8.0-0.ci-2021-03-23-141431
  ...
  Mar 23 15:42:31.138 I clusteroperator/monitoring versions: operator 4.7.3 -> 4.8.0-0.ci-2021-03-23-141431
  Mar 23 15:57:33.713 I clusteroperator/network versions: operator 4.7.3 -> 4.8.0-0.ci-2021-03-23-141431
  Mar 23 15:57:44.117 I clusteroperator/dns versions: operator 4.7.3 -> 4.8.0-0.ci-2021-03-23-141431, coredns quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8ff781a1580e39d92d8d6f47adfe8067145b9303f4c47f874a8962e2086dd9a5 -> registry.ci.openshift.org/ocp/4.8-2021-03-23-141431@sha256:57d17c3cee833b864a6c580a5a279ec2dc425343905ae7d7f137420c9e67ea7a, openshift-cli quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:a1548c618dfbfeeaa18d02ecb6f2cffbac0217e04b15729cf42bb14b6c56d595 -> registry.ci.openshift.org/ocp/4.8-2021-03-23-141431@sha256:e685efa430b027a7abd9824af58192620e0596fde83929e2250d9f630440c01f, kube-rbac-proxy quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:513cca8279a0d9c1cb5155699dafbea416714cc3aa90043f10db71f481f64140 -> registry.ci.openshift.org/ocp/4.8-2021-03-23-141431@sha256:c1397cf1367d4f5554fdb5f6208cfed0a348b5d8fc22244746fbbb1dafb9f184
  Mar 23 16:24:01.197 I clusteroperator/machine-config versions: operator 4.7.3 -> 4.8.0-0.ci-2021-03-23-141431

So of the 77m that job took to update [1], 18m was kube-apiserver, 15m was network, 25m was machine-config, and that leaves ~19m for everyone else.  Drilling on on the network delay:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1374364130718257152/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/events.json | jq -r '.items[] | select(.metadata.namespace | test("openshift-network-operator|openshift-network-diagnostics")) | .firstTimestamp + " " + (.count | tostring) + " " + .reason + ": " + .message' | sort
  ...
  2021-03-23T15:36:43Z 1 ConnectivityOutageDetected: Connectivity outage detected: openshift-apiserver-endpoint-ip-10-0-197-105: failed to establish a TCP connection to 10.128.0.49:8443: dial tcp 10.128.0.49:8443: connect: connection refused
  2021-03-23T15:36:43Z 1 ConnectivityRestored: Connectivity restored after 1m59.999901774s: openshift-apiserver-endpoint-ip-10-0-197-105: tcp connection to 10.128.0.67:8443 succeeded
  2021-03-23T15:36:43Z 1 ConnectivityRestored: Connectivity restored after 59.999565993s: openshift-apiserver-endpoint-ip-10-0-184-232: tcp connection to 10.130.0.68:8443 succeeded
  2021-03-23T15:36:43Z 1 ConnectivityRestored: Connectivity restored after 59.999963901s: openshift-apiserver-endpoint-ip-10-0-167-149: tcp connection to 10.129.0.79:8443 succeeded
  2021-03-23T15:43:43Z 1 ConnectivityOutageDetected: Connectivity outage detected: kubernetes-apiserver-endpoint-ip-10-0-167-149: failed to establish a TCP connection to 10.0.167.149:6443: dial tcp 10.0.167.149:6443: connect: connection refused
  2021-03-23T15:43:43Z 1 ConnectivityOutageDetected: Connectivity outage detected: kubernetes-apiserver-service-cluster: failed to establish a TCP connection to 172.30.214.213:443: dial tcp 172.30.214.213:443: connect: connection refused
  2021-03-23T15:43:43Z 1 ConnectivityRestored: Connectivity restored after 2m0.000208651s: kubernetes-apiserver-endpoint-ip-10-0-167-149: tcp connection to 10.0.167.149:6443 succeeded
  2021-03-23T15:43:43Z 1 ConnectivityRestored: Connectivity restored after 59.99993054s: kubernetes-apiserver-service-cluster: tcp connection to 172.30.214.213:443 succeeded
  2021-03-23T15:46:01Z 1 ScalingReplicaSet: Scaled up replica set network-operator-586f7fbbf to 1
  2021-03-23T15:46:01Z 1 SuccessfulCreate: Created pod: network-operator-586f7fbbf-w8r84
  ...

With lots of that sort of thing.  So probably the networking folks and/or the API-server folks can shed some light on what's going on there.

Maybe bug 1928946 is mixed in with the initial kube-apiserver delays?

Part of the machine-config delay is probably bug 1817075.  I dunno, but at least we're down to three teams instead of a blanket "OCP updates are currently too slow" :p

[1]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1374364130718257152

Comment 6 Michael Gugino 2021-03-24 19:31:42 UTC
Moving to API server team to investigate.

Comment 7 Michal Fojtik 2021-04-23 20:06:09 UTC
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 8 W. Trevor King 2021-04-23 22:37:46 UTC
Raising to medium, based on CI prevalence:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=336h&type=junit&search=cluster+upgrade+should+be+fast' | grep 'failures match' | sort
periodic-ci-openshift-release-master-ci-4.6-e2e-aws-upgrade-rollback (all) - 14 runs, 100% failed, 21% of failures match = 21% impact
periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-aws-ovn-upgrade (all) - 158 runs, 100% failed, 7% of failures match = 7% impact
periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-aws-upgrade (all) - 163 runs, 87% failed, 4% of failures match = 3% impact
periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-azure-ovn-upgrade (all) - 56 runs, 98% failed, 29% of failures match = 29% impact
periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-azure-upgrade (all) - 56 runs, 93% failed, 4% of failures match = 4% impact
periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-gcp-ovn-upgrade (all) - 28 runs, 100% failed, 4% of failures match = 4% impact
periodic-ci-openshift-release-master-ci-4.7-e2e-aws-upgrade-rollback (all) - 14 runs, 7% failed, 100% of failures match = 7% impact
periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-ovn-upgrade (all) - 155 runs, 32% failed, 20% of failures match = 6% impact
periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-upgrade (all) - 89 runs, 25% failed, 5% of failures match = 1% impact
periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-azure-upgrade (all) - 56 runs, 27% failed, 13% of failures match = 4% impact
periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-gcp-ovn-upgrade (all) - 56 runs, 64% failed, 28% of failures match = 18% impact
periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-gcp-upgrade (all) - 58 runs, 33% failed, 47% of failures match = 16% impact
periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-ovirt-upgrade (all) - 28 runs, 54% failed, 113% of failures match = 61% impact
periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-vsphere-upgrade (all) - 14 runs, 71% failed, 10% of failures match = 7% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 244 runs, 100% failed, 54% of failures match = 54% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 215 runs, 87% failed, 4% of failures match = 3% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 55 runs, 100% failed, 47% of failures match = 47% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade (all) - 56 runs, 100% failed, 82% of failures match = 82% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade (all) - 14 runs, 93% failed, 31% of failures match = 29% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade (all) - 32 runs, 100% failed, 38% of failures match = 38% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-vsphere-upgrade (all) - 14 runs, 93% failed, 15% of failures match = 14% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 225 runs, 98% failed, 0% of failures match = 0% impact
periodic-ci-openshift-release-master-nightly-4.6-e2e-metal-ipi-upgrade (all) - 19 runs, 95% failed, 6% of failures match = 5% impact
periodic-ci-openshift-release-master-nightly-4.6-upgrade-from-stable-4.5-e2e-metal-ipi-upgrade (all) - 19 runs, 95% failed, 28% of failures match = 26% impact
periodic-ci-openshift-release-master-nightly-4.7-upgrade-from-stable-4.6-e2e-aws-upgrade (all) - 22 runs, 9% failed, 50% of failures match = 5% impact
periodic-ci-openshift-release-master-nightly-4.7-upgrade-from-stable-4.6-e2e-metal-ipi-upgrade (all) - 22 runs, 100% failed, 14% of failures match = 14% impact
periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 87 runs, 93% failed, 2% of failures match = 2% impact
periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade (all) - 85 runs, 99% failed, 10% of failures match = 9% impact
pull-ci-openshift-cluster-etcd-operator-master-e2e-gcp-upgrade-five-control-plane-replicas (all) - 5 runs, 100% failed, 20% of failures match = 20% impact
...
pull-ci-openshift-ovn-kubernetes-master-e2e-gcp-ovn-upgrade (all) - 41 runs, 85% failed, 31% of failures match = 27% impact
rehearse-17519-periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 2 runs, 100% failed, 100% of failures match = 100% impact
...
rehearse-17928-periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
release-openshift-ocp-installer-upgrade-remote-libvirt-ppc64le-4.7-to-4.8 (all) - 28 runs, 100% failed, 25% of failures match = 25% impact
release-openshift-ocp-installer-upgrade-remote-libvirt-s390x-4.7-to-4.8 (all) - 28 runs, 100% failed, 25% of failures match = 25% impact
release-openshift-okd-installer-e2e-aws-upgrade (all) - 202 runs, 73% failed, 1% of failures match = 1% impact
release-openshift-origin-installer-e2e-aws-upgrade (all) - 137 runs, 37% failed, 16% of failures match = 6% impact
release-openshift-origin-installer-e2e-aws-upgrade-4.3-to-4.4-to-4.5-to-4.6-ci (all) - 14 runs, 36% failed, 20% of failures match = 7% impact
release-openshift-origin-installer-e2e-gcp-ovn-upgrade-4.5-stable-to-4.6-ci (all) - 56 runs, 32% failed, 17% of failures match = 5% impact
release-openshift-origin-installer-launch-aws (all) - 393 runs, 45% failed, 2% of failures match = 1% impact
release-openshift-origin-installer-launch-azure (all) - 135 runs, 49% failed, 2% of failures match = 1% impact
release-openshift-origin-installer-launch-gcp (all) - 985 runs, 58% failed, 0% of failures match = 0% impact

Comment 9 Michal Fojtik 2021-04-23 23:06:15 UTC
The LifecycleStale keyword was removed because the bug got commented on recently.
The bug assignee was notified.

Comment 10 jamo luhrsen 2021-05-18 17:58:01 UTC
@wking, I noticed that this 'cluster upgrade should be fast' test case is passing periodically now in the ovn-upgrade jobs. looking at two recent
passing runs with your bash-fu, I see the network took ~11m both times:

❯ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1394608429129011200/artifacts/e2e-aws-ovn-upgrade/openshift-e2e-test/build-log.txt | rg 'clusteroperator/.*versions:.*operator' | sed 's/^/  /' | rg 'network versions|autoscaler versions'
  May 18 12:20:15.304 I clusteroperator/cluster-autoscaler versions: operator 4.7.11 -> 4.8.0-0.ci-2021-05-18-104756
  May 18 12:31:56.582 I clusteroperator/network versions: operator 4.7.11 -> 4.8.0-0.ci-2021-05-18-104756

❯ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1394628146623418368/artifacts/e2e-aws-ovn-upgrade/openshift-e2e-test/build-log.txt | rg 'clusteroperator/.*versions:.*operator' | sed 's/^/  /' | rg 'network versions|autoscaler versions'
  May 18 13:44:18.063 I clusteroperator/cluster-autoscaler versions: operator 4.7.11 -> 4.8.0-0.ci-2021-05-18-121200
  May 18 13:55:31.023 I clusteroperator/network versions: operator 4.7.11 -> 4.8.0-0.ci-2021-05-18-121200



the two most recently failing jobs were ~16m and ~14m:

❯ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1394587518900375552/artifacts/e2e-aws-ovn-upgrade/openshift-e2e-test/build-log.txt | rg 'clusteroperator/.*versions:.*operator' | sed 's/^/  /' | rg 'network versions|autoscaler versions'
  May 18 11:02:14.270 I clusteroperator/cluster-autoscaler versions: operator 4.7.11 -> 4.8.0-0.ci-2021-05-18-092904
  May 18 11:18:49.986 I clusteroperator/network versions: operator 4.7.11 -> 4.8.0-0.ci-2021-05-18-092904

❯ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1394548277868761088/artifacts/e2e-aws-ovn-upgrade/openshift-e2e-test/build-log.txt | rg 'clusteroperator/.*versions:.*operator' | sed 's/^/  /' | rg 'network versions|autoscaler versions'
  May 18 08:14:33.545 I clusteroperator/cluster-autoscaler versions: operator 4.7.11 -> 4.8.0-0.ci-2021-05-18-065758
  May 18 08:28:49.018 I clusteroperator/network versions: operator 4.7.11 -> 4.8.0-0.ci-2021-05-18-065758



Now, comparing to non-ovn jobs which fails this test periodically but with much less frequency. network took ~7m and ~8m for last two failing cases:

❯ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1394646605461721088/artifacts/e2e-aws-upgrade/openshift-e2e-test/build-log.txt | rg 'clusteroperator/.*versions:.*operator' | sed 's/^/  /' | rg 'network versions|autoscaler versions'
  May 18 16:49:27.029 I clusteroperator/cluster-autoscaler versions: operator 4.7.11 -> 4.8.0-0.ci-2021-05-18-132557
  May 18 16:56:34.779 I clusteroperator/network versions: operator 4.7.11 -> 4.8.0-0.ci-2021-05-18-132557
❯ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1394608429045125120/artifacts/e2e-aws-upgrade/openshift-e2e-test/build-log.txt | rg 'clusteroperator/.*versions:.*operator' | sed 's/^/  /' | rg 'network versions|autoscaler versions'
  May 18 12:09:06.342 I clusteroperator/cluster-autoscaler versions: operator 4.7.11 -> 4.8.0-0.ci-2021-05-18-104756
  May 18 12:17:26.701 I clusteroperator/network versions: operator 4.7.11 -> 4.8.0-0.ci-2021-05-18-104756


and ~5m and ~12m in the last two passing cases:

❯ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1394587518837460992/artifacts/e2e-aws-upgrade/openshift-e2e-test/build-log.txt | rg 'clusteroperator/.*versions:.*operator' | sed 's/^/  /' | rg 'network versions|autoscaler versions'
  May 18 10:49:26.040 I clusteroperator/cluster-autoscaler versions: operator 4.7.11 -> 4.8.0-0.ci-2021-05-18-092904
  May 18 11:04:44.419 I clusteroperator/network versions: operator 4.7.11 -> 4.8.0-0.ci-2021-05-18-092904
❯ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1394274953615904768/artifacts/e2e-aws-upgrade/openshift-e2e-test/build-log.txt | rg 'clusteroperator/.*versions:.*operator' | sed 's/^/  /' | rg 'network versions|autoscaler versions'
  May 17 15:05:10.484 I clusteroperator/cluster-autoscaler versions: operator 4.7.11 -> 4.8.0-0.ci-2021-05-17-125107
  May 17 15:17:22.899 I clusteroperator/network versions: operator 4.7.11 -> 4.8.0-0.ci-2021-05-17-125107




If it's worth collecting more data so we can be more confident we can, but my hunch is that OVN networking
is taking some 50-100% longer to come up compared to openshift-sdn.


I'll bring this up with the team, and maybe we need a separate BZ just for this case as I know there are
other operators you called out (e.g., kube-apiserver) as taking a while.

The upgrade timeout is 75m before this test gets marked as a failure. I don't know the history of that number,
but if it's semi-arbitrary and we decide that it's acceptible that ovn networking should take longer then we
can bump that 75 by a handful of minutes I guess.

Comment 11 W. Trevor King 2021-05-18 22:37:26 UTC
yeah, my impression is that 75m is fairly arbitrary.  But we should make a solid case before bumping it, to avoid getting slower and slower out of laziness ;).  Bug 1817075 and bug 1949677 will both also squeeze out a few minutes each.

Comment 12 Lukasz Szaszkiewicz 2021-05-19 10:46:41 UTC
Created attachment 1784787 [details]
kube-apis-termination-events

Comment 13 Lukasz Szaszkiewicz 2021-05-19 10:47:38 UTC
Created attachment 1784788 [details]
kube-apis-termination-events-2

Comment 14 Lukasz Szaszkiewicz 2021-05-19 11:00:01 UTC
For me the two most recently failing jobs were ~13m and ~14m:


1. https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1394587518900375552/build-log.txt


from the build-log.txt:
10:33:23.000 I ns/openshift-kube-apiserver-operator deployment/kube-apiserver-operator reason/OperatorVersionChanged clusteroperator/kube-apiserver version "raw-internal" changed from "4.7.11" to "4.8.0-0.ci-2021-05-18-092904"
10:46:34.037 I clusteroperator/kube-apiserver versions: kube-apiserver 1.20.0-beta.2 -> 1.21.0-rc.0, operator 4.7.11 -> 4.8.0-0.ci-2021-05-18-092904

from the events:
10:33:23	openshift-kube-apiserver-operator	kube-apiserver-operator-status-controller-statussyncer_kube-apiserver	kube-apiserver-operator	OperatorVersionChanged	clusteroperator/kube-apiserver version "raw-internal" changed from "4.7.11" to "4.8.0-0.ci-2021-05-18-092904"
10:46:33	openshift-kube-apiserver-operator	kube-apiserver-operator-status-controller-statussyncer_kube-apiserver	kube-apiserver-operator	OperatorVersionChanged	clusteroperator/kube-apiserver version "operator" changed from "4.7.11" to "4.8.0-0.ci-2021-05-18-092904"
10:46:33	openshift-kube-apiserver-operator	kube-apiserver-operator-status-controller-statussyncer_kube-apiserver	kube-apiserver-operator	OperatorVersionChanged	clusteroperator/kube-apiserver version "kube-apiserver" changed from "1.20.0-beta.2" to "1.21.0-rc.0"



2. https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1394548277868761088/artifacts/e2e-aws-ovn-upgrade/openshift-e2e-test/build-log.txt


from the build-log.txt:
May 18 07:44:41.000 I ns/openshift-kube-apiserver-operator deployment/kube-apiserver-operator reason/OperatorVersionChanged clusteroperator/kube-apiserver version "raw-internal" changed from "4.7.11" to "4.8.0-0.ci-2021-05-18-065758"
May 18 07:58:39.483 I clusteroperator/kube-apiserver versions: kube-apiserver 1.20.0-beta.2 -> 1.21.0-rc.0, operator 4.7.11 -> 4.8.0-0.ci-2021-05-18-065758

from the events:
07:44:41	openshift-kube-apiserver-operator	kube-apiserver-operator-status-controller-statussyncer_kube-apiserver	kube-apiserver-operator	OperatorVersionChanged	clusteroperator/kube-apiserver version "raw-internal" changed from "4.7.11" to "4.8.0-0.ci-2021-05-18-065758"
07:58:39	openshift-kube-apiserver-operator	kube-apiserver-operator-status-controller-statussyncer_kube-apiserver	kube-apiserver-operator	OperatorVersionChanged	clusteroperator/kube-apiserver version "operator" changed from "4.7.11" to "4.8.0-0.ci-2021-05-18-065758"
07:58:39	openshift-kube-apiserver-operator	kube-apiserver-operator-status-controller-statussyncer_kube-apiserver	kube-apiserver-operator	OperatorVersionChanged	clusteroperator/kube-apiserver version "kube-apiserver" changed from "1.20.0-beta.2" to "1.21.0-rc.0"



What is concerning is that processing pending requests always takes 60 seconds. I have attached two screenshots on which this can be seen. This is something I would like to investigate. I think I will create a separate BZ for that.

Also, the time difference between closing one server and starting termination of the next one is also significant around 2 minutes. This deserves further investigation as well.

Also, once https://github.com/openshift/cluster-kube-apiserver-operator/pull/1079 merges the time needed to install a new version of Kube API servers will be significantly increased on AWS.

Comment 15 jamo luhrsen 2021-05-20 20:54:57 UTC
@wking@redhat.com , I opened a ticket[0] to investigate the OVN network upgrade taking so long. I'm not sure it's as big of a deal
as I originally thought. You can read the comments in the ticket and maybe I can dig further from there.

[0] https://issues.redhat.com/browse/SDN-1867

Comment 16 jamo luhrsen 2021-06-04 18:33:27 UTC
According to @trozet, OVN upgrade should take a bit longer than the SDN upgrade. It looks like, on average, OVN is taking ~13m longer.
I know MCO has some ideas about decreasing their upgrade duration (see bz 1817075), but not sure that is coming any time soon.

The current limit is 75m. +13 is 88m. Here is a PR to make it 90. Someone originally thought 75 was safe for an SDN upgrade
scenario, so adding 15m to be "safe" for an OVN upgrade scenario. Although, in reality the OVN upgrade jobs that I look at
are finishing just a few (maybe 5-ish) minutes longer than the 75m limit. but, to be safe...

  https://github.com/openshift/origin/pull/26202

Comment 17 Lukasz Szaszkiewicz 2021-06-10 08:51:52 UTC
https://github.com/openshift/kubernetes/pull/799 will speed up the upgrading of the API servers by additional 3 minutes per revision.

Comment 19 jamo luhrsen 2021-07-13 21:49:58 UTC
the original PR did not work like I thought. An update PR is here: https://github.com/openshift/origin/pull/26324

Comment 25 errata-xmlrpc 2021-10-18 17:29:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759


Note You need to log in before you can comment on or make changes to this bug.