1955333 – "Kubernetes APIs remain available for new connections" and similar failing on 4.8 Azure updates

Bug 1955333 - "Kubernetes APIs remain available for new connections" and similar failing on 4.8 Azure updates [NEEDINFO]

Summary: "Kubernetes APIs remain available for new connections" and similar failing on...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-apiserver
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Stefan Schimanski
QA Contact:	Ke Wang
Docs Contact:
URL:
Whiteboard:	tag-ci LifecycleStale
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-04-29 21:24 UTC by W. Trevor King
Modified:	2022-02-25 18:36 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-02-25 18:36:02 UTC
Target Upstream Version:
Embargoed:
Flags:	mfojtik: needinfo?

Attachments	(Terms of Use)

Description W. Trevor King 2021-04-29 21:24:40 UTC

For example, [1]:

  Apr 23 04:46:08.045: API "kubernetes-api-available-new-connections" was unreachable during disruption (AWS has a known issue: https://bugzilla.redhat.com/show_bug.cgi?id=1943804) for at least 1m4s of 1h2m12s (2%):

    Apr 23 04:07:20.638 E kube-apiserver-new-connection kube-apiserver-new-connection started failing: Get "https://api.ci-op-rcbxmc0m-25656.ci.azure.devcluster.openshift.com:6443/api/v1/namespaces/default": dial tcp 20.75.119.245:6443: i/o timeout
    Apr 23 04:07:21.637 - 13s   E kube-apiserver-new-connection kube-apiserver-new-connection is not responding to GET requests
    ...
    Apr 23 04:14:20.637 E kube-apiserver-new-connection kube-apiserver-new-connection started failing: Get "https://api.ci-op-rcbxmc0m-25656.ci.azure.devcluster.openshift.com:6443/api/v1/namespaces/default": dial tcp 20.75.119.245:6443: i/o timeout
    Apr 23 04:14:20.705 I kube-apiserver-new-connection kube-apiserver-new-connection started responding to GET requests
    ...

Here's kube-apiserver claiming it was done progressing at 4:05, a bit over a minute before the connectivity issues crop up:

  $ curl -s https://storage.googleapis.com/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade/1385428355993571328/build-log.txt | grep 'clusteroperator/kube-apiserver condition/Progressing'
  INFO[2021-04-23T04:47:18Z] Apr 23 03:52:14.176 W clusteroperator/kube-apiserver condition/Progressing status/True reason/NodeInstaller changed: NodeInstallerProgressing: 3 nodes are at revision 7; 0 nodes have achieved new revision 8 
  INFO[2021-04-23T04:47:18Z] Apr 23 03:52:14.176 - 769s  W clusteroperator/kube-apiserver condition/Progressing status/True reason/NodeInstallerProgressing: 3 nodes are at revision 7; 0 nodes have achieved new revision 8 
  INFO[2021-04-23T04:47:18Z] Apr 23 04:05:03.579 W clusteroperator/kube-apiserver condition/Progressing status/False reason/AsExpected changed: NodeInstallerProgressing: 3 nodes are at revision 8 

The only other components twiddling Progressing around then are also Kube-control-plane things:

  $ curl -s https://storage.googleapis.com/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade/1385428355993571328/build-log.txt | grep '23 04:0[0-9]:.*condition/Progressing'
  INFO[2021-04-23T04:47:18Z] Apr 23 04:04:05.006 W clusteroperator/kube-controller-manager condition/Progressing status/True reason/NodeInstaller changed: NodeInstallerProgressing: 3 nodes are at revision 7; 0 nodes have achieved new revision 8 
  INFO[2021-04-23T04:47:18Z] Apr 23 04:04:05.006 - 164s  W clusteroperator/kube-controller-manager condition/Progressing status/True reason/NodeInstallerProgressing: 3 nodes are at revision 7; 0 nodes have achieved new revision 8 
  INFO[2021-04-23T04:47:18Z] Apr 23 04:04:05.007 W clusteroperator/kube-scheduler condition/Progressing status/True reason/NodeInstaller changed: NodeInstallerProgressing: 3 nodes are at revision 6; 0 nodes have achieved new revision 7 
  INFO[2021-04-23T04:47:18Z] Apr 23 04:04:05.007 - 229s  W clusteroperator/kube-scheduler condition/Progressing status/True reason/NodeInstallerProgressing: 3 nodes are at revision 6; 0 nodes have achieved new revision 7 
  INFO[2021-04-23T04:47:18Z] Apr 23 04:05:03.579 W clusteroperator/kube-apiserver condition/Progressing status/False reason/AsExpected changed: NodeInstallerProgressing: 3 nodes are at revision 8 
  INFO[2021-04-23T04:47:18Z] Apr 23 04:06:49.895 W clusteroperator/kube-controller-manager condition/Progressing status/False reason/AsExpected changed: NodeInstallerProgressing: 3 nodes are at revision 9 
  INFO[2021-04-23T04:47:18Z] Apr 23 04:07:54.170 W clusteroperator/kube-scheduler condition/Progressing status/False reason/AsExpected changed: NodeInstallerProgressing: 3 nodes are at revision 7 

There are some related tickets like bug 1928946, but my understanding is that those crop up when the machine-config rolls the nodes, while this issue is happening well before even networking starts to move.

Seems popular [2]:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=Kubernetes+APIs+remain+available+for+new+connections&maxAge=96h&type=junit' | grep 'failures match' | sort | grep azure
periodic-ci-openshift-release-master-ci-4.7-e2e-azure-upgrade (all) - 16 runs, 25% failed, 25% of failures match = 6% impact
periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-azure-ovn-upgrade (all) - 16 runs, 19% failed, 500% of failures match = 94% impact
periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-azure-upgrade (all) - 16 runs, 56% failed, 22% of failures match = 13% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-azure-compact-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 16 runs, 100% failed, 75% of failures match = 75% impact

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade/1385428355993571328
[2]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade

Comment 1 Stefan Schimanski 2021-04-30 10:30:23 UTC

In [1] at 04:07 all three apiservers are up and running. None shows network errors, which suggests that the request never reached the server.

Comment 2 W. Trevor King 2021-05-01 21:45:31 UTC

So we suspect the Azure LB fronting the Kube API?  I linked the wrong TestGrid earlier, should have been [1].  Checking a few other jobs:

* 4.8.0-0.ci-2021-04-30-171538 -> 4.8.0-0.ci-2021-04-30-212732 [2] has sporadic API connectivity issues throughout.  This could be a misbehaving LB or some such.
* 4.8.0-0.ci-2021-04-30-171538 -> 4.8.0-0.ci-2021-04-30-212732 [3] has three rounds of API connectivity issues while kube-apiserver is Progressing=True, and three more while machine-config is rolling nodes.  That could be an LB that is not respecting its target-polling rules, but seems more likely to be an API-server/kubelet pod-restart thing to me.
* 4.8.0-0.ci-2021-04-27-095812 -> 4.8.0-0.ci-2021-04-28-221518 [4] has the same pattern as [3].
* 4.8.0-0.ci-2021-04-27-075557 -> 4.8.0-0.ci-2021-04-27-095812 [5] has the same pattern as [3].
* 4.8.0-0.ci-2021-04-26-173158 -> 4.8.0-0.ci-2021-04-27-000602 [6] has the same pattern as [3].

So I dunno what happened with [2], but [3,4,5,6] seems to have a pretty consistent pattern of Kube API-server connectivity that is at least correlated with pod restarts.  And both modes differ from comment 0's job which had two blocks of API connectivity issues after kube-apiserver went back to Progressing=False.

Let's drill into [3]:

  ...
  May 01 04:00:58.307 E kube-apiserver-new-connection kube-apiserver-new-connection started failing: Get "https://api.ci-op-x2jck503-1158e.origin-ci-int-aws.dev.rhcloud.com:6443/api/v1/namespaces/default": dial tcp 18.233.88.0:6443: connect: connection refused
  May 01 04:00:59.300 - 1s    E kube-apiserver-new-connection kube-apiserver-new-connection is not responding to GET requests
  May 01 04:00:59.314 I kube-apiserver-new-connection kube-apiserver-new-connection started responding to GET requests
  May 01 04:01:03.307 E kube-apiserver-new-connection kube-apiserver-new-connection started failing: Get "https://api.ci-op-x2jck503-1158e.origin-ci-int-aws.dev.rhcloud.com:6443/api/v1/namespaces/default": dial tcp 18.233.88.0:6443: connect: connection refused
  May 01 04:01:04.301 - 1s    E kube-apiserver-new-connection kube-apiserver-new-connection is not responding to GET requests
  May 01 04:01:04.313 I kube-apiserver-new-connection kube-apiserver-new-connection started responding to GET requests
  ...

So out from 4:00:58.307 to 4:00:59.314, and then again out from 4:01:03.307 to 4:01:04.313 (among other outages I won't dig into).  Those are offset by a minute, but otherwise almost match to millisecond precision (I dunno what the connectivity poll period is, but it's probably partly responsible for that alignment).

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade/1388328160080171008/artifacts/e2e-aws-upgrade/gather-extra/artifacts/events.json | jq -r '.items[] | select(.metadata.namespace == "openshift-kube-apiserver") | (.firstTimestamp // .metadata.creationTimestamp) + " " + (.count | tostring) + " " + .involvedObject.name + " " + .reason + ": " + .message' | grep 'T03:59\|T04:0' | sort
  ...
  2021-05-01T03:59:02Z 1 revision-pruner-8-ip-10-0-239-74.ec2.internal Started: Started container pruner
  ...
  2021-05-01T03:59:35Z 1 installer-8-ip-10-0-189-59.ec2.internal Started: Started container installer
  2021-05-01T03:59:42Z 1 installer-8-ip-10-0-189-59.ec2.internal StaticPodInstallerCompleted: Successfully installed revision 8
  2021-05-01T03:59:42Z 1 kube-apiserver-ip-10-0-189-59.ec2.internal Killing: Stopping container kube-apiserver
  2021-05-01T03:59:42Z 1 kube-apiserver-ip-10-0-189-59.ec2.internal Killing: Stopping container kube-apiserver-cert-regeneration-controller
  2021-05-01T03:59:42Z 1 kube-apiserver-ip-10-0-189-59.ec2.internal Killing: Stopping container kube-apiserver-cert-syncer
  2021-05-01T03:59:42Z 1 kube-apiserver-ip-10-0-189-59.ec2.internal Killing: Stopping container kube-apiserver-check-endpoints
  2021-05-01T03:59:42Z 1 kube-apiserver-ip-10-0-189-59.ec2.internal Killing: Stopping container kube-apiserver-insecure-readyz
  2021-05-01T03:59:43Z null kube-apiserver-ip-10-0-189-59.ec2.internal TerminationPreShutdownHooksFinished: All pre-shutdown hooks have been finished
  2021-05-01T03:59:43Z null kube-apiserver-ip-10-0-189-59.ec2.internal TerminationStart: Received signal to terminate, becoming unready, but keeping serving
  2021-05-01T03:59:49Z 1 cert-regeneration-controller-lock LeaderElection: ip-10-0-239-74_02f2b687-97f4-44c4-9516-e3fb364deb85 became leader
  2021-05-01T04:00:53Z null kube-apiserver-ip-10-0-189-59.ec2.internal TerminationMinimalShutdownDurationFinished: The minimal shutdown duration of 1m10s finished
  2021-05-01T04:00:53Z null kube-apiserver-ip-10-0-189-59.ec2.internal TerminationStoppedServing: Server has stopped listening
  2021-05-01T04:01:53Z null kube-apiserver-ip-10-0-189-59.ec2.internal TerminationGracefulTerminationFinished: All pending requests processed
  2021-05-01T04:01:55Z 1 kube-apiserver-ip-10-0-189-59.ec2.internal Pulling: Pulling image "registry.ci.openshift.org/ocp/4.8-2021-04-30-212732@sha256:e4c7be2f0e8b1e9ef1ad9161061449ec1bdc6953a58f6d456971ee945a8d3197"
  2021-05-01T04:02:05Z 1 kube-apiserver-ip-10-0-189-59.ec2.internal Created: Created container setup
  2021-05-01T04:02:05Z 1 kube-apiserver-ip-10-0-189-59.ec2.internal Pulled: Container image "registry.ci.openshift.org/ocp/4.8-2021-04-30-212732@sha256:e4c7be2f0e8b1e9ef1ad9161061449ec1bdc6953a58f6d456971ee945a8d3197" already present on machine
  2021-05-01T04:02:05Z 1 kube-apiserver-ip-10-0-189-59.ec2.internal Pulled: Successfully pulled image "registry.ci.openshift.org/ocp/4.8-2021-04-30-212732@sha256:e4c7be2f0e8b1e9ef1ad9161061449ec1bdc6953a58f6d456971ee945a8d3197" in 9.383893506s
  2021-05-01T04:02:05Z 1 kube-apiserver-ip-10-0-189-59.ec2.internal Started: Started container setup
  2021-05-01T04:02:06Z 1 kube-apiserver-ip-10-0-189-59.ec2.internal Created: Created container kube-apiserver
  ...

That really looks like kube-apiserver is rolling out a new version, and for some reason there is not the graceful LB handoff we need to avoid connection issues.  Unifying the two timelines:

* 03:59:43Z TerminationPreShutdownHooksFinished
* 03:59:43Z TerminationStart: Received signal to terminate, becoming unready, but keeping serving
* 04:00:53Z TerminationMinimalShutdownDurationFinished: The minimal shutdown duration of 1m10s finished
* 04:00:53Z TerminationStoppedServing: Server has stopped listening
* 04:00:58.307Z kube-apiserver-new-connection started failing... connection refused
* 04:00:59.314Z kube-apiserver-new-connection started responding to GET requests
* 04:01:03.307Z kube-apiserver-new-connection started failing... connection refused
* 04:01:04.313Z kube-apiserver-new-connection started responding to GET requests
* 04:01:53Z TerminationGracefulTerminationFinished: All pending requests processed

So my guesses for this failure mode:

* The LB isn't pinning the e2e polls to one target.
* When the server stops listening, its LB target was still active.
* A handful of subsequent requests get routed to the no-longer-listening server, and some others get routed to its still-listening siblings.
* Within a few seconds, the LB does notice that the server is not longer listening, and removes it from rotation.

Azure LB target rules haven't changed in over a year, and call for 2-poll transitions with polling every five seconds [7], so I dunno how to square that with the apparent behavior.

[1]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade/1388328160222777344
[3]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade/1388328160080171008
[4]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade/1387603262152118272
[5]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade/1387240645646618624
[6]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade/1386878204156841984
[7]: https://github.com/openshift/installer/blame/dd6b3d59ffb8b437f22605d7ff2140635246c77d/data/data/azure/vnet/public-lb.tf#L179-L180

Comment 3 Michal Fojtik 2021-05-31 22:43:11 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 4 W. Trevor King 2021-06-01 22:41:25 UTC

Still popular:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=Kubernetes+APIs+remain+available+for+new+connections&maxAge=96h&type=junit' | grep 'azure.*failures match' | sort
periodic-ci-openshift-release-master-ci-4.7-e2e-azure-upgrade (all) - 16 runs, 19% failed, 133% of failures match = 25% impact
periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-azure-ovn-upgrade (all) - 16 runs, 19% failed, 400% of failures match = 75% impact
periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-azure-upgrade (all) - 16 runs, 44% failed, 57% of failures match = 25% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 16 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-upgrade (all) - 4 runs, 75% failed, 33% of failures match = 25% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-azure-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-azure-ovn-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
rehearse-18937-periodic-ci-openshift-release-master-ci-4.9-e2e-azure-compact-upgrade (all) - 3 runs, 33% failed, 100% of failures match = 33% impact
rehearse-18937-periodic-ci-openshift-release-master-ci-4.9-e2e-azure-upgrade (all) - 3 runs, 33% failed, 100% of failures match = 33% impact
rehearse-18937-periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-azure-ovn-upgrade (all) - 3 runs, 67% failed, 50% of failures match = 33% impact
rehearse-18937-periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-azure-upgrade (all) - 3 runs, 67% failed, 50% of failures match = 33% impact

Comment 5 Michal Fojtik 2021-06-01 22:43:24 UTC

The LifecycleStale keyword was removed because the bug got commented on recently.
The bug assignee was notified.

Comment 6 Michal Fojtik 2021-07-01 23:11:07 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 7 W. Trevor King 2021-07-01 23:31:24 UTC

Still a problem for 4.7->4.8 on OVN:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=Kubernetes+APIs+remain+available+for+new+connections&maxAge=96h&type=junit' | grep 'failures match' | sort | grep azure
periodic-ci-openshift-release-master-ci-4.7-e2e-azure-upgrade (all) - 16 runs, 13% failed, 50% of failures match = 6% impact
periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-azure-ovn-upgrade (all) - 16 runs, 19% failed, 467% of failures match = 88% impact
periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-azure-upgrade (all) - 16 runs, 19% failed, 133% of failures match = 25% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 16 runs, 100% failed, 63% of failures match = 63% impact
pull-ci-openshift-kubernetes-release-4.8-e2e-azure-upgrade (all) - 6 runs, 50% failed, 33% of failures match = 17% impact

Comment 8 Michal Fojtik 2021-07-02 00:11:14 UTC

The LifecycleStale keyword was removed because the bug got commented on recently.
The bug assignee was notified.

Comment 9 Stefan Schimanski 2021-07-02 10:00:57 UTC

@wking Please don't add random stats about the test. This test is a general smoke test and can have many reasons to fail (network, MCO, kubelet, apiserver, cloud provider, to name a few). This is about "dial tcp 20.75.119.245:6443: i/o timeout" only. We cannot hunt random ghosts, we must be specific. A dial tcp timeout is something on the network layer.

Comment 10 W. Trevor King 2021-07-02 23:12:35 UTC

(In reply to Michal Fojtik from comment #6)
> The information can be, for example, that the problem still occurs...

This is what my stats are showing, right?  If I've guessed the wrong component, we can change the component, but I don't think "leave the bug to rot out, so further CI (and presumably in-the-wild) failures continue" is the what we want.  Sounds like you think this belongs to the Networking component?

Comment 11 Michal Fojtik 2021-08-01 23:47:09 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 12 W. Trevor King 2021-08-03 01:41:25 UTC

Still happening pretty frequently in affected jobs:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=Kubernetes+APIs+remain+available+for+new+connections&maxAge=168h&type=junit' | grep 'failures match' | sort | grep azure
periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade (all) - 7 runs, 57% failed, 25% of failures match = 14% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 7 runs, 100% failed, 43% of failures match = 43% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-azure-ovn-upgrade (all) - 7 runs, 100% failed, 29% of failures match = 29% impact

Example job [1], with hiccups like:

  Jul 29 04:20:46.154 - 106ms E kube-apiserver-new-connection kube-apiserver-new-connection is not responding to GET requests

Mostly during the machine-config node-rolling portion of the update, but with some blips earlier in the update as well [2].

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade/1420589973467303936
[2]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade/1420589973467303936/artifacts/e2e-azure-upgrade/openshift-e2e-test/artifacts/junit/e2e-intervals_20210729-041926.html

Comment 13 Michal Fojtik 2021-08-03 01:47:11 UTC

The LifecycleStale keyword was removed because the bug got commented on recently.
The bug assignee was notified.

Comment 14 Michal Fojtik 2021-09-02 02:29:47 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 15 W. Trevor King 2021-09-11 02:53:43 UTC

Still lots of these:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=Kubernetes+APIs+remain+available+for+new+connections&maxAge=168h&type=junit' | grep 'failures match' | sort | grep azure
periodic-ci-openshift-release-master-ci-4.10-e2e-azure-upgrade (all) - 7 runs, 86% failed, 17% of failures match = 14% impact
periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-azure-ovn-upgrade (all) - 7 runs, 100% failed, 29% of failures match = 29% impact
periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-azure-upgrade (all) - 7 runs, 71% failed, 20% of failures match = 14% impact
periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-azure-ovn-upgrade (all) - 2 runs, 50% failed, 200% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade (all) - 7 runs, 29% failed, 50% of failures match = 14% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 7 runs, 100% failed, 71% of failures match = 71% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-azure-upgrade (all) - 7 runs, 100% failed, 14% of failures match = 14% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-azure-ovn-upgrade (all) - 7 runs, 100% failed, 14% of failures match = 14% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-azure-upgrade (all) - 7 runs, 86% failed, 17% of failures match = 14% impact
pull-ci-openshift-machine-config-operator-release-4.8-e2e-azure-upgrade (all) - 6 runs, 67% failed, 25% of failures match = 17% impact
rehearse-17197-periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
release-openshift-origin-installer-launch-azure (all) - 37 runs, 46% failed, 41% of failures match = 19% impact

Comment 16 Michal Fojtik 2021-09-11 03:28:00 UTC

The LifecycleStale keyword was removed because the bug got commented on recently.
The bug assignee was notified.

Comment 17 Michal Fojtik 2021-10-11 03:58:34 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Whiteboard if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 18 Wally 2022-02-25 18:36:02 UTC

Closing due to inactivity/time lapsed.

Note You need to log in before you can comment on or make changes to this bug.