Hide Forgot
For example, [1]: Apr 23 04:46:08.045: API "kubernetes-api-available-new-connections" was unreachable during disruption (AWS has a known issue: https://bugzilla.redhat.com/show_bug.cgi?id=1943804) for at least 1m4s of 1h2m12s (2%): Apr 23 04:07:20.638 E kube-apiserver-new-connection kube-apiserver-new-connection started failing: Get "https://api.ci-op-rcbxmc0m-25656.ci.azure.devcluster.openshift.com:6443/api/v1/namespaces/default": dial tcp 20.75.119.245:6443: i/o timeout Apr 23 04:07:21.637 - 13s E kube-apiserver-new-connection kube-apiserver-new-connection is not responding to GET requests ... Apr 23 04:14:20.637 E kube-apiserver-new-connection kube-apiserver-new-connection started failing: Get "https://api.ci-op-rcbxmc0m-25656.ci.azure.devcluster.openshift.com:6443/api/v1/namespaces/default": dial tcp 20.75.119.245:6443: i/o timeout Apr 23 04:14:20.705 I kube-apiserver-new-connection kube-apiserver-new-connection started responding to GET requests ... Here's kube-apiserver claiming it was done progressing at 4:05, a bit over a minute before the connectivity issues crop up: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade/1385428355993571328/build-log.txt | grep 'clusteroperator/kube-apiserver condition/Progressing' INFO[2021-04-23T04:47:18Z] Apr 23 03:52:14.176 W clusteroperator/kube-apiserver condition/Progressing status/True reason/NodeInstaller changed: NodeInstallerProgressing: 3 nodes are at revision 7; 0 nodes have achieved new revision 8 INFO[2021-04-23T04:47:18Z] Apr 23 03:52:14.176 - 769s W clusteroperator/kube-apiserver condition/Progressing status/True reason/NodeInstallerProgressing: 3 nodes are at revision 7; 0 nodes have achieved new revision 8 INFO[2021-04-23T04:47:18Z] Apr 23 04:05:03.579 W clusteroperator/kube-apiserver condition/Progressing status/False reason/AsExpected changed: NodeInstallerProgressing: 3 nodes are at revision 8 The only other components twiddling Progressing around then are also Kube-control-plane things: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade/1385428355993571328/build-log.txt | grep '23 04:0[0-9]:.*condition/Progressing' INFO[2021-04-23T04:47:18Z] Apr 23 04:04:05.006 W clusteroperator/kube-controller-manager condition/Progressing status/True reason/NodeInstaller changed: NodeInstallerProgressing: 3 nodes are at revision 7; 0 nodes have achieved new revision 8 INFO[2021-04-23T04:47:18Z] Apr 23 04:04:05.006 - 164s W clusteroperator/kube-controller-manager condition/Progressing status/True reason/NodeInstallerProgressing: 3 nodes are at revision 7; 0 nodes have achieved new revision 8 INFO[2021-04-23T04:47:18Z] Apr 23 04:04:05.007 W clusteroperator/kube-scheduler condition/Progressing status/True reason/NodeInstaller changed: NodeInstallerProgressing: 3 nodes are at revision 6; 0 nodes have achieved new revision 7 INFO[2021-04-23T04:47:18Z] Apr 23 04:04:05.007 - 229s W clusteroperator/kube-scheduler condition/Progressing status/True reason/NodeInstallerProgressing: 3 nodes are at revision 6; 0 nodes have achieved new revision 7 INFO[2021-04-23T04:47:18Z] Apr 23 04:05:03.579 W clusteroperator/kube-apiserver condition/Progressing status/False reason/AsExpected changed: NodeInstallerProgressing: 3 nodes are at revision 8 INFO[2021-04-23T04:47:18Z] Apr 23 04:06:49.895 W clusteroperator/kube-controller-manager condition/Progressing status/False reason/AsExpected changed: NodeInstallerProgressing: 3 nodes are at revision 9 INFO[2021-04-23T04:47:18Z] Apr 23 04:07:54.170 W clusteroperator/kube-scheduler condition/Progressing status/False reason/AsExpected changed: NodeInstallerProgressing: 3 nodes are at revision 7 There are some related tickets like bug 1928946, but my understanding is that those crop up when the machine-config rolls the nodes, while this issue is happening well before even networking starts to move. Seems popular [2]: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=Kubernetes+APIs+remain+available+for+new+connections&maxAge=96h&type=junit' | grep 'failures match' | sort | grep azure periodic-ci-openshift-release-master-ci-4.7-e2e-azure-upgrade (all) - 16 runs, 25% failed, 25% of failures match = 6% impact periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-azure-ovn-upgrade (all) - 16 runs, 19% failed, 500% of failures match = 94% impact periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-azure-upgrade (all) - 16 runs, 56% failed, 22% of failures match = 13% impact periodic-ci-openshift-release-master-ci-4.8-e2e-azure-compact-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 16 runs, 100% failed, 75% of failures match = 75% impact [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade/1385428355993571328 [2]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade
In [1] at 04:07 all three apiservers are up and running. None shows network errors, which suggests that the request never reached the server.
So we suspect the Azure LB fronting the Kube API? I linked the wrong TestGrid earlier, should have been [1]. Checking a few other jobs: * 4.8.0-0.ci-2021-04-30-171538 -> 4.8.0-0.ci-2021-04-30-212732 [2] has sporadic API connectivity issues throughout. This could be a misbehaving LB or some such. * 4.8.0-0.ci-2021-04-30-171538 -> 4.8.0-0.ci-2021-04-30-212732 [3] has three rounds of API connectivity issues while kube-apiserver is Progressing=True, and three more while machine-config is rolling nodes. That could be an LB that is not respecting its target-polling rules, but seems more likely to be an API-server/kubelet pod-restart thing to me. * 4.8.0-0.ci-2021-04-27-095812 -> 4.8.0-0.ci-2021-04-28-221518 [4] has the same pattern as [3]. * 4.8.0-0.ci-2021-04-27-075557 -> 4.8.0-0.ci-2021-04-27-095812 [5] has the same pattern as [3]. * 4.8.0-0.ci-2021-04-26-173158 -> 4.8.0-0.ci-2021-04-27-000602 [6] has the same pattern as [3]. So I dunno what happened with [2], but [3,4,5,6] seems to have a pretty consistent pattern of Kube API-server connectivity that is at least correlated with pod restarts. And both modes differ from comment 0's job which had two blocks of API connectivity issues after kube-apiserver went back to Progressing=False. Let's drill into [3]: ... May 01 04:00:58.307 E kube-apiserver-new-connection kube-apiserver-new-connection started failing: Get "https://api.ci-op-x2jck503-1158e.origin-ci-int-aws.dev.rhcloud.com:6443/api/v1/namespaces/default": dial tcp 18.233.88.0:6443: connect: connection refused May 01 04:00:59.300 - 1s E kube-apiserver-new-connection kube-apiserver-new-connection is not responding to GET requests May 01 04:00:59.314 I kube-apiserver-new-connection kube-apiserver-new-connection started responding to GET requests May 01 04:01:03.307 E kube-apiserver-new-connection kube-apiserver-new-connection started failing: Get "https://api.ci-op-x2jck503-1158e.origin-ci-int-aws.dev.rhcloud.com:6443/api/v1/namespaces/default": dial tcp 18.233.88.0:6443: connect: connection refused May 01 04:01:04.301 - 1s E kube-apiserver-new-connection kube-apiserver-new-connection is not responding to GET requests May 01 04:01:04.313 I kube-apiserver-new-connection kube-apiserver-new-connection started responding to GET requests ... So out from 4:00:58.307 to 4:00:59.314, and then again out from 4:01:03.307 to 4:01:04.313 (among other outages I won't dig into). Those are offset by a minute, but otherwise almost match to millisecond precision (I dunno what the connectivity poll period is, but it's probably partly responsible for that alignment). $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade/1388328160080171008/artifacts/e2e-aws-upgrade/gather-extra/artifacts/events.json | jq -r '.items[] | select(.metadata.namespace == "openshift-kube-apiserver") | (.firstTimestamp // .metadata.creationTimestamp) + " " + (.count | tostring) + " " + .involvedObject.name + " " + .reason + ": " + .message' | grep 'T03:59\|T04:0' | sort ... 2021-05-01T03:59:02Z 1 revision-pruner-8-ip-10-0-239-74.ec2.internal Started: Started container pruner ... 2021-05-01T03:59:35Z 1 installer-8-ip-10-0-189-59.ec2.internal Started: Started container installer 2021-05-01T03:59:42Z 1 installer-8-ip-10-0-189-59.ec2.internal StaticPodInstallerCompleted: Successfully installed revision 8 2021-05-01T03:59:42Z 1 kube-apiserver-ip-10-0-189-59.ec2.internal Killing: Stopping container kube-apiserver 2021-05-01T03:59:42Z 1 kube-apiserver-ip-10-0-189-59.ec2.internal Killing: Stopping container kube-apiserver-cert-regeneration-controller 2021-05-01T03:59:42Z 1 kube-apiserver-ip-10-0-189-59.ec2.internal Killing: Stopping container kube-apiserver-cert-syncer 2021-05-01T03:59:42Z 1 kube-apiserver-ip-10-0-189-59.ec2.internal Killing: Stopping container kube-apiserver-check-endpoints 2021-05-01T03:59:42Z 1 kube-apiserver-ip-10-0-189-59.ec2.internal Killing: Stopping container kube-apiserver-insecure-readyz 2021-05-01T03:59:43Z null kube-apiserver-ip-10-0-189-59.ec2.internal TerminationPreShutdownHooksFinished: All pre-shutdown hooks have been finished 2021-05-01T03:59:43Z null kube-apiserver-ip-10-0-189-59.ec2.internal TerminationStart: Received signal to terminate, becoming unready, but keeping serving 2021-05-01T03:59:49Z 1 cert-regeneration-controller-lock LeaderElection: ip-10-0-239-74_02f2b687-97f4-44c4-9516-e3fb364deb85 became leader 2021-05-01T04:00:53Z null kube-apiserver-ip-10-0-189-59.ec2.internal TerminationMinimalShutdownDurationFinished: The minimal shutdown duration of 1m10s finished 2021-05-01T04:00:53Z null kube-apiserver-ip-10-0-189-59.ec2.internal TerminationStoppedServing: Server has stopped listening 2021-05-01T04:01:53Z null kube-apiserver-ip-10-0-189-59.ec2.internal TerminationGracefulTerminationFinished: All pending requests processed 2021-05-01T04:01:55Z 1 kube-apiserver-ip-10-0-189-59.ec2.internal Pulling: Pulling image "registry.ci.openshift.org/ocp/4.8-2021-04-30-212732@sha256:e4c7be2f0e8b1e9ef1ad9161061449ec1bdc6953a58f6d456971ee945a8d3197" 2021-05-01T04:02:05Z 1 kube-apiserver-ip-10-0-189-59.ec2.internal Created: Created container setup 2021-05-01T04:02:05Z 1 kube-apiserver-ip-10-0-189-59.ec2.internal Pulled: Container image "registry.ci.openshift.org/ocp/4.8-2021-04-30-212732@sha256:e4c7be2f0e8b1e9ef1ad9161061449ec1bdc6953a58f6d456971ee945a8d3197" already present on machine 2021-05-01T04:02:05Z 1 kube-apiserver-ip-10-0-189-59.ec2.internal Pulled: Successfully pulled image "registry.ci.openshift.org/ocp/4.8-2021-04-30-212732@sha256:e4c7be2f0e8b1e9ef1ad9161061449ec1bdc6953a58f6d456971ee945a8d3197" in 9.383893506s 2021-05-01T04:02:05Z 1 kube-apiserver-ip-10-0-189-59.ec2.internal Started: Started container setup 2021-05-01T04:02:06Z 1 kube-apiserver-ip-10-0-189-59.ec2.internal Created: Created container kube-apiserver ... That really looks like kube-apiserver is rolling out a new version, and for some reason there is not the graceful LB handoff we need to avoid connection issues. Unifying the two timelines: * 03:59:43Z TerminationPreShutdownHooksFinished * 03:59:43Z TerminationStart: Received signal to terminate, becoming unready, but keeping serving * 04:00:53Z TerminationMinimalShutdownDurationFinished: The minimal shutdown duration of 1m10s finished * 04:00:53Z TerminationStoppedServing: Server has stopped listening * 04:00:58.307Z kube-apiserver-new-connection started failing... connection refused * 04:00:59.314Z kube-apiserver-new-connection started responding to GET requests * 04:01:03.307Z kube-apiserver-new-connection started failing... connection refused * 04:01:04.313Z kube-apiserver-new-connection started responding to GET requests * 04:01:53Z TerminationGracefulTerminationFinished: All pending requests processed So my guesses for this failure mode: * The LB isn't pinning the e2e polls to one target. * When the server stops listening, its LB target was still active. * A handful of subsequent requests get routed to the no-longer-listening server, and some others get routed to its still-listening siblings. * Within a few seconds, the LB does notice that the server is not longer listening, and removes it from rotation. Azure LB target rules haven't changed in over a year, and call for 2-poll transitions with polling every five seconds [7], so I dunno how to square that with the apparent behavior. [1]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade [2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade/1388328160222777344 [3]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade/1388328160080171008 [4]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade/1387603262152118272 [5]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade/1387240645646618624 [6]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade/1386878204156841984 [7]: https://github.com/openshift/installer/blame/dd6b3d59ffb8b437f22605d7ff2140635246c77d/data/data/azure/vnet/public-lb.tf#L179-L180
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.
Still popular: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=Kubernetes+APIs+remain+available+for+new+connections&maxAge=96h&type=junit' | grep 'azure.*failures match' | sort periodic-ci-openshift-release-master-ci-4.7-e2e-azure-upgrade (all) - 16 runs, 19% failed, 133% of failures match = 25% impact periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-azure-ovn-upgrade (all) - 16 runs, 19% failed, 400% of failures match = 75% impact periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-azure-upgrade (all) - 16 runs, 44% failed, 57% of failures match = 25% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 16 runs, 100% failed, 50% of failures match = 50% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-upgrade (all) - 4 runs, 75% failed, 33% of failures match = 25% impact periodic-ci-openshift-release-master-ci-4.9-e2e-azure-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-azure-ovn-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact rehearse-18937-periodic-ci-openshift-release-master-ci-4.9-e2e-azure-compact-upgrade (all) - 3 runs, 33% failed, 100% of failures match = 33% impact rehearse-18937-periodic-ci-openshift-release-master-ci-4.9-e2e-azure-upgrade (all) - 3 runs, 33% failed, 100% of failures match = 33% impact rehearse-18937-periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-azure-ovn-upgrade (all) - 3 runs, 67% failed, 50% of failures match = 33% impact rehearse-18937-periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-azure-upgrade (all) - 3 runs, 67% failed, 50% of failures match = 33% impact
The LifecycleStale keyword was removed because the bug got commented on recently. The bug assignee was notified.
Still a problem for 4.7->4.8 on OVN: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=Kubernetes+APIs+remain+available+for+new+connections&maxAge=96h&type=junit' | grep 'failures match' | sort | grep azure periodic-ci-openshift-release-master-ci-4.7-e2e-azure-upgrade (all) - 16 runs, 13% failed, 50% of failures match = 6% impact periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-azure-ovn-upgrade (all) - 16 runs, 19% failed, 467% of failures match = 88% impact periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-azure-upgrade (all) - 16 runs, 19% failed, 133% of failures match = 25% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 16 runs, 100% failed, 63% of failures match = 63% impact pull-ci-openshift-kubernetes-release-4.8-e2e-azure-upgrade (all) - 6 runs, 50% failed, 33% of failures match = 17% impact
@wking Please don't add random stats about the test. This test is a general smoke test and can have many reasons to fail (network, MCO, kubelet, apiserver, cloud provider, to name a few). This is about "dial tcp 20.75.119.245:6443: i/o timeout" only. We cannot hunt random ghosts, we must be specific. A dial tcp timeout is something on the network layer.
(In reply to Michal Fojtik from comment #6) > The information can be, for example, that the problem still occurs... This is what my stats are showing, right? If I've guessed the wrong component, we can change the component, but I don't think "leave the bug to rot out, so further CI (and presumably in-the-wild) failures continue" is the what we want. Sounds like you think this belongs to the Networking component?
Still happening pretty frequently in affected jobs: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=Kubernetes+APIs+remain+available+for+new+connections&maxAge=168h&type=junit' | grep 'failures match' | sort | grep azure periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade (all) - 7 runs, 57% failed, 25% of failures match = 14% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 7 runs, 100% failed, 43% of failures match = 43% impact periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-azure-ovn-upgrade (all) - 7 runs, 100% failed, 29% of failures match = 29% impact Example job [1], with hiccups like: Jul 29 04:20:46.154 - 106ms E kube-apiserver-new-connection kube-apiserver-new-connection is not responding to GET requests Mostly during the machine-config node-rolling portion of the update, but with some blips earlier in the update as well [2]. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade/1420589973467303936 [2]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade/1420589973467303936/artifacts/e2e-azure-upgrade/openshift-e2e-test/artifacts/junit/e2e-intervals_20210729-041926.html
Still lots of these: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=Kubernetes+APIs+remain+available+for+new+connections&maxAge=168h&type=junit' | grep 'failures match' | sort | grep azure periodic-ci-openshift-release-master-ci-4.10-e2e-azure-upgrade (all) - 7 runs, 86% failed, 17% of failures match = 14% impact periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-azure-ovn-upgrade (all) - 7 runs, 100% failed, 29% of failures match = 29% impact periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-azure-upgrade (all) - 7 runs, 71% failed, 20% of failures match = 14% impact periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-azure-ovn-upgrade (all) - 2 runs, 50% failed, 200% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade (all) - 7 runs, 29% failed, 50% of failures match = 14% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 7 runs, 100% failed, 71% of failures match = 71% impact periodic-ci-openshift-release-master-ci-4.9-e2e-azure-upgrade (all) - 7 runs, 100% failed, 14% of failures match = 14% impact periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-azure-ovn-upgrade (all) - 7 runs, 100% failed, 14% of failures match = 14% impact periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-azure-upgrade (all) - 7 runs, 86% failed, 17% of failures match = 14% impact pull-ci-openshift-machine-config-operator-release-4.8-e2e-azure-upgrade (all) - 6 runs, 67% failed, 25% of failures match = 17% impact rehearse-17197-periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact release-openshift-origin-installer-launch-azure (all) - 37 runs, 46% failed, 41% of failures match = 19% impact
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Whiteboard if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.
Closing due to inactivity/time lapsed.