Description of problem:
Recent e2e-agnostic-upgrade in MCO repo has been failing due to some mandatory test failing, such as
disruption_tests: [sig-api-machinery] Kubernetes APIs remain available for new connections
Aug 23 16:53:37.649 E kube-apiserver-new-connection kube-apiserver-new-connection started failing: Get "https://api.ci-op-3gxydq9m-57c36.ci.azure.devcluster.openshift.com:6443/api/v1/namespaces/default": dial tcp 22.214.171.124:6443: i/o timeout
Aug 23 16:53:37.649 - 15s E kube-apiserver-new-connection kube-apiserver-new-connection is not responding to GET requests
Aug 23 16:53:52.649 I kube-apiserver-new-connection kube-apiserver-new-connection started responding to GET requests
github.com/openshift/origin/test/extended/util/disruption/controlplane.(*availableTest).Test(0xc001d3ad20, 0xc001c28dc0, 0xc002d6d500, 0x2)
created by k8s.io/kubernetes/test/e2e/chaosmonkey.(*Chaosmonkey).Do
There are few other disruption_tests failing as well which could be related:
- disruption_tests: [sig-api-machinery] Kubernetes APIs remain available with reused connections expand_less
- disruption_tests: [sig-api-machinery] OpenShift APIs remain available for new connections
- disruption_tests: [sig-api-machinery] OAuth APIs remain available for new connections
Few CI job links from MCO PRs:
possibly a dup of bug 1955333? Certainly in the same Azure + Kube-reachability space.
e2e-agnostic-* jobs could run on any platform. But for the MCO, they're currently Azure . And Kube-reachability issues are often platform-specific, involving pod-restart logic vs. platform-specific load balancer implementation. So tweaking the title here to include "Azure".
Setting priority to high because upgrade job is blocking on MCO PRs and as a result most of the PRs are not getting merged.
(In reply to W. Trevor King from comment #2)
> e2e-agnostic-* jobs could run on any platform. But for the MCO, they're
> currently Azure . And Kube-reachability issues are often
> platform-specific, involving pod-restart logic vs. platform-specific load
> balancer implementation. So tweaking the title here to include "Azure".
Not sure if this is just azure specific issue. I later on started an upgrade test on gcp https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2722/pull-ci-openshift-machine-config-operator-master-e2e-gcp-upgrade/1430172245858193408 where these tests failed too.
Duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1845414. Until there are very specific new insights about the root causes, there is no value in new BZs. There are a thousand different reasons why the API can go unavailable for some time, in many components like kube-apiserver itself, but also node, MCO, cri-o and the cloud infra. I don't see a triage attempt in this BZ to point to one of those.
*** This bug has been marked as a duplicate of bug 1845414 ***