Bug 1997057

Summary: Azure upgrade test failing due to [sig-api-machinery] Kubernetes APIs remain available for new connections
Product: OpenShift Container Platform Reporter: Sinny Kumari <skumari>
Component: kube-apiserverAssignee: Stefan Schimanski <sttts>
Status: CLOSED DUPLICATE QA Contact: Xingxing Xia <xxia>
Severity: unspecified Docs Contact:
Priority: high    
Version: 4.9CC: aos-bugs, mfojtik, wking, xxia
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: tag-ci
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-08-25 10:37:54 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Sinny Kumari 2021-08-24 10:33:13 UTC
Description of problem:
Recent e2e-agnostic-upgrade in MCO repo has been failing due to some mandatory test failing, such as

disruption_tests: [sig-api-machinery] Kubernetes APIs remain available for new connections

"""
Aug 23 16:53:37.649 E kube-apiserver-new-connection kube-apiserver-new-connection started failing: Get "https://api.ci-op-3gxydq9m-57c36.ci.azure.devcluster.openshift.com:6443/api/v1/namespaces/default": dial tcp 20.106.10.105:6443: i/o timeout
Aug 23 16:53:37.649 - 15s   E kube-apiserver-new-connection kube-apiserver-new-connection is not responding to GET requests
Aug 23 16:53:52.649 I kube-apiserver-new-connection kube-apiserver-new-connection started responding to GET requests

github.com/openshift/origin/test/extended/util/disruption/controlplane.(*availableTest).Test(0xc001d3ad20, 0xc001c28dc0, 0xc002d6d500, 0x2)
	github.com/openshift/origin/test/extended/util/disruption/controlplane/controlplane.go:127 +0x528
github.com/openshift/origin/test/extended/util/disruption.(*chaosMonkeyAdapter).Test(0xc001dc11d0, 0xc0018ada28)
	github.com/openshift/origin/test/extended/util/disruption/disruption.go:190 +0x3be
k8s.io/kubernetes/test/e2e/chaosmonkey.(*Chaosmonkey).Do.func1(0xc0018ada28, 0xc00168ef70)
	k8s.io/kubernetes.0-rc.0/test/e2e/chaosmonkey/chaosmonkey.go:90 +0x6d
created by k8s.io/kubernetes/test/e2e/chaosmonkey.(*Chaosmonkey).Do
	k8s.io/kubernetes.0-rc.0/test/e2e/chaosmonkey/chaosmonkey.go:87 +0xc9

"""

There are few other disruption_tests failing as well which could be related:
- disruption_tests: [sig-api-machinery] Kubernetes APIs remain available with reused connections expand_less
- disruption_tests: [sig-api-machinery] OpenShift APIs remain available for new connections
- disruption_tests: [sig-api-machinery] OAuth APIs remain available for new connections

Few CI job links from MCO PRs:
- https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2704/pull-ci-openshift-machine-config-operator-master-e2e-agnostic-upgrade/1429824441595990016
- https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2706/pull-ci-openshift-machine-config-operator-master-e2e-agnostic-upgrade/1429821059154055168

Comment 1 W. Trevor King 2021-08-24 14:02:12 UTC
possibly a dup of bug 1955333?  Certainly in the same Azure + Kube-reachability space.

Comment 2 W. Trevor King 2021-08-24 14:04:33 UTC
e2e-agnostic-* jobs could run on any platform.  But for the MCO, they're currently Azure [1].  And Kube-reachability issues are often platform-specific, involving pod-restart logic vs. platform-specific load balancer implementation.  So tweaking the title here to include "Azure".

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2704/pull-ci-openshift-machine-config-operator-master-e2e-agnostic-upgrade/1429824441595990016#1:build-log.txt%3A19

Comment 3 Sinny Kumari 2021-08-25 10:02:55 UTC
Setting priority to high because upgrade job is blocking on MCO PRs and as a result most of the PRs are not getting merged.

Comment 4 Sinny Kumari 2021-08-25 10:08:40 UTC
(In reply to W. Trevor King from comment #2)
> e2e-agnostic-* jobs could run on any platform.  But for the MCO, they're
> currently Azure [1].  And Kube-reachability issues are often
> platform-specific, involving pod-restart logic vs. platform-specific load
> balancer implementation.  So tweaking the title here to include "Azure".

Not sure if this is just azure specific issue. I later on started an upgrade test on gcp https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2722/pull-ci-openshift-machine-config-operator-master-e2e-gcp-upgrade/1430172245858193408 where these tests failed too.

Comment 5 Stefan Schimanski 2021-08-25 10:37:54 UTC
Duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1845414. Until there are very specific new insights about the root causes, there is no value in new BZs. There are a thousand different reasons why the API can go unavailable for some time, in many components like kube-apiserver itself, but also node, MCO, cri-o and the cloud infra. I don't see a triage attempt in this BZ to point to one of those.

*** This bug has been marked as a duplicate of bug 1845414 ***