Bug 1865857

Summary:	e2e test fail because API server restarts after installation, does not remain available - release blocker
Product:	OpenShift Container Platform	Reporter:	Jan Safranek <jsafrane>
Component:	kube-apiserver	Assignee:	Stefan Schimanski <sttts>
Status:	CLOSED DUPLICATE	QA Contact:	Ke Wang <kewang>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	4.6	CC:	aos-bugs, ccoleman, mfojtik, xxia
Target Milestone:	---
Target Release:	4.6.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-08-21 11:48:26 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Jan Safranek 2020-08-04 11:52:02 UTC

aws-ebs-csi-driver-operator-master-e2e-operator flake a lot, we were not able to merge a PR for 5 days:

https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-informing#release-openshift-ocp-installer-e2e-aws-csi-4.6
https://prow.ci.openshift.org/pr-history/?org=openshift&repo=aws-ebs-csi-driver-operator&pr=70

They typically fail because a test gets "connection refused" from API server. Using https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_aws-ebs-csi-driver-operator/70/pull-ci-openshift-aws-ebs-csi-driver-operator-master-e2e-operator/1290552613552525312 as an example:

fail [@/k8s.io/kubernetes/test/e2e/e2e.go:262]: Unexpected error:
...

    Get https://api.ci-op-5wcn99xm-79224.origin-ci-int-aws.dev.rhcloud.com:6443/api/v1/nodes?fieldSelector=spec.unschedulable%3Dfalse&resourceVersion=0: dial tcp 54.236.162.103:6443: connect: connection refused
occurred

From logs and events of this CI job:

1. Installation completes:
time="2020-08-04T06:50:09Z" level=info msg="Install complete!"

2. e2e tests start
openshift-tests version: v4.1.0-3029-g5279aaf
I0804 06:50:12.796634     235 test_context.go:427] Tolerating taints "node-role.kubernetes.io/master" when considering if nodes are ready
...

3. *something* restarts API servers
Event at 06:50:21
Status for clusteroperator/kube-apiserver changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready" to "NodeControllerDegraded: All master nodes are ready\nStaticPodsDegraded: pod/kube-apiserver-ip-10-0-192-228.ec2.internal container \"kube-apiserver\" is not ready: PodInitializing: \nStaticPodsDegraded: pod/kube-apiserver-ip-10-0-192-228.ec2.internal container \"kube-apiserver-cert-regeneration-controller\" is not ready: PodInitializing: \nStaticPodsDegraded: pod/kube-apiserver-ip-10-0-192-228.ec2.internal container \"kube-apiserver-cert-syncer\" is not ready: PodInitializing: \nStaticPodsDegraded: pod/kube-apiserver-ip-10-0-192-228.ec2.internal container \"kube-apiserver-check-endpoints\" is not ready: PodInitializing: \nStaticPodsDegraded: pod/kube-apiserver-ip-10-0-192-228.ec2.internal container \"kube-apiserver-insecure-readyz\" is not ready: PodInitializing: "

4. Things start failing:
Aug 04 06:51:21.347 W ns/openshift-apiserver deployment/apiserver reason/ConnectivityOutageDetected Connectivity outage detected: kubernetes-apiserver-endpoint-10-0-192-228-6443: failed to establish a TCP connection to 10.0.192.228:6443: dial tcp 10.0.192.228:6443: connect: connection refused (2 times)

5. And ultimately, the tests fail at 06:51:33.414 too:
Get https://api.ci-op-5wcn99xm-79224.origin-ci-int-aws.dev.rhcloud.com:6443/api/v1/nodes?fieldSelector=spec.unschedulable%3Dfalse&resourceVersion=0: dial tcp 54.236.162.103:6443: connect: connection refused
occurred

I do not know *why* is API server restarted, this event looks interesting:

06:51:58	openshift-kube-apiserver-operator	kube-apiserver-operator-installer-controller	
NodeCurrentRevisionChanged   Updated node "ip-10-0-192-228.ec2.internal" from revision 4 to 6 because static pod is ready

Was the installation reported to be completed too early? Upstream e2e tests (incl. conformance) don't retry on failure and they expect the cluster is stable.

Comment 1 Clayton Coleman 2020-08-07 17:37:45 UTC

I'm bumping this to urgent, this is happening in a broad range of 4.6 jobs (fips has it particularly bad) https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-informing#release-openshift-ocp-installer-e2e-aws-fips-4.6

In general, connection refused during a test run is not allowed, and means graceful rollout of apiserver isn't functioning as expected.

Comment 2 Stefan Schimanski 2020-08-21 11:48:26 UTC

Marking as duplicate of our aws upgrade disruption BZ. This one here does not add anything new.

*** This bug has been marked as a duplicate of bug 1845412 ***

Comment 3 Stefan Schimanski 2020-08-21 11:59:11 UTC

Marking as duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1868750. That one is more specific.

*** This bug has been marked as a duplicate of bug 1868750 ***