1865857 – e2e test fail because API server restarts after installation, does not remain available - release blocker

Bug 1865857 - e2e test fail because API server restarts after installation, does not remain available - release blocker

Summary: e2e test fail because API server restarts after installation, does not remain...

Keywords:
Status:	CLOSED DUPLICATE of bug 1868750
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-apiserver
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Stefan Schimanski
QA Contact:	Ke Wang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-08-04 11:52 UTC by Jan Safranek
Modified:	2020-08-21 11:59 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-08-21 11:48:26 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Jan Safranek 2020-08-04 11:52:02 UTC

aws-ebs-csi-driver-operator-master-e2e-operator flake a lot, we were not able to merge a PR for 5 days:

https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-informing#release-openshift-ocp-installer-e2e-aws-csi-4.6
https://prow.ci.openshift.org/pr-history/?org=openshift&repo=aws-ebs-csi-driver-operator&pr=70

They typically fail because a test gets "connection refused" from API server. Using https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_aws-ebs-csi-driver-operator/70/pull-ci-openshift-aws-ebs-csi-driver-operator-master-e2e-operator/1290552613552525312 as an example:

fail [@/k8s.io/kubernetes/test/e2e/e2e.go:262]: Unexpected error:
...

    Get https://api.ci-op-5wcn99xm-79224.origin-ci-int-aws.dev.rhcloud.com:6443/api/v1/nodes?fieldSelector=spec.unschedulable%3Dfalse&resourceVersion=0: dial tcp 54.236.162.103:6443: connect: connection refused
occurred

From logs and events of this CI job:

1. Installation completes:
time="2020-08-04T06:50:09Z" level=info msg="Install complete!"

2. e2e tests start
openshift-tests version: v4.1.0-3029-g5279aaf
I0804 06:50:12.796634     235 test_context.go:427] Tolerating taints "node-role.kubernetes.io/master" when considering if nodes are ready
...

3. *something* restarts API servers
Event at 06:50:21
Status for clusteroperator/kube-apiserver changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready" to "NodeControllerDegraded: All master nodes are ready\nStaticPodsDegraded: pod/kube-apiserver-ip-10-0-192-228.ec2.internal container \"kube-apiserver\" is not ready: PodInitializing: \nStaticPodsDegraded: pod/kube-apiserver-ip-10-0-192-228.ec2.internal container \"kube-apiserver-cert-regeneration-controller\" is not ready: PodInitializing: \nStaticPodsDegraded: pod/kube-apiserver-ip-10-0-192-228.ec2.internal container \"kube-apiserver-cert-syncer\" is not ready: PodInitializing: \nStaticPodsDegraded: pod/kube-apiserver-ip-10-0-192-228.ec2.internal container \"kube-apiserver-check-endpoints\" is not ready: PodInitializing: \nStaticPodsDegraded: pod/kube-apiserver-ip-10-0-192-228.ec2.internal container \"kube-apiserver-insecure-readyz\" is not ready: PodInitializing: "

4. Things start failing:
Aug 04 06:51:21.347 W ns/openshift-apiserver deployment/apiserver reason/ConnectivityOutageDetected Connectivity outage detected: kubernetes-apiserver-endpoint-10-0-192-228-6443: failed to establish a TCP connection to 10.0.192.228:6443: dial tcp 10.0.192.228:6443: connect: connection refused (2 times)

5. And ultimately, the tests fail at 06:51:33.414 too:
Get https://api.ci-op-5wcn99xm-79224.origin-ci-int-aws.dev.rhcloud.com:6443/api/v1/nodes?fieldSelector=spec.unschedulable%3Dfalse&resourceVersion=0: dial tcp 54.236.162.103:6443: connect: connection refused
occurred

I do not know *why* is API server restarted, this event looks interesting:

06:51:58	openshift-kube-apiserver-operator	kube-apiserver-operator-installer-controller	
NodeCurrentRevisionChanged   Updated node "ip-10-0-192-228.ec2.internal" from revision 4 to 6 because static pod is ready

Was the installation reported to be completed too early? Upstream e2e tests (incl. conformance) don't retry on failure and they expect the cluster is stable.

Comment 1 Clayton Coleman 2020-08-07 17:37:45 UTC

I'm bumping this to urgent, this is happening in a broad range of 4.6 jobs (fips has it particularly bad) https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-informing#release-openshift-ocp-installer-e2e-aws-fips-4.6

In general, connection refused during a test run is not allowed, and means graceful rollout of apiserver isn't functioning as expected.

Comment 2 Stefan Schimanski 2020-08-21 11:48:26 UTC

Marking as duplicate of our aws upgrade disruption BZ. This one here does not add anything new.

*** This bug has been marked as a duplicate of bug 1845412 ***

Comment 3 Stefan Schimanski 2020-08-21 11:59:11 UTC

Marking as duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1868750. That one is more specific.

*** This bug has been marked as a duplicate of bug 1868750 ***

Note You need to log in before you can comment on or make changes to this bug.