aws-ebs-csi-driver-operator-master-e2e-operator flake a lot, we were not able to merge a PR for 5 days: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-informing#release-openshift-ocp-installer-e2e-aws-csi-4.6 https://prow.ci.openshift.org/pr-history/?org=openshift&repo=aws-ebs-csi-driver-operator&pr=70 They typically fail because a test gets "connection refused" from API server. Using https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_aws-ebs-csi-driver-operator/70/pull-ci-openshift-aws-ebs-csi-driver-operator-master-e2e-operator/1290552613552525312 as an example: fail [@/k8s.io/kubernetes/test/e2e/e2e.go:262]: Unexpected error: ... Get https://api.ci-op-5wcn99xm-79224.origin-ci-int-aws.dev.rhcloud.com:6443/api/v1/nodes?fieldSelector=spec.unschedulable%3Dfalse&resourceVersion=0: dial tcp 54.236.162.103:6443: connect: connection refused occurred From logs and events of this CI job: 1. Installation completes: time="2020-08-04T06:50:09Z" level=info msg="Install complete!" 2. e2e tests start openshift-tests version: v4.1.0-3029-g5279aaf I0804 06:50:12.796634 235 test_context.go:427] Tolerating taints "node-role.kubernetes.io/master" when considering if nodes are ready ... 3. *something* restarts API servers Event at 06:50:21 Status for clusteroperator/kube-apiserver changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready" to "NodeControllerDegraded: All master nodes are ready\nStaticPodsDegraded: pod/kube-apiserver-ip-10-0-192-228.ec2.internal container \"kube-apiserver\" is not ready: PodInitializing: \nStaticPodsDegraded: pod/kube-apiserver-ip-10-0-192-228.ec2.internal container \"kube-apiserver-cert-regeneration-controller\" is not ready: PodInitializing: \nStaticPodsDegraded: pod/kube-apiserver-ip-10-0-192-228.ec2.internal container \"kube-apiserver-cert-syncer\" is not ready: PodInitializing: \nStaticPodsDegraded: pod/kube-apiserver-ip-10-0-192-228.ec2.internal container \"kube-apiserver-check-endpoints\" is not ready: PodInitializing: \nStaticPodsDegraded: pod/kube-apiserver-ip-10-0-192-228.ec2.internal container \"kube-apiserver-insecure-readyz\" is not ready: PodInitializing: " 4. Things start failing: Aug 04 06:51:21.347 W ns/openshift-apiserver deployment/apiserver reason/ConnectivityOutageDetected Connectivity outage detected: kubernetes-apiserver-endpoint-10-0-192-228-6443: failed to establish a TCP connection to 10.0.192.228:6443: dial tcp 10.0.192.228:6443: connect: connection refused (2 times) 5. And ultimately, the tests fail at 06:51:33.414 too: Get https://api.ci-op-5wcn99xm-79224.origin-ci-int-aws.dev.rhcloud.com:6443/api/v1/nodes?fieldSelector=spec.unschedulable%3Dfalse&resourceVersion=0: dial tcp 54.236.162.103:6443: connect: connection refused occurred I do not know *why* is API server restarted, this event looks interesting: 06:51:58 openshift-kube-apiserver-operator kube-apiserver-operator-installer-controller NodeCurrentRevisionChanged Updated node "ip-10-0-192-228.ec2.internal" from revision 4 to 6 because static pod is ready Was the installation reported to be completed too early? Upstream e2e tests (incl. conformance) don't retry on failure and they expect the cluster is stable.
I'm bumping this to urgent, this is happening in a broad range of 4.6 jobs (fips has it particularly bad) https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-informing#release-openshift-ocp-installer-e2e-aws-fips-4.6 In general, connection refused during a test run is not allowed, and means graceful rollout of apiserver isn't functioning as expected.
Marking as duplicate of our aws upgrade disruption BZ. This one here does not add anything new. *** This bug has been marked as a duplicate of bug 1845412 ***
Marking as duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1868750. That one is more specific. *** This bug has been marked as a duplicate of bug 1868750 ***