release-openshift-ocp-installer-e2e-aws-csi-4.6 CI job flakes a bit, often with "connection refused". https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-informing#release-openshift-ocp-installer-e2e-aws-csi-4.6 Concrete job run: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-csi-4.6/1314588983040151552 Version-Release number of selected component (if applicable): 4.6, 4.7
There are several theories: 1. openshift-tests runs "kubectl exec ..." to inject data into a container. The first such kubectl call will discover APIs available on the API server, i.e. 100s of rapid API calls. Something on the way towards the API server may not like it (such as GCE load balancer). 1.1 Depending on the openshift-test pod and how it's run, kubectl may probe for APIs with *every* single execution. This may worsen the situation a lot. 1.2 Or, even if kubectl caches API discovery, there may be several parallel "kubectl exec" "first" calls when the cache is not built yet. In both cases, not using "kubectl exec" should help. 2. OCP is not actually ready to serve requests when openshift-tests starts and it may reconfigure / restart stuff during the tests running. Postponing openshift-test start by a minute or tow should help.
Update on 1.1: this theory is wrong. $HOME is set correctly in e2e-aws-csi jobs. kubectl caches discovered APIs.
*** Bug 1915539 has been marked as a duplicate of this bug. ***
*** Bug 1917678 has been marked as a duplicate of this bug. ***
*** Bug 1917570 has been marked as a duplicate of this bug. ***
*** Bug 1927709 has been marked as a duplicate of this bug. ***
Checked recent CI runs. * release-openshift-ocp-installer-e2e-aws-csi-4.7 is much better than release-openshift-ocp-installer-e2e-aws-csi-4.6 and it does not flake on "connection refused" or similar errors. * duplicate https://bugzilla.redhat.com/show_bug.cgi?id=1927709: "[sig-storage] In-tree Volumes [Driver: cinder] [Testpattern: Dynamic PV (default fs)] fsgroupchangepolicy (OnRootMismatch)[LinuxOnly], pod created with an initial fsgroup, new pod fsgroup applied to volume contents" "matched 0.05% of failing runs" * duplicate https://bugzilla.redhat.com/show_bug.cgi?id=1917570: "matched 0.05% of failing runs" * duplicate https://bugzilla.redhat.com/show_bug.cgi?id=1915539: "matched 0.03% of failing runs" The tests did not flake in any -periodic- jobs except for ovn, s390x and ppc64le. To sum it up, it looks like it's fixed in 4.7.