Bug 1890131

Summary: e2e-aws-csi job flakes on connection refused
Product: OpenShift Container Platform Reporter: Jan Safranek <jsafrane>
Component: StorageAssignee: Jan Safranek <jsafrane>
Storage sub component: Storage QA Contact: Qin Ping <piqin>
Status: CLOSED CURRENTRELEASE Docs Contact:
Severity: medium    
Priority: unspecified CC: aos-bugs, apavel, chaoyang, jack.ottofaro, sttts
Version: 4.7   
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-02-17 17:26:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jan Safranek 2020-10-21 13:48:59 UTC
release-openshift-ocp-installer-e2e-aws-csi-4.6 CI job flakes a bit, often with "connection refused".

https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-informing#release-openshift-ocp-installer-e2e-aws-csi-4.6

Concrete job run:
https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-csi-4.6/1314588983040151552

Version-Release number of selected component (if applicable):
4.6, 4.7

Comment 1 Jan Safranek 2020-10-21 13:56:13 UTC
There are several theories:

1. openshift-tests runs "kubectl exec ..." to inject data into a container. The first such kubectl call will discover APIs available on the API server, i.e. 100s of rapid API calls. Something on the way towards the API server may not like it (such as GCE load balancer).

  1.1 Depending on the openshift-test pod and how it's run, kubectl may probe for APIs with *every* single execution. This may worsen the situation a lot.
  1.2 Or, even if kubectl caches API discovery, there may be several parallel "kubectl exec" "first" calls when the cache is not built yet.

  In both cases, not using "kubectl exec" should help.

2. OCP is not actually ready to serve requests when openshift-tests starts and it may reconfigure / restart stuff during the tests running. Postponing openshift-test start by a minute or tow should help.

Comment 2 Jan Safranek 2020-10-23 15:29:15 UTC
Update on 1.1: this theory is wrong. $HOME is set correctly in e2e-aws-csi jobs. kubectl caches discovered APIs.

Comment 3 Jan Safranek 2021-01-15 15:23:14 UTC
*** Bug 1915539 has been marked as a duplicate of this bug. ***

Comment 4 Jan Safranek 2021-01-19 10:05:39 UTC
*** Bug 1917678 has been marked as a duplicate of this bug. ***

Comment 5 Jan Safranek 2021-01-19 15:00:51 UTC
*** Bug 1917570 has been marked as a duplicate of this bug. ***

Comment 6 Jan Safranek 2021-02-11 14:38:06 UTC
*** Bug 1927709 has been marked as a duplicate of this bug. ***

Comment 7 Jan Safranek 2021-02-17 17:26:30 UTC
Checked recent CI runs.

* release-openshift-ocp-installer-e2e-aws-csi-4.7 is much better than release-openshift-ocp-installer-e2e-aws-csi-4.6 and it does not flake on "connection refused" or similar errors.
* duplicate https://bugzilla.redhat.com/show_bug.cgi?id=1927709: "[sig-storage] In-tree Volumes [Driver: cinder] [Testpattern: Dynamic PV (default fs)] fsgroupchangepolicy (OnRootMismatch)[LinuxOnly], pod created with an initial fsgroup, new pod fsgroup applied to volume contents"
  "matched 0.05% of failing runs"
* duplicate https://bugzilla.redhat.com/show_bug.cgi?id=1917570: "matched 0.05% of failing runs"
* duplicate https://bugzilla.redhat.com/show_bug.cgi?id=1915539: "matched 0.03% of failing runs"

The tests did not flake in any -periodic- jobs except for ovn, s390x and ppc64le.

To sum it up, it looks like it's fixed in 4.7.