1890131 – e2e-aws-csi job flakes on connection refused

Bug 1890131 - e2e-aws-csi job flakes on connection refused

Summary: e2e-aws-csi job flakes on connection refused

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Jan Safranek
QA Contact:	Qin Ping
Docs Contact:
URL:
Whiteboard:
Duplicates (3):	1915539 1917570 1927709 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-10-21 13:48 UTC by Jan Safranek
Modified:	2021-02-17 17:26 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-02-17 17:26:30 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Jan Safranek 2020-10-21 13:48:59 UTC

release-openshift-ocp-installer-e2e-aws-csi-4.6 CI job flakes a bit, often with "connection refused".

https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-informing#release-openshift-ocp-installer-e2e-aws-csi-4.6

Concrete job run:
https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-csi-4.6/1314588983040151552

Version-Release number of selected component (if applicable):
4.6, 4.7

Comment 1 Jan Safranek 2020-10-21 13:56:13 UTC

There are several theories:

1. openshift-tests runs "kubectl exec ..." to inject data into a container. The first such kubectl call will discover APIs available on the API server, i.e. 100s of rapid API calls. Something on the way towards the API server may not like it (such as GCE load balancer).

  1.1 Depending on the openshift-test pod and how it's run, kubectl may probe for APIs with *every* single execution. This may worsen the situation a lot.
  1.2 Or, even if kubectl caches API discovery, there may be several parallel "kubectl exec" "first" calls when the cache is not built yet.

  In both cases, not using "kubectl exec" should help.

2. OCP is not actually ready to serve requests when openshift-tests starts and it may reconfigure / restart stuff during the tests running. Postponing openshift-test start by a minute or tow should help.

Comment 2 Jan Safranek 2020-10-23 15:29:15 UTC

Update on 1.1: this theory is wrong. $HOME is set correctly in e2e-aws-csi jobs. kubectl caches discovered APIs.

Comment 3 Jan Safranek 2021-01-15 15:23:14 UTC

*** Bug 1915539 has been marked as a duplicate of this bug. ***

Comment 4 Jan Safranek 2021-01-19 10:05:39 UTC

*** Bug 1917678 has been marked as a duplicate of this bug. ***

Comment 5 Jan Safranek 2021-01-19 15:00:51 UTC

*** Bug 1917570 has been marked as a duplicate of this bug. ***

Comment 6 Jan Safranek 2021-02-11 14:38:06 UTC

*** Bug 1927709 has been marked as a duplicate of this bug. ***

Comment 7 Jan Safranek 2021-02-17 17:26:30 UTC

Checked recent CI runs.

* release-openshift-ocp-installer-e2e-aws-csi-4.7 is much better than release-openshift-ocp-installer-e2e-aws-csi-4.6 and it does not flake on "connection refused" or similar errors.
* duplicate https://bugzilla.redhat.com/show_bug.cgi?id=1927709: "[sig-storage] In-tree Volumes [Driver: cinder] [Testpattern: Dynamic PV (default fs)] fsgroupchangepolicy (OnRootMismatch)[LinuxOnly], pod created with an initial fsgroup, new pod fsgroup applied to volume contents"
  "matched 0.05% of failing runs"
* duplicate https://bugzilla.redhat.com/show_bug.cgi?id=1917570: "matched 0.05% of failing runs"
* duplicate https://bugzilla.redhat.com/show_bug.cgi?id=1915539: "matched 0.03% of failing runs"

The tests did not flake in any -periodic- jobs except for ovn, s390x and ppc64le.

To sum it up, it looks like it's fixed in 4.7.

Note You need to log in before you can comment on or make changes to this bug.