Bug 1953729

Summary: e2e unidling test is flaking heavily on SNO jobs
Product: OpenShift Container Platform Reporter: Stephen Greene <sgreene>
Component: NetworkingAssignee: Stephen Greene <sgreene>
Networking sub component: router QA Contact: jechen <jechen>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: unspecified CC: aos-bugs, jechen
Version: 4.6   
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 23:04:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1955600    

Description Stephen Greene 2021-04-26 18:54:47 UTC
Description of problem:

The following test is flaking heavily in Single Node OpenShift CI

The HAProxy router should be able to connect to a service that is idled because a GET on the route will unidle it

As found in https://github.com/openshift/origin/blob/master/test/extended/router/idle.go


See https://search.ci.openshift.org/?search=The+HAProxy+router+should+be+able+to+connect+to+a+service+that+is+idled+because+a+GET+on+the+route+will+unidle+it&maxAge=48h&context=1&type=bug%2Bjunit&name=%5Epull-ci-openshift-cluster-monitoring-operator-master-e2e-aws-single-node%24&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
Which shows a lot of flakes for the unidling test on SNO ci.


This test sometimes flakes on non SNO jobs. Taking a closer look at the test, you can see that the test does wait until the test workload is completely idled before trying to unidle it. You can also see that the test HTTP logic has a 15 minute HTTP timeout, which renders the HTTP retry logic useless.

This test was introduced in 4.6 so any test improvements and optimizations should be backported accordingly.

Comment 2 jechen 2021-04-30 18:05:53 UTC
Out of last 11 CI runs, there were flaky (most recent one was on April 29), one failed for this test

https://testgrid.k8s.io/redhat-single-node#periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-single-node

Will check again next week with more CI results.

Comment 3 jechen 2021-05-03 12:11:02 UTC
re-checked, last 4 days' CI passed on this test,mark verified.

https://testgrid.k8s.io/redhat-single-node#periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-single-node

Comment 6 errata-xmlrpc 2021-07-27 23:04:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438