Bug 1780334 - There should be a networking stress test harness that helps catch networking flakes in a more controlled environment
Summary: There should be a networking stress test harness that helps catch networking ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.3.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.4.0
Assignee: Casey Callendrello
QA Contact: Mike Fiedler
URL:
Whiteboard:
Depends On:
Blocks: 1780337
TreeView+ depends on / blocked
 
Reported: 2019-12-05 17:29 UTC by Clayton Coleman
Modified: 2020-05-04 11:19 UTC (History)
0 users

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1780337 (view as bug list)
Environment:
Last Closed: 2020-05-04 11:18:31 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:0581 0 None None None 2020-05-04 11:19:07 UTC

Description Clayton Coleman 2019-12-05 17:29:14 UTC
As we have begun rolling out OVN (and over the lifetime of openshift-sdn) we often have subtle networking bugs that only materialize in other test suites in a general fashion that are difficult to root cause. This hampers our detect-triage-fix loop, and also causes a high amount of team to team overhead.  This also forces the networking team into a lot of high overhead interactions that reduces their time spent on actually triaging and fixing their issues (the "it's always networking" mindset).

To better isolate specific testing, we should create a network stress test harness that can evolve to provide more specific detection and resolution of networking issues as we go.  This can start simply - a test that runs the networking e2es repeatedly in parallel - and grow over time to offer more sophisticated invariant checking (a long running test that verifies that all pods can reach all targets).

We should add the test harness as an e2e suite and add a release periodic and PR job for openshift/sdn and ovn-kube that allows us to trigger the test.  We will have follow up work that ensures the tests are more effective, and probably want to look at process changes that ensure when we hit flakes that weren't caught by the stress test we add a new case. 

Things we can do later:

1. run an invariant checker (as a test, or part of the monitor, or as a disruptive test) that verifies the environment is currently working (exec into a pod and make sure it can curl masters over network, known working pods over service, directly reach pods over pod network, or access the host they are running on via host network)
2. add new types of e2e tests that try disruptive style things (delete the ovn-kube pod on a node and verify no failures are detected)
3. better instrument upgrades to verify that no connections are dropped, like we do for service load balancers
4. ... ?

Comment 1 Clayton Coleman 2019-12-05 17:29:45 UTC
For right now getting an openshift-sdn and ovn-kube stress test release job is the primary goal, then we will assess further.

Comment 3 zhaozhanqi 2019-12-16 03:30:20 UTC
hi, Mike

there is new added stress test tools, I'm not sure if you notice this. assign this bug to you. thanks.

Comment 4 Mike Fiedler 2020-01-15 15:38:02 UTC
Verified with openshift-tests from registry.svc.ci.openshift.org/ocp/4.4-art-latest-2020-01-15-102334:tests.   Running openshift/network/stress runs the network tests 15 times.   I saw 45 failures (15 x 3 failing tests) which I will file bugs for separately.

Comment 6 errata-xmlrpc 2020-05-04 11:18:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581


Note You need to log in before you can comment on or make changes to this bug.