Bug 1980141 - NetworkPolicy e2e tests are flaky in 4.9, especially in stress
Summary: NetworkPolicy e2e tests are flaky in 4.9, especially in stress
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.9
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: Andrew Stoycos
QA Contact: zhaozhanqi
URL:
Whiteboard:
: 1975476 1975865 1986119 1989395 1990377 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-07-07 21:35 UTC by Clayton Coleman
Modified: 2022-11-08 19:34 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
job=release-openshift-origin-installer-e2e-aws-sdn-network-stress-4.9=all [sig-network] Netpol [LinuxOnly] NetworkPolicy between server and client should deny egress from all pods in a namespace [Feature:NetworkPolicy] [Skipped:Network/OpenShiftSDN/Multitenant] [Skipped:Network/OpenShiftSDN] [Suite:openshift/conformance/parallel] [Suite:k8s] job=periodic-ci-openshift-release-master-ci-4.9-e2e-aws-compact=all job=periodic-ci-openshift-release-master-ci-4.9-e2e-aws-compact-upgrade=all job=periodic-ci-openshift-release-master-ci-4.9-e2e-azure-compact=all job=periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-compact-upgrade=all job=periodic-ci-openshift-release-master-ci-4.9-e2e-azure-compact-upgrade=all job=periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-compact=all job=periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-compact-upgrade=all job=release-openshift-ocp-installer-e2e-metal-compact-4.9=all job=periodic-ci-openshift-release-master-nightly-4.9-e2e-openstack-az=all job= periodic-ci-openshift-multiarch-master-nightly-4.9-ocp-e2e-compact-remote-libvirt-s390x=all
Last Closed: 2022-11-08 19:34:10 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift origin pull 26266 0 None Merged Bug 1980141: Skip the new "NetPol" tests for now 2022-11-08 14:01:34 UTC
Github openshift origin pull 26316 0 None Merged Bug 1980141: Skip new `Netpol` tests for Network Stress Suite 2022-11-08 14:01:33 UTC
Github openshift origin pull 26775 0 None open Bug 1980141: Reactivate netpol tests 2022-11-08 14:01:29 UTC

Description Clayton Coleman 2021-07-07 21:35:01 UTC
Network stress since 06/21 code has been failing with significant flakes

https://testgrid.k8s.io/redhat-openshift-ocp-release-4.9-informing#release-openshift-origin-installer-e2e-aws-sdn-network-stress-4.9

haproxy 2.4 was merged around that time, but the revert PR was failing only on network policy:  

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-aws/1412856796430733312

fail [k8s.io/kubernetes.1/test/e2e/network/netpol/network_legacy.go:1908]: Jul  7 20:53:30.316: Pod client-a-2v6bp should be able to connect to service svc-server, but was not able to connect.
Pod logs:
TIMEOUT
TIMEOUT
REFUSED
REFUSED
REFUSED


Looking at jobs that fail with that error: https://search.ci.openshift.org/?search=should+be+able+to+connect+to+service+svc-server%2C+but+was+not+able+to+connect&maxAge=48h&context=1&type=bug%2Bjunit&name=master%7C4.9&excludeName=&maxMatches=1&maxBytes=20971520&groupBy=job

shows about 6% failure rate.  Setting high because this is showing up in what looks like all platforms at that rate.  Does not seem to happen in stress jobs prior to 06/25 because we turned those tests on.

last 4.9 stress job pass was https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-sdn-network-stress-4.9/1408434925270470656, but note that is running code from 6/21 because apparently we didn't promote for 4 days.

Network policy is just heavily flaky, if the set of tests can be made not flaky we can leave them in stress, otherwise they need to be excluded.  If they are excluded this can drop to medium and remain open if it's ONLY the tests that are flaky (not the actual function).  Note that NetworkPolicy should have a reasonable SLO, and network stress will push that heavily, so it's possible that instead of bypassing we should optimize network policy.

Comment 1 Clayton Coleman 2021-07-07 21:37:36 UTC
Broader than https://bugzilla.redhat.com/show_bug.cgi?id=1975865

Comment 2 Andrew McDermott 2021-07-08 10:38:43 UTC
PRs reverting haproxy-2.4 => haproxy-2.2

https://github.com/openshift/images/pull/97
https://github.com/openshift/router/pull/318

Comment 4 Dan Winship 2021-07-19 14:47:46 UTC
The test-disabling doesn't need QA, and we don't want this bug to get closed anyway because we need to track the fact that we have to fix them to stop being flaky

Comment 5 Andrew Stoycos 2021-07-30 15:06:47 UTC
*** Bug 1975865 has been marked as a duplicate of this bug. ***

Comment 6 Dan Winship 2021-08-03 13:59:15 UTC
*** Bug 1986119 has been marked as a duplicate of this bug. ***

Comment 7 Dan Winship 2021-08-03 13:59:57 UTC
*** Bug 1975476 has been marked as a duplicate of this bug. ***

Comment 8 Dan Winship 2021-08-09 12:56:15 UTC
(again moving back to NEW so this stays open to track the fact that we are skipping tests)

Comment 9 Andrew Stoycos 2021-08-10 17:54:00 UTC
*** Bug 1989395 has been marked as a duplicate of this bug. ***

Comment 10 Andrew Stoycos 2021-08-19 20:06:19 UTC
*** Bug 1990377 has been marked as a duplicate of this bug. ***

Comment 13 Andrew Stoycos 2022-11-08 19:34:10 UTC
Closing this in favor of tracking the last bit (re-enabling netpol test suite in Openshift) in an issue -> https://github.com/openshift/origin/issues/27535 Since this isn't really a bug and more of a tech-debt item since we're already running these test in upstream ovn-kubernetes.  

Thanks, 
Andrew


Note You need to log in before you can comment on or make changes to this bug.