Bug 1873311

Summary: e2e test fails NetworkPolicy between server and client should stop enforcing policies after they are deleted [Feature:NetworkPolicy]
Product: OpenShift Container Platform Reporter: Tim Rozet <trozet>
Component: NetworkingAssignee: Tim Rozet <trozet>
Networking sub component: ovn-kubernetes QA Contact: Anurag saxena <anusaxen>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: unspecified CC: atheurer, surya
Version: 4.6   
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-27 16:35:33 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Tim Rozet 2020-08-27 19:56:20 UTC
Description of problem:
Test is failing constantly in upstream and downstream CI.

Comment 1 Tim Rozet 2020-08-27 20:01:09 UTC
We debugged and found that the issue is introduced by:

https://github.com/ovn-org/ovn-kubernetes/commit/fd4758701cd61e0f69e21ef5a96ab5d91f704ef0

This makes this test fail when you deploy with more than one node, and the client and server are on different nodes. Consider the following:

client-----nodeA----nodeB---Server

An ingress deny all policy is placed on the cluster. client cannot communicate with server. This works fine.

A new policy is added to allow ingress into Server from client. Sending traffic from client -> Server, arrives at server. However, return traffic from Server-> Client is dropped at nodeA. This is because when we create network policy it only targets port groups:


[root@ovn-control-plane ~]# ovn-nbctl acl-list f62f4d42-5cfb-45e5-8f7f-4bf3e7c9fbe6
  to-lport  1001 (ip4.src == {$a12672671609520104948} && outport == @a1383251650920656097) allow-related

This port group only includes the destination, which is the server. In this case an allow-related ACL will only be placed on nodeB. Therefore any return traffic in nodeA is not conntracked and therefore will be dropped because there is no way to tell it is return traffic.

Comment 2 Tim Rozet 2020-08-27 20:01:32 UTC
adding links to sippy for ovn

[sig-network] NetworkPolicy [LinuxOnly] NetworkPolicy between server and client should stop enforcing policies after they are deleted [Feature:NetworkPolicy]

Comment 3 Tim Rozet 2020-08-27 20:03:53 UTC
As a short term solution we are reverting the previous commit. This will temporarily lower performance for clusters without network policy. The correct fix should be to add ACLs for portgroups on the client side as Dumitru mentioned:

"So it should be ok to add a PG, pg_client and an acl (also applied on pg_client) with match inport == @pg_client && ip.dst == <server_ip> action allow-related"

I'll open another bug to address allow-related perf fix.

Comment 8 errata-xmlrpc 2020-10-27 16:35:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196