1905761 – NetworkPolicy with Egress policyType is resulting in SDN errors and improper communication within Project

Bug 1905761 - NetworkPolicy with Egress policyType is resulting in SDN errors and improper communication within Project

Summary: NetworkPolicy with Egress policyType is resulting in SDN errors and improper ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Victor Pickard
QA Contact:	zhaozhanqi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1969993 1970046
TreeView+	depends on / blocked

Reported:	2020-12-09 01:39 UTC by Robert Bost
Modified:	2024-10-01 17:11 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	1969993 (view as bug list)
Environment:
Last Closed:	2021-02-24 15:41:14 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
YAML containing reproducer project (1.03 KB, text/plain) 2020-12-09 01:47 UTC, Robert Bost	no flags	Details
sdn ovs openflow (22.04 KB, text/plain) 2021-01-08 13:26 UTC, zhaozhanqi	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift sdn pull 233	None	closed	Bug 1905761: Fix IP list for empty Egress network policy	2021-02-15 16:08:34 UTC
Github	openshift sdn pull 239	None	closed	Bug 1905761: Fix empty egress policy connectivity	2021-02-15 16:08:35 UTC
Red Hat Knowledge Base (Solution)	5696641	None	None	None	2021-01-12 01:17:47 UTC
Red Hat Product Errata	RHSA-2020:5633	None	None	None	2021-02-24 15:41:35 UTC

Description Robert Bost 2020-12-09 01:39:47 UTC

Description of problem:

The attached YAML file (includes namespace, svc, StatefulSet, and NetworkPolicy) results in 3 Pods that cannot always communicate with each other, even though the NetworkPolicy would not apply to the Pods. 

If you add Ingress under policyTypes and an empty Ingress rules section in the NetworkPolicy then the issue DOES NOT occur.

$ oc get pods -o wide 
NAME    READY   STATUS    RESTARTS   AGE   IP            NODE                                             NOMINATED NODE   READINESS GATES
web-0   1/1     Running   0          31s   10.131.1.31   worker-1.jocolema4.lab.upshift.rdu2.redhat.com   <none>           <none>
web-1   1/1     Running   0          25s   10.128.2.27   worker-0.jocolema4.lab.upshift.rdu2.redhat.com   <none>           <none>
web-2   1/1     Running   0          8s    10.131.1.32   worker-1.jocolema4.lab.upshift.rdu2.redhat.com   <none>           <none>

TEST 1: Curl to self succeeds (I expected an HTTP 404 response):
$ oc exec -it web-0 -- timeout 5 curl -s -o /dev/null -w "%{http_code}\n" web-0.httpd:8080/asdf
404

TEST 2: Curl from web-0 -> web-1 (it times out):
$ oc exec -it web-0 -- timeout 5 curl -s -o /dev/null -w "%{http_code}\n" web-1.httpd:8080/asdf
command terminated with exit code 124

TEST 3: Curl from web-0 -> web-2 (WORKS):
$ oc exec -it web-0 -- timeout 5 curl -s -o /dev/null -w "%{http_code}\n" web-2.httpd:8080/asdf
404

TEST 4: Deleting all NetworkPolicies:
$ oc delete networkpolicy --all 
$ oc exec -it web-0 -- timeout 5 curl -s -o /dev/null -w "%{http_code}\n" web-0.httpd:8080/asdf
404
$ oc exec -it web-0 -- timeout 5 curl -s -o /dev/null -w "%{http_code}\n" web-1.httpd:8080/asdf
404
$ oc exec -it web-0 -- timeout 5 curl -s -o /dev/null -w "%{http_code}\n" web-2.httpd:8080/asdf
404


Additionally, the SDN logs is showing this repeated which I believe is related:

I1209 01:24:01.738089 4106341 pod.go:508] CNI_ADD hsts/web-0 got IP 10.131.1.97, ofport 1875
I1209 01:24:01.763042 4106341 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: -:2: 0/0: invalid IP address
I1209 01:24:02.282660 4106341 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: -:2: 0/0: invalid IP address
I1209 01:24:02.929075 4106341 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: -:2: 0/0: invalid IP address
I1209 01:24:03.730945 4106341 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: -:2: 0/0: invalid IP address
I1209 01:24:04.736159 4106341 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: -:2: 0/0: invalid IP address
I1209 01:24:05.977063 4106341 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: -:2: 0/0: invalid IP address
I1209 01:24:07.524868 4106341 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: -:2: 0/0: invalid IP address
I1209 01:24:09.454298 4106341 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: -:2: 0/0: invalid IP address
I1209 01:24:11.861189 4106341 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: -:2: 0/0: invalid IP address
I1209 01:24:14.867488 4106341 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: -:2: 0/0: invalid IP address
E1209 01:24:14.867721 4106341 networkpolicy.go:311] Error syncing OVS flows for VNID: timed out waiting for the condition


Version-Release number of selected component (if applicable):
4.6.4


How reproducible: Always with attached YAML contents.

Comment 1 Robert Bost 2020-12-09 01:44:59 UTC

> Additionally, the SDN logs is showing this repeated which I believe is related:
> I1209 01:24:01.738089 4106341 pod.go:508] CNI_ADD hsts/web-0 got IP 10.131.1.97, ofport 1875

To clarify the logs vs oc get pods output in my last comment, the logs always show the IP Address of the Pod. The log I shared above was just saved to a clipboard.

Comment 2 Robert Bost 2020-12-09 01:47:37 UTC

Created attachment 1737777 [details]
YAML containing reproducer project

Comment 4 Victor Pickard 2020-12-15 00:03:49 UTC

I have been able to reproduce the error logs with the attached YAML, and have posted a PR.

Even with the error logs, I did not see a connectivity issue between any of the pods in my local setup with 4.6.4. I consistently got a 404 error (expected) when doing a curl from web-0 pod to the other pods. 

It's possible, that the error caused by attempting to program a flow with an empty IP address may have prevented some other flows from being installed under certain conditions. I'll verify with some other team members.

Comment 6 Robert Bost 2021-01-06 20:50:29 UTC

> I did not see a connectivity issue between any of the pods in my local setup with 4.6.4.

Apologies for the late reply, but please check the Pods are running on different nodes when you attempt to reproduce problem.

Comment 8 zhaozhanqi 2021-01-08 13:22:53 UTC

checked this issue on 4.7.0-0.nightly-2021-01-07-181010

the `invalid IP address` logs not found now. but when create the networkpolicy with type is egress. then pod cannot be accessed each other. 

steps:

1. create namespace z3 and create test pods 

oc get pod -o wide
NAME            READY   STATUS    RESTARTS   AGE   IP           NODE                       NOMINATED NODE   READINESS GATES
test-rc-7bx7s   1/1     Running   0          15m   10.129.3.3   zzhao108-zk7kr-compute-1   <none>           <none>
test-rc-pktzp   1/1     Running   0          15m   10.129.3.2   zzhao108-zk7kr-compute-1   <none>           <none>

2. access pod from one to another, both two can work can return 'Hello OpenShift!"

$ oc exec test-rc-7bx7s -- curl 10.129.3.3:8080
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    17  100    17    0     0  17000      0 --:--:-- --:--:-- --:--:-- 17000
Hello OpenShift!
$ oc exec test-rc-7bx7s -- curl 10.129.3.2:8080
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0Hello OpenShift!
100    17  100    17    0     0  17000      0 --:--:-- --:--:-- --:--:-- 17000

3. Create the networkpolicy with egress type with not match any pods

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: bad-np
spec:
  egress:
  - {}
  podSelector:
    matchLabels:
      never-gonna: match
  policyTypes:
  - Egress

4.  access again. this time pod1 cannot access pod2

$ oc exec test-rc-7bx7s -- curl --connect-timeout 4 10.129.3.2:8080
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:00:04 --:--:--     0
curl: (28) Connection timed out after 4001 milliseconds
command terminated with exit code 28

$ oc get netnamespace z3
NAME   NETID     EGRESS IPS
z3     4671645

Comment 9 zhaozhanqi 2021-01-08 13:26:22 UTC

Created attachment 1745586 [details]
sdn ovs openflow

Comment 10 Victor Pickard 2021-01-11 14:44:49 UTC

Hi Zhanqi,
Can you please attach the yaml files that you used to recreate this? I really like the "Hello Openshift" server you have. Thanks

Comment 13 Dan Winship 2021-01-12 17:39:23 UTC

> Additionally, the SDN logs is showing this repeated which I believe is related:
> 
> I1209 01:24:01.763042 4106341 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: -:2: 0/0: invalid IP address

yes, there was a bug recently introduced in the NetworkPolicy code. None of the other specifics here are relevant; the buggy NetworkPolicy code would result in bad rules regardless of what you were doing.

*** This bug has been marked as a duplicate of bug 1914284 ***

Comment 14 Dan Winship 2021-01-12 17:41:42 UTC

sorry, no there's another bug here

Comment 16 zhaozhanqi 2021-01-15 10:22:56 UTC

Verified this bug on 4.7.0-0.nightly-2021-01-14-211319

Comment 17 Victor Pickard 2021-01-20 19:43:03 UTC

@zzhao has FailedQA flag set. Perhaps you could set(clear) this flag when you get a chance?

Thanks!

Comment 18 zhaozhanqi 2021-01-21 07:01:28 UTC

ah, I clear the FailedQA flag. thanks.

Comment 21 errata-xmlrpc 2021-02-24 15:41:14 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Note You need to log in before you can comment on or make changes to this bug.