Bug 1903651 - Network Policies are not working as expected with OVN-Kubernetes when traffic hairpins back to the same source through a service
Summary: Network Policies are not working as expected with OVN-Kubernetes when traffic...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.6
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.7.0
Assignee: Andrew Stoycos
QA Contact: Anurag saxena
URL:
Whiteboard:
Depends On:
Blocks: 1917240
TreeView+ depends on / blocked
 
Reported: 2020-12-02 15:06 UTC by Swadeep Asthana
Modified: 2021-11-23 16:25 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-02-24 15:37:21 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift ovn-kubernetes pull 408 0 None closed Bug 1903651: Add clusterIP to ingress policy AS for SNAT-ed hairpin Traffic Cherry Pick 2021-02-15 23:13:44 UTC
Red Hat Knowledge Base (Solution) 5620481 0 None None None 2021-01-06 09:47:16 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:37:55 UTC

Comment 2 Anurag saxena 2020-12-02 15:32:05 UTC
Thanks Swadeep. Customer Portal shows error loading the case details. Wanted to check which platform is that? Guess it might be Baremetal as i believe migration was supported only on that platform during 4.5->4.6

Comment 3 Swadeep Asthana 2020-12-02 15:35:53 UTC
(In reply to Anurag saxena from comment #2)
> Thanks Swadeep. Customer Portal shows error loading the case details. Wanted
> to check which platform is that? Guess it might be Baremetal as i believe
> migration was supported only on that platform during 4.5->4.6

This is bare-metal deployment UPI (on vSphere)


Regards,
Swadeep

Comment 6 zhaozhanqi 2020-12-03 10:37:57 UTC
this issue can be reproduced on a new ovn cluster. it should not related upgrade or migrate from sdn to ovn. 

when the networkpolicy is added.  pod only can be accessed in from his owned worker.

1. oc get pod -n z3 -o wide
NAME            READY   STATUS    RESTARTS   AGE     IP             NODE                                      NOMINATED NODE   READINESS GATES
test-rc-46h44   1/1     Running   0          5h49m   10.128.3.212   dell-per740-14.rhts.eng.pek2.redhat.com   <none>           <none>
test-rc-qst7b   1/1     Running   0          5h49m   10.131.1.160   dell-per740-35.rhts.eng.pek2.redhat.com   <none>           <none>

#oc get svc -n z3
NAME           TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)     AGE
test-service   ClusterIP   172.30.62.145   <none>        27017/TCP   5h49m

and one policy only 'allow-from-same-namespace' as below:

oc get networkpolicies.networking.k8s.io -n z3 -o yaml
apiVersion: v1
items:
- apiVersion: networking.k8s.io/v1
  kind: NetworkPolicy
  metadata:
    creationTimestamp: "2020-12-03T09:46:19Z"
    generation: 1
    managedFields:
    - apiVersion: networking.k8s.io/v1
      fieldsType: FieldsV1
      fieldsV1:
        f:spec:
          f:ingress: {}
          f:policyTypes: {}
      manager: kubectl-create
      operation: Update
      time: "2020-12-03T09:46:19Z"
    name: allow-from-same-namespace
    namespace: z3
    resourceVersion: "6290803"
    selfLink: /apis/networking.k8s.io/v1/namespaces/z3/networkpolicies/allow-from-same-namespace
    uid: 05cd5dfe-8610-48f2-8c9b-fb68d652a9e0
  spec:
    ingress:
    - from:
      - podSelector: {}
    podSelector: {}
    policyTypes:
    - Ingress
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""


when I enter one test pod and all can be accessed using Pod_IP

~ $ curl 10.131.1.160:8080
Hello OpenShift!
~ $ curl 10.128.3.212:8080
Hello OpenShift!
~ $ 

but not all works using service ip

~ $ curl --connect-timeout 4 172.30.62.145:27017
Hello OpenShift!
~ $ curl --connect-timeout 4 172.30.62.145:27017
Hello OpenShift!
~ $ curl --connect-timeout 4 172.30.62.145:27017
Hello OpenShift!
~ $ curl --connect-timeout 4 172.30.62.145:27017
Hello OpenShift!
~ $ curl --connect-timeout 4 172.30.62.145:27017
curl: (28) Connection timed out after 4001 milliseconds
~ $ curl --connect-timeout 4 172.30.62.145:27017
Hello OpenShift!
~ $ curl --connect-timeout 4 172.30.62.145:27017
Hello OpenShift!
~ $ curl --connect-timeout 4 172.30.62.145:27017
curl: (28) Connection timed out after 4001 milliseconds
~ $ 

from worker node to curl pod ip

core@dell-per740-35 ~]$ curl --connect-timeout 4 10.131.1.160:8080
Hello OpenShift!
[core@dell-per740-35 ~]$ curl --connect-timeout 4 10.128.3.212:8080
curl: (28) Connection timed out after 4000 milliseconds

[core@dell-per740-35 ~]$ ip route
default via 10.73.117.254 dev br-ex proto dhcp metric 800 
10.73.116.0/23 dev br-ex proto kernel scope link src 10.73.116.54 metric 800 
10.128.0.0/14 via 10.131.0.1 dev ovn-k8s-mp0 
10.131.0.0/23 dev ovn-k8s-mp0 proto kernel scope link src 10.131.0.2 
169.254.0.0/20 dev ovn-k8s-gw0 proto kernel scope link src 169.254.0.1 
172.30.0.0/16 via 10.131.0.1 dev ovn-k8s-mp0 
192.168.222.0/24 dev eno2 proto kernel scope link src 192.168.222.112 metric 100

Comment 10 Andrew Stoycos 2020-12-03 23:34:39 UTC
Hi all, I am actively working on reproducing locally and diagnosing this issue, I will post any findings as I come across them. 

Thanks, Andrew

Comment 13 Andrew Stoycos 2020-12-07 23:00:32 UTC
After investigating today we believe we have found the issue to the problem... 

When a POD sends traffic to a service it is loadbalanced(DNAT-ed) and sometimes the same pod is chosen as the backend to the service. To ensure this traffic travels through OVN-K8s rather than the POD2POD network since (srcIP==dstIP), the traffic is SNAT-ed to the VIP for the service, which is not currently included in the address set that is made for the `allow-same-namespace` network policy.  

For example:

[astoycos@nfvsdn-03 demo]$ kubectl get pods -n test-network-policy  -o wide
NAME                         READY   STATUS    RESTARTS   AGE     IP           NODE                NOMINATED NODE   READINESS GATES
webserver-79997dfc5d-fgwr6   1/1     Running   0          47h     10.244.1.5   ovn-worker          <none>           <none>
webserver-79997dfc5d-gzxh8   1/1     Running   0          4h46m   10.244.0.4   ovn-control-plane   <none>           <none>
webserver-79997dfc5d-zgrzm   1/1     Running   0          4h46m   10.244.2.5   ovn-worker2         <none>           <none>
webserver-pod-test           1/1     Running   0          5h47m   10.244.1.6   ovn-worker          <none>           <none>


The Address Set made to enforce the `allow-same-namespace` network policy originally only included the address's of all pods in the namespace 

_uuid               : 93f01457-aa8a-431a-842f-dbbee790706d
addresses           : ["10.244.0.4", "10.244.1.5", "10.244.1.6", "10.244.2.5"]
external_ids        : {name=test-network-policy.allow-from-same-namespace.ingress.0_v4}
name                : a17251283737316303016

BUT Service VIPs for the service's backed by pods in the test-policy-namespace also need to be added 

[root@ovn-control-plane ~]# ovn-nbctl lb-list
UUID                                    LB                  PROTO      VIP                  IPs
f205c498-f89a-451c-9f1d-906475d078aa                        udp        10.96.0.10:53        10.244.1.4:53,10.244.2.3:53
cb426972-8344-487b-9232-78da20758fed                        tcp        10.96.0.10:53        10.244.1.4:53,10.244.2.3:53
                                                            tcp        10.96.0.10:9153      10.244.1.4:9153,10.244.2.3:9153
                                                            tcp        10.96.0.1:443        172.18.0.4:6443
                                                            tcp        10.96.23.185:8080    10.244.0.4:8080,10.244.1.5:8080,10.244.2.5:8080
                                                         
To manually add the VIP to the address_set run: 

`ovn-nbctl add address_set a17251283737316303016 addresses 10.96.23.185`

Now the Address Set contains the VIP 

_uuid               : 93f01457-aa8a-431a-842f-dbbee790706d
addresses           : ["10.244.0.4", "10.244.1.5", "10.244.1.6", "10.244.2.5", "10.96.23.185"]
external_ids        : {name=test-network-policy.allow-from-same-namespace.ingress.0_v4}
name                : a17251283737316303016


Then All traffic to the SVC works as expected even with the network Policies applied(SEE BELOW) 


[astoycos@nfvsdn-03 demo]$ kubectl get networkPolicy -n test-network-policy                                                                                                                                                                   
NAME                        POD-SELECTOR   AGE                                                                                                                                                                                                
allow-from-ingress          <none>         38m                                                                                                                                                                                                
allow-from-same-namespace   <none>         18h                                                                                                                                                                                                
default-deny-all            <none>         18h                                                                                                                                                                                                
[astoycos@nfvsdn-03 demo]$ ./test2.sh                                                                                                                                                                                                         
                                                                                                                                                                                                                                              
pod/webserver-79997dfc5d-fgwr6 IP --> 10.244.1.5                                                                                                                                                                                              
CURLing from above Pod with command --> curl -s -I -m 2 10.96.23.185:8080                                                                                                                                                                     
HTTP/1.1 200 OK                                                                                                                                                                                                                               
Date: Mon, 07 Dec 2020 22:47:44 GMT                                                                                                                                                                                                           
Connection: keep-alive                                                                                                                                                                                                                        
                                                                                                                                                                                                                                              
                                                                                                                                                                                                                                              
pod/webserver-79997dfc5d-gzxh8 IP --> 10.244.0.4                                                                                                                                                                                              
CURLing from above Pod with command --> curl -s -I -m 2 10.96.23.185:8080
HTTP/1.1 200 OK
Date: Mon, 07 Dec 2020 22:47:44 GMT
Connection: keep-alive

  
pod/webserver-79997dfc5d-zgrzm IP --> 10.244.2.5
CURLing from above Pod with command --> curl -s -I -m 2 10.96.23.185:8080
HTTP/1.1 200 OK
Date: Mon, 07 Dec 2020 22:47:45 GMT
Connection: keep-alive

  
pod/webserver-pod-test IP --> 10.244.1.6
CURLing from above Pod with command --> curl -s -I -m 2 10.96.23.185:8080
HTTP/1.1 200 OK
Date: Mon, 07 Dec 2020 22:47:45 GMT
Connection: keep-alive

(Running the test twice for good measure) 


[astoycos@nfvsdn-03 demo]$ ./test2.sh 
  
pod/webserver-79997dfc5d-fgwr6 IP --> 10.244.1.5
CURLing from above Pod with command --> curl -s -I -m 2 10.96.23.185:8080
HTTP/1.1 200 OK
Date: Mon, 07 Dec 2020 22:47:54 GMT
Connection: keep-alive

  
pod/webserver-79997dfc5d-gzxh8 IP --> 10.244.0.4
CURLing from above Pod with command --> curl -s -I -m 2 10.96.23.185:8080
HTTP/1.1 200 OK
Date: Mon, 07 Dec 2020 22:47:55 GMT
Connection: keep-alive

  
pod/webserver-79997dfc5d-zgrzm IP --> 10.244.2.5
CURLing from above Pod with command --> curl -s -I -m 2 10.96.23.185:8080
HTTP/1.1 200 OK
Date: Mon, 07 Dec 2020 22:47:55 GMT
Connection: keep-alive

  
pod/webserver-pod-test IP --> 10.244.1.6
CURLing from above Pod with command --> curl -s -I -m 2 10.96.23.185:8080
HTTP/1.1 200 OK
Date: Mon, 07 Dec 2020 22:47:55 GMT
Connection: keep-alive



A PR to fix this issue on master will be created shortly, and we will backport accordingly

Comment 23 Ben Bennett 2021-01-11 16:00:45 UTC
This is not a regression and there is a known workaround.  Unsetting the blocker flag, but we expect this to merge before the 4.7 release anyway.

Comment 27 Andrew Stoycos 2021-01-13 15:40:31 UTC
Status update: Upstream PR has merged -> https://github.com/ovn-org/ovn-kubernetes/pull/1921

Comment 34 Andrew Stoycos 2021-01-18 15:39:00 UTC
Fixes are In Cherry pick state for both downstream 4.7 and 4.6 see -> https://github.com/openshift/ovn-kubernetes/pull/408 and https://github.com/openshift/ovn-kubernetes/pull/411

Comment 40 Andrew Stoycos 2021-01-19 19:53:31 UTC
Update: Downstream 4.7 master PR has merged, waiting on verification to complete backport to 4.6

Comment 60 Andrew Stoycos 2021-02-18 21:56:06 UTC
If any of the attached customer cases involve ingress traffic problems after applying network policies please see BZ1927841 for a probable explanation 

Thanks, 
Andrew

Comment 62 errata-xmlrpc 2021-02-24 15:37:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.