Bug 1903651
| Summary: | Network Policies are not working as expected with OVN-Kubernetes when traffic hairpins back to the same source through a service | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Swadeep Asthana <swasthan> |
| Component: | Networking | Assignee: | Andrew Stoycos <astoycos> |
| Networking sub component: | ovn-kubernetes | QA Contact: | Anurag saxena <anusaxen> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | high | ||
| Priority: | high | CC: | abodhe, aconstan, anbhat, astoycos, bbennett, hchatter, huirwang, mifiedle, openshift-bugs-escalate, rbohne, rbost, rjamadar, zzhao |
| Version: | 4.6 | ||
| Target Milestone: | --- | ||
| Target Release: | 4.7.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | No Doc Update | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-02-24 15:37:21 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1917240 | ||
|
Comment 2
Anurag saxena
2020-12-02 15:32:05 UTC
(In reply to Anurag saxena from comment #2) > Thanks Swadeep. Customer Portal shows error loading the case details. Wanted > to check which platform is that? Guess it might be Baremetal as i believe > migration was supported only on that platform during 4.5->4.6 This is bare-metal deployment UPI (on vSphere) Regards, Swadeep this issue can be reproduced on a new ovn cluster. it should not related upgrade or migrate from sdn to ovn.
when the networkpolicy is added. pod only can be accessed in from his owned worker.
1. oc get pod -n z3 -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
test-rc-46h44 1/1 Running 0 5h49m 10.128.3.212 dell-per740-14.rhts.eng.pek2.redhat.com <none> <none>
test-rc-qst7b 1/1 Running 0 5h49m 10.131.1.160 dell-per740-35.rhts.eng.pek2.redhat.com <none> <none>
#oc get svc -n z3
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
test-service ClusterIP 172.30.62.145 <none> 27017/TCP 5h49m
and one policy only 'allow-from-same-namespace' as below:
oc get networkpolicies.networking.k8s.io -n z3 -o yaml
apiVersion: v1
items:
- apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
creationTimestamp: "2020-12-03T09:46:19Z"
generation: 1
managedFields:
- apiVersion: networking.k8s.io/v1
fieldsType: FieldsV1
fieldsV1:
f:spec:
f:ingress: {}
f:policyTypes: {}
manager: kubectl-create
operation: Update
time: "2020-12-03T09:46:19Z"
name: allow-from-same-namespace
namespace: z3
resourceVersion: "6290803"
selfLink: /apis/networking.k8s.io/v1/namespaces/z3/networkpolicies/allow-from-same-namespace
uid: 05cd5dfe-8610-48f2-8c9b-fb68d652a9e0
spec:
ingress:
- from:
- podSelector: {}
podSelector: {}
policyTypes:
- Ingress
kind: List
metadata:
resourceVersion: ""
selfLink: ""
when I enter one test pod and all can be accessed using Pod_IP
~ $ curl 10.131.1.160:8080
Hello OpenShift!
~ $ curl 10.128.3.212:8080
Hello OpenShift!
~ $
but not all works using service ip
~ $ curl --connect-timeout 4 172.30.62.145:27017
Hello OpenShift!
~ $ curl --connect-timeout 4 172.30.62.145:27017
Hello OpenShift!
~ $ curl --connect-timeout 4 172.30.62.145:27017
Hello OpenShift!
~ $ curl --connect-timeout 4 172.30.62.145:27017
Hello OpenShift!
~ $ curl --connect-timeout 4 172.30.62.145:27017
curl: (28) Connection timed out after 4001 milliseconds
~ $ curl --connect-timeout 4 172.30.62.145:27017
Hello OpenShift!
~ $ curl --connect-timeout 4 172.30.62.145:27017
Hello OpenShift!
~ $ curl --connect-timeout 4 172.30.62.145:27017
curl: (28) Connection timed out after 4001 milliseconds
~ $
from worker node to curl pod ip
core@dell-per740-35 ~]$ curl --connect-timeout 4 10.131.1.160:8080
Hello OpenShift!
[core@dell-per740-35 ~]$ curl --connect-timeout 4 10.128.3.212:8080
curl: (28) Connection timed out after 4000 milliseconds
[core@dell-per740-35 ~]$ ip route
default via 10.73.117.254 dev br-ex proto dhcp metric 800
10.73.116.0/23 dev br-ex proto kernel scope link src 10.73.116.54 metric 800
10.128.0.0/14 via 10.131.0.1 dev ovn-k8s-mp0
10.131.0.0/23 dev ovn-k8s-mp0 proto kernel scope link src 10.131.0.2
169.254.0.0/20 dev ovn-k8s-gw0 proto kernel scope link src 169.254.0.1
172.30.0.0/16 via 10.131.0.1 dev ovn-k8s-mp0
192.168.222.0/24 dev eno2 proto kernel scope link src 192.168.222.112 metric 100
Hi all, I am actively working on reproducing locally and diagnosing this issue, I will post any findings as I come across them. Thanks, Andrew After investigating today we believe we have found the issue to the problem...
When a POD sends traffic to a service it is loadbalanced(DNAT-ed) and sometimes the same pod is chosen as the backend to the service. To ensure this traffic travels through OVN-K8s rather than the POD2POD network since (srcIP==dstIP), the traffic is SNAT-ed to the VIP for the service, which is not currently included in the address set that is made for the `allow-same-namespace` network policy.
For example:
[astoycos@nfvsdn-03 demo]$ kubectl get pods -n test-network-policy -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
webserver-79997dfc5d-fgwr6 1/1 Running 0 47h 10.244.1.5 ovn-worker <none> <none>
webserver-79997dfc5d-gzxh8 1/1 Running 0 4h46m 10.244.0.4 ovn-control-plane <none> <none>
webserver-79997dfc5d-zgrzm 1/1 Running 0 4h46m 10.244.2.5 ovn-worker2 <none> <none>
webserver-pod-test 1/1 Running 0 5h47m 10.244.1.6 ovn-worker <none> <none>
The Address Set made to enforce the `allow-same-namespace` network policy originally only included the address's of all pods in the namespace
_uuid : 93f01457-aa8a-431a-842f-dbbee790706d
addresses : ["10.244.0.4", "10.244.1.5", "10.244.1.6", "10.244.2.5"]
external_ids : {name=test-network-policy.allow-from-same-namespace.ingress.0_v4}
name : a17251283737316303016
BUT Service VIPs for the service's backed by pods in the test-policy-namespace also need to be added
[root@ovn-control-plane ~]# ovn-nbctl lb-list
UUID LB PROTO VIP IPs
f205c498-f89a-451c-9f1d-906475d078aa udp 10.96.0.10:53 10.244.1.4:53,10.244.2.3:53
cb426972-8344-487b-9232-78da20758fed tcp 10.96.0.10:53 10.244.1.4:53,10.244.2.3:53
tcp 10.96.0.10:9153 10.244.1.4:9153,10.244.2.3:9153
tcp 10.96.0.1:443 172.18.0.4:6443
tcp 10.96.23.185:8080 10.244.0.4:8080,10.244.1.5:8080,10.244.2.5:8080
To manually add the VIP to the address_set run:
`ovn-nbctl add address_set a17251283737316303016 addresses 10.96.23.185`
Now the Address Set contains the VIP
_uuid : 93f01457-aa8a-431a-842f-dbbee790706d
addresses : ["10.244.0.4", "10.244.1.5", "10.244.1.6", "10.244.2.5", "10.96.23.185"]
external_ids : {name=test-network-policy.allow-from-same-namespace.ingress.0_v4}
name : a17251283737316303016
Then All traffic to the SVC works as expected even with the network Policies applied(SEE BELOW)
[astoycos@nfvsdn-03 demo]$ kubectl get networkPolicy -n test-network-policy
NAME POD-SELECTOR AGE
allow-from-ingress <none> 38m
allow-from-same-namespace <none> 18h
default-deny-all <none> 18h
[astoycos@nfvsdn-03 demo]$ ./test2.sh
pod/webserver-79997dfc5d-fgwr6 IP --> 10.244.1.5
CURLing from above Pod with command --> curl -s -I -m 2 10.96.23.185:8080
HTTP/1.1 200 OK
Date: Mon, 07 Dec 2020 22:47:44 GMT
Connection: keep-alive
pod/webserver-79997dfc5d-gzxh8 IP --> 10.244.0.4
CURLing from above Pod with command --> curl -s -I -m 2 10.96.23.185:8080
HTTP/1.1 200 OK
Date: Mon, 07 Dec 2020 22:47:44 GMT
Connection: keep-alive
pod/webserver-79997dfc5d-zgrzm IP --> 10.244.2.5
CURLing from above Pod with command --> curl -s -I -m 2 10.96.23.185:8080
HTTP/1.1 200 OK
Date: Mon, 07 Dec 2020 22:47:45 GMT
Connection: keep-alive
pod/webserver-pod-test IP --> 10.244.1.6
CURLing from above Pod with command --> curl -s -I -m 2 10.96.23.185:8080
HTTP/1.1 200 OK
Date: Mon, 07 Dec 2020 22:47:45 GMT
Connection: keep-alive
(Running the test twice for good measure)
[astoycos@nfvsdn-03 demo]$ ./test2.sh
pod/webserver-79997dfc5d-fgwr6 IP --> 10.244.1.5
CURLing from above Pod with command --> curl -s -I -m 2 10.96.23.185:8080
HTTP/1.1 200 OK
Date: Mon, 07 Dec 2020 22:47:54 GMT
Connection: keep-alive
pod/webserver-79997dfc5d-gzxh8 IP --> 10.244.0.4
CURLing from above Pod with command --> curl -s -I -m 2 10.96.23.185:8080
HTTP/1.1 200 OK
Date: Mon, 07 Dec 2020 22:47:55 GMT
Connection: keep-alive
pod/webserver-79997dfc5d-zgrzm IP --> 10.244.2.5
CURLing from above Pod with command --> curl -s -I -m 2 10.96.23.185:8080
HTTP/1.1 200 OK
Date: Mon, 07 Dec 2020 22:47:55 GMT
Connection: keep-alive
pod/webserver-pod-test IP --> 10.244.1.6
CURLing from above Pod with command --> curl -s -I -m 2 10.96.23.185:8080
HTTP/1.1 200 OK
Date: Mon, 07 Dec 2020 22:47:55 GMT
Connection: keep-alive
A PR to fix this issue on master will be created shortly, and we will backport accordingly
This is not a regression and there is a known workaround. Unsetting the blocker flag, but we expect this to merge before the 4.7 release anyway. Status update: Upstream PR has merged -> https://github.com/ovn-org/ovn-kubernetes/pull/1921 Fixes are In Cherry pick state for both downstream 4.7 and 4.6 see -> https://github.com/openshift/ovn-kubernetes/pull/408 and https://github.com/openshift/ovn-kubernetes/pull/411 Update: Downstream 4.7 master PR has merged, waiting on verification to complete backport to 4.6 If any of the attached customer cases involve ingress traffic problems after applying network policies please see BZ1927841 for a probable explanation Thanks, Andrew Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633 |