Thanks Swadeep. Customer Portal shows error loading the case details. Wanted to check which platform is that? Guess it might be Baremetal as i believe migration was supported only on that platform during 4.5->4.6
(In reply to Anurag saxena from comment #2) > Thanks Swadeep. Customer Portal shows error loading the case details. Wanted > to check which platform is that? Guess it might be Baremetal as i believe > migration was supported only on that platform during 4.5->4.6 This is bare-metal deployment UPI (on vSphere) Regards, Swadeep
this issue can be reproduced on a new ovn cluster. it should not related upgrade or migrate from sdn to ovn. when the networkpolicy is added. pod only can be accessed in from his owned worker. 1. oc get pod -n z3 -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES test-rc-46h44 1/1 Running 0 5h49m 10.128.3.212 dell-per740-14.rhts.eng.pek2.redhat.com <none> <none> test-rc-qst7b 1/1 Running 0 5h49m 10.131.1.160 dell-per740-35.rhts.eng.pek2.redhat.com <none> <none> #oc get svc -n z3 NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE test-service ClusterIP 172.30.62.145 <none> 27017/TCP 5h49m and one policy only 'allow-from-same-namespace' as below: oc get networkpolicies.networking.k8s.io -n z3 -o yaml apiVersion: v1 items: - apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: creationTimestamp: "2020-12-03T09:46:19Z" generation: 1 managedFields: - apiVersion: networking.k8s.io/v1 fieldsType: FieldsV1 fieldsV1: f:spec: f:ingress: {} f:policyTypes: {} manager: kubectl-create operation: Update time: "2020-12-03T09:46:19Z" name: allow-from-same-namespace namespace: z3 resourceVersion: "6290803" selfLink: /apis/networking.k8s.io/v1/namespaces/z3/networkpolicies/allow-from-same-namespace uid: 05cd5dfe-8610-48f2-8c9b-fb68d652a9e0 spec: ingress: - from: - podSelector: {} podSelector: {} policyTypes: - Ingress kind: List metadata: resourceVersion: "" selfLink: "" when I enter one test pod and all can be accessed using Pod_IP ~ $ curl 10.131.1.160:8080 Hello OpenShift! ~ $ curl 10.128.3.212:8080 Hello OpenShift! ~ $ but not all works using service ip ~ $ curl --connect-timeout 4 172.30.62.145:27017 Hello OpenShift! ~ $ curl --connect-timeout 4 172.30.62.145:27017 Hello OpenShift! ~ $ curl --connect-timeout 4 172.30.62.145:27017 Hello OpenShift! ~ $ curl --connect-timeout 4 172.30.62.145:27017 Hello OpenShift! ~ $ curl --connect-timeout 4 172.30.62.145:27017 curl: (28) Connection timed out after 4001 milliseconds ~ $ curl --connect-timeout 4 172.30.62.145:27017 Hello OpenShift! ~ $ curl --connect-timeout 4 172.30.62.145:27017 Hello OpenShift! ~ $ curl --connect-timeout 4 172.30.62.145:27017 curl: (28) Connection timed out after 4001 milliseconds ~ $ from worker node to curl pod ip core@dell-per740-35 ~]$ curl --connect-timeout 4 10.131.1.160:8080 Hello OpenShift! [core@dell-per740-35 ~]$ curl --connect-timeout 4 10.128.3.212:8080 curl: (28) Connection timed out after 4000 milliseconds [core@dell-per740-35 ~]$ ip route default via 10.73.117.254 dev br-ex proto dhcp metric 800 10.73.116.0/23 dev br-ex proto kernel scope link src 10.73.116.54 metric 800 10.128.0.0/14 via 10.131.0.1 dev ovn-k8s-mp0 10.131.0.0/23 dev ovn-k8s-mp0 proto kernel scope link src 10.131.0.2 169.254.0.0/20 dev ovn-k8s-gw0 proto kernel scope link src 169.254.0.1 172.30.0.0/16 via 10.131.0.1 dev ovn-k8s-mp0 192.168.222.0/24 dev eno2 proto kernel scope link src 192.168.222.112 metric 100
Hi all, I am actively working on reproducing locally and diagnosing this issue, I will post any findings as I come across them. Thanks, Andrew
After investigating today we believe we have found the issue to the problem... When a POD sends traffic to a service it is loadbalanced(DNAT-ed) and sometimes the same pod is chosen as the backend to the service. To ensure this traffic travels through OVN-K8s rather than the POD2POD network since (srcIP==dstIP), the traffic is SNAT-ed to the VIP for the service, which is not currently included in the address set that is made for the `allow-same-namespace` network policy. For example: [astoycos@nfvsdn-03 demo]$ kubectl get pods -n test-network-policy -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES webserver-79997dfc5d-fgwr6 1/1 Running 0 47h 10.244.1.5 ovn-worker <none> <none> webserver-79997dfc5d-gzxh8 1/1 Running 0 4h46m 10.244.0.4 ovn-control-plane <none> <none> webserver-79997dfc5d-zgrzm 1/1 Running 0 4h46m 10.244.2.5 ovn-worker2 <none> <none> webserver-pod-test 1/1 Running 0 5h47m 10.244.1.6 ovn-worker <none> <none> The Address Set made to enforce the `allow-same-namespace` network policy originally only included the address's of all pods in the namespace _uuid : 93f01457-aa8a-431a-842f-dbbee790706d addresses : ["10.244.0.4", "10.244.1.5", "10.244.1.6", "10.244.2.5"] external_ids : {name=test-network-policy.allow-from-same-namespace.ingress.0_v4} name : a17251283737316303016 BUT Service VIPs for the service's backed by pods in the test-policy-namespace also need to be added [root@ovn-control-plane ~]# ovn-nbctl lb-list UUID LB PROTO VIP IPs f205c498-f89a-451c-9f1d-906475d078aa udp 10.96.0.10:53 10.244.1.4:53,10.244.2.3:53 cb426972-8344-487b-9232-78da20758fed tcp 10.96.0.10:53 10.244.1.4:53,10.244.2.3:53 tcp 10.96.0.10:9153 10.244.1.4:9153,10.244.2.3:9153 tcp 10.96.0.1:443 172.18.0.4:6443 tcp 10.96.23.185:8080 10.244.0.4:8080,10.244.1.5:8080,10.244.2.5:8080 To manually add the VIP to the address_set run: `ovn-nbctl add address_set a17251283737316303016 addresses 10.96.23.185` Now the Address Set contains the VIP _uuid : 93f01457-aa8a-431a-842f-dbbee790706d addresses : ["10.244.0.4", "10.244.1.5", "10.244.1.6", "10.244.2.5", "10.96.23.185"] external_ids : {name=test-network-policy.allow-from-same-namespace.ingress.0_v4} name : a17251283737316303016 Then All traffic to the SVC works as expected even with the network Policies applied(SEE BELOW) [astoycos@nfvsdn-03 demo]$ kubectl get networkPolicy -n test-network-policy NAME POD-SELECTOR AGE allow-from-ingress <none> 38m allow-from-same-namespace <none> 18h default-deny-all <none> 18h [astoycos@nfvsdn-03 demo]$ ./test2.sh pod/webserver-79997dfc5d-fgwr6 IP --> 10.244.1.5 CURLing from above Pod with command --> curl -s -I -m 2 10.96.23.185:8080 HTTP/1.1 200 OK Date: Mon, 07 Dec 2020 22:47:44 GMT Connection: keep-alive pod/webserver-79997dfc5d-gzxh8 IP --> 10.244.0.4 CURLing from above Pod with command --> curl -s -I -m 2 10.96.23.185:8080 HTTP/1.1 200 OK Date: Mon, 07 Dec 2020 22:47:44 GMT Connection: keep-alive pod/webserver-79997dfc5d-zgrzm IP --> 10.244.2.5 CURLing from above Pod with command --> curl -s -I -m 2 10.96.23.185:8080 HTTP/1.1 200 OK Date: Mon, 07 Dec 2020 22:47:45 GMT Connection: keep-alive pod/webserver-pod-test IP --> 10.244.1.6 CURLing from above Pod with command --> curl -s -I -m 2 10.96.23.185:8080 HTTP/1.1 200 OK Date: Mon, 07 Dec 2020 22:47:45 GMT Connection: keep-alive (Running the test twice for good measure) [astoycos@nfvsdn-03 demo]$ ./test2.sh pod/webserver-79997dfc5d-fgwr6 IP --> 10.244.1.5 CURLing from above Pod with command --> curl -s -I -m 2 10.96.23.185:8080 HTTP/1.1 200 OK Date: Mon, 07 Dec 2020 22:47:54 GMT Connection: keep-alive pod/webserver-79997dfc5d-gzxh8 IP --> 10.244.0.4 CURLing from above Pod with command --> curl -s -I -m 2 10.96.23.185:8080 HTTP/1.1 200 OK Date: Mon, 07 Dec 2020 22:47:55 GMT Connection: keep-alive pod/webserver-79997dfc5d-zgrzm IP --> 10.244.2.5 CURLing from above Pod with command --> curl -s -I -m 2 10.96.23.185:8080 HTTP/1.1 200 OK Date: Mon, 07 Dec 2020 22:47:55 GMT Connection: keep-alive pod/webserver-pod-test IP --> 10.244.1.6 CURLing from above Pod with command --> curl -s -I -m 2 10.96.23.185:8080 HTTP/1.1 200 OK Date: Mon, 07 Dec 2020 22:47:55 GMT Connection: keep-alive A PR to fix this issue on master will be created shortly, and we will backport accordingly
This is not a regression and there is a known workaround. Unsetting the blocker flag, but we expect this to merge before the 4.7 release anyway.
Status update: Upstream PR has merged -> https://github.com/ovn-org/ovn-kubernetes/pull/1921
Fixes are In Cherry pick state for both downstream 4.7 and 4.6 see -> https://github.com/openshift/ovn-kubernetes/pull/408 and https://github.com/openshift/ovn-kubernetes/pull/411
Update: Downstream 4.7 master PR has merged, waiting on verification to complete backport to 4.6
If any of the attached customer cases involve ingress traffic problems after applying network policies please see BZ1927841 for a probable explanation Thanks, Andrew
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633