Description of problem: OVN-Kubernetes: egress router pod (redirect mode), access from pod on different worker-node (redirect) doesn't work Version-Release number of selected component (if applicable): OCP 4.9.24, 4.9.26, 4.10.9 How reproducible: - Installed - OCP 4.9.24 UPI bare-metal(libvirt) with OVN - 3 master, 3 worker - 192.168.60.0/24 - External application - 192.168.60.2/24 - while true; do cat netcat_curl.txt | nc -l 192.168.60.2 1234; done - Created project/namespace "rregress" - Added following label to all worker nodes - "k8s.ovn.org/egress-assignable" [1] - Deployed EgressRouter - "... egress IP address ... must be in the same subnet as the primary IP address of the node ..." [1] - "... The additional IP address must not be assigned to any other node in the cluster ..." [1] - ip: "192.168.60.81/24", gateway: "192.168.60.1" - destinationIP: "192.168.60.2", port: 1234 - Pod egress-router-cni-deployment... running on worker3 - Created Service "egress-1" [2] - # oc rsh egress-router-cni-deployment... sh-4.4$ ip a s net1 | grep "inet " inet 192.168.60.81/24 brd 192.168.60.255 scope global net1 sh-4.4$ curl 192.168.60.2:1234 ok - Created deployment for simple "test pods" - Replicaset 3, so 1 pod on each worker node - pod on worker1 sh-4.4$ curl --max-time 5 egress-1:1234 curl: (28) Connection timed out after 5000 milliseconds - pod on worker2 sh-4.4$ curl --max-time 5 egress-1:1234 curl: (28) Connection timed out after 5000 milliseconds - pod on worker3 sh-4.4$ curl --max-time 5 egress-1:1234 ok - If both pods (test-pod and egress-pod) are on the same worker node (in this scenario worker3) it's working - I tried the pod-IP (egress-pod) instead of "egress-1" (service), same results Actual results: - test-pod and egress-pod on same (worker) node: test-pod --curl--> service --> egress-pod --> external-app--> successful - test-pod and egress-pod on different (worker) nodes: No connection to external application Expected results: Regardless of which (worker) nodes test-pod and egress-pod are running on, the connection (to the external-app) succeeds Additional info: 1) https://docs.openshift.com/container-platform/4.9/networking/ovn_kubernetes_network_provider/configuring-egress-ips-ovn.html 2) https://docs.openshift.com/container-platform/4.9/networking/ovn_kubernetes_network_provider/using-an-egress-router-ovn.html
this link has more details about this feature and its configuration and debugging https://docs.openshift.com/container-platform/4.8/networking/ovn_kubernetes_network_provider/deploying-egress-router-ovn-redirection.html I have the following questions :- - what is platform is it baremetal ? - hmm was it intentional to have the external ip in the same subnet as the nodes IP ? - can u describe the svc wanted to make sure it was labelled correctly - can u connect the node where egress router CNI is running on and collect the logs cat /tmp/egress-router-log, ip add the way this work is the egress router will act as bridge between pods and external system, egress router will have two interfaces eth0 for cluster internal networking and mcavlan0 has an IP and gateway from the external physical network.
(In reply to Mohamed Mahmoud from comment #3) > I have the following questions :- > - what is platform is it baremetal ? - yes, tested it with UPI (libvirt) "bare-metal" (comment 0) - Customer also observes the issue on bare-metal > - hmm was it intentional to have the external ip in the same subnet as the > nodes IP ? - yes, I chose the same subnet (in my testenvironment) to keep it simple > - can u describe the svc wanted to make sure it was labelled correctly - I'll attach "service_egress-1.txt" > - can u connect the node where egress router CNI is running on and collect > the logs cat /tmp/egress-router-log, ip add - I'll attach "worker1_egress-router-log.txt", "worker1_ip_address.txt", "egress-router-cni-deployment_ip_address.txt" P.S. the "egress router pod" is currently running on worker1 (initially it was worker3)
Created attachment 1873486 [details] service_egress-1.txt
Created attachment 1873487 [details] worker1_egress-router-log.txt
Created attachment 1873490 [details] worker1_ip_address.txt
Created attachment 1873492 [details] egress-router-cni-deployment_ip_address.txt
just to be sure customer dropped into the different test pod shell and ran curl 172.30.138.228:1234 ? have we tried to create test pods 1st and scale them to whatever # then deploy egress router then create ClusterIP svc would like to collect pcap files for working and none working curl to make sure iptables rules took effect and we see DNAT and SNAT took place
QE reproduced this problem on local testing cluster: [weliang@weliang tmp]$ oc get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES egress-router-cni-deployment-5d659496ff-wn4rf 1/1 Running 0 14m 10.128.2.83 worker-0-0 <none> <none> test-pod-86879d8c8c-5jh5s 1/1 Running 0 7m2s 10.131.0.30 worker-0-1 <none> <none> test-pod-86879d8c8c-c4cbv 1/1 Running 0 7m2s 10.128.2.85 worker-0-0 <none> <none> test-pod-86879d8c8c-mc9xh 1/1 Running 0 7m2s 10.128.2.84 worker-0-0 <none> <none> test-pod-86879d8c8c-n8dk8 1/1 Running 0 7m2s 10.131.0.29 worker-0-1 <none> <none> test-pod-86879d8c8c-q97pj 1/1 Running 0 7m2s 10.128.2.86 worker-0-0 <none> <none> test-pod-86879d8c8c-tzsqw 1/1 Running 0 7m2s 10.131.0.28 worker-0-1 <none> <none> worker-0-0-debug 1/1 Running 0 13m 192.168.123.138 worker-0-0 <none> <none> [weliang@weliang tmp]$ [weliang@weliang tmp]$ oc exec $pod -- curl 10.128.2.83 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- 0:00:19 --:--:-- 0^C [weliang@weliang tmp]$ oc exec test-pod-86879d8c8c-mc9xh -- curl 10.128.2.83 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8"> <TITLE>301 Moved</TITLE></HEAD><BODY> <H1>301 Moved</H1> The document has moved <A HREF="http://www.google.com/">here</A>. </BODY></HTML> 100 219 100 219 0 0 793 0 --:--:-- --:--:-- --:--:-- 796 [weliang@weliang tmp]$ oc exec test-pod-86879d8c8c-q97pj -- curl 10.128.2.83 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 219 100 219 0 0 820 0 --:--:-- --:--:-- --:--:-- 817 <HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8"> <TITLE>301 Moved</TITLE></HEAD><BODY> <H1>301 Moved</H1> The document has moved <A HREF="http://www.google.com/">here</A>. </BODY></HTML> [weliang@weliang tmp]$ oc exec test-pod-86879d8c8c-5jh5s -- curl 10.128.2.83 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- 0:02:08 --:--:-- 0 curl: (28) Failed to connect to 10.128.2.83 port 80: Operation timed out command terminated with exit code 28 [weliang@weliang tmp]$
do we have both egressIP and egress-router configs on the same cluster ? can we get must-gather wanted also to know if ovnk is running with shared-gateway or local-gateway mode in theory svc that is tagged with egress router will be backed up by egress-router pod so traffic from any pod anywhere should reach to egress-router pod and traffic get redirected to destination
Testing also failed on 4.8.33 [weliang@weliang tmp]$ oc exec test-pod-6686bd4977-z5lmm -- curl 172.30.62.189:80 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 219 100 219 0 0 793 0 --:--:-- --:--:-- --:--:-- 793 <HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8"> <TITLE>301 Moved</TITLE></HEAD><BODY> <H1>301 Moved</H1> The document has moved <A HREF="http://www.google.com/">here</A>. </BODY></HTML> [weliang@weliang tmp]$ oc exec test-pod-6686bd4977-7kclb -- curl 172.30.62.189:80 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- 0:02:11 --:--:-- 0 curl: (28) Failed to connect to 172.30.62.189 port 80: Operation timed out command terminated with exit code 28 [weliang@weliang tmp]$
Tested and verified in 4.11.0-0.nightly-2022-05-05-015322 [weliang@weliang Test]$ oc get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES dell-per740-14rhtsengpek2redhatcom-debug 1/1 Running 0 4m9s 10.73.116.62 dell-per740-14.rhts.eng.pek2.redhat.com <none> <none> egress-router-cni-deployment-7f89795b59-jvxtb 1/1 Running 0 59s 10.131.0.28 dell-per740-35.rhts.eng.pek2.redhat.com <none> <none> test-pod-86879d8c8c-87pbz 1/1 Running 0 20s 10.131.0.30 dell-per740-35.rhts.eng.pek2.redhat.com <none> <none> test-pod-86879d8c8c-9rsjl 1/1 Running 0 20s 10.128.2.30 dell-per740-14.rhts.eng.pek2.redhat.com <none> <none> test-pod-86879d8c8c-gw847 1/1 Running 0 20s 10.128.2.29 dell-per740-14.rhts.eng.pek2.redhat.com <none> <none> test-pod-86879d8c8c-nmtxd 1/1 Running 0 20s 10.128.2.28 dell-per740-14.rhts.eng.pek2.redhat.com <none> <none> test-pod-86879d8c8c-q462m 1/1 Running 0 20s 10.131.0.31 dell-per740-35.rhts.eng.pek2.redhat.com <none> <none> test-pod-86879d8c8c-x6zzk 1/1 Running 0 20s 10.131.0.29 dell-per740-35.rhts.eng.pek2.redhat.com <none> <none> [weliang@weliang Test]$ oc exec test-pod-86879d8c8c-9rsjl -- curl 10.131.0.28 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 219 100 219 0 0 361 0 --:--:-- --:--:-- --:--:-- <HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8"> <TITLE>301 Moved</TITLE></HEAD><BODY> <H1>301 Moved</H1> The document has moved <A HREF="http://www.google.com/">here</A>. </BODY></HTML> 360 [weliang@weliang Test]$ oc exec test-pod-86879d8c8c-q462m -- curl 10.131.0.28 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 219 100 219 0 0 347 0 --:--:-- --:--:-- --:--:-- 347 <HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8"> <TITLE>301 Moved</TITLE></HEAD><BODY> <H1>301 Moved</H1> The document has moved <A HREF="http://www.google.com/">here</A>. </BODY></HTML> [weliang@weliang Test]$ [weliang@weliang Test]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-05-05-015322 True False 33m Cluster version is 4.11.0-0.nightly-2022-05-05-015322 [weliang@weliang Test]$
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069