Bug 2075475
Summary: | OVN-Kubernetes: egress router pod (redirect mode), access from pod on different worker-node (redirect) doesn't work | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Rainer Beyel <rbeyel> | ||||||||||
Component: | Networking | Assignee: | Andreas Karis <akaris> | ||||||||||
Networking sub component: | ovn-kubernetes | QA Contact: | Weibin Liang <weliang> | ||||||||||
Status: | CLOSED ERRATA | Docs Contact: | |||||||||||
Severity: | high | ||||||||||||
Priority: | high | CC: | ableisch, cfields, danw, evadla, ffernand, mmahmoud, weliang, wweber | ||||||||||
Version: | 4.9 | Flags: | mmahmoud:
needinfo-
mmahmoud: needinfo- |
||||||||||
Target Milestone: | --- | ||||||||||||
Target Release: | 4.11.0 | ||||||||||||
Hardware: | Unspecified | ||||||||||||
OS: | Unspecified | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||
Doc Text: |
Cause:
The egress-router-cni relied on the gateway
field of the CNI definition to delete the default route and inject its own. However, the CNI standard clearly defines that annotation k8s.v1.cni.cncf.io/networks: { "default-route" : ["{{.Gateway}}"] should be used for scenarios where a default route shall be injected via an additional network.
Consequence:
The egress-router-cni pods lacked some cluster internal routes which are usually injected by the CNI plugin / SDN provider. Thus, the pods could not reach some cluster internal destinations.
Fix:
Use k8s.v1.cni.cncf.io/networks: { "default-route" : ["{{.Gateway}}"] to inject correct routing information into the egress-router-cni pods.
Result:
egress-router-cni pods can reach external and cluster internal destinations.
|
Story Points: | --- | ||||||||||
Clone Of: | Environment: | ||||||||||||
Last Closed: | 2022-08-10 11:07:06 UTC | Type: | Bug | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Bug Depends On: | |||||||||||||
Bug Blocks: | 2083593 | ||||||||||||
Attachments: |
|
Description
Rainer Beyel
2022-04-14 10:33:01 UTC
this link has more details about this feature and its configuration and debugging https://docs.openshift.com/container-platform/4.8/networking/ovn_kubernetes_network_provider/deploying-egress-router-ovn-redirection.html I have the following questions :- - what is platform is it baremetal ? - hmm was it intentional to have the external ip in the same subnet as the nodes IP ? - can u describe the svc wanted to make sure it was labelled correctly - can u connect the node where egress router CNI is running on and collect the logs cat /tmp/egress-router-log, ip add the way this work is the egress router will act as bridge between pods and external system, egress router will have two interfaces eth0 for cluster internal networking and mcavlan0 has an IP and gateway from the external physical network. (In reply to Mohamed Mahmoud from comment #3) > I have the following questions :- > - what is platform is it baremetal ? - yes, tested it with UPI (libvirt) "bare-metal" (comment 0) - Customer also observes the issue on bare-metal > - hmm was it intentional to have the external ip in the same subnet as the > nodes IP ? - yes, I chose the same subnet (in my testenvironment) to keep it simple > - can u describe the svc wanted to make sure it was labelled correctly - I'll attach "service_egress-1.txt" > - can u connect the node where egress router CNI is running on and collect > the logs cat /tmp/egress-router-log, ip add - I'll attach "worker1_egress-router-log.txt", "worker1_ip_address.txt", "egress-router-cni-deployment_ip_address.txt" P.S. the "egress router pod" is currently running on worker1 (initially it was worker3) Created attachment 1873486 [details]
service_egress-1.txt
Created attachment 1873487 [details]
worker1_egress-router-log.txt
Created attachment 1873490 [details]
worker1_ip_address.txt
Created attachment 1873492 [details]
egress-router-cni-deployment_ip_address.txt
just to be sure customer dropped into the different test pod shell and ran curl 172.30.138.228:1234 ? have we tried to create test pods 1st and scale them to whatever # then deploy egress router then create ClusterIP svc would like to collect pcap files for working and none working curl to make sure iptables rules took effect and we see DNAT and SNAT took place QE reproduced this problem on local testing cluster: [weliang@weliang tmp]$ oc get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES egress-router-cni-deployment-5d659496ff-wn4rf 1/1 Running 0 14m 10.128.2.83 worker-0-0 <none> <none> test-pod-86879d8c8c-5jh5s 1/1 Running 0 7m2s 10.131.0.30 worker-0-1 <none> <none> test-pod-86879d8c8c-c4cbv 1/1 Running 0 7m2s 10.128.2.85 worker-0-0 <none> <none> test-pod-86879d8c8c-mc9xh 1/1 Running 0 7m2s 10.128.2.84 worker-0-0 <none> <none> test-pod-86879d8c8c-n8dk8 1/1 Running 0 7m2s 10.131.0.29 worker-0-1 <none> <none> test-pod-86879d8c8c-q97pj 1/1 Running 0 7m2s 10.128.2.86 worker-0-0 <none> <none> test-pod-86879d8c8c-tzsqw 1/1 Running 0 7m2s 10.131.0.28 worker-0-1 <none> <none> worker-0-0-debug 1/1 Running 0 13m 192.168.123.138 worker-0-0 <none> <none> [weliang@weliang tmp]$ [weliang@weliang tmp]$ oc exec $pod -- curl 10.128.2.83 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- 0:00:19 --:--:-- 0^C [weliang@weliang tmp]$ oc exec test-pod-86879d8c8c-mc9xh -- curl 10.128.2.83 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8"> <TITLE>301 Moved</TITLE></HEAD><BODY> <H1>301 Moved</H1> The document has moved <A HREF="http://www.google.com/">here</A>. </BODY></HTML> 100 219 100 219 0 0 793 0 --:--:-- --:--:-- --:--:-- 796 [weliang@weliang tmp]$ oc exec test-pod-86879d8c8c-q97pj -- curl 10.128.2.83 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 219 100 219 0 0 820 0 --:--:-- --:--:-- --:--:-- 817 <HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8"> <TITLE>301 Moved</TITLE></HEAD><BODY> <H1>301 Moved</H1> The document has moved <A HREF="http://www.google.com/">here</A>. </BODY></HTML> [weliang@weliang tmp]$ oc exec test-pod-86879d8c8c-5jh5s -- curl 10.128.2.83 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- 0:02:08 --:--:-- 0 curl: (28) Failed to connect to 10.128.2.83 port 80: Operation timed out command terminated with exit code 28 [weliang@weliang tmp]$ do we have both egressIP and egress-router configs on the same cluster ? can we get must-gather wanted also to know if ovnk is running with shared-gateway or local-gateway mode in theory svc that is tagged with egress router will be backed up by egress-router pod so traffic from any pod anywhere should reach to egress-router pod and traffic get redirected to destination Testing also failed on 4.8.33 [weliang@weliang tmp]$ oc exec test-pod-6686bd4977-z5lmm -- curl 172.30.62.189:80 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 219 100 219 0 0 793 0 --:--:-- --:--:-- --:--:-- 793 <HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8"> <TITLE>301 Moved</TITLE></HEAD><BODY> <H1>301 Moved</H1> The document has moved <A HREF="http://www.google.com/">here</A>. </BODY></HTML> [weliang@weliang tmp]$ oc exec test-pod-6686bd4977-7kclb -- curl 172.30.62.189:80 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- 0:02:11 --:--:-- 0 curl: (28) Failed to connect to 172.30.62.189 port 80: Operation timed out command terminated with exit code 28 [weliang@weliang tmp]$ Tested and verified in 4.11.0-0.nightly-2022-05-05-015322 [weliang@weliang Test]$ oc get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES dell-per740-14rhtsengpek2redhatcom-debug 1/1 Running 0 4m9s 10.73.116.62 dell-per740-14.rhts.eng.pek2.redhat.com <none> <none> egress-router-cni-deployment-7f89795b59-jvxtb 1/1 Running 0 59s 10.131.0.28 dell-per740-35.rhts.eng.pek2.redhat.com <none> <none> test-pod-86879d8c8c-87pbz 1/1 Running 0 20s 10.131.0.30 dell-per740-35.rhts.eng.pek2.redhat.com <none> <none> test-pod-86879d8c8c-9rsjl 1/1 Running 0 20s 10.128.2.30 dell-per740-14.rhts.eng.pek2.redhat.com <none> <none> test-pod-86879d8c8c-gw847 1/1 Running 0 20s 10.128.2.29 dell-per740-14.rhts.eng.pek2.redhat.com <none> <none> test-pod-86879d8c8c-nmtxd 1/1 Running 0 20s 10.128.2.28 dell-per740-14.rhts.eng.pek2.redhat.com <none> <none> test-pod-86879d8c8c-q462m 1/1 Running 0 20s 10.131.0.31 dell-per740-35.rhts.eng.pek2.redhat.com <none> <none> test-pod-86879d8c8c-x6zzk 1/1 Running 0 20s 10.131.0.29 dell-per740-35.rhts.eng.pek2.redhat.com <none> <none> [weliang@weliang Test]$ oc exec test-pod-86879d8c8c-9rsjl -- curl 10.131.0.28 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 219 100 219 0 0 361 0 --:--:-- --:--:-- --:--:-- <HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8"> <TITLE>301 Moved</TITLE></HEAD><BODY> <H1>301 Moved</H1> The document has moved <A HREF="http://www.google.com/">here</A>. </BODY></HTML> 360 [weliang@weliang Test]$ oc exec test-pod-86879d8c8c-q462m -- curl 10.131.0.28 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 219 100 219 0 0 347 0 --:--:-- --:--:-- --:--:-- 347 <HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8"> <TITLE>301 Moved</TITLE></HEAD><BODY> <H1>301 Moved</H1> The document has moved <A HREF="http://www.google.com/">here</A>. </BODY></HTML> [weliang@weliang Test]$ [weliang@weliang Test]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-05-05-015322 True False 33m Cluster version is 4.11.0-0.nightly-2022-05-05-015322 [weliang@weliang Test]$ Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069 |