Bug 1552738 - Egress Router HTTP Proxy cannot reach the node which router pod runs
Summary: Egress Router HTTP Proxy cannot reach the node which router pod runs
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 3.7.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 3.10.0
Assignee: Dan Winship
QA Contact: Meng Bo
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-03-07 16:15 UTC by Birol Bilgin
Modified: 2018-12-29 07:35 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: The way that egress routers are set up made it impossible for an egress router pod to connect to the public IP address of the node it was hosted on. Consequence: If an egress pod was configured to use its node as a name server via /etc/resolv.conf, it would be unable to do DNS resolution. Fix: Traffic from an egress router pod to its node is now routed via the SDN tunnel instead of trying to send it via the egress interface. Result: Egress routers can now connect to their node's IP, and egress router DNS should always work, regardless of configuration.
Clone Of:
Environment:
Last Closed: 2018-07-30 19:10:04 UTC


Attachments (Terms of Use)
iptables_filter (9.90 KB, text/plain)
2018-04-04 11:07 UTC, Birol Bilgin
no flags Details
iptables_nat (89.54 KB, text/plain)
2018-04-04 11:08 UTC, Birol Bilgin
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:1816 None None None 2018-07-30 19:10:30 UTC
Origin (Github) 19885 None None None 2018-05-30 16:04:05 UTC

Internal Links: 1595291

Description Birol Bilgin 2018-03-07 16:15:35 UTC
Description of problem: 

Egress Router HTTP Proxy cannot reach the node which router pod runs
thus DNS name resolution does not work.

Version-Release number of selected component (if applicable):

OCP 3.7
it is probably applied to all version


How reproducible:

Created a namespace in a VM or on a host,
replicated the macvlan interface creation.

Used steps from the current snapshot of the github.com/openshift/origin

./images/egress/router/egress-router.sh 30:1 
function setup_network()

./pkg/network/node/pod.go 433:8-15 
netlink.LinkAdd(&netlink.Macvlan{
                LinkAttrs: netlink.LinkAttrs{
                        MTU:         iface.Attrs().MTU,
                        Name:        "macvlan0",
                        ParentIndex: iface.Attrs().Index,
                        Namespace:   netlink.NsFd(podNs.Fd()),
                },
                Mode: netlink.MACVLAN_MODE_PRIVATE,
        })


Steps to Reproduce:
1. ip netns add test
2. ip link add macvlan0 link eth0 type macvlan mode private
3. ip link set dev macvlan0 netns test
4. ip netns exec test bash
// from now we running stuff in namespace
5. ip addr add <ip_belongs_the_same_subnet_as_host_ip> dev macvlan0
6. ip link set up dev macvlan0
7. ip route add "<host_gateway>"/32 dev macvlan0
8. ip route add default via "<host_gateway>" dev macvlan0
9. I ran a dnsmasq service to test dns, but I would imagine 
   any open port should work as well.
10. ping <host_ip>
11. dig @<host_ip> redhat.com

Actual results:

ping 172.22.2.52
PING 172.22.2.52 (172.22.2.52) 56(84) bytes of data.
From 172.22.2.1 icmp_seq=2 Redirect Host(New nexthop: 172.22.2.52)
From 172.22.2.1: icmp_seq=2 Redirect Host(New nexthop: 172.22.2.52)
From 172.22.2.1 icmp_seq=7 Redirect Host(New nexthop: 172.22.2.52)
From 172.22.2.1: icmp_seq=7 Redirect Host(New nexthop: 172.22.2.52)
^C
--- 172.22.2.52 ping statistics ---
7 packets transmitted, 0 received, +2 errors, 100% packet loss, time 5999ms


dig @172.22.2.52 redhat.com

; <<>> DiG 9.9.4-RedHat-9.9.4-51.el7_4.2 <<>> @172.22.2.52 redhat.com
; (1 server found)
;; global options: +cmd
;; connection timed out; no servers could be reached


Expected results:

10. Not sure ping should work

11. Since pod nameserver is the node ip it should be
able to reach the node, so dns resolution does not work.

Additional info:

Comment 2 Meng Bo 2018-03-08 10:33:13 UTC
I can reproduce the issue with v3.9.3

The egress-http-proxy and egress-router pod cannot talk to the host ip when the dnsmasq is enabled on the node.

# ip neigh
10.66.140.15 dev macvlan0  FAILED
10.66.140.117 dev macvlan0 lladdr 52:54:00:7e:86:4e STALE

# ip route
default via 10.66.141.254 dev macvlan0 
10.66.140.0/23 dev macvlan0 proto kernel scope link src 10.66.140.200 
10.66.141.254 dev macvlan0 scope link 
10.128.0.0/23 dev eth0 proto kernel scope link src 10.128.0.17 
10.128.0.0/14 dev eth0 
224.0.0.0/4 dev eth0 


10.66.140.15 is the other node.
10.66.140.117 is the node where the egress pod landed.

Comment 7 Dan Winship 2018-03-13 16:56:57 UTC
OK, right. This is inherent to the way macvlans work: even if you set them to "bridge" mode (which we don't), they can't send packets directly to their parent device, so even if you set up proper subnet routing, it would only be able to connect to the node's primary IP if the node's upstream router was willing to "hairpin" packets (which it probably isn't).

One possible fix would be to masquerade the packets to the node's internal SDN IP address instead. Eg, if the node has primary IP 172.17.0.3 and tun0 IP 10.129.0.1, then you'd run (in the pod's network namespace):

  iptables -t nat -A OUTPUT -d 172.17.0.3/32 \
      -j DNAT --to-destination 10.129.0.1
  iptables -t nat -I POSTROUTING -d 10.129.0.1/32 \
      -j MASQUERADE

(Note "-I" not "-A" on the second rule, to get it inserted before the default SNAT rule.)

This could be partially automated:

  #!/bin/bash
  
  node_eth0_address=172.22.2.52
  egress_pod_eth0_address=$(ip addr show dev eth0 | \
      sed -ne 's/.*inet \([0-9.]*\)\/.*/\1/p')
  node_tun0_address=$(echo $(egress_pod_eth0_address) | sed -e 's/[0-9]*$/1/')
  
  iptables -t nat -A OUTPUT -d $(node_eth0_address)/32 \
      -j DNAT --to-destination $(node_tun0_address)
  iptables -t nat -I POSTROUTING -d $(node_tun0_address)/32 \
      -j MASQUERADE

"node_eth0_address" needs to be filled in here by hand, but the tun0 address can be figured out from the egress-router's eth0 configuration.

Right now the egress-router.sh script doesn't know the node's primary IP so it wouldn't be able to set this up automatically. I need to think about the best way to do this.

Comment 8 Birol Bilgin 2018-03-14 14:34:19 UTC
When I ran this commands there were some errors, so values like $(egress_pod_eth0_address) should be used as $egress_pod_eth0_address otherwise bash interpreted this as a command to be run.

after running this, I could ping the nodeIP, 
however, I could not make any DNS queries.

these probes were taken from the pod's namespace 

$ nmap 172.22.2.52 -p 53 -Pn

Starting Nmap 6.40 ( http://nmap.org ) at 2018-03-14 10:27 EDT
Nmap scan report for 10.74.157.166
Host is up.
PORT   STATE    SERVICE
53/tcp filtered domain

$ nmap 172.22.2.52 -p 53 -Pn -sU

Starting Nmap 6.40 ( http://nmap.org ) at 2018-03-14 10:27 EDT
Nmap scan report for 10.74.157.166
Host is up.
PORT   STATE         SERVICE
53/udp open|filtered domain

seems we need modifications on host  networking as well

Comment 9 Meng Bo 2018-03-15 06:00:27 UTC
The script in comment#7, works for me. Though there are some shell syntax issues.

After fix the script and run it inside the container's netnamespace, the egress pod can access the host's ip address and can resolve domain names normally.

sh-4.2# cat /etc/resolv.conf 
nameserver 10.66.140.15
search default.svc.cluster.local svc.cluster.local cluster.local par.redhat.com bmeng.local
options ndots:5

sh-4.2# ping 10.66.140.15
PING 10.66.140.15 (10.66.140.15) 56(84) bytes of data.
64 bytes from 10.66.140.15: icmp_seq=1 ttl=64 time=0.459 ms
^C
--- 10.66.140.15 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.459/0.459/0.459/0.000 ms

sh-4.2# curl -I www.youdao.com
HTTP/1.1 200 OK
Server: nginx
Date: Thu, 15 Mar 2018 05:58:04 GMT
Content-Type: text/html; charset=utf-8
Content-Length: 0
Connection: keep-alive
Cache-Control: private
Content-Language: en-US
Set-Cookie: DICT_UGC=be3af0da19b5c5e6aa4e17bd8d90b28a|; domain=.youdao.com
Set-Cookie: OUTFOX_SEARCH_USER_ID=542782096@119.254.120.72; domain=.youdao.com; expires=Sat, 07-Mar-2048 05:58:03 GMT
Set-Cookie: JSESSIONID=abc5b_DX9j7OB_70jpOiw; domain=youdao.com; path=/

The iptables rules in the pod will like:
[root@ose-node2 ~]# nsenter -n -t 4291
[root@ose-node2 ~]# iptables -S -t nat 
-P PREROUTING ACCEPT
-P INPUT ACCEPT
-P OUTPUT ACCEPT
-P POSTROUTING ACCEPT
-A OUTPUT -d 10.66.140.15/32 -j DNAT --to-destination 10.129.0.1
-A POSTROUTING -d 10.129.0.1/32 -j MASQUERADE

Comment 10 Dan Winship 2018-03-15 12:58:35 UTC
> The iptables rules in the pod will like:

did you miss a line in the cut+paste? It should end with

  -A POSTROUTING -j SNAT --to-source ${EGRESS_SOURCE}

if you don't see that there then the egress router isn't set up correctly. (Did you accidentally flush the egress router's own rules at some point?)

Comment 11 Birol Bilgin 2018-03-15 13:17:10 UTC
>for me the iptables rules are the same

# iptables -S -t nat
-P PREROUTING ACCEPT
-P INPUT ACCEPT
-P OUTPUT ACCEPT
-P POSTROUTING ACCEPT
-A OUTPUT -d 10.74.157.166/32 -j DNAT --to-destination 10.128.0.1
-A POSTROUTING -d 10.128.0.1/32 -j MASQUERADE

I think this is because of the routers EGRESS_ROUTER_MODE=http-proxy 
so for http-proxy setup_iptables does not run only setup_network runs

https://github.com/openshift/origin/blob/9d81d1bbb5c512ffcb8cb34b373d93f92ed26628/images/egress/router/egress-router.sh#L153

Comment 12 Birol Bilgin 2018-03-19 12:10:50 UTC
> correction https://bugzilla.redhat.com/show_bug.cgi?id=1552738#c8 

I have tested the workaround before on 3.6 it did not work

I have tested the workaround on 3.7 and now DNS resolution works.

Comment 15 Birol Bilgin 2018-04-04 11:07:52 UTC
Created attachment 1417214 [details]
iptables_filter

Comment 16 Birol Bilgin 2018-04-04 11:08:36 UTC
Created attachment 1417215 [details]
iptables_nat

Comment 20 Meng Bo 2018-04-27 08:44:20 UTC
Any update for this? Still face this problem in 3.10 testing.

Comment 22 Dan Winship 2018-05-30 16:04:06 UTC
https://github.com/openshift/origin/pull/19885

Comment 23 Dan Winship 2018-05-30 16:07:55 UTC
Note to QE: the fix requires both a new origin binary and a new egress-router image. I'm not sure where your images come from when you're testing, but make sure you do get the new one. You can tell by looking at the iptables rules; if you do "oc exec my-egress-router-pod -- iptables-save", it should have:

  -A POSTROUTING -o macvlan0 -j SNAT --to-source ${EGRESS_SOURCE}

not

  -A POSTROUTING -j SNAT --to-source ${EGRESS_SOURCE}

Comment 24 Weibin Liang 2018-06-01 20:24:05 UTC
@Dan, Testing on v3.10.0-0.56.0:

[root@ip-172-18-6-68 ~]# docker ps | grep egress
9d583f4b7e24        e32428b2269e                                                                                                                                       "/usr/bin/pod"           2 minutes ago        Up About a minute                           k8s_egressrouter-redirect_egress-redirect_p1_93d2644f-65d7-11e8-90ea-0e53bd2bbf32_0
55fa51e2bc48        registry.reg-aws.openshift.com:443/openshift3/ose-pod:v3.10.0-0.56.0                                                                               "/usr/bin/pod"           2 minutes ago        Up 2 minutes                                k8s_POD_egress-redirect_p1_93d2644f-65d7-11e8-90ea-0e53bd2bbf32_1
[root@ip-172-18-6-68 ~]# docker inspect 9d583f4b7e24  | grep Pid
            "Pid": 10927,
            "PidMode": "",
            "PidsLimit": 0,
[root@ip-172-18-6-68 ~]# nsenter -n -t 10927 bash
[root@ip-172-18-6-68 ~]# iptables-save
# Generated by iptables-save v1.4.21 on Fri Jun  1 16:12:47 2018
*nat
:PREROUTING ACCEPT [0:0]
:INPUT ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
:POSTROUTING ACCEPT [0:0]
-A PREROUTING -i eth0 -j DNAT --to-destination 10.240.0.65
-A POSTROUTING -o macvlan0 -j SNAT --to-source 172.18.6.68
COMMIT
# Completed on Fri Jun  1 16:12:47 2018
# Generated by iptables-save v1.4.21 on Fri Jun  1 16:12:47 2018
*filter
:INPUT ACCEPT [0:0]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
COMMIT
# Completed on Fri Jun  1 16:12:47 2018
[root@ip-172-18-6-68 ~]# 


For the new fixed iptables rule, should it be -A POSTROUTING -o macvlan0 -j SNAT --to-source 10.129.0.1(SDN tunnel) not -A POSTROUTING -o macvlan0 -j SNAT --to-source 172.18.6.68(egress interface)?

Comment 25 Dan Winship 2018-06-01 20:39:29 UTC
The rule is correct; it's not really a "new" rule, it's just a fix to the old rule. It used to be that *all* outgoing traffic got NATted to the EGRESS_SOURCE, but that ended up meaning that the egress-router couldn't send to addresses on the SDN. The fix was to add "-o macvlan0" to the NAT rule, so that it only applies to traffic that is going out the macvlan interface, which is what we'd intended all along.

Comment 27 Meng Bo 2018-06-05 10:45:38 UTC
Test on v3.10.0-0.58.0 with egress router images.

The egress router and egress http proxy features works well with dnsmasq enabled.

And the following iptables rule found in egress router pod:
-A POSTROUTING -o macvlan0 -j SNAT --to-source 10.66.140.201

Comment 33 errata-xmlrpc 2018-07-30 19:10:04 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1816


Note You need to log in before you can comment on or make changes to this bug.