Bug 1552869

Summary: Semi automatic namespace wide egress IP randomly shows up as node IP
Product: OpenShift Container Platform Reporter: Taneem Ibrahim <tibrahim>
Component: NetworkingAssignee: Dan Winship <danw>
Status: CLOSED ERRATA QA Contact: Meng Bo <bmeng>
Severity: high Docs Contact:
Priority: high    
Version: 3.7.0CC: aos-bugs, bbennett, danw, eparis, tibrahim
Target Milestone: ---   
Target Release: 3.10.0   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: The "kube-proxy" and "kubelet" parts of the OpenShift node process were being given different default values for the config options describing how to interact with iptables. Consequence: OpenShift would periodically add a bogus iptables rule that would cause *some* per-project static egress IPs to not be used for some length of time, until the bogus rule was removed again. (While the bogus rule was present, traffic from those projects would use the node IP address of the node hosting the egress IP, rather than the egress IP itself.) Fix: The inconsistent configuration was resolved, causing the bogus iptables rule to no longer be added. Result: Projects consistently use their static egress IPs.
Story Points: ---
Clone Of:
: 1560584 1560586 1560587 (view as bug list) Environment:
Last Closed: 2018-07-30 19:10:40 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1560584, 1560586, 1560587    

Description Taneem Ibrahim 2018-03-07 21:13:20 UTC
Description of problem:

Semi automatic namespace wide egress IP randomly shows up as node IP for external connections.

Version-Release number of selected component (if applicable):

v3.7.9

How reproducible:

If we run the test 10 times, 7 times we see the egress IP as the outgoing IP, however, the other 3 times the node IP gets picked.

Steps to Reproduce:
1. Follow steps here to create an egress IP:
https://access.redhat.com/documentation/en-us/openshift_container_platform/3.7/html-single/cluster_administration/#enabling-static-ips-for-external-project-traffic

2. Run a simple curl test to an external site from the pod with the egress IP 10 times in a row.


Actual results:

Random number of times the node IP gets selected as the outgoing IP instead of the egress IP.

Expected results:

It should always show the egress IP as the outgoing IP for the project.


Additional info:

Comment 1 Meng Bo 2018-03-08 03:08:21 UTC
Cannot reproduce on v3.9.3

[root@ose-master ~]# oc get hostsubnet 
NAME                    HOST                    HOST IP    SUBNET          EGRESS IPS
ose-node1.bmeng.local   ose-node1.bmeng.local   10.1.1.3   10.128.0.0/23   [10.1.1.200]
ose-node2.bmeng.local   ose-node2.bmeng.local   10.1.1.4   10.129.0.0/23   []
[root@ose-master ~]# oc get netnamespace u1p1
NAME      NETID     EGRESS IPS
u1p1      392490    [10.1.1.200]

$ oc get po -o wide 
NAME            READY     STATUS    RESTARTS   AGE       IP            NODE
test-rc-mm99q   1/1       Running   0          1m        10.128.0.35   ose-node1.bmeng.local
test-rc-wg4h4   1/1       Running   0          1m        10.129.0.2    ose-node2.bmeng.local

$ ping -c1 ose-node1.bmeng.local
PING ose-node1.bmeng.local (10.1.1.3) 56(84) bytes of data.
64 bytes from ose-node1.bmeng.local (10.1.1.3): icmp_seq=1 ttl=64 time=0.237 ms

$ ping -c1 ose-node2.bmeng.local
PING ose-node2.bmeng.local (10.1.1.4) 56(84) bytes of data.
64 bytes from ose-node2.bmeng.local (10.1.1.4): icmp_seq=1 ttl=64 time=0.201 ms

[root@ose-node1 ~]# curl 10.1.1.2:8888
10.1.1.3
[root@ose-node2 ~]# curl 10.1.1.2:8888
10.1.1.4

On each pod:
$ for i in {1..20}; do oc exec test-rc-mm99q -- curl -s 10.1.1.2:8888 ; echo ; done 
10.1.1.200
10.1.1.200
10.1.1.200
10.1.1.200
10.1.1.200
10.1.1.200
10.1.1.200
10.1.1.200
10.1.1.200
10.1.1.200
10.1.1.200
10.1.1.200
10.1.1.200
10.1.1.200
10.1.1.200
10.1.1.200
10.1.1.200
10.1.1.200
10.1.1.200
10.1.1.200


$ for i in {1..20}; do oc exec test-rc-wg4h4 -- curl -s 10.1.1.2:8888 ; echo ; done 
10.1.1.200
10.1.1.200
10.1.1.200
10.1.1.200
10.1.1.200
10.1.1.200
10.1.1.200
10.1.1.200
10.1.1.200
10.1.1.200
10.1.1.200
10.1.1.200
10.1.1.200
10.1.1.200
10.1.1.200
10.1.1.200
10.1.1.200
10.1.1.200
10.1.1.200
10.1.1.200

Comment 4 Meng Bo 2018-03-08 07:20:39 UTC
[root@ose-master-37 ~]# oc get hostsubnet 
NAME                       HOST                       HOST IP        SUBNET          EGRESS IPS
ose-node1-37.bmeng.local   ose-node1-37.bmeng.local   10.66.140.1    10.128.0.0/23   []
ose-node2-37.bmeng.local   ose-node2-37.bmeng.local   10.66.140.21   10.129.0.0/23   [10.66.140.100]


[user1@ose-master-37 ~]$ for i in {1..20}; do oc exec test-rc-kv6k7 -- curl -s 10.66.141.175:8888 ; echo ; done
10.66.140.100
10.66.140.100
10.66.140.100
10.66.140.100
10.66.140.100
10.66.140.100
10.66.140.100
10.66.140.100
10.66.140.100
10.66.140.100
10.66.140.100
10.66.140.100
10.66.140.100
10.66.140.100
10.66.140.100
10.66.140.100
10.66.140.100
10.66.140.100
10.66.140.100
10.66.140.100
[user1@ose-master-37 ~]$ for i in {1..20}; do oc exec test-rc-qk2fp -- curl -s 10.66.141.175:8888 ; echo ; done
10.66.140.100
10.66.140.100
10.66.140.100
10.66.140.100
10.66.140.100
10.66.140.100
10.66.140.100
10.66.140.100
10.66.140.100
10.66.140.100
10.66.140.100
10.66.140.100
10.66.140.100
10.66.140.100
10.66.140.100
10.66.140.100
10.66.140.100
10.66.140.100
10.66.140.100
10.66.140.100

Also works fine on v3.7.9 env.

Comment 6 Dan Winship 2018-03-14 14:22:05 UTC
So for one, the fact that they can reproduce this in two different clusters and we can't reproduce it in any means that clearly they are doing something "weird" in those clusters, and so having as much information as possible about any non-standard stuff they are doing would be helpful.

Especially, are they at all touching iptables or OVS, ever, in any way, other than what OpenShift does itself by default?

Have they changed the values of any of the networking-related node-config variables from the defaults?


Further debug info:

1. Mark the node hosting the egress IP unschedulable and flush all existing pods from it

2. Restart atomic-openshift-node on that node, running with --loglevel=5

3. Give the node a minute to get fully up and running, then run "iptables-save --counters" and save the output

4. Run the curl test for a while from another node, printing a timestamp with each attempt, so we can see the exact time of each incorrect IP.

5. Run "iptables-save --counters" again on the egress node

6. Send us the two iptables-save outputs, the curl test log with timestamps, and the atomic-openshift-node logs from the egress node.

Comment 8 Dan Winship 2018-03-16 15:01:30 UTC
Copying from the support case:

The problem is that upstream kubernetes specifies the "KUBE-MARK-MASQ" bit in two different places: once for kubelet and once for kube-proxy. We're overriding the kube-proxy default value but not the kubelet default value. No one noticed this before because even though kubelet occasionally tries to resynchronize the iptables rules with its own value of KUBE-MARK-MASQ, kube-proxy will end up overwriting it with its own value. But while kubelet's incorrect rule is there, it can cause egress IPs to fail (by sending the packets through the regular masquerade rule rather than the egress IP masquerade rule).

Workaround: Add this to your node-config.yaml on the egress node:

  kubeletArguments:
    iptables-masquerade-bit: ["0"]

if you already have a "kubeletArguments" section, just add the iptables-masquerade-bit line as another line in that section.

Comment 9 Dan Winship 2018-03-16 15:08:01 UTC
https://github.com/openshift/origin/pull/19005

Comment 10 Dan Winship 2018-03-21 12:51:38 UTC
bmeng: the missing piece in being able to reproduce this before was that it only happens if the egress IP has bit 14 set; that is, if the 3rd octet of the egress IP address is between 64 and 127, or 192 and 255. (So both of your tests cases before, 10.1.1.200 and 10.66.140.100, missed the bug, but the reporter's examples, 192.168.80.66 and 192.168.64.49, hit it.)

Comment 12 Meng Bo 2018-05-17 08:57:28 UTC
Checked on v3.10.0-0.47.0

With egressIP set, the openflow rule on the egress node will have the same value for the vnid and the pkt_mark

 cookie=0x0, duration=631.651s, table=100, n_packets=19, n_bytes=1467, priority=100,ip,reg0=0x931514 actions=set_field:0e:ca:99:5b:c3:21->eth_dst,set_field:0x931514->pkt_mark,goto_table:101

Comment 13 Meng Bo 2018-05-22 08:31:57 UTC
Additional testing to the comment#12,

created a lot of netnamespaces with egressIP assigned, it follows the rule that if the vnid is even number, the pkt_mark is the same value with vnid, if the vnid is odd number, the pkt_mark will add 1 bit to the beginning and minus 1 to the last bit.

 cookie=0x0, duration=727.369s, table=100, n_packets=0, n_bytes=0, priority=100,ip,reg0=0xc25a54 actions=set_field:4e:5f:6b:62:db:60->eth_dst,set_field:0xc25a54->pkt_mark,goto_table:101
 cookie=0x0, duration=19.085s, table=100, n_packets=0, n_bytes=0, priority=100,ip,reg0=0x2ea40 actions=set_field:4e:5f:6b:62:db:60->eth_dst,set_field:0x2ea40->pkt_mark,goto_table:101
 cookie=0x0, duration=14.125s, table=100, n_packets=0, n_bytes=0, priority=100,ip,reg0=0x8a5d5 actions=set_field:4e:5f:6b:62:db:60->eth_dst,set_field:0x108a5d4->pkt_mark,goto_table:101
 cookie=0x0, duration=9.173s, table=100, n_packets=0, n_bytes=0, priority=100,ip,reg0=0xbb016c actions=set_field:4e:5f:6b:62:db:60->eth_dst,set_field:0xbb016c->pkt_mark,goto_table:101
 cookie=0x0, duration=4.710s, table=100, n_packets=0, n_bytes=0, priority=100,ip,reg0=0x7975e1 actions=set_field:4e:5f:6b:62:db:60->eth_dst,set_field:0x17975e0->pkt_mark,goto_table:101

Comment 15 errata-xmlrpc 2018-07-30 19:10:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1816