Description of problem: Semi automatic namespace wide egress IP randomly shows up as node IP for external connections. Version-Release number of selected component (if applicable): v3.7.9 How reproducible: If we run the test 10 times, 7 times we see the egress IP as the outgoing IP, however, the other 3 times the node IP gets picked. Steps to Reproduce: 1. Follow steps here to create an egress IP: https://access.redhat.com/documentation/en-us/openshift_container_platform/3.7/html-single/cluster_administration/#enabling-static-ips-for-external-project-traffic 2. Run a simple curl test to an external site from the pod with the egress IP 10 times in a row. Actual results: Random number of times the node IP gets selected as the outgoing IP instead of the egress IP. Expected results: It should always show the egress IP as the outgoing IP for the project. Additional info:
Cannot reproduce on v3.9.3 [root@ose-master ~]# oc get hostsubnet NAME HOST HOST IP SUBNET EGRESS IPS ose-node1.bmeng.local ose-node1.bmeng.local 10.1.1.3 10.128.0.0/23 [10.1.1.200] ose-node2.bmeng.local ose-node2.bmeng.local 10.1.1.4 10.129.0.0/23 [] [root@ose-master ~]# oc get netnamespace u1p1 NAME NETID EGRESS IPS u1p1 392490 [10.1.1.200] $ oc get po -o wide NAME READY STATUS RESTARTS AGE IP NODE test-rc-mm99q 1/1 Running 0 1m 10.128.0.35 ose-node1.bmeng.local test-rc-wg4h4 1/1 Running 0 1m 10.129.0.2 ose-node2.bmeng.local $ ping -c1 ose-node1.bmeng.local PING ose-node1.bmeng.local (10.1.1.3) 56(84) bytes of data. 64 bytes from ose-node1.bmeng.local (10.1.1.3): icmp_seq=1 ttl=64 time=0.237 ms $ ping -c1 ose-node2.bmeng.local PING ose-node2.bmeng.local (10.1.1.4) 56(84) bytes of data. 64 bytes from ose-node2.bmeng.local (10.1.1.4): icmp_seq=1 ttl=64 time=0.201 ms [root@ose-node1 ~]# curl 10.1.1.2:8888 10.1.1.3 [root@ose-node2 ~]# curl 10.1.1.2:8888 10.1.1.4 On each pod: $ for i in {1..20}; do oc exec test-rc-mm99q -- curl -s 10.1.1.2:8888 ; echo ; done 10.1.1.200 10.1.1.200 10.1.1.200 10.1.1.200 10.1.1.200 10.1.1.200 10.1.1.200 10.1.1.200 10.1.1.200 10.1.1.200 10.1.1.200 10.1.1.200 10.1.1.200 10.1.1.200 10.1.1.200 10.1.1.200 10.1.1.200 10.1.1.200 10.1.1.200 10.1.1.200 $ for i in {1..20}; do oc exec test-rc-wg4h4 -- curl -s 10.1.1.2:8888 ; echo ; done 10.1.1.200 10.1.1.200 10.1.1.200 10.1.1.200 10.1.1.200 10.1.1.200 10.1.1.200 10.1.1.200 10.1.1.200 10.1.1.200 10.1.1.200 10.1.1.200 10.1.1.200 10.1.1.200 10.1.1.200 10.1.1.200 10.1.1.200 10.1.1.200 10.1.1.200 10.1.1.200
[root@ose-master-37 ~]# oc get hostsubnet NAME HOST HOST IP SUBNET EGRESS IPS ose-node1-37.bmeng.local ose-node1-37.bmeng.local 10.66.140.1 10.128.0.0/23 [] ose-node2-37.bmeng.local ose-node2-37.bmeng.local 10.66.140.21 10.129.0.0/23 [10.66.140.100] [user1@ose-master-37 ~]$ for i in {1..20}; do oc exec test-rc-kv6k7 -- curl -s 10.66.141.175:8888 ; echo ; done 10.66.140.100 10.66.140.100 10.66.140.100 10.66.140.100 10.66.140.100 10.66.140.100 10.66.140.100 10.66.140.100 10.66.140.100 10.66.140.100 10.66.140.100 10.66.140.100 10.66.140.100 10.66.140.100 10.66.140.100 10.66.140.100 10.66.140.100 10.66.140.100 10.66.140.100 10.66.140.100 [user1@ose-master-37 ~]$ for i in {1..20}; do oc exec test-rc-qk2fp -- curl -s 10.66.141.175:8888 ; echo ; done 10.66.140.100 10.66.140.100 10.66.140.100 10.66.140.100 10.66.140.100 10.66.140.100 10.66.140.100 10.66.140.100 10.66.140.100 10.66.140.100 10.66.140.100 10.66.140.100 10.66.140.100 10.66.140.100 10.66.140.100 10.66.140.100 10.66.140.100 10.66.140.100 10.66.140.100 10.66.140.100 Also works fine on v3.7.9 env.
So for one, the fact that they can reproduce this in two different clusters and we can't reproduce it in any means that clearly they are doing something "weird" in those clusters, and so having as much information as possible about any non-standard stuff they are doing would be helpful. Especially, are they at all touching iptables or OVS, ever, in any way, other than what OpenShift does itself by default? Have they changed the values of any of the networking-related node-config variables from the defaults? Further debug info: 1. Mark the node hosting the egress IP unschedulable and flush all existing pods from it 2. Restart atomic-openshift-node on that node, running with --loglevel=5 3. Give the node a minute to get fully up and running, then run "iptables-save --counters" and save the output 4. Run the curl test for a while from another node, printing a timestamp with each attempt, so we can see the exact time of each incorrect IP. 5. Run "iptables-save --counters" again on the egress node 6. Send us the two iptables-save outputs, the curl test log with timestamps, and the atomic-openshift-node logs from the egress node.
Copying from the support case: The problem is that upstream kubernetes specifies the "KUBE-MARK-MASQ" bit in two different places: once for kubelet and once for kube-proxy. We're overriding the kube-proxy default value but not the kubelet default value. No one noticed this before because even though kubelet occasionally tries to resynchronize the iptables rules with its own value of KUBE-MARK-MASQ, kube-proxy will end up overwriting it with its own value. But while kubelet's incorrect rule is there, it can cause egress IPs to fail (by sending the packets through the regular masquerade rule rather than the egress IP masquerade rule). Workaround: Add this to your node-config.yaml on the egress node: kubeletArguments: iptables-masquerade-bit: ["0"] if you already have a "kubeletArguments" section, just add the iptables-masquerade-bit line as another line in that section.
https://github.com/openshift/origin/pull/19005
bmeng: the missing piece in being able to reproduce this before was that it only happens if the egress IP has bit 14 set; that is, if the 3rd octet of the egress IP address is between 64 and 127, or 192 and 255. (So both of your tests cases before, 10.1.1.200 and 10.66.140.100, missed the bug, but the reporter's examples, 192.168.80.66 and 192.168.64.49, hit it.)
Checked on v3.10.0-0.47.0 With egressIP set, the openflow rule on the egress node will have the same value for the vnid and the pkt_mark cookie=0x0, duration=631.651s, table=100, n_packets=19, n_bytes=1467, priority=100,ip,reg0=0x931514 actions=set_field:0e:ca:99:5b:c3:21->eth_dst,set_field:0x931514->pkt_mark,goto_table:101
Additional testing to the comment#12, created a lot of netnamespaces with egressIP assigned, it follows the rule that if the vnid is even number, the pkt_mark is the same value with vnid, if the vnid is odd number, the pkt_mark will add 1 bit to the beginning and minus 1 to the last bit. cookie=0x0, duration=727.369s, table=100, n_packets=0, n_bytes=0, priority=100,ip,reg0=0xc25a54 actions=set_field:4e:5f:6b:62:db:60->eth_dst,set_field:0xc25a54->pkt_mark,goto_table:101 cookie=0x0, duration=19.085s, table=100, n_packets=0, n_bytes=0, priority=100,ip,reg0=0x2ea40 actions=set_field:4e:5f:6b:62:db:60->eth_dst,set_field:0x2ea40->pkt_mark,goto_table:101 cookie=0x0, duration=14.125s, table=100, n_packets=0, n_bytes=0, priority=100,ip,reg0=0x8a5d5 actions=set_field:4e:5f:6b:62:db:60->eth_dst,set_field:0x108a5d4->pkt_mark,goto_table:101 cookie=0x0, duration=9.173s, table=100, n_packets=0, n_bytes=0, priority=100,ip,reg0=0xbb016c actions=set_field:4e:5f:6b:62:db:60->eth_dst,set_field:0xbb016c->pkt_mark,goto_table:101 cookie=0x0, duration=4.710s, table=100, n_packets=0, n_bytes=0, priority=100,ip,reg0=0x7975e1 actions=set_field:4e:5f:6b:62:db:60->eth_dst,set_field:0x17975e0->pkt_mark,goto_table:101
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:1816