Bug 1552869
Summary: | Semi automatic namespace wide egress IP randomly shows up as node IP | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Taneem Ibrahim <tibrahim> | |
Component: | Networking | Assignee: | Dan Winship <danw> | |
Status: | CLOSED ERRATA | QA Contact: | Meng Bo <bmeng> | |
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | 3.7.0 | CC: | aos-bugs, bbennett, danw, eparis, tibrahim | |
Target Milestone: | --- | |||
Target Release: | 3.10.0 | |||
Hardware: | Unspecified | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: |
Cause: The "kube-proxy" and "kubelet" parts of the OpenShift node process were being given different default values for the config options describing how to interact with iptables.
Consequence: OpenShift would periodically add a bogus iptables rule that would cause *some* per-project static egress IPs to not be used for some length of time, until the bogus rule was removed again. (While the bogus rule was present, traffic from those projects would use the node IP address of the node hosting the egress IP, rather than the egress IP itself.)
Fix: The inconsistent configuration was resolved, causing the bogus iptables rule to no longer be added.
Result: Projects consistently use their static egress IPs.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1560584 1560586 1560587 (view as bug list) | Environment: | ||
Last Closed: | 2018-07-30 19:10:40 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1560584, 1560586, 1560587 |
Description
Taneem Ibrahim
2018-03-07 21:13:20 UTC
Cannot reproduce on v3.9.3 [root@ose-master ~]# oc get hostsubnet NAME HOST HOST IP SUBNET EGRESS IPS ose-node1.bmeng.local ose-node1.bmeng.local 10.1.1.3 10.128.0.0/23 [10.1.1.200] ose-node2.bmeng.local ose-node2.bmeng.local 10.1.1.4 10.129.0.0/23 [] [root@ose-master ~]# oc get netnamespace u1p1 NAME NETID EGRESS IPS u1p1 392490 [10.1.1.200] $ oc get po -o wide NAME READY STATUS RESTARTS AGE IP NODE test-rc-mm99q 1/1 Running 0 1m 10.128.0.35 ose-node1.bmeng.local test-rc-wg4h4 1/1 Running 0 1m 10.129.0.2 ose-node2.bmeng.local $ ping -c1 ose-node1.bmeng.local PING ose-node1.bmeng.local (10.1.1.3) 56(84) bytes of data. 64 bytes from ose-node1.bmeng.local (10.1.1.3): icmp_seq=1 ttl=64 time=0.237 ms $ ping -c1 ose-node2.bmeng.local PING ose-node2.bmeng.local (10.1.1.4) 56(84) bytes of data. 64 bytes from ose-node2.bmeng.local (10.1.1.4): icmp_seq=1 ttl=64 time=0.201 ms [root@ose-node1 ~]# curl 10.1.1.2:8888 10.1.1.3 [root@ose-node2 ~]# curl 10.1.1.2:8888 10.1.1.4 On each pod: $ for i in {1..20}; do oc exec test-rc-mm99q -- curl -s 10.1.1.2:8888 ; echo ; done 10.1.1.200 10.1.1.200 10.1.1.200 10.1.1.200 10.1.1.200 10.1.1.200 10.1.1.200 10.1.1.200 10.1.1.200 10.1.1.200 10.1.1.200 10.1.1.200 10.1.1.200 10.1.1.200 10.1.1.200 10.1.1.200 10.1.1.200 10.1.1.200 10.1.1.200 10.1.1.200 $ for i in {1..20}; do oc exec test-rc-wg4h4 -- curl -s 10.1.1.2:8888 ; echo ; done 10.1.1.200 10.1.1.200 10.1.1.200 10.1.1.200 10.1.1.200 10.1.1.200 10.1.1.200 10.1.1.200 10.1.1.200 10.1.1.200 10.1.1.200 10.1.1.200 10.1.1.200 10.1.1.200 10.1.1.200 10.1.1.200 10.1.1.200 10.1.1.200 10.1.1.200 10.1.1.200 [root@ose-master-37 ~]# oc get hostsubnet NAME HOST HOST IP SUBNET EGRESS IPS ose-node1-37.bmeng.local ose-node1-37.bmeng.local 10.66.140.1 10.128.0.0/23 [] ose-node2-37.bmeng.local ose-node2-37.bmeng.local 10.66.140.21 10.129.0.0/23 [10.66.140.100] [user1@ose-master-37 ~]$ for i in {1..20}; do oc exec test-rc-kv6k7 -- curl -s 10.66.141.175:8888 ; echo ; done 10.66.140.100 10.66.140.100 10.66.140.100 10.66.140.100 10.66.140.100 10.66.140.100 10.66.140.100 10.66.140.100 10.66.140.100 10.66.140.100 10.66.140.100 10.66.140.100 10.66.140.100 10.66.140.100 10.66.140.100 10.66.140.100 10.66.140.100 10.66.140.100 10.66.140.100 10.66.140.100 [user1@ose-master-37 ~]$ for i in {1..20}; do oc exec test-rc-qk2fp -- curl -s 10.66.141.175:8888 ; echo ; done 10.66.140.100 10.66.140.100 10.66.140.100 10.66.140.100 10.66.140.100 10.66.140.100 10.66.140.100 10.66.140.100 10.66.140.100 10.66.140.100 10.66.140.100 10.66.140.100 10.66.140.100 10.66.140.100 10.66.140.100 10.66.140.100 10.66.140.100 10.66.140.100 10.66.140.100 10.66.140.100 Also works fine on v3.7.9 env. So for one, the fact that they can reproduce this in two different clusters and we can't reproduce it in any means that clearly they are doing something "weird" in those clusters, and so having as much information as possible about any non-standard stuff they are doing would be helpful. Especially, are they at all touching iptables or OVS, ever, in any way, other than what OpenShift does itself by default? Have they changed the values of any of the networking-related node-config variables from the defaults? Further debug info: 1. Mark the node hosting the egress IP unschedulable and flush all existing pods from it 2. Restart atomic-openshift-node on that node, running with --loglevel=5 3. Give the node a minute to get fully up and running, then run "iptables-save --counters" and save the output 4. Run the curl test for a while from another node, printing a timestamp with each attempt, so we can see the exact time of each incorrect IP. 5. Run "iptables-save --counters" again on the egress node 6. Send us the two iptables-save outputs, the curl test log with timestamps, and the atomic-openshift-node logs from the egress node. Copying from the support case: The problem is that upstream kubernetes specifies the "KUBE-MARK-MASQ" bit in two different places: once for kubelet and once for kube-proxy. We're overriding the kube-proxy default value but not the kubelet default value. No one noticed this before because even though kubelet occasionally tries to resynchronize the iptables rules with its own value of KUBE-MARK-MASQ, kube-proxy will end up overwriting it with its own value. But while kubelet's incorrect rule is there, it can cause egress IPs to fail (by sending the packets through the regular masquerade rule rather than the egress IP masquerade rule). Workaround: Add this to your node-config.yaml on the egress node: kubeletArguments: iptables-masquerade-bit: ["0"] if you already have a "kubeletArguments" section, just add the iptables-masquerade-bit line as another line in that section. bmeng: the missing piece in being able to reproduce this before was that it only happens if the egress IP has bit 14 set; that is, if the 3rd octet of the egress IP address is between 64 and 127, or 192 and 255. (So both of your tests cases before, 10.1.1.200 and 10.66.140.100, missed the bug, but the reporter's examples, 192.168.80.66 and 192.168.64.49, hit it.) Checked on v3.10.0-0.47.0 With egressIP set, the openflow rule on the egress node will have the same value for the vnid and the pkt_mark cookie=0x0, duration=631.651s, table=100, n_packets=19, n_bytes=1467, priority=100,ip,reg0=0x931514 actions=set_field:0e:ca:99:5b:c3:21->eth_dst,set_field:0x931514->pkt_mark,goto_table:101 Additional testing to the comment#12, created a lot of netnamespaces with egressIP assigned, it follows the rule that if the vnid is even number, the pkt_mark is the same value with vnid, if the vnid is odd number, the pkt_mark will add 1 bit to the beginning and minus 1 to the last bit. cookie=0x0, duration=727.369s, table=100, n_packets=0, n_bytes=0, priority=100,ip,reg0=0xc25a54 actions=set_field:4e:5f:6b:62:db:60->eth_dst,set_field:0xc25a54->pkt_mark,goto_table:101 cookie=0x0, duration=19.085s, table=100, n_packets=0, n_bytes=0, priority=100,ip,reg0=0x2ea40 actions=set_field:4e:5f:6b:62:db:60->eth_dst,set_field:0x2ea40->pkt_mark,goto_table:101 cookie=0x0, duration=14.125s, table=100, n_packets=0, n_bytes=0, priority=100,ip,reg0=0x8a5d5 actions=set_field:4e:5f:6b:62:db:60->eth_dst,set_field:0x108a5d4->pkt_mark,goto_table:101 cookie=0x0, duration=9.173s, table=100, n_packets=0, n_bytes=0, priority=100,ip,reg0=0xbb016c actions=set_field:4e:5f:6b:62:db:60->eth_dst,set_field:0xbb016c->pkt_mark,goto_table:101 cookie=0x0, duration=4.710s, table=100, n_packets=0, n_bytes=0, priority=100,ip,reg0=0x7975e1 actions=set_field:4e:5f:6b:62:db:60->eth_dst,set_field:0x17975e0->pkt_mark,goto_table:101 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:1816 |