Description of problem: The issue reported is that a cluster configuration change has gone wrong leaving OpenFlow rules inconsistent. More in detail, OpenFlow rules have been deleted inconsistently on the endpoint node when a GCE preemptible VM node was destroyed. So the new rules could not be added when the preemptible VM reappeared. Version-Release number of selected component (if applicable): oc v3.2.1.17 kubernetes v1.2.0-36-g4a3f9c5 How reproducible: Service and its endpoint: ``` shell [root@ose3-rbox-master ~]# oc describe service kafka-service Name: kafka-service Namespace: default Labels: internal.acs.amadeus.com/component=kafka-zookeeper,internal.acs.amadeus.com/env-name=rbox_kafka_zookeeper,internal.acs.amadeus.com/env-type=DML,internal.acs.amadeus.com/env-version=1.67.8-4-g34683df,internal.acs.amadeus.com/version=4.0.73 Selector: name=kafka Type: NodePort IP: 172.30.251.81 Port: <unset> 9092/TCP NodePort: <unset> 31005/TCP Endpoints: 10.128.0.5:9092 Session Affinity: None No events. ``` On a node that works: ``` shell [root@ose3-rbox-nd-p-w089 ~]# python Python 2.7.5 (default, Aug 29 2016, 10:12:21) [GCC 4.8.5 20150623 (Red Hat 4.8.5-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import socket >>> s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) >>> #ip='172.30.251.81' ... ip='10.128.0.5' >>> port= 9092 >>> #port=31005 ... s.connect((ip,port)) >>> s.send('hello') 5 >>> d = s.recv(2048) Traceback (most recent call last): File "<stdin>", line 1, in <module> socket.error: [Errno 104] Connection reset by peer >>> s.close() ``` Connection reset at this point is fine - I'm not sending the data the server expects, so it closes the connection. On a broken node, try to connect directly to the endpoint: ``` shell [root@ose3-rbox-nd-p-d0dp ~]# python Python 2.7.5 (default, Aug 29 2016, 10:12:21) [GCC 4.8.5 20150623 (Red Hat 4.8.5-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import socket >>> s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) >>> ip='10.128.0.5' >>> port= 9092 >>> s.connect((ip,port)) Traceback (most recent call last): File "<stdin>", line 2, in <module> File "/usr/lib64/python2.7/socket.py", line 224, in meth return getattr(self._sock,name)(*args) socket.error: [Errno 113] No route to host ``` So we fail connecting, "no route to host". Following the smoking gun, here's the node hosting the pod with endpoint 10.128.0.5: ``` shell [root@ose3-rbox-master ~]# oc get pods --selector=name=kafka -owide NAME READY STATUS RESTARTS AGE NODE kafka-zookeeper-replication-controller-1-shjsu 2/2 Running 3 70d ose3-rbox-i-0db1.node.nh-gcus-acs1-gce-eu-west-1.acs [root@ose3-rbox-master ~]# oc get pods --selector=name=kafka --template='{{range .items}}HostIP: {{.status.hostIP}} PodIP: {{.status.podIP}}{{end}}{{"\n"}}' HostIP: 10.44.77.42 PodIP: 10.128.0.5 [root@ose3-rbox-master ~]# oc get hostsubnet NAME HOST HOST IP SUBNET ose3-rbox-i-0db1.node.nh-gcus-acs1-gce-eu-west-1.acs ose3-rbox-i-0db1.node.nh-gcus-acs1-gce-eu-west-1.acs 10.44.77.42 10.128.0.0/23 ose3-rbox-i-jc8q.node.nh-gcus-acs1-gce-eu-west-1.acs ose3-rbox-i-jc8q.node.nh-gcus-acs1-gce-eu-west-1.acs 10.44.77.45 10.129.0.0/23 ose3-rbox-master.node.nh-gcus-acs1-gce-eu-west-1.acs ose3-rbox-master.node.nh-gcus-acs1-gce-eu-west-1.acs 10.44.77.29 10.130.0.0/23 ose3-rbox-nd-p-3nnp.node.nh-gcus-acs1-gce-eu-west-1.acs ose3-rbox-nd-p-3nnp.node.nh-gcus-acs1-gce-eu-west-1.acs 10.44.77.3 10.131.0.0/23 ose3-rbox-nd-p-b9v5.node.nh-gcus-acs1-gce-eu-west-1.acs ose3-rbox-nd-p-b9v5.node.nh-gcus-acs1-gce-eu-west-1.acs 10.44.77.12 10.129.4.0/23 ose3-rbox-nd-p-czk3.node.nh-gcus-acs1-gce-eu-west-1.acs ose3-rbox-nd-p-czk3.node.nh-gcus-acs1-gce-eu-west-1.acs 10.44.77.11 10.129.2.0/23 ose3-rbox-nd-p-d0dp.node.nh-gcus-acs1-gce-eu-west-1.acs ose3-rbox-nd-p-d0dp.node.nh-gcus-acs1-gce-eu-west-1.acs 10.44.77.3 10.128.2.0/23 ose3-rbox-nd-p-gl03.node.nh-gcus-acs1-gce-eu-west-1.acs ose3-rbox-nd-p-gl03.node.nh-gcus-acs1-gce-eu-west-1.acs 10.44.77.10 10.130.2.0/23 ose3-rbox-nd-p-jpbw.node.nh-gcus-acs1-gce-eu-west-1.acs ose3-rbox-nd-p-jpbw.node.nh-gcus-acs1-gce-eu-west-1.acs 10.44.77.9 10.130.6.0/23 ose3-rbox-nd-p-n7cz.node.nh-gcus-acs1-gce-eu-west-1.acs ose3-rbox-nd-p-n7cz.node.nh-gcus-acs1-gce-eu-west-1.acs 10.44.77.7 10.128.8.0/23 ose3-rbox-nd-p-nw70.node.nh-gcus-acs1-gce-eu-west-1.acs ose3-rbox-nd-p-nw70.node.nh-gcus-acs1-gce-eu-west-1.acs 10.44.77.2 10.131.6.0/23 ose3-rbox-nd-p-pfkw.node.nh-gcus-acs1-gce-eu-west-1.acs ose3-rbox-nd-p-pfkw.node.nh-gcus-acs1-gce-eu-west-1.acs 10.44.77.8 10.131.2.0/23 ose3-rbox-nd-p-qjfv.node.nh-gcus-acs1-gce-eu-west-1.acs ose3-rbox-nd-p-qjfv.node.nh-gcus-acs1-gce-eu-west-1.acs 10.44.77.6 10.131.4.0/23 ose3-rbox-nd-p-w089.node.nh-gcus-acs1-gce-eu-west-1.acs ose3-rbox-nd-p-w089.node.nh-gcus-acs1-gce-eu-west-1.acs 10.44.77.4 10.128.6.0/23 ose3-rbox-nd-p-w5qf.node.nh-gcus-acs1-gce-eu-west-1.acs ose3-rbox-nd-p-w5qf.node.nh-gcus-acs1-gce-eu-west-1.acs 10.44.77.10 10.129.6.0/23 ose3-rbox-nd-t481.node.nh-gcus-acs1-gce-eu-west-1.acs ose3-rbox-nd-t481.node.nh-gcus-acs1-gce-eu-west-1.acs 10.44.77.50 10.130.4.0/23 ose3-rbox-sp-n9tv.node.nh-gcus-acs1-gce-eu-west-1.acs ose3-rbox-sp-n9tv.node.nh-gcus-acs1-gce-eu-west-1.acs 10.44.77.47 10.128.4.0/23 ``` A few points the SDN troubleshooting document recommends to check: ``` [root@ose3-rbox-nd-p-d0dp ~]# ping 10.44.77.42 PING 10.44.77.42 (10.44.77.42) 56(84) bytes of data. 64 bytes from 10.44.77.42: icmp_seq=1 ttl=64 time=1.34 ms [root@ose3-rbox-nd-p-d0dp ~]# sysctl net.ipv4.ip_forward net.ipv4.ip_forward = 1 [root@ose3-rbox-nd-p-d0dp ~]# systemctl status firewalld ● firewalld.service Loaded: not-found (Reason: No such file or directory) Active: inactive (dead) [root@ose3-rbox-nd-p-d0dp ~]# ip route default via 10.44.77.1 dev eth0 proto static metric 100 10.44.77.1 dev eth0 scope link 10.44.77.1 dev eth0 proto dhcp scope link metric 100 10.128.0.0/14 dev tun0 proto kernel scope link 169.254.169.254 via 10.44.77.1 dev eth0 proto dhcp metric 100 172.30.0.0/16 dev tun0 scope link ``` Trouble starts here: ``` shell [root@ose3-rbox-i-0db1 ~]# tcpdump -i eth0 -nn | grep -A 2 VXLAN | grep -A 2 10.44.77.3 | grep 10.128.0.5 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes ARP, Request who-has 10.128.0.5 tell 10.128.2.1, length 28 ARP, Request who-has 10.128.0.5 tell 10.128.2.1, length 28 ARP, Request who-has 10.128.0.5 tell 10.128.2.1, length 28 ARP, Request who-has 10.128.0.5 tell 10.128.2.1, length 28 IP 10.128.0.5.9092 > 10.130.0.1.37006: Flags [P.], seq 1:141, ack 33, win 13580, options [nop,nop,TS val 804239922 ecr 804246130], length 140 ARP, Request who-has 10.128.0.5 tell 10.128.2.1, length 28 ARP, Request who-has 10.128.0.5 tell 10.128.2.1, length 28 ARP, Request who-has 10.128.0.5 tell 10.128.2.1, length 28 ARP, Request who-has 10.128.0.5 tell 10.128.2.1, length 28 IP 10.130.0.1.39419 > 10.128.0.5.9092: Flags [.], seq 7353:8711, ack 1, win 1369, options [nop,nop,TS val 42014197 ecr 804242420], length 1358 ARP, Request who-has 10.128.0.5 tell 10.128.2.1, length 28 IP 10.130.0.1.53750 > 10.128.0.5.9092: Flags [.], ack 153505, win 32089, options [nop,nop,TS val 804251131 ecr 804244921], length 0 IP 10.130.0.1.53750 > 10.128.0.5.9092: Flags [.], ack 180665, win 32089, options [nop,nop,TS val 804251133 ecr 804244923], length 0 ARP, Request who-has 10.128.0.5 tell 10.128.2.1, length 28 ARP, Request who-has 10.128.0.5 tell 10.128.2.1, length 28 ARP, Request who-has 10.128.0.5 tell 10.128.2.1, length 28 ARP, Request who-has 10.128.0.5 tell 10.128.2.1, length 28 ARP, Request who-has 10.128.0.5 tell 10.128.2.1, length 28 ARP, Request who-has 10.128.0.5 tell 10.128.2.1, length 28 ARP, Request who-has 10.128.0.5 tell 10.128.2.1, length 28 ARP, Request who-has 10.128.0.5 tell 10.128.2.1, length 28 ARP, Request who-has 10.128.0.5 tell 10.128.2.1, length 28 ARP, Request who-has 10.128.0.5 tell 10.128.2.1, length 28 IP 10.130.0.1.41416 > 10.128.0.5.9092: Flags [P.], seq 25480:25692, ack 90377, win 1720, options [nop,nop,TS val 804261482 ecr 804255270], length 212 IP 10.128.0.5.9092 > 10.130.0.1.33676: Flags [.], ack 42751, win 48832, options [nop,nop,TS val 804255273 ecr 30982550], length 0 ARP, Request who-has 10.128.0.5 tell 10.128.2.1, length 28 ARP, Request who-has 10.128.0.5 tell 10.128.2.1, length 28 ARP, Request who-has 10.128.0.5 tell 10.128.2.1, length 28 ARP, Request who-has 10.128.0.5 tell 10.128.2.1, length 28 ARP, Request who-has 10.128.0.5 tell 10.128.2.1, length 28 ``` This shows that some nodes send ARP requests for 10.128.0.5 and get a response (10.130.0.1 = ose3-rbox-master), and others do not (10.128.2.1 = ose3-rbox-nd-p-d0dp). Let's take a look at the OpenFlow rules on the infra node: ``` text [root@ose3-rbox-i-0db1 ~]# ovs-ofctl -O OpenFlow13 dump-flows br0 | less OFPST_FLOW reply (OF1.3) (xid=0x2): cookie=0x0, duration=5097779.115s, table=0, n_packets=21504710, n_bytes=903197820, priority=200,arp,in_port=1,arp_spa=10.128.0.0/14,arp_tpa=10.128.0.0/23 actions=move:NXM_NX_TUN_ID[0..31]->NXM_NX_REG0[],goto_table:1 ==> Above ARP rule sends us to table 1 cookie=0x0, duration=5097779.112s, table=0, n_packets=16006380614, n_bytes=9046678831184, priority=200,ip,in_port=1,nw_src=10.128.0.0/14,nw_dst=10.128.0.0/23 actions=move:NXM_NX_TUN_ID[0..31]->NXM_NX_REG0[],goto_table:1 cookie=0x0, duration=5097779.107s, table=0, n_packets=1599648, n_bytes=67185216, priority=200,arp,in_port=2,arp_spa=10.128.0.1,arp_tpa=10.128.0.0/14 actions=goto_table:5 cookie=0x0, duration=5097779.104s, table=0, n_packets=14396604107, n_bytes=9358948488855, priority=200,ip,in_port=2 actions=goto_table:5 cookie=0x0, duration=5097779.099s, table=0, n_packets=1, n_bytes=42, priority=200,arp,in_port=3,arp_spa=10.128.0.0/23 actions=goto_table:5 cookie=0x0, duration=5097779.097s, table=0, n_packets=0, n_bytes=0, priority=200,ip,in_port=3,nw_src=10.128.0.0/23 actions=goto_table:5 cookie=0x0, duration=5097779.109s, table=0, n_packets=0, n_bytes=0, priority=150,in_port=1 actions=drop cookie=0x0, duration=5097779.102s, table=0, n_packets=19896, n_bytes=835632, priority=150,in_port=2 actions=drop cookie=0x0, duration=5097779.095s, table=0, n_packets=20, n_bytes=1788, priority=150,in_port=3 actions=drop cookie=0x0, duration=5097779.091s, table=0, n_packets=13107650, n_bytes=550521300, priority=100,arp actions=goto_table:2 cookie=0x0, duration=5097779.088s, table=0, n_packets=18623864804, n_bytes=24517021768011, priority=100,ip actions=goto_table:2 cookie=0x0, duration=5097779.086s, table=0, n_packets=106, n_bytes=8412, priority=0 actions=drop ==> Here is table 1: cookie=0x0, duration=1514842.089s, table=1, n_packets=5654619263, n_bytes=1364346433795, priority=100,tun_src=10.44.77.45 actions=goto_table:5 cookie=0x0, duration=1514842.076s, table=1, n_packets=3154109, n_bytes=1829431635, priority=100,tun_src=10.44.77.11 actions=goto_table:5 cookie=0x0, duration=1514842.051s, table=1, n_packets=290540380, n_bytes=204795887605, priority=100,tun_src=10.44.77.10 actions=goto_table:5 cookie=0x0, duration=1514842.038s, table=1, n_packets=88937675, n_bytes=62143424775, priority=100,tun_src=10.44.77.8 actions=goto_table:5 cookie=0x0, duration=1514842.025s, table=1, n_packets=144214921, n_bytes=20184914857, priority=100,tun_src=10.44.77.47 actions=goto_table:5 cookie=0x0, duration=1514842.013s, table=1, n_packets=3161116, n_bytes=1836648002, priority=100,tun_src=10.44.77.12 actions=goto_table:5 cookie=0x0, duration=1514841.999s, table=1, n_packets=318005732, n_bytes=220695552970, priority=100,tun_src=10.44.77.50 actions=goto_table:5 cookie=0x0, duration=1514841.985s, table=1, n_packets=286339191, n_bytes=197815708131, priority=100,tun_src=10.44.77.6 actions=goto_table:5 cookie=0x0, duration=1514841.971s, table=1, n_packets=85568819, n_bytes=58118122099, priority=100,tun_src=10.44.77.9 actions=goto_table:5 cookie=0x0, duration=1514841.958s, table=1, n_packets=7932116478, n_bytes=6098819928021, priority=100,tun_src=10.44.77.29 actions=goto_table:5 cookie=0x0, duration=914202.872s, table=1, n_packets=56349706, n_bytes=40203119777, priority=100,tun_src=10.44.77.2 actions=goto_table:5 cookie=0x0, duration=868611.664s, table=1, n_packets=53567650, n_bytes=38301174872, priority=100,tun_src=10.44.77.4 actions=goto_table:5 cookie=0x0, duration=5097779.083s, table=1, n_packets=18442562, n_bytes=1099251485, priority=0 actions=drop ==> Above table 1 has no entry for source node 10.44.77.3, so we hit the drop rule, no ARP response is sent back ... cookie=0x0, duration=1516181.377s, table=8, n_packets=16565, n_bytes=695730, priority=100,arp,arp_tpa=10.131.0.0/23 actions=move:NXM_NX_REG0[]->NXM_NX_TUN_ID[0..31],set_field:10.44.77.3->tun_dst,output:1 cookie=0x0, duration=1516181.372s, table=8, n_packets=3394265, n_bytes=4223544126, priority=100,ip,nw_dst=10.131.0.0/23 actions=move:NXM_NX_REG0[]->NXM_NX_TUN_ID[0..31],set_field:10.44.77.3->tun_dst,output:1 ==> Yet the table 8 rules are present... ... ``` Let's look at the openshift-sdn code (note: this isn't the code for the precise version of OpenShift we're running, I'm off by a patchlevel IIRC): atomic-openshift-git-0.d84be7f_3.2.1.15-1.git.0.d84be7f.el7/Godeps/_workspace/src/github.com/openshift/openshift-sdn/plugins/osdn/ovs/controller.go: ``` go func (plugin *ovsPlugin) AddHostSubnetRules(subnet *osapi.HostSubnet) { glog.Infof("AddHostSubnetRules for %s", osdn.HostSubnetToString(subnet)) otx := ovs.NewTransaction(BR) otx.AddFlow("table=1, priority=100, tun_src=%s, actions=goto_table:5", subnet.HostIP) otx.AddFlow("table=8, priority=100, arp, nw_dst=%s, actions=move:NXM_NX_REG0[]->NXM_NX_TUN_ID[0..31],set_field:%s->tun_dst,output:1", subnet.Subnet, subnet.HostIP) otx.AddFlow("table=8, priority=100, ip, nw_dst=%s, actions=move:NXM_NX_REG0[]->NXM_NX_TUN_ID[0..31],set_field:%s->tun_dst,output:1", subnet.Subnet, subnet.HostIP) err := otx.EndTransaction() if err != nil { glog.Errorf("Error adding OVS flows: %v", err) } } ``` So we see that rules are added together (and removed together if we also check DeleteHostSubnetRules()). Looking at AddFlow(): ``` go // AddFlow adds a flow to the bridge. The arguments are passed to fmt.Sprintf(). func (tx *Transaction) AddFlow(flow string, args ...interface{}) { if len(args) > 0 { flow = fmt.Sprintf(flow, args...) } tx.ofctlExec("add-flow", tx.bridge, flow) } ``` Nothing suspicious. Looking at how commands are then executed however: ``` go func (tx *Transaction) ofctlExec(args ...string) (string, error) { args = append([]string{"-O", "OpenFlow13"}, args...) return tx.exec("ovs-ofctl", args...) } ``` and ``` go func (tx *Transaction) exec(cmd string, args ...string) (string, error) { if tx.err != nil { return "", tx.err } cmdpath, err := exec.LookPath(cmd) if err != nil { tx.err = fmt.Errorf("OVS is not installed") return "", tx.err } var output string output, tx.err = exec.Exec(cmdpath, args...) return output, tx.err } ``` The problem becomes a little clearer: the semantics of the Transaction object are that it executes commands until one fails, the rest are then ignored. In our case, we are expecting to have: * 1 table1 rule * 2 table 8 rules And we have: * 0 table1 rules * 2 table8 rules So, the problem was on deletion. Either the table1 rule was deleted and the delete returned an error, causing the 2 table8 deletion commands to be ignored, or the table1 rule delete worked and returned no error, but the 1st table8 rule command failed and led to ignoring the 2nd table8 delete. The problem is that the situation can't be recovered without 1 of: * rebooting the infra node that has corrupt rules * adding the missing table1 rule back manually Additional info:
Related case: 01875046
> So, the problem was on deletion. Either the table1 rule was deleted and the > delete returned an error, causing the 2 table8 deletion commands to be ignored, > or the table1 rule delete worked and returned no error, but the 1st table8 > rule command failed and led to ignoring the 2nd table8 delete. This can't really happen. add-flow and delete-flows mostly can only fail if the flow is syntactically invalid in some way (in which case *everyone* would see the bug). But anyway, if something is going wrong with our manipulation of the OVS flows, there ought to be errors logged about it. Can you get a copy of the atomic-openshift-node logs from one of the nodes that is experiencing problems? (If the logs are huge, then grep for just "controller.go" to start with.)
We've looked over the code again, and although it is not explicitly done in a transaction, it will never fail since the commands do not error out if they are syntactically correct. If the rule already exists, nothing happens. Given that this has not happened other than the one time, and given that we are moving to OVN rather than programming OVS directly, I'd rather not take the risk of changing the code now to attempt to address this.