Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1464191

Summary:	[GSS] [OCP 3.2.1] openshift-sdn partially deletes openflow rules leading to "no route to host" for service
Product:	OpenShift Container Platform	Reporter:	Francesco Marchioni <fmarchio>
Component:	Networking	Assignee:	Ben Bennett <bbennett>
Status:	CLOSED WONTFIX	QA Contact:	Meng Bo <bmeng>
Severity:	high	Docs Contact:
Priority:	high
Version:	3.2.1	CC:	aloughla, aos-bugs, atragler, bbennett, pweil, rkhan, sukulkar, wmeng
Target Milestone:	---
Target Release:	3.2.1
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1538220 1538227 (view as bug list)		Environment:
Last Closed:	2017-10-16 12:32:25 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1538220, 1538227

Description Francesco Marchioni 2017-06-22 15:35:15 UTC

Description of problem:
The issue reported is that a cluster configuration change has gone wrong leaving OpenFlow rules inconsistent. More in detail, OpenFlow rules have been deleted inconsistently on the endpoint node when a GCE preemptible VM node was destroyed. So the new rules could not be added when the preemptible VM reappeared.

Version-Release number of selected component (if applicable):
oc v3.2.1.17
kubernetes v1.2.0-36-g4a3f9c5

How reproducible:

Service and its endpoint:

``` shell
[root@ose3-rbox-master ~]# oc describe service kafka-service
Name:                   kafka-service
Namespace:              default
Labels:                 internal.acs.amadeus.com/component=kafka-zookeeper,internal.acs.amadeus.com/env-name=rbox_kafka_zookeeper,internal.acs.amadeus.com/env-type=DML,internal.acs.amadeus.com/env-version=1.67.8-4-g34683df,internal.acs.amadeus.com/version=4.0.73
Selector:               name=kafka
Type:                   NodePort
IP:                     172.30.251.81
Port:                   <unset> 9092/TCP
NodePort:               <unset> 31005/TCP
Endpoints:              10.128.0.5:9092
Session Affinity:       None
No events.
```

On a node that works:

``` shell
[root@ose3-rbox-nd-p-w089 ~]# python
Python 2.7.5 (default, Aug 29 2016, 10:12:21)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import socket
>>> s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
>>> #ip='172.30.251.81'
... ip='10.128.0.5'
>>> port= 9092
>>> #port=31005
... s.connect((ip,port))
>>> s.send('hello')
5
>>> d = s.recv(2048)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
socket.error: [Errno 104] Connection reset by peer
>>> s.close()
```

Connection reset at this point is fine - I'm not sending the data the server expects, so it closes the connection.

On a broken node, try to connect directly to the endpoint:

``` shell
[root@ose3-rbox-nd-p-d0dp ~]# python
Python 2.7.5 (default, Aug 29 2016, 10:12:21)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import socket
>>> s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
>>> ip='10.128.0.5'
>>> port= 9092
>>> s.connect((ip,port))
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/usr/lib64/python2.7/socket.py", line 224, in meth
    return getattr(self._sock,name)(*args)
socket.error: [Errno 113] No route to host
```

So we fail connecting, "no route to host".

Following the smoking gun, here's the node hosting the pod with endpoint 10.128.0.5:

``` shell
[root@ose3-rbox-master ~]# oc get pods --selector=name=kafka -owide
NAME                                             READY     STATUS    RESTARTS   AGE       NODE
kafka-zookeeper-replication-controller-1-shjsu   2/2       Running   3          70d       ose3-rbox-i-0db1.node.nh-gcus-acs1-gce-eu-west-1.acs

[root@ose3-rbox-master ~]# oc get pods --selector=name=kafka --template='{{range .items}}HostIP: {{.status.hostIP}}   PodIP: {{.status.podIP}}{{end}}{{"\n"}}'
HostIP: 10.44.77.42   PodIP: 10.128.0.5

[root@ose3-rbox-master ~]# oc get hostsubnet
NAME                                                      HOST                                                      HOST IP       SUBNET
ose3-rbox-i-0db1.node.nh-gcus-acs1-gce-eu-west-1.acs      ose3-rbox-i-0db1.node.nh-gcus-acs1-gce-eu-west-1.acs      10.44.77.42   10.128.0.0/23
ose3-rbox-i-jc8q.node.nh-gcus-acs1-gce-eu-west-1.acs      ose3-rbox-i-jc8q.node.nh-gcus-acs1-gce-eu-west-1.acs      10.44.77.45   10.129.0.0/23
ose3-rbox-master.node.nh-gcus-acs1-gce-eu-west-1.acs      ose3-rbox-master.node.nh-gcus-acs1-gce-eu-west-1.acs      10.44.77.29   10.130.0.0/23
ose3-rbox-nd-p-3nnp.node.nh-gcus-acs1-gce-eu-west-1.acs   ose3-rbox-nd-p-3nnp.node.nh-gcus-acs1-gce-eu-west-1.acs   10.44.77.3    10.131.0.0/23
ose3-rbox-nd-p-b9v5.node.nh-gcus-acs1-gce-eu-west-1.acs   ose3-rbox-nd-p-b9v5.node.nh-gcus-acs1-gce-eu-west-1.acs   10.44.77.12   10.129.4.0/23
ose3-rbox-nd-p-czk3.node.nh-gcus-acs1-gce-eu-west-1.acs   ose3-rbox-nd-p-czk3.node.nh-gcus-acs1-gce-eu-west-1.acs   10.44.77.11   10.129.2.0/23
ose3-rbox-nd-p-d0dp.node.nh-gcus-acs1-gce-eu-west-1.acs   ose3-rbox-nd-p-d0dp.node.nh-gcus-acs1-gce-eu-west-1.acs   10.44.77.3    10.128.2.0/23
ose3-rbox-nd-p-gl03.node.nh-gcus-acs1-gce-eu-west-1.acs   ose3-rbox-nd-p-gl03.node.nh-gcus-acs1-gce-eu-west-1.acs   10.44.77.10   10.130.2.0/23
ose3-rbox-nd-p-jpbw.node.nh-gcus-acs1-gce-eu-west-1.acs   ose3-rbox-nd-p-jpbw.node.nh-gcus-acs1-gce-eu-west-1.acs   10.44.77.9    10.130.6.0/23
ose3-rbox-nd-p-n7cz.node.nh-gcus-acs1-gce-eu-west-1.acs   ose3-rbox-nd-p-n7cz.node.nh-gcus-acs1-gce-eu-west-1.acs   10.44.77.7    10.128.8.0/23
ose3-rbox-nd-p-nw70.node.nh-gcus-acs1-gce-eu-west-1.acs   ose3-rbox-nd-p-nw70.node.nh-gcus-acs1-gce-eu-west-1.acs   10.44.77.2    10.131.6.0/23
ose3-rbox-nd-p-pfkw.node.nh-gcus-acs1-gce-eu-west-1.acs   ose3-rbox-nd-p-pfkw.node.nh-gcus-acs1-gce-eu-west-1.acs   10.44.77.8    10.131.2.0/23
ose3-rbox-nd-p-qjfv.node.nh-gcus-acs1-gce-eu-west-1.acs   ose3-rbox-nd-p-qjfv.node.nh-gcus-acs1-gce-eu-west-1.acs   10.44.77.6    10.131.4.0/23
ose3-rbox-nd-p-w089.node.nh-gcus-acs1-gce-eu-west-1.acs   ose3-rbox-nd-p-w089.node.nh-gcus-acs1-gce-eu-west-1.acs   10.44.77.4    10.128.6.0/23
ose3-rbox-nd-p-w5qf.node.nh-gcus-acs1-gce-eu-west-1.acs   ose3-rbox-nd-p-w5qf.node.nh-gcus-acs1-gce-eu-west-1.acs   10.44.77.10   10.129.6.0/23
ose3-rbox-nd-t481.node.nh-gcus-acs1-gce-eu-west-1.acs     ose3-rbox-nd-t481.node.nh-gcus-acs1-gce-eu-west-1.acs     10.44.77.50   10.130.4.0/23
ose3-rbox-sp-n9tv.node.nh-gcus-acs1-gce-eu-west-1.acs     ose3-rbox-sp-n9tv.node.nh-gcus-acs1-gce-eu-west-1.acs     10.44.77.47   10.128.4.0/23
```

A few points the SDN troubleshooting document recommends to check:

```
[root@ose3-rbox-nd-p-d0dp ~]# ping 10.44.77.42
PING 10.44.77.42 (10.44.77.42) 56(84) bytes of data.
64 bytes from 10.44.77.42: icmp_seq=1 ttl=64 time=1.34 ms

[root@ose3-rbox-nd-p-d0dp ~]# sysctl net.ipv4.ip_forward
net.ipv4.ip_forward = 1

[root@ose3-rbox-nd-p-d0dp ~]# systemctl status firewalld
● firewalld.service
   Loaded: not-found (Reason: No such file or directory)
   Active: inactive (dead)

[root@ose3-rbox-nd-p-d0dp ~]# ip route
default via 10.44.77.1 dev eth0  proto static  metric 100
10.44.77.1 dev eth0  scope link
10.44.77.1 dev eth0  proto dhcp  scope link  metric 100
10.128.0.0/14 dev tun0  proto kernel  scope link
169.254.169.254 via 10.44.77.1 dev eth0  proto dhcp  metric 100
172.30.0.0/16 dev tun0  scope link
```

Trouble starts here:

``` shell
[root@ose3-rbox-i-0db1 ~]# tcpdump -i eth0 -nn | grep -A 2  VXLAN  | grep -A  2 10.44.77.3 | grep 10.128.0.5
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
ARP, Request who-has 10.128.0.5 tell 10.128.2.1, length 28
ARP, Request who-has 10.128.0.5 tell 10.128.2.1, length 28
ARP, Request who-has 10.128.0.5 tell 10.128.2.1, length 28
ARP, Request who-has 10.128.0.5 tell 10.128.2.1, length 28
IP 10.128.0.5.9092 > 10.130.0.1.37006: Flags [P.], seq 1:141, ack 33, win 13580, options [nop,nop,TS val 804239922 ecr 804246130], length 140
ARP, Request who-has 10.128.0.5 tell 10.128.2.1, length 28
ARP, Request who-has 10.128.0.5 tell 10.128.2.1, length 28
ARP, Request who-has 10.128.0.5 tell 10.128.2.1, length 28
ARP, Request who-has 10.128.0.5 tell 10.128.2.1, length 28
IP 10.130.0.1.39419 > 10.128.0.5.9092: Flags [.], seq 7353:8711, ack 1, win 1369, options [nop,nop,TS val 42014197 ecr 804242420], length 1358
ARP, Request who-has 10.128.0.5 tell 10.128.2.1, length 28
IP 10.130.0.1.53750 > 10.128.0.5.9092: Flags [.], ack 153505, win 32089, options [nop,nop,TS val 804251131 ecr 804244921], length 0
IP 10.130.0.1.53750 > 10.128.0.5.9092: Flags [.], ack 180665, win 32089, options [nop,nop,TS val 804251133 ecr 804244923], length 0
ARP, Request who-has 10.128.0.5 tell 10.128.2.1, length 28
ARP, Request who-has 10.128.0.5 tell 10.128.2.1, length 28
ARP, Request who-has 10.128.0.5 tell 10.128.2.1, length 28
ARP, Request who-has 10.128.0.5 tell 10.128.2.1, length 28
ARP, Request who-has 10.128.0.5 tell 10.128.2.1, length 28
ARP, Request who-has 10.128.0.5 tell 10.128.2.1, length 28
ARP, Request who-has 10.128.0.5 tell 10.128.2.1, length 28
ARP, Request who-has 10.128.0.5 tell 10.128.2.1, length 28
ARP, Request who-has 10.128.0.5 tell 10.128.2.1, length 28
ARP, Request who-has 10.128.0.5 tell 10.128.2.1, length 28
IP 10.130.0.1.41416 > 10.128.0.5.9092: Flags [P.], seq 25480:25692, ack 90377, win 1720, options [nop,nop,TS val 804261482 ecr 804255270], length 212
IP 10.128.0.5.9092 > 10.130.0.1.33676: Flags [.], ack 42751, win 48832, options [nop,nop,TS val 804255273 ecr 30982550], length 0
ARP, Request who-has 10.128.0.5 tell 10.128.2.1, length 28
ARP, Request who-has 10.128.0.5 tell 10.128.2.1, length 28
ARP, Request who-has 10.128.0.5 tell 10.128.2.1, length 28
ARP, Request who-has 10.128.0.5 tell 10.128.2.1, length 28
ARP, Request who-has 10.128.0.5 tell 10.128.2.1, length 28
```

This shows that some nodes send ARP requests for 10.128.0.5 and get a response (10.130.0.1 = ose3-rbox-master), and others do not (10.128.2.1 = ose3-rbox-nd-p-d0dp).

Let's take a look at the OpenFlow rules on the infra node:

``` text
[root@ose3-rbox-i-0db1 ~]# ovs-ofctl -O OpenFlow13 dump-flows br0 | less
OFPST_FLOW reply (OF1.3) (xid=0x2):
 cookie=0x0, duration=5097779.115s, table=0, n_packets=21504710, n_bytes=903197820, priority=200,arp,in_port=1,arp_spa=10.128.0.0/14,arp_tpa=10.128.0.0/23 actions=move:NXM_NX_TUN_ID[0..31]->NXM_NX_REG0[],goto_table:1

==> Above ARP rule sends us to table 1

 cookie=0x0, duration=5097779.112s, table=0, n_packets=16006380614, n_bytes=9046678831184, priority=200,ip,in_port=1,nw_src=10.128.0.0/14,nw_dst=10.128.0.0/23 actions=move:NXM_NX_TUN_ID[0..31]->NXM_NX_REG0[],goto_table:1
 cookie=0x0, duration=5097779.107s, table=0, n_packets=1599648, n_bytes=67185216, priority=200,arp,in_port=2,arp_spa=10.128.0.1,arp_tpa=10.128.0.0/14 actions=goto_table:5
 cookie=0x0, duration=5097779.104s, table=0, n_packets=14396604107, n_bytes=9358948488855, priority=200,ip,in_port=2 actions=goto_table:5
 cookie=0x0, duration=5097779.099s, table=0, n_packets=1, n_bytes=42, priority=200,arp,in_port=3,arp_spa=10.128.0.0/23 actions=goto_table:5
 cookie=0x0, duration=5097779.097s, table=0, n_packets=0, n_bytes=0, priority=200,ip,in_port=3,nw_src=10.128.0.0/23 actions=goto_table:5
 cookie=0x0, duration=5097779.109s, table=0, n_packets=0, n_bytes=0, priority=150,in_port=1 actions=drop
 cookie=0x0, duration=5097779.102s, table=0, n_packets=19896, n_bytes=835632, priority=150,in_port=2 actions=drop
 cookie=0x0, duration=5097779.095s, table=0, n_packets=20, n_bytes=1788, priority=150,in_port=3 actions=drop
 cookie=0x0, duration=5097779.091s, table=0, n_packets=13107650, n_bytes=550521300, priority=100,arp actions=goto_table:2
 cookie=0x0, duration=5097779.088s, table=0, n_packets=18623864804, n_bytes=24517021768011, priority=100,ip actions=goto_table:2
 cookie=0x0, duration=5097779.086s, table=0, n_packets=106, n_bytes=8412, priority=0 actions=drop

==> Here is table 1:

 cookie=0x0, duration=1514842.089s, table=1, n_packets=5654619263, n_bytes=1364346433795, priority=100,tun_src=10.44.77.45 actions=goto_table:5
 cookie=0x0, duration=1514842.076s, table=1, n_packets=3154109, n_bytes=1829431635, priority=100,tun_src=10.44.77.11 actions=goto_table:5
 cookie=0x0, duration=1514842.051s, table=1, n_packets=290540380, n_bytes=204795887605, priority=100,tun_src=10.44.77.10 actions=goto_table:5
 cookie=0x0, duration=1514842.038s, table=1, n_packets=88937675, n_bytes=62143424775, priority=100,tun_src=10.44.77.8 actions=goto_table:5
 cookie=0x0, duration=1514842.025s, table=1, n_packets=144214921, n_bytes=20184914857, priority=100,tun_src=10.44.77.47 actions=goto_table:5
 cookie=0x0, duration=1514842.013s, table=1, n_packets=3161116, n_bytes=1836648002, priority=100,tun_src=10.44.77.12 actions=goto_table:5
 cookie=0x0, duration=1514841.999s, table=1, n_packets=318005732, n_bytes=220695552970, priority=100,tun_src=10.44.77.50 actions=goto_table:5
 cookie=0x0, duration=1514841.985s, table=1, n_packets=286339191, n_bytes=197815708131, priority=100,tun_src=10.44.77.6 actions=goto_table:5
 cookie=0x0, duration=1514841.971s, table=1, n_packets=85568819, n_bytes=58118122099, priority=100,tun_src=10.44.77.9 actions=goto_table:5
 cookie=0x0, duration=1514841.958s, table=1, n_packets=7932116478, n_bytes=6098819928021, priority=100,tun_src=10.44.77.29 actions=goto_table:5
 cookie=0x0, duration=914202.872s, table=1, n_packets=56349706, n_bytes=40203119777, priority=100,tun_src=10.44.77.2 actions=goto_table:5
 cookie=0x0, duration=868611.664s, table=1, n_packets=53567650, n_bytes=38301174872, priority=100,tun_src=10.44.77.4 actions=goto_table:5
 cookie=0x0, duration=5097779.083s, table=1, n_packets=18442562, n_bytes=1099251485, priority=0 actions=drop

==> Above table 1 has no entry for source node 10.44.77.3, so we hit the drop rule, no ARP response is sent back

...

 cookie=0x0, duration=1516181.377s, table=8, n_packets=16565, n_bytes=695730, priority=100,arp,arp_tpa=10.131.0.0/23 actions=move:NXM_NX_REG0[]->NXM_NX_TUN_ID[0..31],set_field:10.44.77.3->tun_dst,output:1
 cookie=0x0, duration=1516181.372s, table=8, n_packets=3394265, n_bytes=4223544126, priority=100,ip,nw_dst=10.131.0.0/23 actions=move:NXM_NX_REG0[]->NXM_NX_TUN_ID[0..31],set_field:10.44.77.3->tun_dst,output:1

==> Yet the table 8 rules are present...
...
```

Let's look at the openshift-sdn code (note: this isn't the code for the precise version of OpenShift we're running, I'm off by a patchlevel IIRC):

atomic-openshift-git-0.d84be7f_3.2.1.15-1.git.0.d84be7f.el7/Godeps/_workspace/src/github.com/openshift/openshift-sdn/plugins/osdn/ovs/controller.go:

``` go

func (plugin *ovsPlugin) AddHostSubnetRules(subnet *osapi.HostSubnet) {
        glog.Infof("AddHostSubnetRules for %s", osdn.HostSubnetToString(subnet))
        otx := ovs.NewTransaction(BR)

        otx.AddFlow("table=1, priority=100, tun_src=%s, actions=goto_table:5", subnet.HostIP)
        otx.AddFlow("table=8, priority=100, arp, nw_dst=%s, actions=move:NXM_NX_REG0[]->NXM_NX_TUN_ID[0..31],set_field:%s->tun_dst,output:1", subnet.Subnet, subnet.HostIP)
        otx.AddFlow("table=8, priority=100, ip, nw_dst=%s, actions=move:NXM_NX_REG0[]->NXM_NX_TUN_ID[0..31],set_field:%s->tun_dst,output:1", subnet.Subnet, subnet.HostIP)

        err := otx.EndTransaction()
        if err != nil {
                glog.Errorf("Error adding OVS flows: %v", err)
        }
}
```

So we see that rules are added together (and removed together if we also check DeleteHostSubnetRules()).

Looking at AddFlow():

``` go
// AddFlow adds a flow to the bridge. The arguments are passed to fmt.Sprintf().
func (tx *Transaction) AddFlow(flow string, args ...interface{}) {
        if len(args) > 0 {
                flow = fmt.Sprintf(flow, args...)
        }
        tx.ofctlExec("add-flow", tx.bridge, flow)
}
```

Nothing suspicious.

Looking at how commands are then executed however:

``` go
func (tx *Transaction) ofctlExec(args ...string) (string, error) {
        args = append([]string{"-O", "OpenFlow13"}, args...)
        return tx.exec("ovs-ofctl", args...)
}
```

and 

``` go
func (tx *Transaction) exec(cmd string, args ...string) (string, error) {
        if tx.err != nil {
                return "", tx.err
        }

        cmdpath, err := exec.LookPath(cmd)
        if err != nil {
                tx.err = fmt.Errorf("OVS is not installed")
                return "", tx.err
        }

        var output string
        output, tx.err = exec.Exec(cmdpath, args...)
        return output, tx.err
}
```

The problem becomes a little clearer: the semantics of the Transaction object are that it executes commands until one fails, the rest are then ignored.

In our case, we are expecting to have:

    * 1 table1 rule
    * 2 table 8 rules

And we have:

    * 0 table1 rules
    * 2 table8 rules

So, the problem was on deletion.  Either the table1 rule was deleted and the delete returned an error, causing the 2 table8 deletion commands to be ignored, or the table1 rule delete worked and returned no error, but the 1st table8 rule command failed and led to ignoring the 2nd table8 delete.

The problem is that the situation can't be recovered without 1 of:

    * rebooting the infra node that has corrupt rules
    * adding the missing table1 rule back manually


Additional info:

Comment 1 Francesco Marchioni 2017-06-22 15:41:59 UTC

Related case: 01875046

Comment 2 Dan Winship 2017-07-12 13:57:39 UTC

> So, the problem was on deletion.  Either the table1 rule was deleted and the
> delete returned an error, causing the 2 table8 deletion commands to be ignored,
> or the table1 rule delete worked and returned no error, but the 1st table8
> rule command failed and led to ignoring the 2nd table8 delete.

This can't really happen. add-flow and delete-flows mostly can only fail if the flow is syntactically invalid in some way (in which case *everyone* would see the bug).

But anyway, if something is going wrong with our manipulation of the OVS flows, there ought to be errors logged about it. Can you get a copy of the atomic-openshift-node logs from one of the nodes that is experiencing problems? (If the logs are huge, then grep for just "controller.go" to start with.)

Comment 11 Ben Bennett 2017-10-16 12:32:25 UTC

We've looked over the code again, and although it is not explicitly done in a transaction, it will never fail since the commands do not error out if they are syntactically correct.  If the rule already exists, nothing happens.

Given that this has not happened other than the one time, and given that we are moving to OVN rather than programming OVS directly, I'd rather not take the risk of changing the code now to attempt to address this.