1299756 – The existing pods loss network connection after remove lbr0 and restart node service

Bug 1299756 - The existing pods loss network connection after remove lbr0 and restart node service

Summary: The existing pods loss network connection after remove lbr0 and restart node ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	3.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Dan Winship
QA Contact:	Meng Bo
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-01-19 08:48 UTC by Meng Bo
Modified:	2016-05-12 16:26 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-05-12 16:26:53 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
debug_logs (265.55 KB, application/x-gzip) 2016-01-19 08:48 UTC, Meng Bo	no flags	Details
node_log_with_PR7310 (127.51 KB, text/x-vhdl) 2016-02-19 06:48 UTC, Meng Bo	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2016:1064	0	normal	SHIPPED_LIVE	Important: Red Hat OpenShift Enterprise 3.2 security, bug fix, and enhancement update	2016-05-12 20:19:17 UTC

Description Meng Bo 2016-01-19 08:48:31 UTC

Created attachment 1116087 [details]
debug_logs

Description of problem:
Setup multi-node env and create some pods in different projects. Try to switch the networking plugin from current one to another one, eg, from redhat/openshift-ovs-multitenant to redhat/openshift-ovs-subnet. Check the pod's network connection after switched.

The pod cannot be reached from nodes or pods and cannot access the network outside.

Version-Release number of selected component (if applicable):
openshift v3.1.1.5
kubernetes v1.1.0-origin-1107-g4c8e6f4

How reproducible:
always

Steps to Reproduce:
1. Setup multi-node env with multitenant network config
2. Create some projects and some pods in the projects
3. Switch the networking plugin with following steps:
On master:
systemctl stop atomic-openshift-master
sed -i 's/openshift-ovs-multitenant/openshift-ovs-subnet/g' master-config.yaml
systemctl start atomic-openshift-master

On each node:
systemctl stop atomic-openshift-node
sed -i 's/openshift-ovs-multitenant/openshift-ovs-subnet/g' node-config.yaml
ip link del lbr0
systemctl start atomic-openshift-node

4. Check the networking connection for pods created in step2


Actual results:
The pods cannot access other pods, nodes and the internet.
The pods cannot be reached from all the nodes.

Expected results:
The pods network should works fine after switch plugin

Additional info:
The arp list after trying to access the node and other pods.
bash-4.3$ ip neigh         
10.1.0.1 dev eth0  FAILED
10.1.0.7 dev eth0  FAILED
10.1.0.8 dev eth0  FAILED
10.1.0.6 dev eth0  FAILED


The following OF rules were changed when trying to access the pod from same node:

 cookie=0x0, duration=649.707s, table=0, n_packets=131, n_bytes=9770, tun_src=0.0.0.0 actions=goto_table:1
cookie=0x0, duration=649.704s, table=1, n_packets=149, n_bytes=10526, actions=learn(table=9,hard_timeout=900,priority=200,NXM_OF_ETH_DST[]=NXM_OF_ETH_SRC[],load:NXM_
NX_TUN_IPV4_SRC[]->NXM_NX_TUN_IPV4_DST[],output:NXM_OF_IN_PORT[]),goto_table:2
 cookie=0x0, duration=649.701s, table=2, n_packets=40, n_bytes=1680, priority=200,arp actions=goto_table:9
 cookie=0x0, duration=649.468s, table=9, n_packets=37, n_bytes=1554, priority=0,arp actions=FLOOD

Comment 1 Dan Winship 2016-01-19 20:21:12 UTC

This appears to be the same underlying bug as #1275904, which is thus also fixed by https://github.com/openshift/openshift-sdn/pull/241, which we had decided against trying to get into 3.1.1.

It is not actually necessary to change networking plugins to cause the problem; just "ip link del lbr0; systemctl restart atomic-openshift-node" will do it. (Any pre-existing pods will no longer have network access.)

(This is not new in 3.1.1; the bug should exist in 3.1 as well.)

Comment 3 Meng Bo 2016-01-26 06:49:23 UTC

Checked on OSE puddle 2016-01-25.1

The issue still can be reproduced, after delete lbr0, restart node service will not bring the existing pods network back. Unless restart the docker service manually.

# ovs-ofctl show br0 -O openflow13
OFPT_FEATURES_REPLY (OF1.3) (xid=0x2): dpid:0000428a59474f40
n_tables:254, n_buffers:256
capabilities: FLOW_STATS TABLE_STATS PORT_STATS GROUP_STATS QUEUE_STATS
OFPST_PORT_DESC reply (OF1.3) (xid=0x3):
 1(vxlan0): addr:c6:63:fc:68:67:81
     config:     0
     state:      0
     speed: 0 Mbps now, 0 Mbps max
 2(tun0): addr:06:3b:ce:ea:17:22
     config:     0
     state:      0
     speed: 0 Mbps now, 0 Mbps max
 3(vovsbr): addr:f2:4d:fd:5e:4a:63
     config:     0
     state:      0
     current:    10GB-FD COPPER
     speed: 10000 Mbps now, 0 Mbps max
 10(veth4933e73): addr:ae:79:04:82:28:1c
     config:     0
     state:      0
     current:    10GB-FD COPPER
     speed: 10000 Mbps now, 0 Mbps max
 11(veth4219e71): addr:16:7b:bf:7a:32:86
     config:     0
     state:      0
     current:    10GB-FD COPPER
     speed: 10000 Mbps now, 0 Mbps max
 12(veth85b2122): addr:6e:54:f4:43:09:a8
     config:     0
     state:      0
     current:    10GB-FD COPPER
     speed: 10000 Mbps now, 0 Mbps max
 LOCAL(br0): addr:42:8a:59:47:4f:40
     config:     PORT_DOWN
     state:      LINK_DOWN
     speed: 0 Mbps now, 0 Mbps max
OFPT_GET_CONFIG_REPLY (OF1.3) (xid=0x5): frags=normal miss_send_len=0

# ip link del lbr0
# systemctl restart atomic-openshift-node 

# ovs-ofctl show br0 -O openflow13
OFPT_FEATURES_REPLY (OF1.3) (xid=0x2): dpid:0000428a59474f40
n_tables:254, n_buffers:256
capabilities: FLOW_STATS TABLE_STATS PORT_STATS GROUP_STATS QUEUE_STATS
OFPST_PORT_DESC reply (OF1.3) (xid=0x3):
 1(vxlan0): addr:02:f3:84:e1:f0:12
     config:     0
     state:      0
     speed: 0 Mbps now, 0 Mbps max
 2(tun0): addr:ba:eb:74:9d:ec:8f
     config:     0
     state:      0
     speed: 0 Mbps now, 0 Mbps max
 3(vovsbr): addr:be:e4:cf:8a:41:3e
     config:     0
     state:      0
     current:    10GB-FD COPPER
     speed: 10000 Mbps now, 0 Mbps max
 LOCAL(br0): addr:42:8a:59:47:4f:40
     config:     PORT_DOWN
     state:      LINK_DOWN
     speed: 0 Mbps now, 0 Mbps max
OFPT_GET_CONFIG_REPLY (OF1.3) (xid=0x5): frags=normal miss_send_len=0

# systemctl restart docker
# ovs-ofctl show br0 -O openflow13
OFPT_FEATURES_REPLY (OF1.3) (xid=0x2): dpid:0000428a59474f40
n_tables:254, n_buffers:256
capabilities: FLOW_STATS TABLE_STATS PORT_STATS GROUP_STATS QUEUE_STATS
OFPST_PORT_DESC reply (OF1.3) (xid=0x3):
 1(vxlan0): addr:02:f3:84:e1:f0:12
     config:     0
     state:      0
     speed: 0 Mbps now, 0 Mbps max
 2(tun0): addr:ba:eb:74:9d:ec:8f
     config:     0
     state:      0
     speed: 0 Mbps now, 0 Mbps max
 3(vovsbr): addr:be:e4:cf:8a:41:3e
     config:     0
     state:      0
     current:    10GB-FD COPPER
     speed: 10000 Mbps now, 0 Mbps max
 13(veth5e9efe8): addr:8a:97:8c:fa:ef:9d
     config:     0
     state:      0
     current:    10GB-FD COPPER
     speed: 10000 Mbps now, 0 Mbps max
 14(veth3455db6): addr:72:16:99:32:30:68
     config:     0
     state:      0
     current:    10GB-FD COPPER
     speed: 10000 Mbps now, 0 Mbps max
 15(veth1a7a6a1): addr:da:5a:13:c4:5b:a1
     config:     0
     state:      0
     current:    10GB-FD COPPER
     speed: 10000 Mbps now, 0 Mbps max
 LOCAL(br0): addr:42:8a:59:47:4f:40
     config:     PORT_DOWN
     state:      LINK_DOWN
     speed: 0 Mbps now, 0 Mbps max
OFPT_GET_CONFIG_REPLY (OF1.3) (xid=0x5): frags=normal miss_send_len=0

Comment 4 Dan Winship 2016-01-26 14:02:46 UTC

(In reply to Meng Bo from comment #3)
> Checked on OSE puddle 2016-01-25.1
> 
> The issue still can be reproduced, after delete lbr0, restart node service
> will not bring the existing pods network back.

Ah, right. This bug and bug 1300582 have the same underlying cause (we don't properly reattach pods to OVS if we have to recreate the OVS bridge), but we didn't actually fix that for 1300582; we just made it not recreate the OVS bridge when it doesn't need to. In cases where it does actually need to recreate the bridge (eg, if you deleted one of the network devices, changed the plugin, etc), then the bug still exists.

> Unless restart the docker service manually.

Ah... we'd been thinking this wasn't a regression from 3.1 since the code to handle this didn't exist there either, but it's possible that in 3.1 we ended up restarting docker in this case even though it shouldn't have been necessary.

Of course, the reason why restarting docker "fixes" it is because that destroys all of the existing pods, and then openshift has to recreate them. So there's an outage, and the pods don't even necessarily come back with the same IP addresses.

Comment 5 Dan Winship 2016-02-16 21:07:40 UTC

fixed in origin (https://github.com/openshift/origin/pull/7310)

Comment 6 Dan Winship 2016-02-18 15:17:08 UTC

This is now in OSE

Comment 7 Meng Bo 2016-02-19 06:48:18 UTC

Created attachment 1128467 [details]
node_log_with_PR7310

It is still not working well.

I have tested with latest origin and OSE build 2016-02-17.3.

The ovs ports were not added to the br0 after restart the openshift-node service.

Log with loglevel=5 attached. And there are some errors around line 133 to 135, not sure if they are the problem.

Comment 8 Dan Winship 2016-02-19 15:16:02 UTC

ah, it apparently works with flat but not multitenant

Comment 9 Eric Paris 2016-02-23 16:25:13 UTC

https://github.com/openshift/origin/pull/7560

Comment 10 Meng Bo 2016-02-24 06:54:39 UTC

Tested on origin build v1.1.3-245-g806dd7e-dirty with the PR merged.

It is working well now for both multitenant and flat plugins.

Comment 15 errata-xmlrpc 2016-05-12 16:26:53 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2016:1064

Note You need to log in before you can comment on or make changes to this bug.