Created attachment 1116087 [details] debug_logs Description of problem: Setup multi-node env and create some pods in different projects. Try to switch the networking plugin from current one to another one, eg, from redhat/openshift-ovs-multitenant to redhat/openshift-ovs-subnet. Check the pod's network connection after switched. The pod cannot be reached from nodes or pods and cannot access the network outside. Version-Release number of selected component (if applicable): openshift v3.1.1.5 kubernetes v1.1.0-origin-1107-g4c8e6f4 How reproducible: always Steps to Reproduce: 1. Setup multi-node env with multitenant network config 2. Create some projects and some pods in the projects 3. Switch the networking plugin with following steps: On master: systemctl stop atomic-openshift-master sed -i 's/openshift-ovs-multitenant/openshift-ovs-subnet/g' master-config.yaml systemctl start atomic-openshift-master On each node: systemctl stop atomic-openshift-node sed -i 's/openshift-ovs-multitenant/openshift-ovs-subnet/g' node-config.yaml ip link del lbr0 systemctl start atomic-openshift-node 4. Check the networking connection for pods created in step2 Actual results: The pods cannot access other pods, nodes and the internet. The pods cannot be reached from all the nodes. Expected results: The pods network should works fine after switch plugin Additional info: The arp list after trying to access the node and other pods. bash-4.3$ ip neigh 10.1.0.1 dev eth0 FAILED 10.1.0.7 dev eth0 FAILED 10.1.0.8 dev eth0 FAILED 10.1.0.6 dev eth0 FAILED The following OF rules were changed when trying to access the pod from same node: cookie=0x0, duration=649.707s, table=0, n_packets=131, n_bytes=9770, tun_src=0.0.0.0 actions=goto_table:1 cookie=0x0, duration=649.704s, table=1, n_packets=149, n_bytes=10526, actions=learn(table=9,hard_timeout=900,priority=200,NXM_OF_ETH_DST[]=NXM_OF_ETH_SRC[],load:NXM_ NX_TUN_IPV4_SRC[]->NXM_NX_TUN_IPV4_DST[],output:NXM_OF_IN_PORT[]),goto_table:2 cookie=0x0, duration=649.701s, table=2, n_packets=40, n_bytes=1680, priority=200,arp actions=goto_table:9 cookie=0x0, duration=649.468s, table=9, n_packets=37, n_bytes=1554, priority=0,arp actions=FLOOD
This appears to be the same underlying bug as #1275904, which is thus also fixed by https://github.com/openshift/openshift-sdn/pull/241, which we had decided against trying to get into 3.1.1. It is not actually necessary to change networking plugins to cause the problem; just "ip link del lbr0; systemctl restart atomic-openshift-node" will do it. (Any pre-existing pods will no longer have network access.) (This is not new in 3.1.1; the bug should exist in 3.1 as well.)
Checked on OSE puddle 2016-01-25.1 The issue still can be reproduced, after delete lbr0, restart node service will not bring the existing pods network back. Unless restart the docker service manually. # ovs-ofctl show br0 -O openflow13 OFPT_FEATURES_REPLY (OF1.3) (xid=0x2): dpid:0000428a59474f40 n_tables:254, n_buffers:256 capabilities: FLOW_STATS TABLE_STATS PORT_STATS GROUP_STATS QUEUE_STATS OFPST_PORT_DESC reply (OF1.3) (xid=0x3): 1(vxlan0): addr:c6:63:fc:68:67:81 config: 0 state: 0 speed: 0 Mbps now, 0 Mbps max 2(tun0): addr:06:3b:ce:ea:17:22 config: 0 state: 0 speed: 0 Mbps now, 0 Mbps max 3(vovsbr): addr:f2:4d:fd:5e:4a:63 config: 0 state: 0 current: 10GB-FD COPPER speed: 10000 Mbps now, 0 Mbps max 10(veth4933e73): addr:ae:79:04:82:28:1c config: 0 state: 0 current: 10GB-FD COPPER speed: 10000 Mbps now, 0 Mbps max 11(veth4219e71): addr:16:7b:bf:7a:32:86 config: 0 state: 0 current: 10GB-FD COPPER speed: 10000 Mbps now, 0 Mbps max 12(veth85b2122): addr:6e:54:f4:43:09:a8 config: 0 state: 0 current: 10GB-FD COPPER speed: 10000 Mbps now, 0 Mbps max LOCAL(br0): addr:42:8a:59:47:4f:40 config: PORT_DOWN state: LINK_DOWN speed: 0 Mbps now, 0 Mbps max OFPT_GET_CONFIG_REPLY (OF1.3) (xid=0x5): frags=normal miss_send_len=0 # ip link del lbr0 # systemctl restart atomic-openshift-node # ovs-ofctl show br0 -O openflow13 OFPT_FEATURES_REPLY (OF1.3) (xid=0x2): dpid:0000428a59474f40 n_tables:254, n_buffers:256 capabilities: FLOW_STATS TABLE_STATS PORT_STATS GROUP_STATS QUEUE_STATS OFPST_PORT_DESC reply (OF1.3) (xid=0x3): 1(vxlan0): addr:02:f3:84:e1:f0:12 config: 0 state: 0 speed: 0 Mbps now, 0 Mbps max 2(tun0): addr:ba:eb:74:9d:ec:8f config: 0 state: 0 speed: 0 Mbps now, 0 Mbps max 3(vovsbr): addr:be:e4:cf:8a:41:3e config: 0 state: 0 current: 10GB-FD COPPER speed: 10000 Mbps now, 0 Mbps max LOCAL(br0): addr:42:8a:59:47:4f:40 config: PORT_DOWN state: LINK_DOWN speed: 0 Mbps now, 0 Mbps max OFPT_GET_CONFIG_REPLY (OF1.3) (xid=0x5): frags=normal miss_send_len=0 # systemctl restart docker # ovs-ofctl show br0 -O openflow13 OFPT_FEATURES_REPLY (OF1.3) (xid=0x2): dpid:0000428a59474f40 n_tables:254, n_buffers:256 capabilities: FLOW_STATS TABLE_STATS PORT_STATS GROUP_STATS QUEUE_STATS OFPST_PORT_DESC reply (OF1.3) (xid=0x3): 1(vxlan0): addr:02:f3:84:e1:f0:12 config: 0 state: 0 speed: 0 Mbps now, 0 Mbps max 2(tun0): addr:ba:eb:74:9d:ec:8f config: 0 state: 0 speed: 0 Mbps now, 0 Mbps max 3(vovsbr): addr:be:e4:cf:8a:41:3e config: 0 state: 0 current: 10GB-FD COPPER speed: 10000 Mbps now, 0 Mbps max 13(veth5e9efe8): addr:8a:97:8c:fa:ef:9d config: 0 state: 0 current: 10GB-FD COPPER speed: 10000 Mbps now, 0 Mbps max 14(veth3455db6): addr:72:16:99:32:30:68 config: 0 state: 0 current: 10GB-FD COPPER speed: 10000 Mbps now, 0 Mbps max 15(veth1a7a6a1): addr:da:5a:13:c4:5b:a1 config: 0 state: 0 current: 10GB-FD COPPER speed: 10000 Mbps now, 0 Mbps max LOCAL(br0): addr:42:8a:59:47:4f:40 config: PORT_DOWN state: LINK_DOWN speed: 0 Mbps now, 0 Mbps max OFPT_GET_CONFIG_REPLY (OF1.3) (xid=0x5): frags=normal miss_send_len=0
(In reply to Meng Bo from comment #3) > Checked on OSE puddle 2016-01-25.1 > > The issue still can be reproduced, after delete lbr0, restart node service > will not bring the existing pods network back. Ah, right. This bug and bug 1300582 have the same underlying cause (we don't properly reattach pods to OVS if we have to recreate the OVS bridge), but we didn't actually fix that for 1300582; we just made it not recreate the OVS bridge when it doesn't need to. In cases where it does actually need to recreate the bridge (eg, if you deleted one of the network devices, changed the plugin, etc), then the bug still exists. > Unless restart the docker service manually. Ah... we'd been thinking this wasn't a regression from 3.1 since the code to handle this didn't exist there either, but it's possible that in 3.1 we ended up restarting docker in this case even though it shouldn't have been necessary. Of course, the reason why restarting docker "fixes" it is because that destroys all of the existing pods, and then openshift has to recreate them. So there's an outage, and the pods don't even necessarily come back with the same IP addresses.
fixed in origin (https://github.com/openshift/origin/pull/7310)
This is now in OSE
Created attachment 1128467 [details] node_log_with_PR7310 It is still not working well. I have tested with latest origin and OSE build 2016-02-17.3. The ovs ports were not added to the br0 after restart the openshift-node service. Log with loglevel=5 attached. And there are some errors around line 133 to 135, not sure if they are the problem.
ah, it apparently works with flat but not multitenant
https://github.com/openshift/origin/pull/7560
Tested on origin build v1.1.3-245-g806dd7e-dirty with the PR merged. It is working well now for both multitenant and flat plugins.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2016:1064