Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1389212

Summary: all pod veth are removed after restarting openvswitch
Product: OpenShift Container Platform Reporter: Hongan Li <hongli>
Component: NetworkingAssignee: Dan Winship <danw>
Status: CLOSED ERRATA QA Contact: Meng Bo <bmeng>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.4.0CC: anli, aos-bugs, bbennett, hongli, tdawson
Target Milestone: ---Keywords: Regression
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
undefined
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-01-18 12:46:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Hongan Li 2016-10-27 08:13:20 UTC
Description of problem:
the pods veth are removed after restarting openvswitch

Version-Release number of selected component (if applicable):
openshift v3.4.0.16+cc70b72
kubernetes v1.4.0+776c994
etcd 3.1.0-rc.0
ovs-vsctl (Open vSwitch) 2.4.0


How reproducible:
always

Steps to Reproduce:
1. Setup multi-node env with multitenant network plugin
2. create some pods on the node
3. SSH to node and check pod veth
4. systemctl restart openvswitch

Actual results:
pod veth removed after openvswitch restarted and cannot reach to pods on the node. 
Details as below:
====== before restarting openswitch:
[root@ip-172-18-4-193 ~]# ovs-ofctl -O OpenFlow13 show br0
OFPT_FEATURES_REPLY (OF1.3) (xid=0x2): dpid:000012f02b5a5444
n_tables:254, n_buffers:256
capabilities: FLOW_STATS TABLE_STATS PORT_STATS GROUP_STATS QUEUE_STATS
OFPST_PORT_DESC reply (OF1.3) (xid=0x3):
 1(vxlan0): addr:d2:cc:76:b6:c7:bf
     config:     0
     state:      0
     speed: 0 Mbps now, 0 Mbps max
 2(tun0): addr:1a:89:3b:6e:ec:27
     config:     0
     state:      0
     speed: 0 Mbps now, 0 Mbps max
 5(vethfe66b7f6): addr:ce:0a:1d:40:14:c0
     config:     0
     state:      0
     current:    10GB-FD COPPER
     speed: 10000 Mbps now, 0 Mbps max
 18(vethf897b3ae): addr:62:26:5d:46:77:c2
     config:     0
     state:      0
     current:    10GB-FD COPPER
     speed: 10000 Mbps now, 0 Mbps max
 22(veth3f8d481c): addr:82:d5:e2:48:ed:f1
     config:     0
     state:      0
     current:    10GB-FD COPPER
     speed: 10000 Mbps now, 0 Mbps max
 23(veth4a0ef39a): addr:0e:50:24:80:5c:70
     config:     0
     state:      0
     current:    10GB-FD COPPER
     speed: 10000 Mbps now, 0 Mbps max
 24(veth86b0da34): addr:da:0f:63:c0:42:7f
     config:     0
     state:      0
     current:    10GB-FD COPPER
     speed: 10000 Mbps now, 0 Mbps max
 LOCAL(br0): addr:12:f0:2b:5a:54:44
     config:     PORT_DOWN
     state:      LINK_DOWN
     speed: 0 Mbps now, 0 Mbps max
OFPT_GET_CONFIG_REPLY (OF1.3) (xid=0x5): frags=normal miss_send_len=0
[root@ip-172-18-4-193 ~]# 


====== after restarting openvswitch:
[root@ip-172-18-4-193 ~]# ovs-ofctl -O OpenFlow13 show br0
OFPT_FEATURES_REPLY (OF1.3) (xid=0x2): dpid:000012f02b5a5444
n_tables:254, n_buffers:256
capabilities: FLOW_STATS TABLE_STATS PORT_STATS GROUP_STATS QUEUE_STATS
OFPST_PORT_DESC reply (OF1.3) (xid=0x3):
 1(vxlan0): addr:5a:b5:7a:ba:53:c5
     config:     0
     state:      0
     speed: 0 Mbps now, 0 Mbps max
 2(tun0): addr:4a:ff:34:22:0c:8c
     config:     0
     state:      0
     speed: 0 Mbps now, 0 Mbps max
 LOCAL(br0): addr:12:f0:2b:5a:54:44
     config:     PORT_DOWN
     state:      LINK_DOWN
     speed: 0 Mbps now, 0 Mbps max
OFPT_GET_CONFIG_REPLY (OF1.3) (xid=0x5): frags=normal miss_send_len=0
[root@ip-172-18-4-193 ~]# 
[root@ip-172-18-4-193 ~]# 


Expected results:
The pod veth should not be removed after restarting openvswitch

Additional info:

Comment 1 Hongan Li 2016-10-31 08:57:52 UTC
the issue still can be reproduced in OCP 3.4.0.17 and OVS 2.5.0

openshift v3.4.0.17+b8a03bc
kubernetes v1.4.0+776c994
ovs-vsctl (Open vSwitch) 2.5.0

Comment 2 Dan Winship 2016-11-01 17:37:32 UTC
Hm... worksforme

[root@openshift-node-2 /]# ovs-ofctl -O OpenFlow13 show br0 | grep veth
 3(veth642b9ac1): addr:fa:9f:07:8b:a8:02
[root@openshift-node-2 /]# systemctl restart openvswitch
[root@openshift-node-2 /]# ovs-ofctl -O OpenFlow13 show br0 | grep veth
 3(veth642b9ac1): addr:fa:9f:07:8b:a8:02

"ps" shows that ovsdb-server and ovs-vswitchd are being restarted

Comment 3 Hongan Li 2016-11-02 02:27:22 UTC
still can be reproduced it in OCP 3.4.0.18.
And notice that the node status stays in "NotReady" for about 10 minutes after restarting openvswitch. No veth shows during this time. 

Seems it is related to this bug: https://bugzilla.redhat.com/show_bug.cgi?id=1388867

Comment 5 Dan Winship 2016-11-02 16:36:37 UTC
(In reply to hongli from comment #3)
> Seems it is related to this bug:
> https://bugzilla.redhat.com/show_bug.cgi?id=1388867

OK, let's wait for that to get fixed and see if this is still reproducible

Comment 6 Anping Li 2016-11-07 10:16:42 UTC
For upgrade, i hit same issue, the step is as following.
1. install OCP 3.3 and create pod and etc
2. upgrade to OCP 3.4 and openvswitch 2.4
3. upgrade openvswitch to 2.5 and restart openvswitch.
4. check the existing pods. some pods cannot be access.

Comment 7 Dan Winship 2016-11-07 13:29:09 UTC
(In reply to Dan Winship from comment #5)
> (In reply to hongli from comment #3)
> > Seems it is related to this bug:
> > https://bugzilla.redhat.com/show_bug.cgi?id=1388867
> 
> OK, let's wait for that to get fixed and see if this is still reproducible

That bug is fixed in the current packages. Is this bug still reproducible *without* upgrading OVS in the middle?

Comment 8 Hongan Li 2016-11-08 09:48:05 UTC
The bug still can be reproduced in OCP 3.4.0.23 *with* or *without* upgrading OVS.

But it it not 100% reproducible now and I didn't find any clue. Seem need to restart openvswitch more times to reproduce the issue.   

I reserved the two env for your debugging.

Comment 10 Dan Winship 2016-11-09 13:32:04 UTC
Ah, it's not actually especially OVS-related at all; it's just a badly-timed crash at startup. (OpenShift gets restarted when OVS gets restarted (which is correct), sees that it needs to recreate the SDN, and starts doing so. But when it gets to the part where it starts reattaching the pods, it hits a bug and panics. Systemd then restarts it, and when it gets started this time, it sees that the OVS bridge already exists and so skips SDN setup. So as a result the old pods never get reattached.)

Comment 11 Troy Dawson 2016-11-11 19:32:08 UTC
This should be tested on OSE v3.4.0.25 or newer.

Comment 13 Hongan Li 2016-11-14 10:01:30 UTC
verified in v3.4.0.25 and bug has been fixed.

Comment 14 Hongan Li 2016-11-15 06:27:27 UTC
Verified and fixed in v3.4.0.25 containerized env.

But the issue still can be reproduced in rpm installation env v3.4.0.26. So assign the bug back.

Leave the env for debugging.

Comment 16 Hongan Li 2016-11-17 10:18:18 UTC
verified in 3.4.0.27 rpm installation env, cannot reproduce the issue.

openshift v3.4.0.27+dffa95c
kubernetes v1.4.0+776c994
etcd 3.1.0-rc.0

Comment 18 errata-xmlrpc 2017-01-18 12:46:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0066