Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1389212 - all pod veth are removed after restarting openvswitch
all pod veth are removed after restarting openvswitch
Status: CLOSED ERRATA
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking (Show other bugs)
3.4.0
Unspecified Unspecified
medium Severity medium
: ---
: ---
Assigned To: Dan Winship
Meng Bo
: Regression
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2016-10-27 04:13 EDT by hongli
Modified: 2017-03-08 13 EST (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: No Doc Update
Doc Text:
undefined
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-01-18 07:46:58 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Origin (Github) 11852 None None None 2016-11-09 08:32 EST
Origin (Github) 11911 None None None 2016-11-15 07:13 EST
Red Hat Product Errata RHBA-2017:0066 normal SHIPPED_LIVE Red Hat OpenShift Container Platform 3.4 RPM Release Advisory 2017-01-18 12:23:26 EST

  None (edit)
Description hongli 2016-10-27 04:13:20 EDT
Description of problem:
the pods veth are removed after restarting openvswitch

Version-Release number of selected component (if applicable):
openshift v3.4.0.16+cc70b72
kubernetes v1.4.0+776c994
etcd 3.1.0-rc.0
ovs-vsctl (Open vSwitch) 2.4.0


How reproducible:
always

Steps to Reproduce:
1. Setup multi-node env with multitenant network plugin
2. create some pods on the node
3. SSH to node and check pod veth
4. systemctl restart openvswitch

Actual results:
pod veth removed after openvswitch restarted and cannot reach to pods on the node. 
Details as below:
====== before restarting openswitch:
[root@ip-172-18-4-193 ~]# ovs-ofctl -O OpenFlow13 show br0
OFPT_FEATURES_REPLY (OF1.3) (xid=0x2): dpid:000012f02b5a5444
n_tables:254, n_buffers:256
capabilities: FLOW_STATS TABLE_STATS PORT_STATS GROUP_STATS QUEUE_STATS
OFPST_PORT_DESC reply (OF1.3) (xid=0x3):
 1(vxlan0): addr:d2:cc:76:b6:c7:bf
     config:     0
     state:      0
     speed: 0 Mbps now, 0 Mbps max
 2(tun0): addr:1a:89:3b:6e:ec:27
     config:     0
     state:      0
     speed: 0 Mbps now, 0 Mbps max
 5(vethfe66b7f6): addr:ce:0a:1d:40:14:c0
     config:     0
     state:      0
     current:    10GB-FD COPPER
     speed: 10000 Mbps now, 0 Mbps max
 18(vethf897b3ae): addr:62:26:5d:46:77:c2
     config:     0
     state:      0
     current:    10GB-FD COPPER
     speed: 10000 Mbps now, 0 Mbps max
 22(veth3f8d481c): addr:82:d5:e2:48:ed:f1
     config:     0
     state:      0
     current:    10GB-FD COPPER
     speed: 10000 Mbps now, 0 Mbps max
 23(veth4a0ef39a): addr:0e:50:24:80:5c:70
     config:     0
     state:      0
     current:    10GB-FD COPPER
     speed: 10000 Mbps now, 0 Mbps max
 24(veth86b0da34): addr:da:0f:63:c0:42:7f
     config:     0
     state:      0
     current:    10GB-FD COPPER
     speed: 10000 Mbps now, 0 Mbps max
 LOCAL(br0): addr:12:f0:2b:5a:54:44
     config:     PORT_DOWN
     state:      LINK_DOWN
     speed: 0 Mbps now, 0 Mbps max
OFPT_GET_CONFIG_REPLY (OF1.3) (xid=0x5): frags=normal miss_send_len=0
[root@ip-172-18-4-193 ~]# 


====== after restarting openvswitch:
[root@ip-172-18-4-193 ~]# ovs-ofctl -O OpenFlow13 show br0
OFPT_FEATURES_REPLY (OF1.3) (xid=0x2): dpid:000012f02b5a5444
n_tables:254, n_buffers:256
capabilities: FLOW_STATS TABLE_STATS PORT_STATS GROUP_STATS QUEUE_STATS
OFPST_PORT_DESC reply (OF1.3) (xid=0x3):
 1(vxlan0): addr:5a:b5:7a:ba:53:c5
     config:     0
     state:      0
     speed: 0 Mbps now, 0 Mbps max
 2(tun0): addr:4a:ff:34:22:0c:8c
     config:     0
     state:      0
     speed: 0 Mbps now, 0 Mbps max
 LOCAL(br0): addr:12:f0:2b:5a:54:44
     config:     PORT_DOWN
     state:      LINK_DOWN
     speed: 0 Mbps now, 0 Mbps max
OFPT_GET_CONFIG_REPLY (OF1.3) (xid=0x5): frags=normal miss_send_len=0
[root@ip-172-18-4-193 ~]# 
[root@ip-172-18-4-193 ~]# 


Expected results:
The pod veth should not be removed after restarting openvswitch

Additional info:
Comment 1 hongli 2016-10-31 04:57:52 EDT
the issue still can be reproduced in OCP 3.4.0.17 and OVS 2.5.0

openshift v3.4.0.17+b8a03bc
kubernetes v1.4.0+776c994
ovs-vsctl (Open vSwitch) 2.5.0
Comment 2 Dan Winship 2016-11-01 13:37:32 EDT
Hm... worksforme

[root@openshift-node-2 /]# ovs-ofctl -O OpenFlow13 show br0 | grep veth
 3(veth642b9ac1): addr:fa:9f:07:8b:a8:02
[root@openshift-node-2 /]# systemctl restart openvswitch
[root@openshift-node-2 /]# ovs-ofctl -O OpenFlow13 show br0 | grep veth
 3(veth642b9ac1): addr:fa:9f:07:8b:a8:02

"ps" shows that ovsdb-server and ovs-vswitchd are being restarted
Comment 3 hongli 2016-11-01 22:27:22 EDT
still can be reproduced it in OCP 3.4.0.18.
And notice that the node status stays in "NotReady" for about 10 minutes after restarting openvswitch. No veth shows during this time. 

Seems it is related to this bug: https://bugzilla.redhat.com/show_bug.cgi?id=1388867
Comment 5 Dan Winship 2016-11-02 12:36:37 EDT
(In reply to hongli from comment #3)
> Seems it is related to this bug:
> https://bugzilla.redhat.com/show_bug.cgi?id=1388867

OK, let's wait for that to get fixed and see if this is still reproducible
Comment 6 Anping Li 2016-11-07 05:16:42 EST
For upgrade, i hit same issue, the step is as following.
1. install OCP 3.3 and create pod and etc
2. upgrade to OCP 3.4 and openvswitch 2.4
3. upgrade openvswitch to 2.5 and restart openvswitch.
4. check the existing pods. some pods cannot be access.
Comment 7 Dan Winship 2016-11-07 08:29:09 EST
(In reply to Dan Winship from comment #5)
> (In reply to hongli from comment #3)
> > Seems it is related to this bug:
> > https://bugzilla.redhat.com/show_bug.cgi?id=1388867
> 
> OK, let's wait for that to get fixed and see if this is still reproducible

That bug is fixed in the current packages. Is this bug still reproducible *without* upgrading OVS in the middle?
Comment 8 hongli 2016-11-08 04:48:05 EST
The bug still can be reproduced in OCP 3.4.0.23 *with* or *without* upgrading OVS.

But it it not 100% reproducible now and I didn't find any clue. Seem need to restart openvswitch more times to reproduce the issue.   

I reserved the two env for your debugging.
Comment 10 Dan Winship 2016-11-09 08:32:04 EST
Ah, it's not actually especially OVS-related at all; it's just a badly-timed crash at startup. (OpenShift gets restarted when OVS gets restarted (which is correct), sees that it needs to recreate the SDN, and starts doing so. But when it gets to the part where it starts reattaching the pods, it hits a bug and panics. Systemd then restarts it, and when it gets started this time, it sees that the OVS bridge already exists and so skips SDN setup. So as a result the old pods never get reattached.)
Comment 11 Troy Dawson 2016-11-11 14:32:08 EST
This should be tested on OSE v3.4.0.25 or newer.
Comment 13 hongli 2016-11-14 05:01:30 EST
verified in v3.4.0.25 and bug has been fixed.
Comment 14 hongli 2016-11-15 01:27:27 EST
Verified and fixed in v3.4.0.25 containerized env.

But the issue still can be reproduced in rpm installation env v3.4.0.26. So assign the bug back.

Leave the env for debugging.
Comment 16 hongli 2016-11-17 05:18:18 EST
verified in 3.4.0.27 rpm installation env, cannot reproduce the issue.

openshift v3.4.0.27+dffa95c
kubernetes v1.4.0+776c994
etcd 3.1.0-rc.0
Comment 18 errata-xmlrpc 2017-01-18 07:46:58 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0066

Note You need to log in before you can comment on or make changes to this bug.