1316202 – Restart of openvswitch service removes pods veth from bridge.

Bug 1316202 - Restart of openvswitch service removes pods veth from bridge.

Summary: Restart of openvswitch service removes pods veth from bridge.

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	3.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Ben Bennett
QA Contact:	Meng Bo
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-03-09 16:32 UTC by Ryan Howe
Modified:	2019-10-10 11:29 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-08-12 18:31:01 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Ryan Howe 2016-03-09 16:32:41 UTC

Description of problem:
After the restart openvswitch, "ovs-ofctl -O OpenFlow13 show br0" no longer shows the pod's veth device being connected. New pods created show up but do not have access to the external network. 


Version-Release number of selected component (if applicable):
3.1.1.6
oadm v3.1.1.6-16-g5327e56
kubernetes v1.1.0-origin-1107-g4c8e6f4
atomic-openshift-master-3.1.1.6-3.git.16.5327e56.el7aos.x86_64

How reproducible:
100%

Steps to Reproduce:
1. # systemctl restart openvswitch
2. exec into pod and curl www.redhat.com unable to connect
3. try building cakephp from quickstart, unable to pull image
4. pull image to node with docker pull and then rebuild cakephp, produces error: 

   Error: build error: timeout while waiting for remote repository "https://github.com/kotarusv/cakephp-ex.git”

5. Node status shows Ready and antagonistic does not show error. 




Workaround: Reboot Node or stop openvswitch and atomic-openshift-node

Comment 1 Dan Winship 2016-03-10 14:13:14 UTC

"Don't do that then" ? Why did you restart Open vSwitch?

OpenShift can't possibly recover from every random thing the admin might do behind its back...

Comment 2 Ryan Howe 2016-03-10 22:45:15 UTC

More information: 


In this case lets say I did not restart it manually but it restarted itself due to unknown reason. The issue is the node shows in a ready state even though OpenShift is not functional. 


If this happens on a node with the registry it will continue to restart over and over again. 

After OpenVswitch goes down and comes back up:

# oc get pods -l docker-registry -o wide -w

docker-registry-5-nirks   1/1       Running   20        13h       cnode2.example.com
docker-registry-5-nirks   1/1       Running   21        13h       cnode2.example.com
docker-registry-5-nirks   1/1       Running   21        13h       cnode2.example.com
docker-registry-5-nirks   1/1       Running   23        13h       cnode2.example.com


Remote to the node to try to fix the issue by restarting atomic-openshift-node 
 
Does not recover pods keeps restarting. Instead need to redeploy all pods on the node. Or reboot node.

Comment 3 Dan Winship 2016-06-07 18:11:09 UTC

So the big issue here is that if the ovs package gets updated, it will restart the ovs service and then mess us up. It's possible that we want a solution which is specific to that problem (eg, make sure that openshift-node gets restarted after ovs gets restarted?) rather than a solution in general.

Comment 4 Scott Dodson 2016-08-12 15:54:55 UTC

So, I was going to add PartOf=openvswitch.service to the node when sdn-ovs is in use. This would ensure that whenever openvswitch is restarted the node is as well. However, when I tried to reproduce the issue by restarting openvswitch and using curl inside my registry container, I wasn't able to. Looking at the logs the node is already restarted when openvswitch is due to the combination of Requires=openvswitch and Restart=always.

systemd[1]: Stopping Atomic OpenShift Node...
systemd[1]: Stopping Open vSwitch...
systemd[1]: Stopping Open vSwitch Internal Unit...
ovs-ctl[57476]: Killing ovs-vswitchd (56059) [  OK  ]
ovs-ctl[57476]: Killing ovsdb-server (56049) [  OK  ]
systemd[1]: Starting Open vSwitch Internal Unit...
ovs-ctl[57546]: Starting ovsdb-server [  OK  ]
ovs-ctl[57546]: Configuring Open vSwitch system IDs [  OK  ]
ovs-ctl[57546]: Starting ovs-vswitchd [  OK  ]
systemd[1]: Started Open vSwitch Internal Unit.
systemd[1]: Starting Open vSwitch...
ovs-ctl[57546]: Enabling remote OVSDB managers [  OK  ]
systemd[1]: Started Open vSwitch.
systemd[1]: Starting Atomic OpenShift Node...

So I've re-tested with 3.1 and I see the node is restarted when openvswitch is restarted. However, the test fails; if I `oc rsh` into my registry pod I can't curl google.com after having restarted openvswitch.

If I upgrade to 3.2 the test works. So something has changed in either OSE 3.2 or when moving from docker-1.8.2 to docker-1.10.

Assigning back to networking.

Comment 5 Ben Bennett 2016-08-12 18:31:01 UTC

Given that this works on 3.1 and 3.2 I don't think we need to do any more here.

Note You need to log in before you can comment on or make changes to this bug.