Bug 1316202 - Restart of openvswitch service removes pods veth from bridge.
Restart of openvswitch service removes pods veth from bridge.
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking (Show other bugs)
Unspecified Unspecified
medium Severity medium
: ---
: ---
Assigned To: Ben Bennett
Meng Bo
Depends On:
  Show dependency treegraph
Reported: 2016-03-09 11:32 EST by Ryan Howe
Modified: 2016-08-12 14:31 EDT (History)
8 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2016-08-12 14:31:01 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description Ryan Howe 2016-03-09 11:32:41 EST
Description of problem:
After the restart openvswitch, "ovs-ofctl -O OpenFlow13 show br0" no longer shows the pod's veth device being connected. New pods created show up but do not have access to the external network. 

Version-Release number of selected component (if applicable):
oadm v3.1.1.6-16-g5327e56
kubernetes v1.1.0-origin-1107-g4c8e6f4

How reproducible:

Steps to Reproduce:
1. # systemctl restart openvswitch
2. exec into pod and curl www.redhat.com unable to connect
3. try building cakephp from quickstart, unable to pull image
4. pull image to node with docker pull and then rebuild cakephp, produces error: 

   Error: build error: timeout while waiting for remote repository "https://github.com/kotarusv/cakephp-ex.git”

5. Node status shows Ready and antagonistic does not show error. 

Workaround: Reboot Node or stop openvswitch and atomic-openshift-node
Comment 1 Dan Winship 2016-03-10 09:13:14 EST
"Don't do that then" ? Why did you restart Open vSwitch?

OpenShift can't possibly recover from every random thing the admin might do behind its back...
Comment 2 Ryan Howe 2016-03-10 17:45:15 EST
More information: 

In this case lets say I did not restart it manually but it restarted itself due to unknown reason. The issue is the node shows in a ready state even though OpenShift is not functional. 

If this happens on a node with the registry it will continue to restart over and over again. 

After OpenVswitch goes down and comes back up:

# oc get pods -l docker-registry -o wide -w

docker-registry-5-nirks   1/1       Running   20        13h       cnode2.example.com
docker-registry-5-nirks   1/1       Running   21        13h       cnode2.example.com
docker-registry-5-nirks   1/1       Running   21        13h       cnode2.example.com
docker-registry-5-nirks   1/1       Running   23        13h       cnode2.example.com

Remote to the node to try to fix the issue by restarting atomic-openshift-node 
Does not recover pods keeps restarting. Instead need to redeploy all pods on the node. Or reboot node.
Comment 3 Dan Winship 2016-06-07 14:11:09 EDT
So the big issue here is that if the ovs package gets updated, it will restart the ovs service and then mess us up. It's possible that we want a solution which is specific to that problem (eg, make sure that openshift-node gets restarted after ovs gets restarted?) rather than a solution in general.
Comment 4 Scott Dodson 2016-08-12 11:54:55 EDT
So, I was going to add PartOf=openvswitch.service to the node when sdn-ovs is in use. This would ensure that whenever openvswitch is restarted the node is as well. However, when I tried to reproduce the issue by restarting openvswitch and using curl inside my registry container, I wasn't able to. Looking at the logs the node is already restarted when openvswitch is due to the combination of Requires=openvswitch and Restart=always.

systemd[1]: Stopping Atomic OpenShift Node...
systemd[1]: Stopping Open vSwitch...
systemd[1]: Stopping Open vSwitch Internal Unit...
ovs-ctl[57476]: Killing ovs-vswitchd (56059) [  OK  ]
ovs-ctl[57476]: Killing ovsdb-server (56049) [  OK  ]
systemd[1]: Starting Open vSwitch Internal Unit...
ovs-ctl[57546]: Starting ovsdb-server [  OK  ]
ovs-ctl[57546]: Configuring Open vSwitch system IDs [  OK  ]
ovs-ctl[57546]: Starting ovs-vswitchd [  OK  ]
systemd[1]: Started Open vSwitch Internal Unit.
systemd[1]: Starting Open vSwitch...
ovs-ctl[57546]: Enabling remote OVSDB managers [  OK  ]
systemd[1]: Started Open vSwitch.
systemd[1]: Starting Atomic OpenShift Node...

So I've re-tested with 3.1 and I see the node is restarted when openvswitch is restarted. However, the test fails; if I `oc rsh` into my registry pod I can't curl google.com after having restarted openvswitch.

If I upgrade to 3.2 the test works. So something has changed in either OSE 3.2 or when moving from docker-1.8.2 to docker-1.10.

Assigning back to networking.
Comment 5 Ben Bennett 2016-08-12 14:31:01 EDT
Given that this works on 3.1 and 3.2 I don't think we need to do any more here.

Note You need to log in before you can comment on or make changes to this bug.