Bug 1316202

Summary: Restart of openvswitch service removes pods veth from bridge.
Product: OpenShift Container Platform Reporter: Ryan Howe <rhowe>
Component: NetworkingAssignee: Ben Bennett <bbennett>
Status: CLOSED WORKSFORME QA Contact: Meng Bo <bmeng>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.1.0CC: aloughla, aos-bugs, atragler, bbennett, mleitner, rhowe, rkhan, yadu
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-08-12 18:31:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Ryan Howe 2016-03-09 16:32:41 UTC
Description of problem:
After the restart openvswitch, "ovs-ofctl -O OpenFlow13 show br0" no longer shows the pod's veth device being connected. New pods created show up but do not have access to the external network. 

Version-Release number of selected component (if applicable):
oadm v3.1.1.6-16-g5327e56
kubernetes v1.1.0-origin-1107-g4c8e6f4

How reproducible:

Steps to Reproduce:
1. # systemctl restart openvswitch
2. exec into pod and curl www.redhat.com unable to connect
3. try building cakephp from quickstart, unable to pull image
4. pull image to node with docker pull and then rebuild cakephp, produces error: 

   Error: build error: timeout while waiting for remote repository "https://github.com/kotarusv/cakephp-ex.git”

5. Node status shows Ready and antagonistic does not show error. 

Workaround: Reboot Node or stop openvswitch and atomic-openshift-node

Comment 1 Dan Winship 2016-03-10 14:13:14 UTC
"Don't do that then" ? Why did you restart Open vSwitch?

OpenShift can't possibly recover from every random thing the admin might do behind its back...

Comment 2 Ryan Howe 2016-03-10 22:45:15 UTC
More information: 

In this case lets say I did not restart it manually but it restarted itself due to unknown reason. The issue is the node shows in a ready state even though OpenShift is not functional. 

If this happens on a node with the registry it will continue to restart over and over again. 

After OpenVswitch goes down and comes back up:

# oc get pods -l docker-registry -o wide -w

docker-registry-5-nirks   1/1       Running   20        13h       cnode2.example.com
docker-registry-5-nirks   1/1       Running   21        13h       cnode2.example.com
docker-registry-5-nirks   1/1       Running   21        13h       cnode2.example.com
docker-registry-5-nirks   1/1       Running   23        13h       cnode2.example.com

Remote to the node to try to fix the issue by restarting atomic-openshift-node 
Does not recover pods keeps restarting. Instead need to redeploy all pods on the node. Or reboot node.

Comment 3 Dan Winship 2016-06-07 18:11:09 UTC
So the big issue here is that if the ovs package gets updated, it will restart the ovs service and then mess us up. It's possible that we want a solution which is specific to that problem (eg, make sure that openshift-node gets restarted after ovs gets restarted?) rather than a solution in general.

Comment 4 Scott Dodson 2016-08-12 15:54:55 UTC
So, I was going to add PartOf=openvswitch.service to the node when sdn-ovs is in use. This would ensure that whenever openvswitch is restarted the node is as well. However, when I tried to reproduce the issue by restarting openvswitch and using curl inside my registry container, I wasn't able to. Looking at the logs the node is already restarted when openvswitch is due to the combination of Requires=openvswitch and Restart=always.

systemd[1]: Stopping Atomic OpenShift Node...
systemd[1]: Stopping Open vSwitch...
systemd[1]: Stopping Open vSwitch Internal Unit...
ovs-ctl[57476]: Killing ovs-vswitchd (56059) [  OK  ]
ovs-ctl[57476]: Killing ovsdb-server (56049) [  OK  ]
systemd[1]: Starting Open vSwitch Internal Unit...
ovs-ctl[57546]: Starting ovsdb-server [  OK  ]
ovs-ctl[57546]: Configuring Open vSwitch system IDs [  OK  ]
ovs-ctl[57546]: Starting ovs-vswitchd [  OK  ]
systemd[1]: Started Open vSwitch Internal Unit.
systemd[1]: Starting Open vSwitch...
ovs-ctl[57546]: Enabling remote OVSDB managers [  OK  ]
systemd[1]: Started Open vSwitch.
systemd[1]: Starting Atomic OpenShift Node...

So I've re-tested with 3.1 and I see the node is restarted when openvswitch is restarted. However, the test fails; if I `oc rsh` into my registry pod I can't curl google.com after having restarted openvswitch.

If I upgrade to 3.2 the test works. So something has changed in either OSE 3.2 or when moving from docker-1.8.2 to docker-1.10.

Assigning back to networking.

Comment 5 Ben Bennett 2016-08-12 18:31:01 UTC
Given that this works on 3.1 and 3.2 I don't think we need to do any more here.