Description of problem: After the restart openvswitch, "ovs-ofctl -O OpenFlow13 show br0" no longer shows the pod's veth device being connected. New pods created show up but do not have access to the external network. Version-Release number of selected component (if applicable): 3.1.1.6 oadm v3.1.1.6-16-g5327e56 kubernetes v1.1.0-origin-1107-g4c8e6f4 atomic-openshift-master-3.1.1.6-3.git.16.5327e56.el7aos.x86_64 How reproducible: 100% Steps to Reproduce: 1. # systemctl restart openvswitch 2. exec into pod and curl www.redhat.com unable to connect 3. try building cakephp from quickstart, unable to pull image 4. pull image to node with docker pull and then rebuild cakephp, produces error: Error: build error: timeout while waiting for remote repository "https://github.com/kotarusv/cakephp-ex.git” 5. Node status shows Ready and antagonistic does not show error. Workaround: Reboot Node or stop openvswitch and atomic-openshift-node
"Don't do that then" ? Why did you restart Open vSwitch? OpenShift can't possibly recover from every random thing the admin might do behind its back...
More information: In this case lets say I did not restart it manually but it restarted itself due to unknown reason. The issue is the node shows in a ready state even though OpenShift is not functional. If this happens on a node with the registry it will continue to restart over and over again. After OpenVswitch goes down and comes back up: # oc get pods -l docker-registry -o wide -w docker-registry-5-nirks 1/1 Running 20 13h cnode2.example.com docker-registry-5-nirks 1/1 Running 21 13h cnode2.example.com docker-registry-5-nirks 1/1 Running 21 13h cnode2.example.com docker-registry-5-nirks 1/1 Running 23 13h cnode2.example.com Remote to the node to try to fix the issue by restarting atomic-openshift-node Does not recover pods keeps restarting. Instead need to redeploy all pods on the node. Or reboot node.
So the big issue here is that if the ovs package gets updated, it will restart the ovs service and then mess us up. It's possible that we want a solution which is specific to that problem (eg, make sure that openshift-node gets restarted after ovs gets restarted?) rather than a solution in general.
So, I was going to add PartOf=openvswitch.service to the node when sdn-ovs is in use. This would ensure that whenever openvswitch is restarted the node is as well. However, when I tried to reproduce the issue by restarting openvswitch and using curl inside my registry container, I wasn't able to. Looking at the logs the node is already restarted when openvswitch is due to the combination of Requires=openvswitch and Restart=always. systemd[1]: Stopping Atomic OpenShift Node... systemd[1]: Stopping Open vSwitch... systemd[1]: Stopping Open vSwitch Internal Unit... ovs-ctl[57476]: Killing ovs-vswitchd (56059) [ OK ] ovs-ctl[57476]: Killing ovsdb-server (56049) [ OK ] systemd[1]: Starting Open vSwitch Internal Unit... ovs-ctl[57546]: Starting ovsdb-server [ OK ] ovs-ctl[57546]: Configuring Open vSwitch system IDs [ OK ] ovs-ctl[57546]: Starting ovs-vswitchd [ OK ] systemd[1]: Started Open vSwitch Internal Unit. systemd[1]: Starting Open vSwitch... ovs-ctl[57546]: Enabling remote OVSDB managers [ OK ] systemd[1]: Started Open vSwitch. systemd[1]: Starting Atomic OpenShift Node... So I've re-tested with 3.1 and I see the node is restarted when openvswitch is restarted. However, the test fails; if I `oc rsh` into my registry pod I can't curl google.com after having restarted openvswitch. If I upgrade to 3.2 the test works. So something has changed in either OSE 3.2 or when moving from docker-1.8.2 to docker-1.10. Assigning back to networking.
Given that this works on 3.1 and 3.2 I don't think we need to do any more here.