When rebooting a master, it appears that there's a possibility that you can have a dependency / error state where the sdn doesn't properly get initialized. Restarting the node service appears to have resolved the issue and properly re-initialized the SDN.
Rajat can we get an update on this?
Not been able to reproduce easily. Some error checks and logging has been added as part of https://github.com/openshift/origin/pull/3665 which will help shed more light on the failure.
do you have any sense if it's a regression/true blocker?
Not a regression. Workaround exists - ip link set lbr0 down && brctl delbr lbr0 && systemctl restart openshift-node
The SDN startup now includes more checks to determine whether things are set up correctly, and if not, will reconfigure from scratch. If you see this in the future, please grab the output of: ip addr ip route ovs-ofctl -O OpenFlow13 dump-flows br0 cat /run/openshift-sdn/docker-network
I've not seen it in quite some time. Do we want to close as "NOTABUG" and reopen if I see it again?
Hello, I have an issue with SDN when docker takes a longer time to restart at boot which blocks SDN initialization leaving the lock file behind and preventing further initialization: Nov 02 15:15:15 node1.demo.lan kernel: device tun0 entered promiscuous mode Nov 02 15:15:15 node1.demo.lan kernel: IPv6: ADDRCONF(NETDEV_UP): vlinuxbr: link is not ready Nov 02 15:15:15 node1.demo.lan kernel: IPv6: ADDRCONF(NETDEV_CHANGE): vlinuxbr: link becomes ready Nov 02 15:15:15 node1.demo.lan ovs-vsctl[2408]: ovs|00001|vsctl|INFO|Called as ovs-vsctl del-port br0 vovsbr Nov 02 15:15:15 node1.demo.lan ovs-vsctl[2408]: ovs|00002|vsctl|ERR|no port named vovsbr Nov 02 15:15:15 node1.demo.lan ovs-vsctl[2409]: ovs|00001|vsctl|INFO|Called as ovs-vsctl add-port br0 vovsbr -- set Interface vovsbr ofport_request=9 Nov 02 15:15:15 node1.demo.lan kernel: device vovsbr entered promiscuous mode Nov 02 15:15:15 node1.demo.lan kernel: device vlinuxbr entered promiscuous mode Nov 02 15:15:15 node1.demo.lan kernel: lbr0: port 1(vlinuxbr) entered forwarding state Nov 02 15:15:15 node1.demo.lan kernel: lbr0: port 1(vlinuxbr) entered forwarding state Nov 02 15:15:16 node1.demo.lan systemd[1]: Reloading. ... Nov 02 15:15:16 node1.demo.lan systemd[1]: Stopping Docker Application Container Engine... Nov 02 15:15:16 node1.demo.lan docker[1172]: time="2015-11-02T15:15:16.123506169+01:00" level=info msg="Processing signal 'terminated'" Nov 02 15:15:16 node1.demo.lan openshift-node[2347]: F1102 15:15:16.131295 2347 node.go:85] ERROR: Unable to check for Docker server version. Nov 02 15:15:16 node1.demo.lan openshift-node[2347]: unexpected EOF Nov 02 15:15:16 node1.demo.lan systemd[1]: Starting Docker Storage Setup... Nov 02 15:15:16 node1.demo.lan systemd[1]: openshift-node.service: main process exited, code=exited, status=255/n/a Nov 02 15:15:16 node1.demo.lan systemd[1]: Failed to start OpenShift Node. Nov 02 15:15:16 node1.demo.lan systemd[1]: Unit openshift-node.service entered failed state. Nov 02 15:15:16 node1.demo.lan systemd[1]: Starting Multi-User System. Is this the same issue as described in this bug? The workaround is to reboot the node completely or remove the lock and restart services. Upstream bug is https://github.com/openshift/origin/issues/4903 and I want to ensure the fix will make it into 3.1
Possible cause: SDN node startup is currently done asynchronously in a gofunc, so there is a possibility of the OpenShift core talking to docker while the node is restarting it the first time openshift-node is started. We don't actually need to do any of the node startup asynchronously though (node startup always eventually returns to the caller in the origin core), though it might mean a delay of a few seconds while docker gets restarted and networking is set up. So we could just call Node() in RunSDNController() instead of doing it from a gofunc.
Thanks for the info. I think the related upstream issue is: https://github.com/openshift/origin/issues/4903 It had been closed in October so I was wondering if it made it into OSE 3.1.
(In reply to Evgheni Dereveanchin from comment #14) > Thanks for the info. I think the related upstream issue is: > https://github.com/openshift/origin/issues/4903 > It had been closed in October so I was wondering if it made it into OSE 3.1. If you're referring to the Restart=always bits from that issue, yes it should be in OSE 3.1. It should go as far back as 3.0.2-something. Does the /lib/systemd/system/atomic-openshift-node.service file not have Restart=always and if so what RPM version is that?
Restart=always appears to have been added to atomic-openshift 3.0.2.901.
Verified this bug with atomic-openshift-node-3.1.1.0-1.git.0.8632732.el7aos.x86_64, and PASS. Restart=always is already added into /lib/systemd/system/atomic-openshift-node.service. And QE did not encounter this issue yet during testing.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2016:0070