1243514 – there is possibly a race / error / startup dependency condition where the master's node/sdn doesn't start up properly on boot

Bug 1243514 - there is possibly a race / error / startup dependency condition where the master's node/sdn doesn't start up properly on boot

Summary: there is possibly a race / error / startup dependency condition where the mas...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	3.0.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	low
Target Milestone:	---
Target Release:	---
Assignee:	Dan Williams
QA Contact:	Meng Bo
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1267746
TreeView+	depends on / blocked

Reported:	2015-07-15 16:25 UTC by Erik M Jacobs
Modified:	2019-10-10 09:57 UTC (History)
CC List:	11 users (show)
Fixed In Version:	atomic-openshift-3.0.2.901
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-01-26 19:16:01 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	2037653	0	None	None	None	Never
Red Hat Product Errata	RHSA-2016:0070	0	normal	SHIPPED_LIVE	Important: Red Hat OpenShift Enterprise 3.1.1 bug fix and enhancement update	2016-01-27 00:12:41 UTC

Description Erik M Jacobs 2015-07-15 16:25:08 UTC

When rebooting a master, it appears that there's a possibility that you can have a dependency / error state where the sdn doesn't properly get initialized. Restarting the node service appears to have resolved the issue and properly re-initialized the SDN.

Comment 3 Ben Parees 2015-07-23 19:40:57 UTC

Rajat can we get an update on this?

Comment 4 Rajat Chopra 2015-07-24 17:29:22 UTC

Not been able to reproduce easily. Some error checks and logging has been added as part of https://github.com/openshift/origin/pull/3665 which will help shed more light on the failure.

Comment 5 Ben Parees 2015-07-24 19:40:02 UTC

do you have any sense if it's a regression/true blocker?

Comment 6 Rajat Chopra 2015-07-28 00:32:26 UTC

Not a regression.
Workaround exists - 
ip link set lbr0 down && brctl delbr lbr0 && systemctl restart openshift-node

Comment 7 Dan Williams 2015-10-30 15:08:06 UTC

The SDN startup now includes more checks to determine whether things are set up correctly, and if not, will reconfigure from scratch.  If you see this in the future, please grab the output of:

ip addr
ip route
ovs-ofctl -O OpenFlow13 dump-flows br0
cat /run/openshift-sdn/docker-network

Comment 8 Erik M Jacobs 2015-10-30 15:19:24 UTC

I've not seen it in quite some time. Do we want to close as "NOTABUG" and reopen if I see it again?

Comment 9 Evgheni Dereveanchin 2015-11-04 12:36:26 UTC

Hello, I have an issue with SDN when docker takes a longer time to restart at boot which blocks SDN initialization leaving the lock file behind and preventing further initialization:

Nov 02 15:15:15 node1.demo.lan kernel: device tun0 entered promiscuous mode
Nov 02 15:15:15 node1.demo.lan kernel: IPv6: ADDRCONF(NETDEV_UP): vlinuxbr: link is not ready
Nov 02 15:15:15 node1.demo.lan kernel: IPv6: ADDRCONF(NETDEV_CHANGE): vlinuxbr: link becomes ready
Nov 02 15:15:15 node1.demo.lan ovs-vsctl[2408]: ovs|00001|vsctl|INFO|Called as ovs-vsctl del-port br0 vovsbr
Nov 02 15:15:15 node1.demo.lan ovs-vsctl[2408]: ovs|00002|vsctl|ERR|no port named vovsbr
Nov 02 15:15:15 node1.demo.lan ovs-vsctl[2409]: ovs|00001|vsctl|INFO|Called as ovs-vsctl add-port br0 vovsbr -- set Interface vovsbr ofport_request=9
Nov 02 15:15:15 node1.demo.lan kernel: device vovsbr entered promiscuous mode
Nov 02 15:15:15 node1.demo.lan kernel: device vlinuxbr entered promiscuous mode
Nov 02 15:15:15 node1.demo.lan kernel: lbr0: port 1(vlinuxbr) entered forwarding state
Nov 02 15:15:15 node1.demo.lan kernel: lbr0: port 1(vlinuxbr) entered forwarding state
Nov 02 15:15:16 node1.demo.lan systemd[1]: Reloading.
...
Nov 02 15:15:16 node1.demo.lan systemd[1]: Stopping Docker Application Container Engine...
Nov 02 15:15:16 node1.demo.lan docker[1172]: time="2015-11-02T15:15:16.123506169+01:00" level=info msg="Processing signal 'terminated'"
Nov 02 15:15:16 node1.demo.lan openshift-node[2347]: F1102 15:15:16.131295    2347 node.go:85] ERROR: Unable to check for Docker server version.
Nov 02 15:15:16 node1.demo.lan openshift-node[2347]: unexpected EOF
Nov 02 15:15:16 node1.demo.lan systemd[1]: Starting Docker Storage Setup...
Nov 02 15:15:16 node1.demo.lan systemd[1]: openshift-node.service: main process exited, code=exited, status=255/n/a
Nov 02 15:15:16 node1.demo.lan systemd[1]: Failed to start OpenShift Node.
Nov 02 15:15:16 node1.demo.lan systemd[1]: Unit openshift-node.service entered failed state.
Nov 02 15:15:16 node1.demo.lan systemd[1]: Starting Multi-User System.

Is this the same issue as described in this bug?

The workaround is to reboot the node completely or remove the lock and restart services.
Upstream bug is https://github.com/openshift/origin/issues/4903 and I want to ensure the fix will make it into 3.1

Comment 11 Dan Williams 2015-11-06 17:02:55 UTC

Possible cause:

SDN node startup is currently done asynchronously in a gofunc, so there is a possibility of the OpenShift core talking to docker while the node is restarting it the first time openshift-node is started.  We don't actually need to do any of the node startup asynchronously though (node startup always eventually returns to the caller in the origin core), though it might mean a delay of a few seconds while docker gets restarted and networking is set up.

So we could just call Node() in RunSDNController() instead of doing it from a gofunc.

Comment 14 Evgheni Dereveanchin 2015-11-30 07:55:00 UTC

Thanks for the info. I think the related upstream issue is:
https://github.com/openshift/origin/issues/4903
It had been closed in October so I was wondering if it made it into OSE 3.1.

Comment 15 Dan Williams 2015-12-16 17:23:08 UTC

(In reply to Evgheni Dereveanchin from comment #14)
> Thanks for the info. I think the related upstream issue is:
> https://github.com/openshift/origin/issues/4903
> It had been closed in October so I was wondering if it made it into OSE 3.1.

If you're referring to the Restart=always bits from that issue, yes it should be in OSE 3.1.  It should go as far back as 3.0.2-something.  Does the /lib/systemd/system/atomic-openshift-node.service file not have Restart=always and if so what RPM version is that?

Comment 16 Dan Williams 2015-12-16 17:55:36 UTC

Restart=always appears to have been added to atomic-openshift 3.0.2.901.

Comment 17 Johnny Liu 2015-12-31 07:16:41 UTC

Verified this bug with atomic-openshift-node-3.1.1.0-1.git.0.8632732.el7aos.x86_64, and PASS.

Restart=always is already added into /lib/systemd/system/atomic-openshift-node.service. And QE did not encounter this issue yet during testing.

Comment 19 errata-xmlrpc 2016-01-26 19:16:01 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2016:0070

Note You need to log in before you can comment on or make changes to this bug.