Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1194467

Summary:	openshift-sdn-node takes forever to stop/restart
Product:	OpenShift Container Platform	Reporter:	Erik M Jacobs <ejacobs>
Component:	Networking	Assignee:	Ravi Sankar <rpenta>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Meng Bo <bmeng>
Severity:	low	Docs Contact:
Priority:	low
Version:	3.0.0	CC:	aos-bugs, bleanhar, dmcphers, jokerman, libra-onpremise-devel, lmeyer, mmccomas, rchopra, rpenta, xtian
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2015-11-23 14:44:27 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Erik M Jacobs 2015-02-19 21:38:32 UTC

Description of problem:
When issuing a systemctl stop or systemctl restart on openshift-sdn-node, it takes a LOOOOOOOOOooooooooooooooooong time.

Feb 19 16:36:00 ose3-master systemd[1]: Stopping OpenShift SDN Node...
Feb 19 16:36:01 ose3-master openshift-sdn[2468]: E0219 16:36:01.117108 02468 controller.go:206] Could not find an allocated subnet for this minion (ose3-master)(100: Key not found (/registry/sdn/subnets/ose3-mas
ter) [19]). Waiting..
...
Feb 19 16:37:29 ose3-master openshift-sdn[2468]: E0219 16:37:29.172607 02468 controller.go:206] Could not find an allocated subnet for this minion (ose3-master)(100: Key not found (/registry/sdn/subnets/ose3-mas
ter) [27]). Waiting..
Feb 19 16:37:30 ose3-master systemd[1]: openshift-sdn-node.service stopping timed out. Killing.
Feb 19 16:37:30 ose3-master systemd[1]: openshift-sdn-node.service: main process exited, code=killed, status=9/KILL
Feb 19 16:37:30 ose3-master systemd[1]: Stopped OpenShift SDN Node.
Feb 19 16:37:30 ose3-master systemd[1]: Unit openshift-sdn-node.service entered failed state.

That's 29 seconds to stop.

That seems a little long, but I'm not sure what might be causing it.

Comment 2 Luke Meyer 2015-03-19 13:33:49 UTC

This is instantaneous when openshift-sdn-node is working properly. It hangs when the -hostname it's reporting doesn't match any node records the master has (e.g. master has a record for "node1.example.com" and sdn-node is reporting "node1", or there's just no master record at all). 

I *suspect* this is because sdn-node is trying to look up its subnet on the master (perhaps in order to release it or otherwise change its status - I haven't looked at the code). But that of course is hung up on the problem above, so it just continues retrying until it is killed at a 30s timeout. Naturally, the main reason you'd want to restart openshift-sdn-node in the first place is to fix this problem.

Comment 3 Scott Dodson 2015-03-19 13:38:08 UTC

I can tighten up the time that systemd waits between SIGTERM and SIGKILL'ing the process. I don't know the service that well, but I don't think there's any sort of cleanup that it's doing currently that we're at risk of interrupting so this is probably safe.

Comment 4 Rajat Chopra 2015-05-26 19:14:22 UTC

Is this still a valid issue now that openshift-sdn is not a separate executable?

Comment 5 Scott Dodson 2015-05-26 19:21:44 UTC

Rajat,

This will still happen when the node cannot determine its subnet from the api, but that should never happen assuming the node has been registered properly.

Unfortunately I wasn't able to fully debug why that was happening, I know that the nodes showed up in 'osc get nodes' but perhaps there's something that the node initializes the first time it comes up for real? I think I may have worked around it by bringing the node up first without a network plugin then restarting it with the network plugin set. Does that seem possible?

--
Scott

Comment 6 Rajat Chopra 2015-05-29 18:35:36 UTC

We should ensure that an appropriate signal handler is installed before openshift-sdn goes into the loop.
SIGKILL works anyway, so lowering the priority.

Comment 7 Ravi Sankar 2015-11-03 22:20:35 UTC

Fixed in https://github.com/openshift/openshift-sdn/pull/204

Comment 8 Meng Bo 2015-11-04 06:03:56 UTC

Checked with openshift-sdn version cb0e352cd7591ace30d592d4f82685d2bcd38a04,

The node service will get into failed status after several times failing to get subnet.

Will close the bug once the PR merged into latest OSE puddle.

Nov 04 13:55:54 node2.bmeng.local openshift-node[32762]: W1104 13:55:54.850845   32762 common.go:577] Could not find an allocated subnet for node: node2.bmeng.local, Waiting...
Nov 04 13:55:55 node2.bmeng.local openshift-node[32762]: E1104 13:55:55.165221   32762 event.go:207] Unable to write event: 'Post https://10.66.128.62:8443/api/v1/namespaces/default/events: dial tcp 10.66.128.62:8443: getsockopt: connection refused' (may retry after sleeping)
Nov 04 13:55:55 node2.bmeng.local openshift-node[32762]: W1104 13:55:55.351724   32762 common.go:577] Could not find an allocated subnet for node: node2.bmeng.local, Waiting...
Nov 04 13:55:55 node2.bmeng.local openshift-node[32762]: W1104 13:55:55.852718   32762 common.go:577] Could not find an allocated subnet for node: node2.bmeng.local, Waiting...
Nov 04 13:55:56 node2.bmeng.local openshift-node[32762]: W1104 13:55:56.353912   32762 common.go:577] Could not find an allocated subnet for node: node2.bmeng.local, Waiting...
Nov 04 13:55:56 node2.bmeng.local openshift-node[32762]: W1104 13:55:56.854742   32762 common.go:577] Could not find an allocated subnet for node: node2.bmeng.local, Waiting...
Nov 04 13:55:57 node2.bmeng.local openshift-node[32762]: I1104 13:55:57.094665   32762 kubelet.go:2395] Recording NodeReady event message for node node2.bmeng.local
Nov 04 13:55:57 node2.bmeng.local openshift-node[32762]: I1104 13:55:57.094696   32762 kubelet.go:922] Attempting to register node node2.bmeng.local
Nov 04 13:55:57 node2.bmeng.local openshift-node[32762]: I1104 13:55:57.095747   32762 kubelet.go:925] Unable to register node2.bmeng.local with the apiserver: Post https://10.66.128.62:8443/api/v1/nodes: dial tcp 10.66.128.62:8443: getsockopt: connection refused
Nov 04 13:55:57 node2.bmeng.local openshift-node[32762]: W1104 13:55:57.355638   32762 common.go:577] Could not find an allocated subnet for node: node2.bmeng.local, Waiting...
Nov 04 13:55:57 node2.bmeng.local openshift-node[32762]: W1104 13:55:57.856507   32762 common.go:577] Could not find an allocated subnet for node: node2.bmeng.local, Waiting...
Nov 04 13:55:58 node2.bmeng.local openshift-node[32762]: W1104 13:55:58.357417   32762 common.go:577] Could not find an allocated subnet for node: node2.bmeng.local, Waiting...
Nov 04 13:55:58 node2.bmeng.local openshift-node[32762]: W1104 13:55:58.858349   32762 common.go:577] Could not find an allocated subnet for node: node2.bmeng.local, Waiting...
Nov 04 13:55:59 node2.bmeng.local openshift-node[32762]: W1104 13:55:59.359236   32762 common.go:577] Could not find an allocated subnet for node: node2.bmeng.local, Waiting...
Nov 04 13:55:59 node2.bmeng.local openshift-node[32762]: W1104 13:55:59.860128   32762 common.go:577] Could not find an allocated subnet for node: node2.bmeng.local, Waiting...
Nov 04 13:56:00 node2.bmeng.local openshift-node[32762]: F1104 13:56:00.360409   32762 multitenant.go:49] SDN Node failed: Failed to get subnet for this host: node2.bmeng.local, error: Get https://10.66.128.62:8443/oapi/v1/hostsubnets/node2.bmeng.local: dial tcp 10.66.128.62:8443: getsockopt: connection refused
Nov 04 13:56:00 node2.bmeng.local systemd[1]: openshift-node.service: main process exited, code=exited, status=255/n/a
Nov 04 13:56:00 node2.bmeng.local systemd[1]: Unit openshift-node.service entered failed state.

Comment 9 Meng Bo 2015-11-05 10:18:12 UTC

Verified on openshift v1.0.7-287-g60781e3