Bug 1194467
| Summary: | openshift-sdn-node takes forever to stop/restart | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Erik M Jacobs <ejacobs> |
| Component: | Networking | Assignee: | Ravi Sankar <rpenta> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Meng Bo <bmeng> |
| Severity: | low | Docs Contact: | |
| Priority: | low | ||
| Version: | 3.0.0 | CC: | aos-bugs, bleanhar, dmcphers, jokerman, libra-onpremise-devel, lmeyer, mmccomas, rchopra, rpenta, xtian |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2015-11-23 14:44:27 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Erik M Jacobs
2015-02-19 21:38:32 UTC
This is instantaneous when openshift-sdn-node is working properly. It hangs when the -hostname it's reporting doesn't match any node records the master has (e.g. master has a record for "node1.example.com" and sdn-node is reporting "node1", or there's just no master record at all). I *suspect* this is because sdn-node is trying to look up its subnet on the master (perhaps in order to release it or otherwise change its status - I haven't looked at the code). But that of course is hung up on the problem above, so it just continues retrying until it is killed at a 30s timeout. Naturally, the main reason you'd want to restart openshift-sdn-node in the first place is to fix this problem. I can tighten up the time that systemd waits between SIGTERM and SIGKILL'ing the process. I don't know the service that well, but I don't think there's any sort of cleanup that it's doing currently that we're at risk of interrupting so this is probably safe. Is this still a valid issue now that openshift-sdn is not a separate executable? Rajat, This will still happen when the node cannot determine its subnet from the api, but that should never happen assuming the node has been registered properly. Unfortunately I wasn't able to fully debug why that was happening, I know that the nodes showed up in 'osc get nodes' but perhaps there's something that the node initializes the first time it comes up for real? I think I may have worked around it by bringing the node up first without a network plugin then restarting it with the network plugin set. Does that seem possible? -- Scott We should ensure that an appropriate signal handler is installed before openshift-sdn goes into the loop. SIGKILL works anyway, so lowering the priority. Checked with openshift-sdn version cb0e352cd7591ace30d592d4f82685d2bcd38a04, The node service will get into failed status after several times failing to get subnet. Will close the bug once the PR merged into latest OSE puddle. Nov 04 13:55:54 node2.bmeng.local openshift-node[32762]: W1104 13:55:54.850845 32762 common.go:577] Could not find an allocated subnet for node: node2.bmeng.local, Waiting... Nov 04 13:55:55 node2.bmeng.local openshift-node[32762]: E1104 13:55:55.165221 32762 event.go:207] Unable to write event: 'Post https://10.66.128.62:8443/api/v1/namespaces/default/events: dial tcp 10.66.128.62:8443: getsockopt: connection refused' (may retry after sleeping) Nov 04 13:55:55 node2.bmeng.local openshift-node[32762]: W1104 13:55:55.351724 32762 common.go:577] Could not find an allocated subnet for node: node2.bmeng.local, Waiting... Nov 04 13:55:55 node2.bmeng.local openshift-node[32762]: W1104 13:55:55.852718 32762 common.go:577] Could not find an allocated subnet for node: node2.bmeng.local, Waiting... Nov 04 13:55:56 node2.bmeng.local openshift-node[32762]: W1104 13:55:56.353912 32762 common.go:577] Could not find an allocated subnet for node: node2.bmeng.local, Waiting... Nov 04 13:55:56 node2.bmeng.local openshift-node[32762]: W1104 13:55:56.854742 32762 common.go:577] Could not find an allocated subnet for node: node2.bmeng.local, Waiting... Nov 04 13:55:57 node2.bmeng.local openshift-node[32762]: I1104 13:55:57.094665 32762 kubelet.go:2395] Recording NodeReady event message for node node2.bmeng.local Nov 04 13:55:57 node2.bmeng.local openshift-node[32762]: I1104 13:55:57.094696 32762 kubelet.go:922] Attempting to register node node2.bmeng.local Nov 04 13:55:57 node2.bmeng.local openshift-node[32762]: I1104 13:55:57.095747 32762 kubelet.go:925] Unable to register node2.bmeng.local with the apiserver: Post https://10.66.128.62:8443/api/v1/nodes: dial tcp 10.66.128.62:8443: getsockopt: connection refused Nov 04 13:55:57 node2.bmeng.local openshift-node[32762]: W1104 13:55:57.355638 32762 common.go:577] Could not find an allocated subnet for node: node2.bmeng.local, Waiting... Nov 04 13:55:57 node2.bmeng.local openshift-node[32762]: W1104 13:55:57.856507 32762 common.go:577] Could not find an allocated subnet for node: node2.bmeng.local, Waiting... Nov 04 13:55:58 node2.bmeng.local openshift-node[32762]: W1104 13:55:58.357417 32762 common.go:577] Could not find an allocated subnet for node: node2.bmeng.local, Waiting... Nov 04 13:55:58 node2.bmeng.local openshift-node[32762]: W1104 13:55:58.858349 32762 common.go:577] Could not find an allocated subnet for node: node2.bmeng.local, Waiting... Nov 04 13:55:59 node2.bmeng.local openshift-node[32762]: W1104 13:55:59.359236 32762 common.go:577] Could not find an allocated subnet for node: node2.bmeng.local, Waiting... Nov 04 13:55:59 node2.bmeng.local openshift-node[32762]: W1104 13:55:59.860128 32762 common.go:577] Could not find an allocated subnet for node: node2.bmeng.local, Waiting... Nov 04 13:56:00 node2.bmeng.local openshift-node[32762]: F1104 13:56:00.360409 32762 multitenant.go:49] SDN Node failed: Failed to get subnet for this host: node2.bmeng.local, error: Get https://10.66.128.62:8443/oapi/v1/hostsubnets/node2.bmeng.local: dial tcp 10.66.128.62:8443: getsockopt: connection refused Nov 04 13:56:00 node2.bmeng.local systemd[1]: openshift-node.service: main process exited, code=exited, status=255/n/a Nov 04 13:56:00 node2.bmeng.local systemd[1]: Unit openshift-node.service entered failed state. Verified on openshift v1.0.7-287-g60781e3 |