Bug 1290967
| Summary: | Hostsubnet is not created and OSE node host doesn't do OVS setup | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Kenjiro Nakayama <knakayam> |
| Component: | Networking | Assignee: | Dan Williams <dcbw> |
| Status: | CLOSED ERRATA | QA Contact: | Meng Bo <bmeng> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 3.1.0 | CC: | aloughla, aos-bugs, bleanhar, danw, dcbw, erich, haowang, jkaur, jokerman, knakayam, mhepburn, mwysocki, pep, pruan, rkhan |
| Target Milestone: | --- | Flags: | mleitner:
needinfo-
|
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2016-01-26 19:19:49 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1267746 | ||
|
Description
Kenjiro Nakayama
2015-12-12 06:59:13 UTC
As for workaround, clean up and re-install the OSE couldn't fix this issue.
Creating hostsubnet manually could be workaround.
(e.g.)
# oc create -f subnet.json
# cat subnet.json
{
"kind": "HostSubnet",
"apiVersion": "v1",
"metadata": {
"name": "node1-example.com",
"selfLink": "/oapi/v1/hostsubnets/node1-example.com",
"uid": "05f650a6-970e-11e5-a489-525400b33d1d",
"resourceVersion": "382",
"creationTimestamp": "2015-12-13T02:56:59Z"
},
"host": "node1-example.com",
"hostIP": "192.168.133.2",
"subnet": "10.1.1.0/24"
}
Can you confirm that the RPM version of atomic openshift you have installed is: 3.1.0.4-1.git.15.5e061c3.el7aos The master is responsible for allocating subnets for each node, so we'd also need the master logs (ideally with --loglevel=5) to debug further. Any chance you could get those? (In reply to Dan Williams from comment #6) > Can you confirm that the RPM version of atomic openshift you have installed > is: > > 3.1.0.4-1.git.15.5e061c3.el7aos Yes, it is. The package is: atomic-openshift-3.1.0.4-1.git.15.5e061c3.el7aos.x86_64 atomic-openshift-node-3.1.0.4-1.git.15.5e061c3.el7aos.x86_64 atomic-openshift-master-3.1.0.4-1.git.15.5e061c3.el7aos.x86_64 Yes, we have, it is loglevel=2 though. I attach it to this bugzilla in private, since it contains a lot of customer's info. Thanks for the logs; at least they indicate that the master had no errors start up the SDN. Can you grab the output of: oc get hostsubnets oc get nodes and for each of the nodes listed: oc describe node <name from 'oc get nodes'> (Although I have already provided the information on comment#11 with private comment, I will update it again with public comment. We could't get the result with oc get node/hostsubnets from failed node. My diagnosis here is that it seems to be taking akprdlog.dirapigw.nz a *very* long time to start the atomic-openshift-node process. It launches at 00:42:29 and the last interesting log message we get is at 00:42:35 for "Registering credential provider:". Node registration with the master happens a bit after that message, so it's plausible that node registration isn't actually run within 10 seconds after startup. This machine seems to be pretty heavily loaded. But I guess we should increase the timeout for waiting for the node subnet allocation to cover cases like this. https://github.com/openshift/openshift-sdn/pull/230 bumps the node subnet timeout to 30 seconds. Upstream origin pull request with fix is https://github.com/openshift/origin/pull/6532 Upstream origin pull request has been merged. Now we wait until it gets into OpenShift Enterprise RPMs. I have checked this fix on latest origin v1.1-730-gad80e1f I deleted the hostsubnet for specify node right after the node registered to master, and check the node log for the subnet allocation failure. It will keep trying for 30 seconds. @Dan Do you have any other suggestions to verify the bug, or simulate the problem in comment#1 ? Checked on AOS build 2016-01-13.2 with rpm version: atomic-openshift-sdn-ovs-3.1.1.2-1.git.0.30f8d65.el7aos.x86_64 atomic-openshift-3.1.1.2-1.git.0.30f8d65.el7aos.x86_64 It will wait for 30 seconds when failed to allocate the subnet from master. Verify the bug. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2016:0070 seeing the same issue here: https://bugzilla.redhat.com/show_bug.cgi?id=1320959 |