Bug 1290967
Summary: | Hostsubnet is not created and OSE node host doesn't do OVS setup | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Kenjiro Nakayama <knakayam> |
Component: | Networking | Assignee: | Dan Williams <dcbw> |
Status: | CLOSED ERRATA | QA Contact: | Meng Bo <bmeng> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 3.1.0 | CC: | aloughla, aos-bugs, bleanhar, danw, dcbw, erich, haowang, jkaur, jokerman, knakayam, mhepburn, mwysocki, pep, pruan, rkhan |
Target Milestone: | --- | Flags: | mleitner:
needinfo-
|
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2016-01-26 19:19:49 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1267746 |
Description
Kenjiro Nakayama
2015-12-12 06:59:13 UTC
As for workaround, clean up and re-install the OSE couldn't fix this issue. Creating hostsubnet manually could be workaround. (e.g.) # oc create -f subnet.json # cat subnet.json { "kind": "HostSubnet", "apiVersion": "v1", "metadata": { "name": "node1-example.com", "selfLink": "/oapi/v1/hostsubnets/node1-example.com", "uid": "05f650a6-970e-11e5-a489-525400b33d1d", "resourceVersion": "382", "creationTimestamp": "2015-12-13T02:56:59Z" }, "host": "node1-example.com", "hostIP": "192.168.133.2", "subnet": "10.1.1.0/24" } Can you confirm that the RPM version of atomic openshift you have installed is: 3.1.0.4-1.git.15.5e061c3.el7aos The master is responsible for allocating subnets for each node, so we'd also need the master logs (ideally with --loglevel=5) to debug further. Any chance you could get those? (In reply to Dan Williams from comment #6) > Can you confirm that the RPM version of atomic openshift you have installed > is: > > 3.1.0.4-1.git.15.5e061c3.el7aos Yes, it is. The package is: atomic-openshift-3.1.0.4-1.git.15.5e061c3.el7aos.x86_64 atomic-openshift-node-3.1.0.4-1.git.15.5e061c3.el7aos.x86_64 atomic-openshift-master-3.1.0.4-1.git.15.5e061c3.el7aos.x86_64 Yes, we have, it is loglevel=2 though. I attach it to this bugzilla in private, since it contains a lot of customer's info. Thanks for the logs; at least they indicate that the master had no errors start up the SDN. Can you grab the output of: oc get hostsubnets oc get nodes and for each of the nodes listed: oc describe node <name from 'oc get nodes'> (Although I have already provided the information on comment#11 with private comment, I will update it again with public comment. We could't get the result with oc get node/hostsubnets from failed node. My diagnosis here is that it seems to be taking akprdlog.dirapigw.nz a *very* long time to start the atomic-openshift-node process. It launches at 00:42:29 and the last interesting log message we get is at 00:42:35 for "Registering credential provider:". Node registration with the master happens a bit after that message, so it's plausible that node registration isn't actually run within 10 seconds after startup. This machine seems to be pretty heavily loaded. But I guess we should increase the timeout for waiting for the node subnet allocation to cover cases like this. https://github.com/openshift/openshift-sdn/pull/230 bumps the node subnet timeout to 30 seconds. Upstream origin pull request with fix is https://github.com/openshift/origin/pull/6532 Upstream origin pull request has been merged. Now we wait until it gets into OpenShift Enterprise RPMs. I have checked this fix on latest origin v1.1-730-gad80e1f I deleted the hostsubnet for specify node right after the node registered to master, and check the node log for the subnet allocation failure. It will keep trying for 30 seconds. @Dan Do you have any other suggestions to verify the bug, or simulate the problem in comment#1 ? Checked on AOS build 2016-01-13.2 with rpm version: atomic-openshift-sdn-ovs-3.1.1.2-1.git.0.30f8d65.el7aos.x86_64 atomic-openshift-3.1.1.2-1.git.0.30f8d65.el7aos.x86_64 It will wait for 30 seconds when failed to allocate the subnet from master. Verify the bug. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2016:0070 seeing the same issue here: https://bugzilla.redhat.com/show_bug.cgi?id=1320959 |