Description of problem: After OSEv3.1 installation, some Nodes doesn't setup OVS. So "ip -o addr" outputs only: ~~~~ 1: lo inet 127.0.0.1/8 scope host lo\ valid_lft forever preferred_lft forever 2: ens32 inet 172.21.151.11/24 brd 172.21.151.255 scope global ens32\ valid_lft forever preferred_lft forever 3: docker0 inet 172.17.42.1/16 scope global docker0\ valid_lft forever preferred_lft forever ~~~~ There are no lbr0, tun0, vovsbr and vlinuxbr. Thus, the customer's doesn't output why it doesn't setup the network. It just output ~~~ Dec 12 00:42:27 xxxx.node.example.com atomic-openshift-node[30464]: W1212 00:42:27.608514 30464 common.go:577] Could not find an allocated subnet for node: xxxx.node.example.com, Waiting... Dec 12 00:42:28 xxxx.node.example.com atomic-openshift-node[30464]: W1212 00:42:28.110946 30464 common.go:577] Could not find an allocated subnet for node: xxxx.node.example.com, Waiting... Dec 12 00:42:28 xxxx.node.example.com atomic-openshift-node[30464]: F1212 00:42:28.611204 30464 flatsdn.go:42] SDN Node failed: Failed to get subnet for this host: xxxx.node.example.com, error: hostsubnet "xxxx.node.example.com" not found ~~~ And, #oc get node, #oc get hostsubnets and etcd dump doesn't contain the nodes. Version-Release number of selected component (if applicable): - atomic-openshift-3.1.0.4-1.git.15.5e061c3.el7aos.x86_64 How reproducible: - After installation, the customer got this issue. Steps to Reproduce: - No clear reproducer at the moment. The customer installed OSE v3.1 and got this error.
As for workaround, clean up and re-install the OSE couldn't fix this issue. Creating hostsubnet manually could be workaround. (e.g.) # oc create -f subnet.json # cat subnet.json { "kind": "HostSubnet", "apiVersion": "v1", "metadata": { "name": "node1-example.com", "selfLink": "/oapi/v1/hostsubnets/node1-example.com", "uid": "05f650a6-970e-11e5-a489-525400b33d1d", "resourceVersion": "382", "creationTimestamp": "2015-12-13T02:56:59Z" }, "host": "node1-example.com", "hostIP": "192.168.133.2", "subnet": "10.1.1.0/24" }
Can you confirm that the RPM version of atomic openshift you have installed is: 3.1.0.4-1.git.15.5e061c3.el7aos
The master is responsible for allocating subnets for each node, so we'd also need the master logs (ideally with --loglevel=5) to debug further. Any chance you could get those?
(In reply to Dan Williams from comment #6) > Can you confirm that the RPM version of atomic openshift you have installed > is: > > 3.1.0.4-1.git.15.5e061c3.el7aos Yes, it is. The package is: atomic-openshift-3.1.0.4-1.git.15.5e061c3.el7aos.x86_64 atomic-openshift-node-3.1.0.4-1.git.15.5e061c3.el7aos.x86_64 atomic-openshift-master-3.1.0.4-1.git.15.5e061c3.el7aos.x86_64
Yes, we have, it is loglevel=2 though. I attach it to this bugzilla in private, since it contains a lot of customer's info.
Thanks for the logs; at least they indicate that the master had no errors start up the SDN. Can you grab the output of: oc get hostsubnets oc get nodes and for each of the nodes listed: oc describe node <name from 'oc get nodes'>
(Although I have already provided the information on comment#11 with private comment, I will update it again with public comment. We could't get the result with oc get node/hostsubnets from failed node.
My diagnosis here is that it seems to be taking akprdlog.dirapigw.nz a *very* long time to start the atomic-openshift-node process. It launches at 00:42:29 and the last interesting log message we get is at 00:42:35 for "Registering credential provider:". Node registration with the master happens a bit after that message, so it's plausible that node registration isn't actually run within 10 seconds after startup. This machine seems to be pretty heavily loaded. But I guess we should increase the timeout for waiting for the node subnet allocation to cover cases like this.
https://github.com/openshift/openshift-sdn/pull/230 bumps the node subnet timeout to 30 seconds.
Upstream origin pull request with fix is https://github.com/openshift/origin/pull/6532
Upstream origin pull request has been merged. Now we wait until it gets into OpenShift Enterprise RPMs.
I have checked this fix on latest origin v1.1-730-gad80e1f I deleted the hostsubnet for specify node right after the node registered to master, and check the node log for the subnet allocation failure. It will keep trying for 30 seconds. @Dan Do you have any other suggestions to verify the bug, or simulate the problem in comment#1 ?
Checked on AOS build 2016-01-13.2 with rpm version: atomic-openshift-sdn-ovs-3.1.1.2-1.git.0.30f8d65.el7aos.x86_64 atomic-openshift-3.1.1.2-1.git.0.30f8d65.el7aos.x86_64 It will wait for 30 seconds when failed to allocate the subnet from master. Verify the bug.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2016:0070
seeing the same issue here: https://bugzilla.redhat.com/show_bug.cgi?id=1320959