Red Hat Bugzilla – Bug 1290967
Hostsubnet is not created and OSE node host doesn't do OVS setup
Last modified: 2016-03-29 08:20:09 EDT
Description of problem:
After OSEv3.1 installation, some Nodes doesn't setup OVS. So "ip -o addr" outputs only:
1: lo inet 127.0.0.1/8 scope host lo\ valid_lft forever preferred_lft forever
2: ens32 inet 172.21.151.11/24 brd 172.21.151.255 scope global ens32\ valid_lft forever preferred_lft forever
3: docker0 inet 172.17.42.1/16 scope global docker0\ valid_lft forever preferred_lft forever
There are no lbr0, tun0, vovsbr and vlinuxbr.
Thus, the customer's doesn't output why it doesn't setup the network. It just output
Dec 12 00:42:27 xxxx.node.example.com atomic-openshift-node: W1212 00:42:27.608514 30464 common.go:577] Could not find an allocated subnet for node: xxxx.node.example.com, Waiting...
Dec 12 00:42:28 xxxx.node.example.com atomic-openshift-node: W1212 00:42:28.110946 30464 common.go:577] Could not find an allocated subnet for node: xxxx.node.example.com, Waiting...
Dec 12 00:42:28 xxxx.node.example.com atomic-openshift-node: F1212 00:42:28.611204 30464 flatsdn.go:42] SDN Node failed: Failed to get subnet for this host: xxxx.node.example.com, error: hostsubnet "xxxx.node.example.com" not found
And, #oc get node, #oc get hostsubnets and etcd dump doesn't contain the nodes.
Version-Release number of selected component (if applicable):
- After installation, the customer got this issue.
Steps to Reproduce:
- No clear reproducer at the moment. The customer installed OSE v3.1 and got this error.
As for workaround, clean up and re-install the OSE couldn't fix this issue.
Creating hostsubnet manually could be workaround.
# oc create -f subnet.json
# cat subnet.json
Can you confirm that the RPM version of atomic openshift you have installed is:
The master is responsible for allocating subnets for each node, so we'd also need the master logs (ideally with --loglevel=5) to debug further. Any chance you could get those?
(In reply to Dan Williams from comment #6)
> Can you confirm that the RPM version of atomic openshift you have installed
Yes, it is. The package is:
Yes, we have, it is loglevel=2 though. I attach it to this bugzilla in private, since it contains a lot of customer's info.
Thanks for the logs; at least they indicate that the master had no errors start up the SDN.
Can you grab the output of:
oc get hostsubnets
oc get nodes
and for each of the nodes listed:
oc describe node <name from 'oc get nodes'>
(Although I have already provided the information on comment#11 with private comment, I will update it again with public comment.
We could't get the result with oc get node/hostsubnets from failed node.
My diagnosis here is that it seems to be taking akprdlog.dirapigw.nz a *very* long time to start the atomic-openshift-node process. It launches at 00:42:29 and the last interesting log message we get is at 00:42:35 for "Registering credential provider:". Node registration with the master happens a bit after that message, so it's plausible that node registration isn't actually run within 10 seconds after startup.
This machine seems to be pretty heavily loaded. But I guess we should increase the timeout for waiting for the node subnet allocation to cover cases like this.
https://github.com/openshift/openshift-sdn/pull/230 bumps the node subnet timeout to 30 seconds.
Upstream origin pull request with fix is https://github.com/openshift/origin/pull/6532
Upstream origin pull request has been merged. Now we wait until it gets into OpenShift Enterprise RPMs.
I have checked this fix on latest origin v1.1-730-gad80e1f
I deleted the hostsubnet for specify node right after the node registered to master, and check the node log for the subnet allocation failure.
It will keep trying for 30 seconds.
@Dan Do you have any other suggestions to verify the bug, or simulate the problem in comment#1 ?
Checked on AOS build 2016-01-13.2 with rpm version:
It will wait for 30 seconds when failed to allocate the subnet from master.
Verify the bug.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
seeing the same issue here: