Bug 1290967 - Hostsubnet is not created and OSE node host doesn't do OVS setup
Summary: Hostsubnet is not created and OSE node host doesn't do OVS setup
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 3.1.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: Dan Williams
QA Contact: Meng Bo
URL:
Whiteboard:
Depends On:
Blocks: 1267746
TreeView+ depends on / blocked
 
Reported: 2015-12-12 06:59 UTC by Kenjiro Nakayama
Modified: 2019-10-10 10:41 UTC (History)
15 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-01-26 19:19:49 UTC
Target Upstream Version:
Embargoed:
mleitner: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 2087721 0 None None None Never
Red Hat Product Errata RHSA-2016:0070 0 normal SHIPPED_LIVE Important: Red Hat OpenShift Enterprise 3.1.1 bug fix and enhancement update 2016-01-27 00:12:41 UTC

Description Kenjiro Nakayama 2015-12-12 06:59:13 UTC
Description of problem:

After OSEv3.1 installation, some Nodes doesn't setup OVS. So "ip -o addr" outputs only:

~~~~
1: lo    inet 127.0.0.1/8 scope host lo\       valid_lft forever preferred_lft forever
2: ens32    inet 172.21.151.11/24 brd 172.21.151.255 scope global ens32\       valid_lft forever preferred_lft forever
3: docker0    inet 172.17.42.1/16 scope global docker0\       valid_lft forever preferred_lft forever
~~~~

There are no lbr0, tun0, vovsbr and vlinuxbr.

Thus, the customer's doesn't output why it doesn't setup the network. It just output

~~~
Dec 12 00:42:27 xxxx.node.example.com atomic-openshift-node[30464]: W1212 00:42:27.608514   30464 common.go:577] Could not find an allocated subnet for node: xxxx.node.example.com, Waiting...
Dec 12 00:42:28 xxxx.node.example.com atomic-openshift-node[30464]: W1212 00:42:28.110946   30464 common.go:577] Could not find an allocated subnet for node: xxxx.node.example.com, Waiting...
Dec 12 00:42:28 xxxx.node.example.com atomic-openshift-node[30464]: F1212 00:42:28.611204   30464 flatsdn.go:42] SDN Node failed: Failed to get subnet for this host: xxxx.node.example.com, error: hostsubnet "xxxx.node.example.com" not found
~~~

And, #oc get node, #oc get hostsubnets and etcd dump doesn't contain the nodes.

Version-Release number of selected component (if applicable):
- atomic-openshift-3.1.0.4-1.git.15.5e061c3.el7aos.x86_64

How reproducible:
- After installation, the customer got this issue.


Steps to Reproduce:
- No clear reproducer at the moment. The customer installed OSE v3.1 and got this error.

Comment 5 Kenjiro Nakayama 2015-12-13 06:05:09 UTC
As for workaround, clean up and re-install the OSE couldn't fix this issue.
Creating hostsubnet manually could be workaround.

(e.g.)
# oc create -f subnet.json
# cat subnet.json
{
    "kind": "HostSubnet",
    "apiVersion": "v1",
    "metadata": {
        "name": "node1-example.com",
        "selfLink": "/oapi/v1/hostsubnets/node1-example.com",
        "uid": "05f650a6-970e-11e5-a489-525400b33d1d",
        "resourceVersion": "382",
        "creationTimestamp": "2015-12-13T02:56:59Z"
    },
    "host": "node1-example.com",
    "hostIP": "192.168.133.2",
    "subnet": "10.1.1.0/24"
}

Comment 6 Dan Williams 2015-12-14 18:21:04 UTC
Can you confirm that the RPM version of atomic openshift you have installed is:

3.1.0.4-1.git.15.5e061c3.el7aos

Comment 7 Dan Williams 2015-12-14 18:44:37 UTC
The master is responsible for allocating subnets for each node, so we'd also need the master logs (ideally with --loglevel=5) to debug further.  Any chance you could get those?

Comment 8 Kenjiro Nakayama 2015-12-14 23:52:14 UTC
(In reply to Dan Williams from comment #6)
> Can you confirm that the RPM version of atomic openshift you have installed
> is:
> 
> 3.1.0.4-1.git.15.5e061c3.el7aos

Yes, it is. The package is:

atomic-openshift-3.1.0.4-1.git.15.5e061c3.el7aos.x86_64
atomic-openshift-node-3.1.0.4-1.git.15.5e061c3.el7aos.x86_64
atomic-openshift-master-3.1.0.4-1.git.15.5e061c3.el7aos.x86_64

Comment 9 Kenjiro Nakayama 2015-12-14 23:57:42 UTC
Yes, we have, it is loglevel=2 though. I attach it to this bugzilla in private, since it contains a lot of customer's info.

Comment 11 Dan Williams 2015-12-15 17:12:07 UTC
Thanks for the logs; at least they indicate that the master had no errors start up the SDN.

Can you grab the output of:

oc get hostsubnets
oc get nodes

and for each of the nodes listed:

oc describe node <name from 'oc get nodes'>

Comment 13 Kenjiro Nakayama 2015-12-16 02:21:37 UTC
(Although I have already provided the information on comment#11 with private comment, I will update it again with public comment.

We could't get the result with oc get node/hostsubnets from failed node.

Comment 15 Dan Williams 2015-12-17 22:03:38 UTC
My diagnosis here is that it seems to be taking akprdlog.dirapigw.nz a *very* long time to start the atomic-openshift-node process.  It launches at 00:42:29 and the last interesting log message we get is at 00:42:35 for "Registering credential provider:".  Node registration with the master happens a bit after that message, so it's plausible that node registration isn't actually run within 10 seconds after startup.

This machine seems to be pretty heavily loaded.  But I guess we should increase the timeout for waiting for the node subnet allocation to cover cases like this.

Comment 16 Dan Williams 2015-12-18 16:54:12 UTC
https://github.com/openshift/openshift-sdn/pull/230 bumps the node subnet timeout to 30 seconds.

Comment 19 Dan Williams 2016-01-06 21:33:20 UTC
Upstream origin pull request with fix is https://github.com/openshift/origin/pull/6532

Comment 20 Dan Williams 2016-01-11 18:15:18 UTC
Upstream origin pull request has been merged.  Now we wait until it gets into OpenShift Enterprise RPMs.

Comment 21 Meng Bo 2016-01-12 03:18:26 UTC
I have checked this fix on latest origin v1.1-730-gad80e1f

I deleted the hostsubnet for specify node right after the node registered to master, and check the node log for the subnet allocation failure.
It will keep trying for 30 seconds.


@Dan Do you have any other suggestions to verify the bug, or simulate the problem in comment#1 ?

Comment 22 Meng Bo 2016-01-14 11:01:06 UTC
Checked on AOS build 2016-01-13.2 with rpm version:
atomic-openshift-sdn-ovs-3.1.1.2-1.git.0.30f8d65.el7aos.x86_64
atomic-openshift-3.1.1.2-1.git.0.30f8d65.el7aos.x86_64


It will wait for 30 seconds when failed to allocate the subnet from master.

Verify the bug.

Comment 25 errata-xmlrpc 2016-01-26 19:19:49 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2016:0070

Comment 27 Marcel Wysocki 2016-03-29 12:20:09 UTC
seeing the same issue here:
https://bugzilla.redhat.com/show_bug.cgi?id=1320959


Note You need to log in before you can comment on or make changes to this bug.