1290967 – Hostsubnet is not created and OSE node host doesn't do OVS setup

Bug 1290967 - Hostsubnet is not created and OSE node host doesn't do OVS setup

Summary: Hostsubnet is not created and OSE node host doesn't do OVS setup

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	3.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Dan Williams
QA Contact:	Meng Bo
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1267746
TreeView+	depends on / blocked

Reported:	2015-12-12 06:59 UTC by Kenjiro Nakayama
Modified:	2019-10-10 10:41 UTC (History)
CC List:	15 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-01-26 19:19:49 UTC
Target Upstream Version:
Embargoed:
Flags:	mleitner: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	2087721	0	None	None	None	Never
Red Hat Product Errata	RHSA-2016:0070	0	normal	SHIPPED_LIVE	Important: Red Hat OpenShift Enterprise 3.1.1 bug fix and enhancement update	2016-01-27 00:12:41 UTC

Description Kenjiro Nakayama 2015-12-12 06:59:13 UTC

Description of problem:

After OSEv3.1 installation, some Nodes doesn't setup OVS. So "ip -o addr" outputs only:

~~~~
1: lo    inet 127.0.0.1/8 scope host lo\       valid_lft forever preferred_lft forever
2: ens32    inet 172.21.151.11/24 brd 172.21.151.255 scope global ens32\       valid_lft forever preferred_lft forever
3: docker0    inet 172.17.42.1/16 scope global docker0\       valid_lft forever preferred_lft forever
~~~~

There are no lbr0, tun0, vovsbr and vlinuxbr.

Thus, the customer's doesn't output why it doesn't setup the network. It just output

~~~
Dec 12 00:42:27 xxxx.node.example.com atomic-openshift-node[30464]: W1212 00:42:27.608514   30464 common.go:577] Could not find an allocated subnet for node: xxxx.node.example.com, Waiting...
Dec 12 00:42:28 xxxx.node.example.com atomic-openshift-node[30464]: W1212 00:42:28.110946   30464 common.go:577] Could not find an allocated subnet for node: xxxx.node.example.com, Waiting...
Dec 12 00:42:28 xxxx.node.example.com atomic-openshift-node[30464]: F1212 00:42:28.611204   30464 flatsdn.go:42] SDN Node failed: Failed to get subnet for this host: xxxx.node.example.com, error: hostsubnet "xxxx.node.example.com" not found
~~~

And, #oc get node, #oc get hostsubnets and etcd dump doesn't contain the nodes.

Version-Release number of selected component (if applicable):
- atomic-openshift-3.1.0.4-1.git.15.5e061c3.el7aos.x86_64

How reproducible:
- After installation, the customer got this issue.


Steps to Reproduce:
- No clear reproducer at the moment. The customer installed OSE v3.1 and got this error.

Comment 5 Kenjiro Nakayama 2015-12-13 06:05:09 UTC

As for workaround, clean up and re-install the OSE couldn't fix this issue.
Creating hostsubnet manually could be workaround.

(e.g.)
# oc create -f subnet.json
# cat subnet.json
{
    "kind": "HostSubnet",
    "apiVersion": "v1",
    "metadata": {
        "name": "node1-example.com",
        "selfLink": "/oapi/v1/hostsubnets/node1-example.com",
        "uid": "05f650a6-970e-11e5-a489-525400b33d1d",
        "resourceVersion": "382",
        "creationTimestamp": "2015-12-13T02:56:59Z"
    },
    "host": "node1-example.com",
    "hostIP": "192.168.133.2",
    "subnet": "10.1.1.0/24"
}

Comment 6 Dan Williams 2015-12-14 18:21:04 UTC

Can you confirm that the RPM version of atomic openshift you have installed is:

3.1.0.4-1.git.15.5e061c3.el7aos

Comment 7 Dan Williams 2015-12-14 18:44:37 UTC

The master is responsible for allocating subnets for each node, so we'd also need the master logs (ideally with --loglevel=5) to debug further.  Any chance you could get those?

Comment 8 Kenjiro Nakayama 2015-12-14 23:52:14 UTC

(In reply to Dan Williams from comment #6)
> Can you confirm that the RPM version of atomic openshift you have installed
> is:
> 
> 3.1.0.4-1.git.15.5e061c3.el7aos

Yes, it is. The package is:

atomic-openshift-3.1.0.4-1.git.15.5e061c3.el7aos.x86_64
atomic-openshift-node-3.1.0.4-1.git.15.5e061c3.el7aos.x86_64
atomic-openshift-master-3.1.0.4-1.git.15.5e061c3.el7aos.x86_64

Comment 9 Kenjiro Nakayama 2015-12-14 23:57:42 UTC

Yes, we have, it is loglevel=2 though. I attach it to this bugzilla in private, since it contains a lot of customer's info.

Comment 11 Dan Williams 2015-12-15 17:12:07 UTC

Thanks for the logs; at least they indicate that the master had no errors start up the SDN.

Can you grab the output of:

oc get hostsubnets
oc get nodes

and for each of the nodes listed:

oc describe node <name from 'oc get nodes'>

Comment 13 Kenjiro Nakayama 2015-12-16 02:21:37 UTC

(Although I have already provided the information on comment#11 with private comment, I will update it again with public comment.

We could't get the result with oc get node/hostsubnets from failed node.

Comment 15 Dan Williams 2015-12-17 22:03:38 UTC

My diagnosis here is that it seems to be taking akprdlog.dirapigw.nz a *very* long time to start the atomic-openshift-node process.  It launches at 00:42:29 and the last interesting log message we get is at 00:42:35 for "Registering credential provider:".  Node registration with the master happens a bit after that message, so it's plausible that node registration isn't actually run within 10 seconds after startup.

This machine seems to be pretty heavily loaded.  But I guess we should increase the timeout for waiting for the node subnet allocation to cover cases like this.

Comment 16 Dan Williams 2015-12-18 16:54:12 UTC

https://github.com/openshift/openshift-sdn/pull/230 bumps the node subnet timeout to 30 seconds.

Comment 19 Dan Williams 2016-01-06 21:33:20 UTC

Upstream origin pull request with fix is https://github.com/openshift/origin/pull/6532

Comment 20 Dan Williams 2016-01-11 18:15:18 UTC

Upstream origin pull request has been merged.  Now we wait until it gets into OpenShift Enterprise RPMs.

Comment 21 Meng Bo 2016-01-12 03:18:26 UTC

I have checked this fix on latest origin v1.1-730-gad80e1f

I deleted the hostsubnet for specify node right after the node registered to master, and check the node log for the subnet allocation failure.
It will keep trying for 30 seconds.


@Dan Do you have any other suggestions to verify the bug, or simulate the problem in comment#1 ?

Comment 22 Meng Bo 2016-01-14 11:01:06 UTC

Checked on AOS build 2016-01-13.2 with rpm version:
atomic-openshift-sdn-ovs-3.1.1.2-1.git.0.30f8d65.el7aos.x86_64
atomic-openshift-3.1.1.2-1.git.0.30f8d65.el7aos.x86_64


It will wait for 30 seconds when failed to allocate the subnet from master.

Verify the bug.

Comment 25 errata-xmlrpc 2016-01-26 19:19:49 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2016:0070

Comment 27 Marcel Wysocki 2016-03-29 12:20:09 UTC

seeing the same issue here:
https://bugzilla.redhat.com/show_bug.cgi?id=1320959

Note You need to log in before you can comment on or make changes to this bug.