Bug 1290967 - Hostsubnet is not created and OSE node host doesn't do OVS setup [NEEDINFO]
Hostsubnet is not created and OSE node host doesn't do OVS setup
Status: CLOSED ERRATA
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking (Show other bugs)
3.1.0
Unspecified Unspecified
medium Severity medium
: ---
: ---
Assigned To: Dan Williams
Meng Bo
:
Depends On:
Blocks: 1267746
  Show dependency treegraph
 
Reported: 2015-12-12 01:59 EST by Kenjiro Nakayama
Modified: 2016-03-29 08:20 EDT (History)
15 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-01-26 14:19:49 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
bmeng: needinfo? (dcbw)


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 2087721 None None None Never

  None (edit)
Description Kenjiro Nakayama 2015-12-12 01:59:13 EST
Description of problem:

After OSEv3.1 installation, some Nodes doesn't setup OVS. So "ip -o addr" outputs only:

~~~~
1: lo    inet 127.0.0.1/8 scope host lo\       valid_lft forever preferred_lft forever
2: ens32    inet 172.21.151.11/24 brd 172.21.151.255 scope global ens32\       valid_lft forever preferred_lft forever
3: docker0    inet 172.17.42.1/16 scope global docker0\       valid_lft forever preferred_lft forever
~~~~

There are no lbr0, tun0, vovsbr and vlinuxbr.

Thus, the customer's doesn't output why it doesn't setup the network. It just output

~~~
Dec 12 00:42:27 xxxx.node.example.com atomic-openshift-node[30464]: W1212 00:42:27.608514   30464 common.go:577] Could not find an allocated subnet for node: xxxx.node.example.com, Waiting...
Dec 12 00:42:28 xxxx.node.example.com atomic-openshift-node[30464]: W1212 00:42:28.110946   30464 common.go:577] Could not find an allocated subnet for node: xxxx.node.example.com, Waiting...
Dec 12 00:42:28 xxxx.node.example.com atomic-openshift-node[30464]: F1212 00:42:28.611204   30464 flatsdn.go:42] SDN Node failed: Failed to get subnet for this host: xxxx.node.example.com, error: hostsubnet "xxxx.node.example.com" not found
~~~

And, #oc get node, #oc get hostsubnets and etcd dump doesn't contain the nodes.

Version-Release number of selected component (if applicable):
- atomic-openshift-3.1.0.4-1.git.15.5e061c3.el7aos.x86_64

How reproducible:
- After installation, the customer got this issue.


Steps to Reproduce:
- No clear reproducer at the moment. The customer installed OSE v3.1 and got this error.
Comment 5 Kenjiro Nakayama 2015-12-13 01:05:09 EST
As for workaround, clean up and re-install the OSE couldn't fix this issue.
Creating hostsubnet manually could be workaround.

(e.g.)
# oc create -f subnet.json
# cat subnet.json
{
    "kind": "HostSubnet",
    "apiVersion": "v1",
    "metadata": {
        "name": "node1-example.com",
        "selfLink": "/oapi/v1/hostsubnets/node1-example.com",
        "uid": "05f650a6-970e-11e5-a489-525400b33d1d",
        "resourceVersion": "382",
        "creationTimestamp": "2015-12-13T02:56:59Z"
    },
    "host": "node1-example.com",
    "hostIP": "192.168.133.2",
    "subnet": "10.1.1.0/24"
}
Comment 6 Dan Williams 2015-12-14 13:21:04 EST
Can you confirm that the RPM version of atomic openshift you have installed is:

3.1.0.4-1.git.15.5e061c3.el7aos
Comment 7 Dan Williams 2015-12-14 13:44:37 EST
The master is responsible for allocating subnets for each node, so we'd also need the master logs (ideally with --loglevel=5) to debug further.  Any chance you could get those?
Comment 8 Kenjiro Nakayama 2015-12-14 18:52:14 EST
(In reply to Dan Williams from comment #6)
> Can you confirm that the RPM version of atomic openshift you have installed
> is:
> 
> 3.1.0.4-1.git.15.5e061c3.el7aos

Yes, it is. The package is:

atomic-openshift-3.1.0.4-1.git.15.5e061c3.el7aos.x86_64
atomic-openshift-node-3.1.0.4-1.git.15.5e061c3.el7aos.x86_64
atomic-openshift-master-3.1.0.4-1.git.15.5e061c3.el7aos.x86_64
Comment 9 Kenjiro Nakayama 2015-12-14 18:57:42 EST
Yes, we have, it is loglevel=2 though. I attach it to this bugzilla in private, since it contains a lot of customer's info.
Comment 11 Dan Williams 2015-12-15 12:12:07 EST
Thanks for the logs; at least they indicate that the master had no errors start up the SDN.

Can you grab the output of:

oc get hostsubnets
oc get nodes

and for each of the nodes listed:

oc describe node <name from 'oc get nodes'>
Comment 13 Kenjiro Nakayama 2015-12-15 21:21:37 EST
(Although I have already provided the information on comment#11 with private comment, I will update it again with public comment.

We could't get the result with oc get node/hostsubnets from failed node.
Comment 15 Dan Williams 2015-12-17 17:03:38 EST
My diagnosis here is that it seems to be taking akprdlog.dirapigw.nz a *very* long time to start the atomic-openshift-node process.  It launches at 00:42:29 and the last interesting log message we get is at 00:42:35 for "Registering credential provider:".  Node registration with the master happens a bit after that message, so it's plausible that node registration isn't actually run within 10 seconds after startup.

This machine seems to be pretty heavily loaded.  But I guess we should increase the timeout for waiting for the node subnet allocation to cover cases like this.
Comment 16 Dan Williams 2015-12-18 11:54:12 EST
https://github.com/openshift/openshift-sdn/pull/230 bumps the node subnet timeout to 30 seconds.
Comment 19 Dan Williams 2016-01-06 16:33:20 EST
Upstream origin pull request with fix is https://github.com/openshift/origin/pull/6532
Comment 20 Dan Williams 2016-01-11 13:15:18 EST
Upstream origin pull request has been merged.  Now we wait until it gets into OpenShift Enterprise RPMs.
Comment 21 Meng Bo 2016-01-11 22:18:26 EST
I have checked this fix on latest origin v1.1-730-gad80e1f

I deleted the hostsubnet for specify node right after the node registered to master, and check the node log for the subnet allocation failure.
It will keep trying for 30 seconds.


@Dan Do you have any other suggestions to verify the bug, or simulate the problem in comment#1 ?
Comment 22 Meng Bo 2016-01-14 06:01:06 EST
Checked on AOS build 2016-01-13.2 with rpm version:
atomic-openshift-sdn-ovs-3.1.1.2-1.git.0.30f8d65.el7aos.x86_64
atomic-openshift-3.1.1.2-1.git.0.30f8d65.el7aos.x86_64


It will wait for 30 seconds when failed to allocate the subnet from master.

Verify the bug.
Comment 25 errata-xmlrpc 2016-01-26 14:19:49 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2016:0070
Comment 27 Marcel Wysocki 2016-03-29 08:20:09 EDT
seeing the same issue here:
https://bugzilla.redhat.com/show_bug.cgi?id=1320959

Note You need to log in before you can comment on or make changes to this bug.