Bug 1290967

Summary: Hostsubnet is not created and OSE node host doesn't do OVS setup
Product: OpenShift Container Platform Reporter: Kenjiro Nakayama <knakayam>
Component: NetworkingAssignee: Dan Williams <dcbw>
Status: CLOSED ERRATA QA Contact: Meng Bo <bmeng>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.1.0CC: aloughla, aos-bugs, bleanhar, danw, dcbw, erich, haowang, jkaur, jokerman, knakayam, mhepburn, mwysocki, pep, pruan, rkhan
Target Milestone: ---Flags: mleitner: needinfo-
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-01-26 19:19:49 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1267746    

Description Kenjiro Nakayama 2015-12-12 06:59:13 UTC
Description of problem:

After OSEv3.1 installation, some Nodes doesn't setup OVS. So "ip -o addr" outputs only:

~~~~
1: lo    inet 127.0.0.1/8 scope host lo\       valid_lft forever preferred_lft forever
2: ens32    inet 172.21.151.11/24 brd 172.21.151.255 scope global ens32\       valid_lft forever preferred_lft forever
3: docker0    inet 172.17.42.1/16 scope global docker0\       valid_lft forever preferred_lft forever
~~~~

There are no lbr0, tun0, vovsbr and vlinuxbr.

Thus, the customer's doesn't output why it doesn't setup the network. It just output

~~~
Dec 12 00:42:27 xxxx.node.example.com atomic-openshift-node[30464]: W1212 00:42:27.608514   30464 common.go:577] Could not find an allocated subnet for node: xxxx.node.example.com, Waiting...
Dec 12 00:42:28 xxxx.node.example.com atomic-openshift-node[30464]: W1212 00:42:28.110946   30464 common.go:577] Could not find an allocated subnet for node: xxxx.node.example.com, Waiting...
Dec 12 00:42:28 xxxx.node.example.com atomic-openshift-node[30464]: F1212 00:42:28.611204   30464 flatsdn.go:42] SDN Node failed: Failed to get subnet for this host: xxxx.node.example.com, error: hostsubnet "xxxx.node.example.com" not found
~~~

And, #oc get node, #oc get hostsubnets and etcd dump doesn't contain the nodes.

Version-Release number of selected component (if applicable):
- atomic-openshift-3.1.0.4-1.git.15.5e061c3.el7aos.x86_64

How reproducible:
- After installation, the customer got this issue.


Steps to Reproduce:
- No clear reproducer at the moment. The customer installed OSE v3.1 and got this error.

Comment 5 Kenjiro Nakayama 2015-12-13 06:05:09 UTC
As for workaround, clean up and re-install the OSE couldn't fix this issue.
Creating hostsubnet manually could be workaround.

(e.g.)
# oc create -f subnet.json
# cat subnet.json
{
    "kind": "HostSubnet",
    "apiVersion": "v1",
    "metadata": {
        "name": "node1-example.com",
        "selfLink": "/oapi/v1/hostsubnets/node1-example.com",
        "uid": "05f650a6-970e-11e5-a489-525400b33d1d",
        "resourceVersion": "382",
        "creationTimestamp": "2015-12-13T02:56:59Z"
    },
    "host": "node1-example.com",
    "hostIP": "192.168.133.2",
    "subnet": "10.1.1.0/24"
}

Comment 6 Dan Williams 2015-12-14 18:21:04 UTC
Can you confirm that the RPM version of atomic openshift you have installed is:

3.1.0.4-1.git.15.5e061c3.el7aos

Comment 7 Dan Williams 2015-12-14 18:44:37 UTC
The master is responsible for allocating subnets for each node, so we'd also need the master logs (ideally with --loglevel=5) to debug further.  Any chance you could get those?

Comment 8 Kenjiro Nakayama 2015-12-14 23:52:14 UTC
(In reply to Dan Williams from comment #6)
> Can you confirm that the RPM version of atomic openshift you have installed
> is:
> 
> 3.1.0.4-1.git.15.5e061c3.el7aos

Yes, it is. The package is:

atomic-openshift-3.1.0.4-1.git.15.5e061c3.el7aos.x86_64
atomic-openshift-node-3.1.0.4-1.git.15.5e061c3.el7aos.x86_64
atomic-openshift-master-3.1.0.4-1.git.15.5e061c3.el7aos.x86_64

Comment 9 Kenjiro Nakayama 2015-12-14 23:57:42 UTC
Yes, we have, it is loglevel=2 though. I attach it to this bugzilla in private, since it contains a lot of customer's info.

Comment 11 Dan Williams 2015-12-15 17:12:07 UTC
Thanks for the logs; at least they indicate that the master had no errors start up the SDN.

Can you grab the output of:

oc get hostsubnets
oc get nodes

and for each of the nodes listed:

oc describe node <name from 'oc get nodes'>

Comment 13 Kenjiro Nakayama 2015-12-16 02:21:37 UTC
(Although I have already provided the information on comment#11 with private comment, I will update it again with public comment.

We could't get the result with oc get node/hostsubnets from failed node.

Comment 15 Dan Williams 2015-12-17 22:03:38 UTC
My diagnosis here is that it seems to be taking akprdlog.dirapigw.nz a *very* long time to start the atomic-openshift-node process.  It launches at 00:42:29 and the last interesting log message we get is at 00:42:35 for "Registering credential provider:".  Node registration with the master happens a bit after that message, so it's plausible that node registration isn't actually run within 10 seconds after startup.

This machine seems to be pretty heavily loaded.  But I guess we should increase the timeout for waiting for the node subnet allocation to cover cases like this.

Comment 16 Dan Williams 2015-12-18 16:54:12 UTC
https://github.com/openshift/openshift-sdn/pull/230 bumps the node subnet timeout to 30 seconds.

Comment 19 Dan Williams 2016-01-06 21:33:20 UTC
Upstream origin pull request with fix is https://github.com/openshift/origin/pull/6532

Comment 20 Dan Williams 2016-01-11 18:15:18 UTC
Upstream origin pull request has been merged.  Now we wait until it gets into OpenShift Enterprise RPMs.

Comment 21 Meng Bo 2016-01-12 03:18:26 UTC
I have checked this fix on latest origin v1.1-730-gad80e1f

I deleted the hostsubnet for specify node right after the node registered to master, and check the node log for the subnet allocation failure.
It will keep trying for 30 seconds.


@Dan Do you have any other suggestions to verify the bug, or simulate the problem in comment#1 ?

Comment 22 Meng Bo 2016-01-14 11:01:06 UTC
Checked on AOS build 2016-01-13.2 with rpm version:
atomic-openshift-sdn-ovs-3.1.1.2-1.git.0.30f8d65.el7aos.x86_64
atomic-openshift-3.1.1.2-1.git.0.30f8d65.el7aos.x86_64


It will wait for 30 seconds when failed to allocate the subnet from master.

Verify the bug.

Comment 25 errata-xmlrpc 2016-01-26 19:19:49 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2016:0070

Comment 27 Marcel Wysocki 2016-03-29 12:20:09 UTC
seeing the same issue here:
https://bugzilla.redhat.com/show_bug.cgi?id=1320959