1320959 – No HostSubnet are created

Bug 1320959 - No HostSubnet are created

Summary: No HostSubnet are created

Keywords:
Status:	CLOSED DUPLICATE of bug 1303085
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	3.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Dan Winship
QA Contact:	Meng Bo
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-03-24 11:09 UTC by Marcel Wysocki
Modified:	2016-04-29 15:13 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-04-29 15:13:46 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
node log (25.01 KB, text/plain) 2016-03-24 11:09 UTC, Marcel Wysocki	no flags	Details
master api log (7.61 KB, text/plain) 2016-03-24 11:09 UTC, Marcel Wysocki	no flags	Details
new master-api log (7.75 KB, text/x-vhdl) 2016-03-29 08:34 UTC, Marcel Wysocki	no flags	Details
new node log (25.19 KB, text/x-vhdl) 2016-03-29 08:34 UTC, Marcel Wysocki	no flags	Details
master controllers log (297.13 KB, text/x-vhdl) 2016-03-29 08:34 UTC, Marcel Wysocki	no flags	Details
master-config (4.46 KB, text/plain) 2016-04-05 17:17 UTC, Marcel Wysocki	no flags	Details
Show Obsolete (2) View All

Description Marcel Wysocki 2016-03-24 11:09:13 UTC

Created attachment 1139901 [details]
node log

Description of problem:
When after the ansible installer finished successfully all nodes are NotReady.
Mar 24 09:01:45 master-a1-6cu85xf5.osp.sfa.se docker[31379]: W0324 09:01:45.088372   31430 subnets.go:150] Could not find an allocated subnet for node: master-a1-6cu85xf5.osp.sfa.se, Waiting...
Mar 24 09:01:45 master-a1-6cu85xf5.osp.sfa.se docker[31379]: W0324 09:01:45.590567   31430 subnets.go:150] Could not find an allocated subnet for node: master-a1-6cu85xf5.osp.sfa.se, Waiting...
Mar 24 09:01:46 master-a1-6cu85xf5.osp.sfa.se docker[31379]: W0324 09:01:46.092582   31430 subnets.go:150] Could not find an allocated subnet for node: master-a1-6cu85xf5.osp.sfa.se, Waiting...
Mar 24 09:01:46 master-a1-6cu85xf5.osp.sfa.se docker[31379]: F0324 09:01:46.592739   31430 node.go:175] SDN Node failed: Failed to start plugin: Failed to get subnet for this host: master-a1-6cu85xf5.osp.sfa.se, error: hostsubnet "master-a1-6cu85xf5.osp.sfa.se" not found


There are no hostsubnets created for any nodes.

Version-Release number of selected component (if applicable):
[root@master-a2-ez8xmop2 origin]# openshift version 
openshift v3.1.1.6-21-gcd70c35
kubernetes v1.1.0-origin-1107-g4c8e6f4
etcd 2.1.2

Comment 1 Marcel Wysocki 2016-03-24 11:09:36 UTC

Created attachment 1139902 [details]
master api log

Comment 2 Marcel Wysocki 2016-03-24 11:45:17 UTC

this also seems to be related to https://bugzilla.redhat.com/show_bug.cgi?id=1290967

Comment 3 Scott Dodson 2016-03-24 18:22:19 UTC

Marcel,

Thanks, the logs are helpful. What we really need logs from is the controllers service, can you verify that `atomic-openshift-master-controllers` is running on your masters and gather their logs? That's the service that watches for new nodes to be registered and creates other resources.

--
Scott

Comment 4 Marcel Wysocki 2016-03-29 08:34:02 UTC

Created attachment 1141137 [details]
new master-api log

Comment 5 Marcel Wysocki 2016-03-29 08:34:25 UTC

Created attachment 1141138 [details]
new node log

Comment 6 Marcel Wysocki 2016-03-29 08:34:52 UTC

Created attachment 1141139 [details]
master controllers log

Comment 7 Marcel Wysocki 2016-03-29 08:35:21 UTC

i added logs from a fresh deployment.

Comment 8 Scott Dodson 2016-03-29 13:24:26 UTC

I don't see anything wrong that'd be specific to the installer. Assigning to networking team.

Comment 9 Marcel Wysocki 2016-03-29 14:13:54 UTC

Do you have any idea _when_ the HostSubnets should be created ?

Comment 10 Marcel Wysocki 2016-03-30 08:11:48 UTC

I have a gut feeling this could be related to DNS not being set up properly. When using ip addresses instead of the hostnames it worked. even tho /etc/hosts contains all nodes and their addresses.

Comment 11 Marcel Wysocki 2016-03-30 09:39:01 UTC

after retrying with proper DNS settings the issue still occurs.

Comment 12 Dan Winship 2016-03-30 15:48:33 UTC

I don't think the problem here is the HostSubnets per se. There seems to be some sort of generic master-node communication problem. Eg:

Mar 29 08:25:55 master-a1-y57c1gdo.osp.sfa.se atomic-openshift-master-controllers[21487]: I0329 04:25:55.883698       1 nodecontroller.go:631] node node-a2-x3jt6e89.osp.sfa.se hasn't been updated for 3m15.900125363s. Last ready condition is: {Type:Ready Status:Unknown LastHeartbeatTime:2016-03-29 04:22:35 -0400 EDT LastTransitionTime:2016-03-29 04:23:20 -0400 EDT Reason:NodeStatusUnknown Message:Kubelet stopped posting node status.}

Also, the node logs start out with:

    Starting a node connected to https://openshift.sfa.se:8443

which doesn't seem to match the "master-a1-y57c1gdo.osp.sfa.se" name seen elsewhere. Is it possible that you have two different OpenShift clusters configured, and the nodes are connecting to the wrong master?

Comment 13 Marcel Wysocki 2016-03-31 13:16:00 UTC

No, its one cluster
using
openshift_master_cluster_method=native
openshift_master_cluster_hostname=openshift.sfa.se
openshift_master_cluster_public_hostname=openshift.sfa.se


and that resolves to the 3 floating ip addresses of the masters

but that issue also happened when it was resolving to the internal ip of the first master.

the node stopping to post its own status is due to its SDN failing and it just gives up then.

Comment 14 Dan Winship 2016-04-04 17:26:39 UTC

(In reply to Marcel Wysocki from comment #9)
> Do you have any idea _when_ the HostSubnets should be created ?

The master should create it when it observes that a new node has been added:

Mar 29 08:14:45 master-a1-y57c1gdo.osp.sfa.se atomic-openshift-master-controllers[21487]: I0329 04:14:45.577085       1 nodecontroller.go:357] NodeController observed a new Node: api.Node...

If an error occurred, it would log it. So given that it did observe the node being added, and didn't log an error about creating the HostSubnet (and apparently didn't create a HostSubnet), that makes me wonder if maybe the master doesn't realize we're using the openshift-sdn networking plugin? Can you attach the master config file?

Comment 15 Marcel Wysocki 2016-04-05 17:16:53 UTC

I set up a second environment. same issue happens here, but this will give me more possibilities to debug.

I'll attach the master config.

Comment 16 Marcel Wysocki 2016-04-05 17:17:43 UTC

Created attachment 1143901 [details]
master-config

Comment 17 Marcel Wysocki 2016-04-05 17:35:21 UTC

deleting the node and restarting the service (on the same host) also doesnt work.
It gets re-added but the hostsubnet never created.

Comment 18 Marcel Wysocki 2016-04-07 09:49:19 UTC

Do you need any more information from me ?

Comment 19 Jonas Nordell 2016-04-07 10:25:42 UTC

Hi,

I also have seen this problem while trying out the openstack integration on OSE in OS1.

If I add to /etc/origin/node/node-config.yml

kubeletArguments:
  cloud-provider:
    - "openstack"
  cloud-config:
    - "/etc/cloud.conf"

And restart atomic-openshift-node I get the error:

Apr 07 06:18:29 master02.jlab.rhc-ose.labs.redhat.com atomic-openshift-node[9301]: I0407 06:18:29.036810    9301 openstack.go:289] Claiming to support Instances
Apr 07 06:18:29 master02.jlab.rhc-ose.labs.redhat.com atomic-openshift-node[9301]: I0407 06:18:29.135445    9301 kubelet.go:2499] Recording NodeReady event message for node master02-jlab.os1.phx2.redhat.com
Apr 07 06:18:29 master02.jlab.rhc-ose.labs.redhat.com atomic-openshift-node[9301]: I0407 06:18:29.135526    9301 kubelet.go:972] Attempting to register node master02-jlab.os1.phx2.redhat.com
Apr 07 06:18:29 master02.jlab.rhc-ose.labs.redhat.com atomic-openshift-node[9301]: E0407 06:18:29.186169    9301 kubelet.go:1011] Previously "master02-jlab.os1.phx2.redhat.com" had externalID "9a3923a3-fe4d-49ac
Apr 07 06:18:29 master02.jlab.rhc-ose.labs.redhat.com atomic-openshift-node[9301]: E0407 06:18:29.190361    9301 kubelet.go:1013] Unable to delete old node: User "system:node:master02-jlab.os1.phx2.redhat.com"

So I delete the node with

[root@master02 ~]# oc delete node master02-jlab.os1.phx2.redhat.com
node "master02-jlab.os1.phx2.redhat.com" deleted

And restart atomic-openshift-node

[root@master02 ~]# systemctl restart atomic-openshift-node
Job for atomic-openshift-node.service failed because the control process exited with error code. See "systemctl status atomic-openshift-node.service" and "journalctl -xe" for details.

And in the log I get

Apr 07 05:08:38 master02.jlab.rhc-ose.labs.redhat.com atomic-openshift-node[2491]: I0407 05:08:38.768106    2491 openstack.go:289] Claiming to support Instances
Apr 07 05:08:39 master02.jlab.rhc-ose.labs.redhat.com atomic-openshift-node[2491]: W0407 05:08:39.278492    2491 subnets.go:150] Could not find an allocated subnet for node: master02-jlab.os1.phx2.redhat.com, 
Apr 07 05:08:39 master02.jlab.rhc-ose.labs.redhat.com atomic-openshift-node[2491]: W0407 05:08:39.791434    2491 subnets.go:150] Could not find an allocated subnet for node: master02-jlab.os1.phx2.redhat.com, 

And nothing works until I comment the openstack related lines in /etc/origin/node/node-config.yaml, remove the node with oc delete node and restart atomic-openshift-node.

So maybe it's related to openstack integration in the node ? 

/Jonas Nordell

Comment 20 Ben Bennett 2016-04-07 13:37:23 UTC

Marking this as not a release blocker since we don't think it's a regression.  We are still going to keep working on it.

Comment 21 Josep 'Pep' Turro Mauri 2016-04-12 11:08:43 UTC

I also hit this in an OSE on OpenStack setup (single master, additional 2 nodes). In my case though openstack integration worked: cluster was initially deployed with the heat templates, they don't configure openstack integration. After that, manually configured openstack integration and things worked fine: after manual delete of the previous nodes, they re-registered and everything worked, including cinder-backed PVs.

Later I tried to reconfigure the cluster to use the openshift-ovs-multitenant plugin and things stopped working. Deleted the nodes, they re-registered but would not get a subnet.

I switched back to the previous config (with the openshift-ovs-subnet plugin) and still not working. 

Master was failing a few seconds after start with:

apr 12 06:43:09 master.example.com atomic-openshift-master[5509]: F0412 06:43:09.891855    5509 run_components.go:340] SDN initialization failed: Failed to start plugin: Invalid node IP


Deleting the nodes allowed the master to start and keep running. Restarting the node services I see them registering but no hostsubnet created.

atomic-openshift-3.1.1.6-4.git.32.adf8ec9.el7aos.x86_64

Comment 22 Dan Winship 2016-04-12 14:15:51 UTC

I've built some packages with additional debug logging on master at http://people.redhat.com/dwinship/aos/. Would any of you be able to try those on master, restart master, and then try adding a node, and seeing what gets logged? Oh, better yet, if you could run master with additional logging. (eg, add "--loglevel=5" to the command line in /usr/lib/systemd/system/atomic-openshift-master.service before restarting)

Comment 23 Marcel Wysocki 2016-04-26 09:24:33 UTC

I also have an issue where the nodeIp in node-config.yaml is not getting populated, could that be related?
Would it usually derive the HostSubnet from the NodeIP ?

Comment 24 Marcel Wysocki 2016-04-26 13:27:02 UTC

Ok, I could confirm this is not related to the missing NodeIp

Comment 25 Scott Dodson 2016-04-26 13:34:15 UTC

Yes, in 3.1 and later installs we no longer populate nodeIP value in the config unless openshift_set_node_ip is set to true. This is because the master resolves the node's ip when it's being registered. If the master cannot resolve the node's hostname to the IP address that should be used by the SDN then you'll want to set openshift_set_node_ip=True and set openshift_ip to the IP address that should be used by the SDN.

Comment 34 Josep 'Pep' Turro Mauri 2016-04-29 15:13:46 UTC

Furhter investigation pointed out that this problem was related to https://github.com/kubernetes/kubernetes/issues/18409 which is already tracked for inclusion in OSE via bug 1303085, so closing this one as a duplicate.

*** This bug has been marked as a duplicate of bug 1303085 ***

Note You need to log in before you can comment on or make changes to this bug.