Created attachment 1139901 [details] node log Description of problem: When after the ansible installer finished successfully all nodes are NotReady. Mar 24 09:01:45 master-a1-6cu85xf5.osp.sfa.se docker[31379]: W0324 09:01:45.088372 31430 subnets.go:150] Could not find an allocated subnet for node: master-a1-6cu85xf5.osp.sfa.se, Waiting... Mar 24 09:01:45 master-a1-6cu85xf5.osp.sfa.se docker[31379]: W0324 09:01:45.590567 31430 subnets.go:150] Could not find an allocated subnet for node: master-a1-6cu85xf5.osp.sfa.se, Waiting... Mar 24 09:01:46 master-a1-6cu85xf5.osp.sfa.se docker[31379]: W0324 09:01:46.092582 31430 subnets.go:150] Could not find an allocated subnet for node: master-a1-6cu85xf5.osp.sfa.se, Waiting... Mar 24 09:01:46 master-a1-6cu85xf5.osp.sfa.se docker[31379]: F0324 09:01:46.592739 31430 node.go:175] SDN Node failed: Failed to start plugin: Failed to get subnet for this host: master-a1-6cu85xf5.osp.sfa.se, error: hostsubnet "master-a1-6cu85xf5.osp.sfa.se" not found There are no hostsubnets created for any nodes. Version-Release number of selected component (if applicable): [root@master-a2-ez8xmop2 origin]# openshift version openshift v3.1.1.6-21-gcd70c35 kubernetes v1.1.0-origin-1107-g4c8e6f4 etcd 2.1.2
Created attachment 1139902 [details] master api log
this also seems to be related to https://bugzilla.redhat.com/show_bug.cgi?id=1290967
Marcel, Thanks, the logs are helpful. What we really need logs from is the controllers service, can you verify that `atomic-openshift-master-controllers` is running on your masters and gather their logs? That's the service that watches for new nodes to be registered and creates other resources. -- Scott
Created attachment 1141137 [details] new master-api log
Created attachment 1141138 [details] new node log
Created attachment 1141139 [details] master controllers log
i added logs from a fresh deployment.
I don't see anything wrong that'd be specific to the installer. Assigning to networking team.
Do you have any idea _when_ the HostSubnets should be created ?
I have a gut feeling this could be related to DNS not being set up properly. When using ip addresses instead of the hostnames it worked. even tho /etc/hosts contains all nodes and their addresses.
after retrying with proper DNS settings the issue still occurs.
I don't think the problem here is the HostSubnets per se. There seems to be some sort of generic master-node communication problem. Eg: Mar 29 08:25:55 master-a1-y57c1gdo.osp.sfa.se atomic-openshift-master-controllers[21487]: I0329 04:25:55.883698 1 nodecontroller.go:631] node node-a2-x3jt6e89.osp.sfa.se hasn't been updated for 3m15.900125363s. Last ready condition is: {Type:Ready Status:Unknown LastHeartbeatTime:2016-03-29 04:22:35 -0400 EDT LastTransitionTime:2016-03-29 04:23:20 -0400 EDT Reason:NodeStatusUnknown Message:Kubelet stopped posting node status.} Also, the node logs start out with: Starting a node connected to https://openshift.sfa.se:8443 which doesn't seem to match the "master-a1-y57c1gdo.osp.sfa.se" name seen elsewhere. Is it possible that you have two different OpenShift clusters configured, and the nodes are connecting to the wrong master?
No, its one cluster using openshift_master_cluster_method=native openshift_master_cluster_hostname=openshift.sfa.se openshift_master_cluster_public_hostname=openshift.sfa.se and that resolves to the 3 floating ip addresses of the masters but that issue also happened when it was resolving to the internal ip of the first master. the node stopping to post its own status is due to its SDN failing and it just gives up then.
(In reply to Marcel Wysocki from comment #9) > Do you have any idea _when_ the HostSubnets should be created ? The master should create it when it observes that a new node has been added: Mar 29 08:14:45 master-a1-y57c1gdo.osp.sfa.se atomic-openshift-master-controllers[21487]: I0329 04:14:45.577085 1 nodecontroller.go:357] NodeController observed a new Node: api.Node... If an error occurred, it would log it. So given that it did observe the node being added, and didn't log an error about creating the HostSubnet (and apparently didn't create a HostSubnet), that makes me wonder if maybe the master doesn't realize we're using the openshift-sdn networking plugin? Can you attach the master config file?
I set up a second environment. same issue happens here, but this will give me more possibilities to debug. I'll attach the master config.
Created attachment 1143901 [details] master-config
deleting the node and restarting the service (on the same host) also doesnt work. It gets re-added but the hostsubnet never created.
Do you need any more information from me ?
Hi, I also have seen this problem while trying out the openstack integration on OSE in OS1. If I add to /etc/origin/node/node-config.yml kubeletArguments: cloud-provider: - "openstack" cloud-config: - "/etc/cloud.conf" And restart atomic-openshift-node I get the error: Apr 07 06:18:29 master02.jlab.rhc-ose.labs.redhat.com atomic-openshift-node[9301]: I0407 06:18:29.036810 9301 openstack.go:289] Claiming to support Instances Apr 07 06:18:29 master02.jlab.rhc-ose.labs.redhat.com atomic-openshift-node[9301]: I0407 06:18:29.135445 9301 kubelet.go:2499] Recording NodeReady event message for node master02-jlab.os1.phx2.redhat.com Apr 07 06:18:29 master02.jlab.rhc-ose.labs.redhat.com atomic-openshift-node[9301]: I0407 06:18:29.135526 9301 kubelet.go:972] Attempting to register node master02-jlab.os1.phx2.redhat.com Apr 07 06:18:29 master02.jlab.rhc-ose.labs.redhat.com atomic-openshift-node[9301]: E0407 06:18:29.186169 9301 kubelet.go:1011] Previously "master02-jlab.os1.phx2.redhat.com" had externalID "9a3923a3-fe4d-49ac Apr 07 06:18:29 master02.jlab.rhc-ose.labs.redhat.com atomic-openshift-node[9301]: E0407 06:18:29.190361 9301 kubelet.go:1013] Unable to delete old node: User "system:node:master02-jlab.os1.phx2.redhat.com" So I delete the node with [root@master02 ~]# oc delete node master02-jlab.os1.phx2.redhat.com node "master02-jlab.os1.phx2.redhat.com" deleted And restart atomic-openshift-node [root@master02 ~]# systemctl restart atomic-openshift-node Job for atomic-openshift-node.service failed because the control process exited with error code. See "systemctl status atomic-openshift-node.service" and "journalctl -xe" for details. And in the log I get Apr 07 05:08:38 master02.jlab.rhc-ose.labs.redhat.com atomic-openshift-node[2491]: I0407 05:08:38.768106 2491 openstack.go:289] Claiming to support Instances Apr 07 05:08:39 master02.jlab.rhc-ose.labs.redhat.com atomic-openshift-node[2491]: W0407 05:08:39.278492 2491 subnets.go:150] Could not find an allocated subnet for node: master02-jlab.os1.phx2.redhat.com, Apr 07 05:08:39 master02.jlab.rhc-ose.labs.redhat.com atomic-openshift-node[2491]: W0407 05:08:39.791434 2491 subnets.go:150] Could not find an allocated subnet for node: master02-jlab.os1.phx2.redhat.com, And nothing works until I comment the openstack related lines in /etc/origin/node/node-config.yaml, remove the node with oc delete node and restart atomic-openshift-node. So maybe it's related to openstack integration in the node ? /Jonas Nordell
Marking this as not a release blocker since we don't think it's a regression. We are still going to keep working on it.
I also hit this in an OSE on OpenStack setup (single master, additional 2 nodes). In my case though openstack integration worked: cluster was initially deployed with the heat templates, they don't configure openstack integration. After that, manually configured openstack integration and things worked fine: after manual delete of the previous nodes, they re-registered and everything worked, including cinder-backed PVs. Later I tried to reconfigure the cluster to use the openshift-ovs-multitenant plugin and things stopped working. Deleted the nodes, they re-registered but would not get a subnet. I switched back to the previous config (with the openshift-ovs-subnet plugin) and still not working. Master was failing a few seconds after start with: apr 12 06:43:09 master.example.com atomic-openshift-master[5509]: F0412 06:43:09.891855 5509 run_components.go:340] SDN initialization failed: Failed to start plugin: Invalid node IP Deleting the nodes allowed the master to start and keep running. Restarting the node services I see them registering but no hostsubnet created. atomic-openshift-3.1.1.6-4.git.32.adf8ec9.el7aos.x86_64
I've built some packages with additional debug logging on master at http://people.redhat.com/dwinship/aos/. Would any of you be able to try those on master, restart master, and then try adding a node, and seeing what gets logged? Oh, better yet, if you could run master with additional logging. (eg, add "--loglevel=5" to the command line in /usr/lib/systemd/system/atomic-openshift-master.service before restarting)
I also have an issue where the nodeIp in node-config.yaml is not getting populated, could that be related? Would it usually derive the HostSubnet from the NodeIP ?
Ok, I could confirm this is not related to the missing NodeIp
Yes, in 3.1 and later installs we no longer populate nodeIP value in the config unless openshift_set_node_ip is set to true. This is because the master resolves the node's ip when it's being registered. If the master cannot resolve the node's hostname to the IP address that should be used by the SDN then you'll want to set openshift_set_node_ip=True and set openshift_ip to the IP address that should be used by the SDN.
Furhter investigation pointed out that this problem was related to https://github.com/kubernetes/kubernetes/issues/18409 which is already tracked for inclusion in OSE via bug 1303085, so closing this one as a duplicate. *** This bug has been marked as a duplicate of bug 1303085 ***