Hide Forgot
Description of problem: The master starts looking for a different externalID for the node and as a result the kubelet attempts to delete/recreate Node API object but is unable to and results in nodes service failing. Version-Release number of selected component (if applicable): 3.1 How reproducible: Unsure of step to reproduce Steps to Reproduce: 1. Install on Openstack with hostnames set in ansible host file. 2. Environment works for some time and then randomly results in error Actual results: Error with external ID Expected results: Openshift to resolve the externalID change Additional info: Related to upstream issue https://github.com/kubernetes/kubernetes/issues/17731 Error Message from node: Jan 06 11:28:14 mad-osshift-master01.cisco.com atomic-openshift-node[10624]: I0106 11:28:14.952785 10624 kubelet.go:928] Attempting to register node mad-osshift-master01.cisco.com Jan 06 11:28:14 mad-osshift-master01.cisco.com atomic-openshift-node[10624]: E0106 11:28:14.967250 10624 kubelet.go:951] Previously "mad-osshift-master01.cisco.com" had externalID "mad-osshift-master01.cisco.com"; now it is "10.42.137.150"; will delete and recreate. Jan 06 11:28:14 mad-osshift-master01.cisco.com atomic-openshift-node[10624]: E0106 11:28:14.969106 10624 kubelet.go:953] Unable to delete old node: User "system:node:mad-osshift-master01.cisco.com" cannot delete nodes at the cluster scope Work around # oc delete mad-osshift-node01.cisco.com # ssh mad-osshift-node01.cisco.com # openshift start node --config='/etc/origin/node/node-config.yaml'
How the node's ExternalID is set depends on if a cloud provider has been configured or not. If a cloud provider has been configured, Kubernetes uses the cloud provider to get the ExternalID. If there is no cloud provider, the ExternalID is set to the same value as the node's "hostname." I put "hostname" in quotes because there is logic around how the node's hostname is set, and it can be an IP address. If the node-config.yaml file has nodeIP set, then the node's hostname is set to match nodeIP, which means that ExternalID will also be that IP address. Otherwise, the node's hostname comes from the nodeName field in node-config.yaml. I looked at the customer's logs and they are not running with a cloud provider set, which means that the ExternalID came from the node's "hostname." I have not yet found any code that looks like it is responsible for replacing the hostname with an IP address. I can imagine a few possibilities for how this happened: 1) Someone manually set nodeIP in node-config.yaml. This doesn't seem likely, as it appears that every node's ExternalID is an IP address. 2) Some tooling (openshift-ansible?) set the nodeIP, or set nodeName to be an IP. 3) Some code either exists currently or existed in the past that manipulated setting either nodeIP or the node's hostname to an IP
openshift-ansible at one point was setting nodeIP in node-config.yaml. The latest version does not. The customer runs ansible every night to ensure the node configs are all correct. It sounds like they ran ansible when it was setting nodeIP, then later ran it again after it was no longer setting nodeIP. According to Andrew Butcher, ansible would remove nodeIP from the configs. I'm thinking this is what happened.
PR was merged to master https://github.com/openshift/openshift-ansible/pull/970 Errata release with changes to ansible installer https://access.redhat.com/errata/RHBA-2015:2667
There's a PR in to address several facets of this problem: https://github.com/openshift/origin/pull/6310 (among other things, it tolerates switching between nodeIP and hostname for externalID without deleting and recreating the node).
https://github.com/openshift/origin/pull/6310 has merged in origin
Fix is in latest OSE build, moving to QE.
Verified on openshift v3.1.1.3 kubernetes v1.1.0-origin-1107-g4c8e6f4 etcd 2.1.2 Test env: OpenStack Test scenarios: ==Without a cloud provider== 1. Add public ip associated to the instance as nodeIP In node-config.yaml: nodeName: openshift-125.lab.eng.nay.redhat.com nodeIP: "10.66.79.125" Result: Failed to start node service, on master, node status is unknown, kubelet stopped posting node status Reason: "failed to create kubelet: Node IP: "10.66.79.125" not found in the host's network interfaces". On openstack env, "10.66.79.125" is an associated floating ip. 2. Set nodeIP: "" or remove nodeIP from node-config.yaml Result: node status is ready, no errors or warnings seen 3. Set nodeIP: "192.168.0.116" in node-config.yaml, where "192.168.0.116" is the eth0 network interface nodeName: openshift-125.lab.eng.nay.redhat.com nodeIP: "192.168.0.116" Result: Node status is ready, the externalID is shown as openshift-125.lab.eng.nay.redhat.com, not the nodeIP(192.168.0.116) ==With openstack as cloud provider(openstack instance names are updated to be same with nodeName)== Result: The cloud provider gets the ExternalID, here the node has to be deleted in order to be updated successfully. Jan 18 14:36:15 openshift-125.lab.eng.nay.redhat.com atomic-openshift-node[7486]: I0118 14:36:15.714228 7486 kubelet.go:972] Attempting to register node openshift-125.lab.eng.nay.redhat.com Jan 18 14:36:15 openshift-125.lab.eng.nay.redhat.com atomic-openshift-node[7486]: E0118 14:36:15.723288 7486 kubelet.go:1011] Previously "openshift-125.lab.eng.nay.redhat.com" had externalID "af53b164-a3a4-48c9-bb6a-b3725c1dcae4"; now it is "openshift-125.lab.eng.nay.redhat.com"; will delete and recreate. Jan 18 14:36:15 openshift-125.lab.eng.nay.redhat.com atomic-openshift-node[7486]: E0118 14:36:15.724619 7486 kubelet.go:1013] Unable to delete old node: User "system:node:openshift-125.lab.eng.nay.redhat.com" cannot delete nodes at the cluster scope After deleting the origin node as admin, node is launched successfully oc get node -o yaml ``` spec: externalID: af53b164-a3a4-48c9-bb6a-b3725c1dcae4 providerID: openstack:///af53b164-a3a4-48c9-bb6a-b3725c1dcae4 status: addresses: - address: 192.168.0.116 type: InternalIP - address: 10.66.79.125 type: InternalIP - address: "" type: ExternalIP ```
(In reply to Eric Jones from comment #23) > What version of AEP should this not be a problem in? Per comment #20 this should be in OSE 3.1.1 (RHSA-2016:0070). Updating the missing "fixed in version". AEP preview is based on the same packages, so it should be fixed in atomic-openshift-3.1.1.6-1.git.0.b57e8bd.el7aos there too.