Bug 1297075 - externalID changes and kubelet attempts to delete/recreate Node API object
Summary: externalID changes and kubelet attempts to delete/recreate Node API object
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 3.1.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: ---
Assignee: Solly Ross
QA Contact: Jianwei Hou
URL:
Whiteboard:
Depends On:
Blocks: 1267746
TreeView+ depends on / blocked
 
Reported: 2016-01-08 23:13 UTC by Ryan Howe
Modified: 2019-10-10 10:50 UTC (History)
17 users (show)

Fixed In Version: atomic-openshift-3.1.1.6-1.git.0.b57e8bd.el7aos
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-01-29 20:30:24 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Ryan Howe 2016-01-08 23:13:15 UTC
Description of problem:

The master starts looking for a different externalID for the node and as a result the kubelet attempts to delete/recreate Node API object but is unable to and results in nodes service failing. 


Version-Release number of selected component (if applicable):
3.1

How reproducible:
Unsure of step to reproduce 

Steps to Reproduce:
1. Install on Openstack with hostnames set in ansible host file.
2. Environment works for some time and then randomly results in error

Actual results:
Error with external ID

Expected results:
Openshift to resolve the externalID change

Additional info:

Related to upstream issue https://github.com/kubernetes/kubernetes/issues/17731


Error Message from node: 

Jan 06 11:28:14 mad-osshift-master01.cisco.com atomic-openshift-node[10624]: I0106 11:28:14.952785   10624 kubelet.go:928] Attempting to register node mad-osshift-master01.cisco.com
Jan 06 11:28:14 mad-osshift-master01.cisco.com atomic-openshift-node[10624]: E0106 11:28:14.967250   10624 kubelet.go:951] Previously "mad-osshift-master01.cisco.com" had externalID "mad-osshift-master01.cisco.com"; now it is "10.42.137.150"; will delete and recreate.
Jan 06 11:28:14 mad-osshift-master01.cisco.com atomic-openshift-node[10624]: E0106 11:28:14.969106   10624 kubelet.go:953] Unable to delete old node: User "system:node:mad-osshift-master01.cisco.com" cannot delete nodes at the cluster scope

Work around 

# oc delete mad-osshift-node01.cisco.com
# ssh  mad-osshift-node01.cisco.com
# openshift start node --config='/etc/origin/node/node-config.yaml'

Comment 9 Andy Goldstein 2016-01-12 04:21:00 UTC
How the node's ExternalID is set depends on if a cloud provider has been configured or not. If a cloud provider has been configured, Kubernetes uses the cloud provider to get the ExternalID. If there is no cloud provider, the ExternalID is set to the same value as the node's "hostname." I put "hostname" in quotes because there is logic around how the node's hostname is set, and it can be an IP address. If the node-config.yaml file has nodeIP set, then the node's hostname is set to match nodeIP, which means that ExternalID will also be that IP address. Otherwise, the node's hostname comes from the nodeName field in node-config.yaml.

I looked at the customer's logs and they are not running with a cloud provider set, which means that the ExternalID came from the node's "hostname." I have not yet found any code that looks like it is responsible for replacing the hostname with an IP address.

I can imagine a few possibilities for how this happened:

1) Someone manually set nodeIP in node-config.yaml. This doesn't seem likely, as it appears that every node's ExternalID is an IP address.

2) Some tooling (openshift-ansible?) set the nodeIP, or set nodeName to be an IP.

3) Some code either exists currently or existed in the past that manipulated setting either nodeIP or the node's hostname to an IP

Comment 12 Andy Goldstein 2016-01-12 18:45:12 UTC
openshift-ansible at one point was setting nodeIP in node-config.yaml. The latest version does not. The customer runs ansible every night to ensure the node configs are all correct. It sounds like they ran ansible when it was setting nodeIP, then later ran it again after it was no longer setting nodeIP. According to Andrew Butcher, ansible would remove nodeIP from the configs. I'm thinking this is what happened.

Comment 13 Ryan Howe 2016-01-12 19:01:57 UTC
PR was merged to master
https://github.com/openshift/openshift-ansible/pull/970

Errata release with changes to ansible installer 
https://access.redhat.com/errata/RHBA-2015:2667

Comment 14 Solly Ross 2016-01-13 21:06:28 UTC
There's a PR in to address several facets of this problem: https://github.com/openshift/origin/pull/6310 (among other things, it tolerates switching between nodeIP and hostname for externalID without deleting and recreating the node).

Comment 15 Jordan Liggitt 2016-01-14 13:49:38 UTC
https://github.com/openshift/origin/pull/6310 has merged in origin

Comment 16 Troy Dawson 2016-01-15 04:34:06 UTC
Fix is in latest OSE build, moving to QE.

Comment 17 Jianwei Hou 2016-01-18 06:45:23 UTC
Verified on 
openshift v3.1.1.3
kubernetes v1.1.0-origin-1107-g4c8e6f4
etcd 2.1.2

Test env: OpenStack

Test scenarios:
==Without a cloud provider==
1. Add public ip associated to the instance as nodeIP
In node-config.yaml:
nodeName: openshift-125.lab.eng.nay.redhat.com
nodeIP: "10.66.79.125"

  Result: Failed to start node service, on master, node status is unknown, kubelet stopped posting node status 
  Reason: "failed to create kubelet: Node IP: "10.66.79.125" not found in the host's network interfaces". On openstack env, "10.66.79.125" is an associated floating ip.

2. Set nodeIP: "" or remove nodeIP from node-config.yaml

  Result: node status is ready, no errors or warnings seen

3. Set nodeIP: "192.168.0.116" in node-config.yaml, where "192.168.0.116" is the eth0 network interface
nodeName: openshift-125.lab.eng.nay.redhat.com
nodeIP: "192.168.0.116"

  Result: Node status is ready, the externalID is shown as openshift-125.lab.eng.nay.redhat.com, not the nodeIP(192.168.0.116)


==With openstack as cloud provider(openstack instance names are updated to be same with nodeName)==

Result: The cloud provider gets the ExternalID, here the node has to be deleted in order to be updated successfully.

Jan 18 14:36:15 openshift-125.lab.eng.nay.redhat.com atomic-openshift-node[7486]: I0118 14:36:15.714228    7486 kubelet.go:972] Attempting to register node openshift-125.lab.eng.nay.redhat.com
Jan 18 14:36:15 openshift-125.lab.eng.nay.redhat.com atomic-openshift-node[7486]: E0118 14:36:15.723288    7486 kubelet.go:1011] Previously "openshift-125.lab.eng.nay.redhat.com" had externalID "af53b164-a3a4-48c9-bb6a-b3725c1dcae4"; now it is "openshift-125.lab.eng.nay.redhat.com"; will delete and recreate.
Jan 18 14:36:15 openshift-125.lab.eng.nay.redhat.com atomic-openshift-node[7486]: E0118 14:36:15.724619    7486 kubelet.go:1013] Unable to delete old node: User "system:node:openshift-125.lab.eng.nay.redhat.com" cannot delete nodes at the cluster scope

After deleting the origin node as admin, node is launched successfully
oc get node -o yaml
```
spec:
  externalID: af53b164-a3a4-48c9-bb6a-b3725c1dcae4
  providerID: openstack:///af53b164-a3a4-48c9-bb6a-b3725c1dcae4
status:
  addresses:
  - address: 192.168.0.116
    type: InternalIP
  - address: 10.66.79.125
    type: InternalIP
  - address: ""
    type: ExternalIP
```

Comment 24 Josep 'Pep' Turro Mauri 2016-03-02 08:53:05 UTC
(In reply to Eric Jones from comment #23)
> What version of AEP should this not be a problem in?

Per comment #20 this should be in OSE 3.1.1 (RHSA-2016:0070). Updating the missing "fixed in version".

AEP preview is based on the same packages, so it should be fixed in atomic-openshift-3.1.1.6-1.git.0.b57e8bd.el7aos there too.


Note You need to log in before you can comment on or make changes to this bug.