Bug 1589396

Summary:	atomic-openshift-node.service unable to start because "network.go:100] Unable to get a bind address: failed to retrieve node IP"
Product:	OpenShift Container Platform	Reporter:	Chris Kim <chrkim>
Component:	Cloud Compute	Assignee:	Jan Chaloupka <jchaloup>
Status:	CLOSED CURRENTRELEASE	QA Contact:	DeShuai Ma <dma>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	3.9.0	CC:	andrew.rolls-drew, aos-bugs, bleanhar, bowe, byount, chrkim, cshereme, hongli, jchaloup, jokerman, jolee, mmccomas, rbost, tatanaka
Target Milestone:	---	Keywords:	Reopened
Target Release:	3.9.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-09-27 15:29:22 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Chris Kim 2018-06-08 23:37:17 UTC

Description of problem:
With an OpenShift 3.9.30 environment running on an OpenStack 10 (Newton) cloud, there can be a situation when atomic-openshift-node is unable to start because of a false set of data being retrieved from the OpenShift API.

This environment is on an OpenStack provider net, and thus the IP's that are supposed to be listed via an "oc describe node" are "internalIP" like such:

Addresses:
  InternalIP:  172.17.0.5
  Hostname:    master-host-1

Jun 08 23:16:27 master-host-1.example.com atomic-openshift-node[6809]: I0608 23:16:27.592307    6809 node.go:350] Starting openshift-sdn pod manager
Jun 08 23:16:27 master-host-1.example.com atomic-openshift-node[6809]: I0608 23:16:27.595137    6809 node.go:393] openshift-sdn network plugin ready
Jun 08 23:16:27 master-host-1.example.com atomic-openshift-node[6809]: I0608 23:16:27.597720    6809 network.go:95] Using iptables Proxier.
Jun 08 23:16:27 master-host-1.example.com atomic-openshift-node[6809]: I0608 23:16:27.599898    6809 multitenant.go:154] SyncVNIDRules: 0 unused VNIDs
Jun 08 23:16:27 master-host-1.example.com atomic-openshift-node[6809]: F0608 23:16:27.602358    6809 network.go:100] Unable to get a bind address: failed to retrieve node IP: host IP unknown; known addresses: [{Hostname master-host-1}]

This condition (Unable to get a bind address) will occur if the InternalIP attribute is lost for some reason; most recently I found that this attribute was lost from an upgrade from OCP 3.9.25 to 3.9.30.

The workaround to get the node to re-join the cluster is to explicitly set the bind address in the node-config.yaml file temporarily; reverting to 0.0.0.0 after the node re-registers with the cluster is no problem because the InternalIP field is re-populated with the correct IP.

It appears that this may be an issue stemming from the fact that the node will not set it's node IP via the API until after it gets past this step: https://github.com/openshift/origin/blob/d67b8ce9d32b4defd6bebba2082e5cadf185590b/vendor/k8s.io/kubernetes/pkg/kubelet/kubelet_node_status.go#L1084

I believe this may be possible to reproduce with the following steps:
1. Have node start with pre-existing InternalIP attribute, but have OpenStack nova API non-accessible, thus node status may get updated without the InternalIP attribute
2. Restart the node on a master host

Weirdly enough, I was able to restart the atomic-openshift-node service successfully on nodes; the only instances that failed were master hosts.

Version-Release number of selected component (if applicable):
3.9.30

How reproducible:
Intermittent; waiting until InternalIP value is lost again

Steps to Reproduce:
1. InternalIP value should not exist on an OpenShift with OpenStack Cloud Provider integrated cluster
2. Restart atomic-openshift-node

Actual results:
atomic-openshift-node fails to start citing "failed to retrieve node IP"

Expected results:
atomic-openshift-node starts and properly updates the internalIP attribute.

Additional info:

Comment 11 Takayoshi Tanaka 2018-08-30 01:39:26 UTC

I reopen this bug because another customer case #02171117 seems to face this bug. This customer is using Azure and there are similar error messages. We have a sosreport of an affected master node. I'll attach in private.

I’m going to ask the customer to collect master and service log with LogLevel=10. If you have any information to ask the customer, could you tell me?

Comment 20 DeShuai Ma 2018-09-26 03:12:35 UTC

We already test the errata, no any extra test needed, move to verified.

Comment 21 jolee 2018-09-27 15:43:30 UTC

I see this was released as errata for 3.9.43 and has a target of 3.12

What is the status for 3.10 and 3.11?

Comment 22 Bowe Strickland 2019-01-26 23:54:59 UTC

fyi... i came across the same issue on one of what should have been 3 identical nodes, reference architecture installed on AWS, upgrading (some components) from 3.9.30 -> 3.9.60....

pre upgrade:
[root@ip-172-15-23-65 git]# rpm -qa | grep openshift
atomic-openshift-node-3.9.30-1.git.0.dec1ba7.el7.x86_64
atomic-openshift-sdn-ovs-3.9.30-1.git.0.dec1ba7.el7.x86_64
atomic-openshift-clients-3.9.30-1.git.0.dec1ba7.el7.x86_64
atomic-openshift-docker-excluder-3.9.30-1.git.0.dec1ba7.el7.noarch
atomic-openshift-excluder-3.9.30-1.git.0.dec1ba7.el7.noarch
atomic-openshift-3.9.30-1.git.0.dec1ba7.el7.x86_64

post-upgrade:
[root@ip-172-15-43-148 ec2-user]# rpm -qa |grep openshift
atomic-openshift-node-3.9.30-1.git.0.dec1ba7.el7.x86_64
atomic-openshift-excluder-3.9.60-1.git.0.f8b38ff.el7.noarch
atomic-openshift-sdn-ovs-3.9.30-1.git.0.dec1ba7.el7.x86_64
atomic-openshift-docker-excluder-3.9.60-1.git.0.f8b38ff.el7.noarch
atomic-openshift-clients-3.9.30-1.git.0.dec1ba7.el7.x86_64
atomic-openshift-3.9.30-1.git.0.dec1ba7.el7.x86_64

the workaround mentioned above resolved the issue.

Comment 23 Brenton Leanhardt 2019-01-28 13:43:32 UTC

@jan, According to Comment #22 should he have have to explicitly set the node's bind address to workaround the problem is the 3.9.30 build even after upgrading to 3.9.60 (which has the fix)?

If not, perhaps Bowe could provide the exact steps he performed to reproduce the problem and we could ask QE to double check.

Comment 24 Jan Chaloupka 2019-01-29 12:19:55 UTC

No additional workaround is needed after upgrading to 3.9.60. The timeout issue was fully fixed. If there is anything else malfunctioning, it's different issue we need to revisit.

> Azure REST API seems unstable. However, this timeout error didn't show up at the next service restart.

@Bowe, are you referring to this workaround?

The fix is part of atomic-openshift rpm which needs to be updated on each node. Until that happens the only way how to temporarily fix the issue is to restart node daemon.