1505266 – Node could not start due to the error:SDN node startup failed: could not find egress network interface

Bug 1505266 - Node could not start due to the error:SDN node startup failed: could not find egress network interface

Summary: Node could not start due to the error:SDN node startup failed: could not find...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	3.7.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	3.7.0
Assignee:	Ravi Sankar
QA Contact:	Meng Bo
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-10-23 07:37 UTC by Yan Du
Modified:	2017-11-28 22:18 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: Configured node hostname doesn't resolve to local/private node IP. Consequence: OpenShift SDN pod to pod communication will fail Fix: Detect this misconfiguration early and don't start the openshift-node service. Result: Avoids unnecessary debugging and error msg helps the user to fix the config.
Clone Of:
Environment:
Last Closed:	2017-11-28 22:18:47 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
node log (75.06 KB, text/plain) 2017-10-23 07:41 UTC, Yan Du	no flags	Details
log (10.08 KB, text/plain) 2017-11-01 08:58 UTC, Yan Du	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Origin (Github)	17132	0	None	None	None	2017-11-01 15:54:14 UTC
Red Hat Product Errata	RHSA-2017:3188	0	normal	SHIPPED_LIVE	Moderate: Red Hat OpenShift Container Platform 3.7 security, bug, and enhancement update	2017-11-29 02:34:54 UTC

Description Yan Du 2017-10-23 07:37:05 UTC

Description of problem:
Oct 23 03:11:02 preserve-qe-yadu-nrr-1 atomic-openshift-node[118785]: F1023 03:11:02.546716  118785 network.go:45] SDN node startup failed: could not find egress network interface: could not find network interface with the address "10.8.241.73"
Oct 23 03:11:02 preserve-qe-yadu-nrr-1 systemd[1]: atomic-openshift-node.service: main process exited, code=exited, status=255/n/a
Oct 23 03:11:02 preserve-qe-yadu-nrr-1 systemd[1]: Failed to start OpenShift Node.
Oct 23 03:11:02 preserve-qe-yadu-nrr-1 systemd[1]: Unit atomic-openshift-node.service entered failed state.
Oct 23 03:11:02 preserve-qe-yadu-nrr-1 systemd[1]: atomic-openshift-node.service failed.


Version-Release number of selected component (if applicable):
openshift v3.7.0-0.174.0
kubernetes v1.7.6+a08f5eeb6
openshift-ansible-3.7.0-0.174.0.git.0.01932ad.el7.noarch.rpm


How reproducible:
Always

Steps to Reproduce:
1. Setup up OCP env with multitenant plugin
2. Check the node
3.

Actual results:
Node could not start due to the error： SDN node startup failed: could not find egress network interface: could not find network interface with the address "10.8.241.73" 

Full node log have attached

Expected results:
Node could start normally

Additional info:
1. Can not reproduce the issue with the env with subnet plugin
2. Workarround: Add node eth0 IP in node config (nodeIP part) could fix the issue
3. The issue may cause by https://github.com/openshift/origin/pull/16866

Comment 1 Yan Du 2017-10-23 07:41:03 UTC

Created attachment 1342024 [details]
node log

Comment 2 Ravi Sankar 2017-10-23 21:13:03 UTC

I did the initial analysis and this is what happened:
 - OpenShift Node started with 'host-8-241-73.host.centralci.eng.rdu2.redhat.com' as nodeName in node-config.yaml
 - SDN node resolved to 10.8.241.73
 - Trying to fetch network interface for IP '10.8.241.73' failed as part of egressIP start.
 
Logging into the machine:
- DNS host-8-241-73.host.centralci.eng.rdu2.redhat.com resolved to 10.8.241.73 but this IP is not in any of the local interfaces on the node and this could be due to how *openstack* cluster did the network setup (similar case with AWS?)

==============
[root@qe-cryan-37-2mrrne-1 ~]# ping 10.8.241.73
PING 10.8.241.73 (10.8.241.73) 56(84) bytes of data.
64 bytes from 10.8.241.73: icmp_seq=1 ttl=63 time=0.354 ms

[root@qe-cryan-37-2mrrne-1 ~]# tcpdump -i eth0 -p icmp
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
16:45:50.484256 IP host-172-16-120-7.openstacklocal > host-8-241-73.host.centralci.eng.rdu2.redhat.com: ICMP echo request, id 12607, seq 1, length 64
16:45:50.484478 IP host-8-241-73.host.centralci.eng.rdu2.redhat.com > host-172-16-120-7.openstacklocal: ICMP echo request, id 12607, seq 1, length 64
===============

May be we need to reject nodeName in node-config.yaml if corresponding IP is not local? 

@bbennett @danw what do you think?

Comment 3 Dan Winship 2017-10-23 21:45:43 UTC

> May be we need to reject nodeName in node-config.yaml if corresponding IP is not local?

Yes. The error talks about egress, but we should make it more generic since the whole SDN will be broken in this case. The problem is that, to avoid spoofing, nodes check the source IP of VXLAN packets that they receive, and only accept packets from known node IP addresses. That means that the HostSubnet record (and by extension, the Node record) needs to contain the local/private IP of the node, not the external/public IP, or else all inter-node SDN traffic will get dropped silently.

openshift-ansible lets you configure both IP addresses (openshift_ip vs openshift_public_ip). I'm not sure exactly what that translates to in terms of node-config.yaml. Probably we should make openshift-ansible actually check if you're misconfiguring your nodes this way, rather than only failing when we get to node startup.

(@yadu, are you using ansible to install, or something else?)

Comment 4 Yan Du 2017-10-24 02:54:28 UTC

@Dan: Yes, I'm using ansible to install openshift

Comment 5 Dan Winship 2017-10-24 13:25:44 UTC

In that case, you want to set openshift_public_ip=10.8.241.73, and openshift_ip= whatever the local eth0 IP address is

Comment 6 Hongan Li 2017-10-26 03:16:06 UTC

openshift v3.7.0-0.178.0 + Openstack + networkpolicy plugin + containerized installation: CAN reproduce this issue.

openshift v3.7.0-0.178.0 + GCE + networkpolicy plugin + RPM installation: CANNOT reproduce this issue.

in GCE env, the hostname is resolved to the local interface IP as below:

[root@qe-hongli-master-etcd-1 ~]# oc get hostsubnet 
NAME                               HOST                               HOST IP      SUBNET          EGRESS IPS
qe-hongli-master-etcd-1            qe-hongli-master-etcd-1            10.240.0.5   10.130.0.0/23   []

[root@qe-hongli-master-etcd-1 ~]# ip a show eth0 | grep inet
    inet 10.240.0.5/32 brd 10.240.0.5 scope global dynamic eth0

Comment 7 Ravi Sankar 2017-10-26 06:18:29 UTC

Added SDN node IP validation: https://github.com/openshift/origin/pull/17043

Comment 8 Meng Bo 2017-10-26 09:42:30 UTC

Raise the severity since it is blocking the env to be installed on cloud IaaS with cloud provider enabled.

Comment 9 Dan Winship 2017-10-26 12:45:14 UTC

@bmeng/@hongli: the PR will NOT make your system work, it will just log a slightly clearer error message.

We believe that this error will only show up if the node is configured in such a way that all pod-to-pod traffic between different nodes would fail. I was assuming before that you were configuring a single-node cluster, and thus simply hadn't noticed the misconfiguration. The "oc get hostsubnet" output above seems to confirm that; there is only one node there (which is also the master). In that case, the problem is NOT with OpenShift, it's that your cluster is misconfigured. You need to separately set openshift_ip and openshift_public_ip when configuring the ansible install, as per comment 5.

Comment 10 Dan Winship 2017-10-26 17:37:50 UTC

filed bug 1506750 about making ansible check for this

Comment 11 Meng Bo 2017-10-27 10:42:15 UTC

I think I was wrong in comment#8 
This will only affect the installation on openstack with whatever cloud provider status.

Remove the testblocker tag.

Comment 12 openshift-github-bot 2017-10-28 20:41:38 UTC

Commits pushed to master at https://github.com/openshift/origin

https://github.com/openshift/origin/commit/75adfb4eb815aab3d0e8c444aac240b77eca0185
Bug 1505266 - Validate node IP is local during sdn node initialization

https://github.com/openshift/origin/commit/f12efeae4ee416860a0bb9e124b0ea876cf5a0f0
Merge pull request #17043 from pravisankar/validate-nodeip

Automatic merge from submit-queue.

 Bug 1505266 - Validate node IP is local during sdn node initialization

Fix for https://bugzilla.redhat.com/show_bug.cgi?id=1505266

Comment 13 Hongan Li 2017-10-31 10:35:14 UTC

failed to install OCP v3.7.0-0.188.0 with subnet plugin on OpenStack, and node errors below:

Oct 31 05:17:51 qe-hongli-37-node-registry-router-2 atomic-openshift-node[67539]: I1031 05:17:51.008268   67539 start_node.go:288] Reading node configuration from /etc/origin/node/node-config.yaml
Oct 31 05:17:51 qe-hongli-37-node-registry-router-2 atomic-openshift-node[67539]: W1031 05:17:51.010169   67539 server.go:190] WARNING: all flags other than --config, --write-config-to, and --cleanup-iptables are deprecated. Please begin using a config file ASAP.
Oct 31 05:17:51 qe-hongli-37-node-registry-router-2 atomic-openshift-node[67539]: I1031 05:17:51.012994   67539 node.go:146] Initializing SDN node of type "redhat/openshift-ovs-subnet" with configured hostname "host-8-241-0.host.centralci.eng.rdu2.redhat.com" (IP ""), iptables sync period "30s"
Oct 31 05:17:51 qe-hongli-37-node-registry-router-2 atomic-openshift-node[67539]: I1031 05:17:51.013992   67539 start_node.go:459] Unable to initialize network configuration: SDN initialization failed: node IP "10.8.241.0" is not a local/private address (hostname "host-8-241-0.host.centralci.eng.rdu2.redhat.com")
Oct 31 05:17:51 qe-hongli-37-node-registry-router-2 atomic-openshift-node[67539]: F1031 05:17:51.014007   67539 start_node.go:159] SDN initialization failed: node IP "10.8.241.0" is not a local/private address (hostname "host-8-241-0.host.centralci.eng.rdu2.redhat.com")
Oct 31 05:17:51 qe-hongli-37-node-registry-router-2 systemd[1]: atomic-openshift-node.service: main process exited, code=exited, status=255/n/a
Oct 31 05:17:51 qe-hongli-37-node-registry-router-2 systemd[1]: Failed to start OpenShift Node.
Oct 31 05:17:51 qe-hongli-37-node-registry-router-2 systemd[1]: Unit atomic-openshift-node.service entered failed state.
Oct 31 05:17:51 qe-hongli-37-node-registry-router-2 systemd[1]: atomic-openshift-node.service failed.

Comment 14 Dan Winship 2017-10-31 14:04:28 UTC

Again, our current understanding of the situation is that the test is failing because *the test is designed incorrectly*, not because OpenShift is broken.

It used to be possible to configure nodes in such a way that everything worked right locally, but pod-to-pod traffic between nodes would silently get dropped. Eventually people would figure out what was wrong and reinstall the node correctly, but it was a big waste of time. So now, we try to detect the problem immediately and error out if we see that the node is misconfigured. That's what's happening here; it appears that the node's configuration is incorrect, so OpenShift is refusing to run until it's fixed.


Can you attach the ansible config you're using to configure this cluster? And what does "ip a" show on qe-hongli-37-node-registry-router-2?

Was I correct in assuming that this test uses a single-node cluster? If so, are there any tests that install a multi-node OpenStack cluster? Do those tests specify separate openshift_public_hostname/openshift_hostname or openshift_public_ip/openshift_ip ?

Comment 16 Yan Du 2017-11-01 08:58:21 UTC

Created attachment 1346383 [details]
log

Comment 18 Dan Winship 2017-11-01 15:12:02 UTC

We're reverting the fatal error for 3.7 (https://github.com/openshift/origin/pull/17132). But it will be coming back once 3.7 branches.

Note that the set of features broken by having an incorrect nodeIP includes at least:

  - all SDN traffic between pods on different nodes
  - pod liveness checks for pods that use hostports
  - automatic per-namespace egress IPs
  - ...?

So you can get away with having misconfigured clusters when you're only testing features that aren't in that list, but we really shouldn't allow normal users to install clusters in this way.

Comment 19 Johnny Liu 2017-11-01 15:52:50 UTC

(In reply to Dan Winship from comment #18)
> We're reverting the fatal error for 3.7
> (https://github.com/openshift/origin/pull/17132). But it will be coming back
> once 3.7 branches.
> 
> Note that the set of features broken by having an incorrect nodeIP includes
> at least:
> 
>   - all SDN traffic between pods on different nodes
>   - pod liveness checks for pods that use hostports
>   - automatic per-namespace egress IPs
>   - ...?
> 
> So you can get away with having misconfigured clusters when you're only
> testing features that aren't in that list, but we really shouldn't allow
> normal users to install clusters in this way.

If QE's cluster is kind of misconfigured, we will try our best to find out a good way to make the cluster installed on openstack. Probably involved openstack network configuration, Still in investigation.

Comment 20 Xiaoli Tian 2017-11-09 03:34:08 UTC

Please test it on build 3.7.4-1 or newer version

Comment 21 Yan Du 2017-11-10 03:11:04 UTC

Test on latest OCP
openshift v3.7.5
kubernetes v1.7.6+a08f5eeb62

Node will use hostip as the hostName in node.config, and could be started normally.

Comment 24 errata-xmlrpc 2017-11-28 22:18:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188

Note You need to log in before you can comment on or make changes to this bug.