Description of problem: Oct 23 03:11:02 preserve-qe-yadu-nrr-1 atomic-openshift-node[118785]: F1023 03:11:02.546716 118785 network.go:45] SDN node startup failed: could not find egress network interface: could not find network interface with the address "10.8.241.73" Oct 23 03:11:02 preserve-qe-yadu-nrr-1 systemd[1]: atomic-openshift-node.service: main process exited, code=exited, status=255/n/a Oct 23 03:11:02 preserve-qe-yadu-nrr-1 systemd[1]: Failed to start OpenShift Node. Oct 23 03:11:02 preserve-qe-yadu-nrr-1 systemd[1]: Unit atomic-openshift-node.service entered failed state. Oct 23 03:11:02 preserve-qe-yadu-nrr-1 systemd[1]: atomic-openshift-node.service failed. Version-Release number of selected component (if applicable): openshift v3.7.0-0.174.0 kubernetes v1.7.6+a08f5eeb6 openshift-ansible-3.7.0-0.174.0.git.0.01932ad.el7.noarch.rpm How reproducible: Always Steps to Reproduce: 1. Setup up OCP env with multitenant plugin 2. Check the node 3. Actual results: Node could not start due to the error: SDN node startup failed: could not find egress network interface: could not find network interface with the address "10.8.241.73" Full node log have attached Expected results: Node could start normally Additional info: 1. Can not reproduce the issue with the env with subnet plugin 2. Workarround: Add node eth0 IP in node config (nodeIP part) could fix the issue 3. The issue may cause by https://github.com/openshift/origin/pull/16866
Created attachment 1342024 [details] node log
I did the initial analysis and this is what happened: - OpenShift Node started with 'host-8-241-73.host.centralci.eng.rdu2.redhat.com' as nodeName in node-config.yaml - SDN node resolved to 10.8.241.73 - Trying to fetch network interface for IP '10.8.241.73' failed as part of egressIP start. Logging into the machine: - DNS host-8-241-73.host.centralci.eng.rdu2.redhat.com resolved to 10.8.241.73 but this IP is not in any of the local interfaces on the node and this could be due to how *openstack* cluster did the network setup (similar case with AWS?) ============== [root@qe-cryan-37-2mrrne-1 ~]# ping 10.8.241.73 PING 10.8.241.73 (10.8.241.73) 56(84) bytes of data. 64 bytes from 10.8.241.73: icmp_seq=1 ttl=63 time=0.354 ms [root@qe-cryan-37-2mrrne-1 ~]# tcpdump -i eth0 -p icmp tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes 16:45:50.484256 IP host-172-16-120-7.openstacklocal > host-8-241-73.host.centralci.eng.rdu2.redhat.com: ICMP echo request, id 12607, seq 1, length 64 16:45:50.484478 IP host-8-241-73.host.centralci.eng.rdu2.redhat.com > host-172-16-120-7.openstacklocal: ICMP echo request, id 12607, seq 1, length 64 =============== May be we need to reject nodeName in node-config.yaml if corresponding IP is not local? @bbennett @danw what do you think?
> May be we need to reject nodeName in node-config.yaml if corresponding IP is not local? Yes. The error talks about egress, but we should make it more generic since the whole SDN will be broken in this case. The problem is that, to avoid spoofing, nodes check the source IP of VXLAN packets that they receive, and only accept packets from known node IP addresses. That means that the HostSubnet record (and by extension, the Node record) needs to contain the local/private IP of the node, not the external/public IP, or else all inter-node SDN traffic will get dropped silently. openshift-ansible lets you configure both IP addresses (openshift_ip vs openshift_public_ip). I'm not sure exactly what that translates to in terms of node-config.yaml. Probably we should make openshift-ansible actually check if you're misconfiguring your nodes this way, rather than only failing when we get to node startup. (@yadu, are you using ansible to install, or something else?)
@Dan: Yes, I'm using ansible to install openshift
In that case, you want to set openshift_public_ip=10.8.241.73, and openshift_ip= whatever the local eth0 IP address is
openshift v3.7.0-0.178.0 + Openstack + networkpolicy plugin + containerized installation: CAN reproduce this issue. openshift v3.7.0-0.178.0 + GCE + networkpolicy plugin + RPM installation: CANNOT reproduce this issue. in GCE env, the hostname is resolved to the local interface IP as below: [root@qe-hongli-master-etcd-1 ~]# oc get hostsubnet NAME HOST HOST IP SUBNET EGRESS IPS qe-hongli-master-etcd-1 qe-hongli-master-etcd-1 10.240.0.5 10.130.0.0/23 [] [root@qe-hongli-master-etcd-1 ~]# ip a show eth0 | grep inet inet 10.240.0.5/32 brd 10.240.0.5 scope global dynamic eth0
Added SDN node IP validation: https://github.com/openshift/origin/pull/17043
Raise the severity since it is blocking the env to be installed on cloud IaaS with cloud provider enabled.
@bmeng/@hongli: the PR will NOT make your system work, it will just log a slightly clearer error message. We believe that this error will only show up if the node is configured in such a way that all pod-to-pod traffic between different nodes would fail. I was assuming before that you were configuring a single-node cluster, and thus simply hadn't noticed the misconfiguration. The "oc get hostsubnet" output above seems to confirm that; there is only one node there (which is also the master). In that case, the problem is NOT with OpenShift, it's that your cluster is misconfigured. You need to separately set openshift_ip and openshift_public_ip when configuring the ansible install, as per comment 5.
filed bug 1506750 about making ansible check for this
I think I was wrong in comment#8 This will only affect the installation on openstack with whatever cloud provider status. Remove the testblocker tag.
Commits pushed to master at https://github.com/openshift/origin https://github.com/openshift/origin/commit/75adfb4eb815aab3d0e8c444aac240b77eca0185 Bug 1505266 - Validate node IP is local during sdn node initialization https://github.com/openshift/origin/commit/f12efeae4ee416860a0bb9e124b0ea876cf5a0f0 Merge pull request #17043 from pravisankar/validate-nodeip Automatic merge from submit-queue. Bug 1505266 - Validate node IP is local during sdn node initialization Fix for https://bugzilla.redhat.com/show_bug.cgi?id=1505266
failed to install OCP v3.7.0-0.188.0 with subnet plugin on OpenStack, and node errors below: Oct 31 05:17:51 qe-hongli-37-node-registry-router-2 atomic-openshift-node[67539]: I1031 05:17:51.008268 67539 start_node.go:288] Reading node configuration from /etc/origin/node/node-config.yaml Oct 31 05:17:51 qe-hongli-37-node-registry-router-2 atomic-openshift-node[67539]: W1031 05:17:51.010169 67539 server.go:190] WARNING: all flags other than --config, --write-config-to, and --cleanup-iptables are deprecated. Please begin using a config file ASAP. Oct 31 05:17:51 qe-hongli-37-node-registry-router-2 atomic-openshift-node[67539]: I1031 05:17:51.012994 67539 node.go:146] Initializing SDN node of type "redhat/openshift-ovs-subnet" with configured hostname "host-8-241-0.host.centralci.eng.rdu2.redhat.com" (IP ""), iptables sync period "30s" Oct 31 05:17:51 qe-hongli-37-node-registry-router-2 atomic-openshift-node[67539]: I1031 05:17:51.013992 67539 start_node.go:459] Unable to initialize network configuration: SDN initialization failed: node IP "10.8.241.0" is not a local/private address (hostname "host-8-241-0.host.centralci.eng.rdu2.redhat.com") Oct 31 05:17:51 qe-hongli-37-node-registry-router-2 atomic-openshift-node[67539]: F1031 05:17:51.014007 67539 start_node.go:159] SDN initialization failed: node IP "10.8.241.0" is not a local/private address (hostname "host-8-241-0.host.centralci.eng.rdu2.redhat.com") Oct 31 05:17:51 qe-hongli-37-node-registry-router-2 systemd[1]: atomic-openshift-node.service: main process exited, code=exited, status=255/n/a Oct 31 05:17:51 qe-hongli-37-node-registry-router-2 systemd[1]: Failed to start OpenShift Node. Oct 31 05:17:51 qe-hongli-37-node-registry-router-2 systemd[1]: Unit atomic-openshift-node.service entered failed state. Oct 31 05:17:51 qe-hongli-37-node-registry-router-2 systemd[1]: atomic-openshift-node.service failed.
Again, our current understanding of the situation is that the test is failing because *the test is designed incorrectly*, not because OpenShift is broken. It used to be possible to configure nodes in such a way that everything worked right locally, but pod-to-pod traffic between nodes would silently get dropped. Eventually people would figure out what was wrong and reinstall the node correctly, but it was a big waste of time. So now, we try to detect the problem immediately and error out if we see that the node is misconfigured. That's what's happening here; it appears that the node's configuration is incorrect, so OpenShift is refusing to run until it's fixed. Can you attach the ansible config you're using to configure this cluster? And what does "ip a" show on qe-hongli-37-node-registry-router-2? Was I correct in assuming that this test uses a single-node cluster? If so, are there any tests that install a multi-node OpenStack cluster? Do those tests specify separate openshift_public_hostname/openshift_hostname or openshift_public_ip/openshift_ip ?
Created attachment 1346383 [details] log
We're reverting the fatal error for 3.7 (https://github.com/openshift/origin/pull/17132). But it will be coming back once 3.7 branches. Note that the set of features broken by having an incorrect nodeIP includes at least: - all SDN traffic between pods on different nodes - pod liveness checks for pods that use hostports - automatic per-namespace egress IPs - ...? So you can get away with having misconfigured clusters when you're only testing features that aren't in that list, but we really shouldn't allow normal users to install clusters in this way.
(In reply to Dan Winship from comment #18) > We're reverting the fatal error for 3.7 > (https://github.com/openshift/origin/pull/17132). But it will be coming back > once 3.7 branches. > > Note that the set of features broken by having an incorrect nodeIP includes > at least: > > - all SDN traffic between pods on different nodes > - pod liveness checks for pods that use hostports > - automatic per-namespace egress IPs > - ...? > > So you can get away with having misconfigured clusters when you're only > testing features that aren't in that list, but we really shouldn't allow > normal users to install clusters in this way. If QE's cluster is kind of misconfigured, we will try our best to find out a good way to make the cluster installed on openstack. Probably involved openstack network configuration, Still in investigation.
Please test it on build 3.7.4-1 or newer version
Test on latest OCP openshift v3.7.5 kubernetes v1.7.6+a08f5eeb62 Node will use hostip as the hostName in node.config, and could be started normally.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:3188