Description of problem: First off, we have now experienced this issue at 3 different customer sites, and I believe we have hit it once in our lab environment a few months back. The issue is with some nodes not talking on the SDN. Looking a bit closer at this, it was found that the routing table was different for these nodes - i.e.: "lbr0" entry causing traffic for the assigned SDN subnet (10.1.x.0) to be routed incorrectly (see examples below). After rebooting the node, it came back up cleanly and everything worked. Before host reboot: # route Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface default 10.20.180.1 0.0.0.0 UG 0 0 0 eth0 10.1.0.0 0.0.0.0 255.255.0.0 U 0 0 0 tun0 10.1.6.0 0.0.0.0 255.255.255.0 U 0 0 0 lbr0 10.1.6.0 0.0.0.0 255.255.255.0 U 0 0 0 tun0 10.20.180.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0 link-local 0.0.0.0 255.255.0.0 U 0 0 0 eth0 link-local 0.0.0.0 255.255.0.0 U 1002 0 0 eth0 172.17.0.0 0.0.0.0 255.255.0.0 U 0 0 0 docker0 After host reboot: # route Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface default 10.20.180.1 0.0.0.0 UG 0 0 0 eth0 10.1.0.0 0.0.0.0 255.255.0.0 U 0 0 0 tun0 10.1.6.0 0.0.0.0 255.255.255.0 U 0 0 0 tun0 10.20.180.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0 link-local 0.0.0.0 255.255.0.0 U 0 0 0 eth0 link-local 0.0.0.0 255.255.0.0 U 1002 0 0 eth0 172.17.0.0 0.0.0.0 255.255.0.0 U 0 0 0 docker0 Note: In some cases the reboot doesn't fix it (or multiple reboots are needed). In those cases manual steps have been taken to alter he routing table, but this shouldn't be necessary. Version-Release number of selected component (if applicable): Appears to be a problem with both 3.0.1 and 3.0.2 How reproducible: Not sure - it seems to "randomly" happen Steps to Reproduce: 1. See above 2. 3. Actual results: A non-working routing table, including the "lbr0" entry. Expected results: See description above - a routing table without the "lbr0" entry Additional info: There's a customer support ticket open for this case as well: https://access.redhat.com/support/cases/#/case/01527050
This ought to be fixed by https://github.com/openshift/openshift-sdn/pull/193
Fixed in https://github.com/openshift/openshift-sdn/pull/196 and https://github.com/openshift/openshift-sdn/pull/193
This issue occurred again today. This is twice it has occurred for me, on both occasion the the systems had selinux disabled and meant the the openshift installer needed to be rerun. Not sure if this was a contributing factor to this issue
This fix is available in OpenShift Enterprise 3.1.