Bug 1275388

Summary:	Issue with nodes on SDN with inconsistent connectivity
Product:	OpenShift Container Platform	Reporter:	Øystein Bedin <obedin>
Component:	Networking	Assignee:	Ravi Sankar <rpenta>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Meng Bo <bmeng>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	3.0.0	CC:	aos-bugs, bvincell, danw, imckinle, jeder, jialiu, obedin, rhowe
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2015-11-23 14:26:00 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Øystein Bedin 2015-10-26 18:29:41 UTC

Description of problem:
First off, we have now experienced this issue at 3 different customer sites, and I believe we have hit it once in our lab environment a few months back.

The issue is with some nodes not talking on the SDN. Looking a bit closer at this, it was found that the routing table was different for these nodes - i.e.: "lbr0" entry causing traffic for the assigned SDN subnet (10.1.x.0) to be routed incorrectly (see examples below). After rebooting the node, it came back up cleanly and everything worked.

Before host reboot:
# route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
default         10.20.180.1     0.0.0.0         UG    0      0        0 eth0
10.1.0.0        0.0.0.0         255.255.0.0     U     0      0        0 tun0
10.1.6.0        0.0.0.0         255.255.255.0   U     0      0        0 lbr0
10.1.6.0        0.0.0.0         255.255.255.0   U     0      0        0 tun0
10.20.180.0     0.0.0.0         255.255.255.0   U     0      0        0 eth0
link-local      0.0.0.0         255.255.0.0     U     0      0        0 eth0
link-local      0.0.0.0         255.255.0.0     U     1002   0        0 eth0
172.17.0.0      0.0.0.0         255.255.0.0     U     0      0        0 docker0

After host reboot:
# route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
default         10.20.180.1     0.0.0.0         UG    0      0        0 eth0
10.1.0.0        0.0.0.0         255.255.0.0     U     0      0        0 tun0
10.1.6.0        0.0.0.0         255.255.255.0   U     0      0        0 tun0
10.20.180.0     0.0.0.0         255.255.255.0   U     0      0        0 eth0
link-local      0.0.0.0         255.255.0.0     U     0      0        0 eth0
link-local      0.0.0.0         255.255.0.0     U     1002   0        0 eth0
172.17.0.0      0.0.0.0         255.255.0.0     U     0      0        0 docker0

Note: In some cases the reboot doesn't fix it (or multiple reboots are needed). In those cases manual steps have been taken to alter he routing table, but this shouldn't be necessary.


Version-Release number of selected component (if applicable):
Appears to be a problem with both 3.0.1 and 3.0.2


How reproducible:
Not sure - it seems to "randomly" happen


Steps to Reproduce:
1. See above
2.
3.

Actual results:
A non-working routing table, including the "lbr0" entry.


Expected results:
See description above - a routing table without the "lbr0" entry


Additional info:
There's a customer support ticket open for this case as well:
https://access.redhat.com/support/cases/#/case/01527050

Comment 2 Dan Winship 2015-10-26 20:33:04 UTC

This ought to be fixed by https://github.com/openshift/openshift-sdn/pull/193

Comment 3 Ravi Sankar 2015-10-27 19:12:15 UTC

Fixed in https://github.com/openshift/openshift-sdn/pull/196 and https://github.com/openshift/openshift-sdn/pull/193

Comment 4 Ivan 2015-10-28 10:55:34 UTC

This issue occurred again today.
 This is twice it has occurred for me, on both occasion the the systems had selinux disabled and meant the the openshift installer needed to be rerun. Not sure if this was a contributing factor to this issue

Comment 8 Brenton Leanhardt 2015-11-23 14:26:00 UTC

This fix is available in OpenShift Enterprise 3.1.