1275388 – Issue with nodes on SDN with inconsistent connectivity

Bug 1275388 - Issue with nodes on SDN with inconsistent connectivity

Summary: Issue with nodes on SDN with inconsistent connectivity

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	3.0.0
Hardware:	All
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Ravi Sankar
QA Contact:	Meng Bo
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-10-26 18:29 UTC by Øystein Bedin
Modified:	2019-08-15 05:44 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-11-23 14:26:00 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Øystein Bedin 2015-10-26 18:29:41 UTC

Description of problem:
First off, we have now experienced this issue at 3 different customer sites, and I believe we have hit it once in our lab environment a few months back.

The issue is with some nodes not talking on the SDN. Looking a bit closer at this, it was found that the routing table was different for these nodes - i.e.: "lbr0" entry causing traffic for the assigned SDN subnet (10.1.x.0) to be routed incorrectly (see examples below). After rebooting the node, it came back up cleanly and everything worked.

Before host reboot:
# route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
default         10.20.180.1     0.0.0.0         UG    0      0        0 eth0
10.1.0.0        0.0.0.0         255.255.0.0     U     0      0        0 tun0
10.1.6.0        0.0.0.0         255.255.255.0   U     0      0        0 lbr0
10.1.6.0        0.0.0.0         255.255.255.0   U     0      0        0 tun0
10.20.180.0     0.0.0.0         255.255.255.0   U     0      0        0 eth0
link-local      0.0.0.0         255.255.0.0     U     0      0        0 eth0
link-local      0.0.0.0         255.255.0.0     U     1002   0        0 eth0
172.17.0.0      0.0.0.0         255.255.0.0     U     0      0        0 docker0

After host reboot:
# route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
default         10.20.180.1     0.0.0.0         UG    0      0        0 eth0
10.1.0.0        0.0.0.0         255.255.0.0     U     0      0        0 tun0
10.1.6.0        0.0.0.0         255.255.255.0   U     0      0        0 tun0
10.20.180.0     0.0.0.0         255.255.255.0   U     0      0        0 eth0
link-local      0.0.0.0         255.255.0.0     U     0      0        0 eth0
link-local      0.0.0.0         255.255.0.0     U     1002   0        0 eth0
172.17.0.0      0.0.0.0         255.255.0.0     U     0      0        0 docker0

Note: In some cases the reboot doesn't fix it (or multiple reboots are needed). In those cases manual steps have been taken to alter he routing table, but this shouldn't be necessary.


Version-Release number of selected component (if applicable):
Appears to be a problem with both 3.0.1 and 3.0.2


How reproducible:
Not sure - it seems to "randomly" happen


Steps to Reproduce:
1. See above
2.
3.

Actual results:
A non-working routing table, including the "lbr0" entry.


Expected results:
See description above - a routing table without the "lbr0" entry


Additional info:
There's a customer support ticket open for this case as well:
https://access.redhat.com/support/cases/#/case/01527050

Comment 2 Dan Winship 2015-10-26 20:33:04 UTC

This ought to be fixed by https://github.com/openshift/openshift-sdn/pull/193

Comment 3 Ravi Sankar 2015-10-27 19:12:15 UTC

Fixed in https://github.com/openshift/openshift-sdn/pull/196 and https://github.com/openshift/openshift-sdn/pull/193

Comment 4 Ivan 2015-10-28 10:55:34 UTC

This issue occurred again today.
 This is twice it has occurred for me, on both occasion the the systems had selinux disabled and meant the the openshift installer needed to be rerun. Not sure if this was a contributing factor to this issue

Comment 8 Brenton Leanhardt 2015-11-23 14:26:00 UTC

This fix is available in OpenShift Enterprise 3.1.

Note You need to log in before you can comment on or make changes to this bug.