1293251 – Can not access service endpoint between different nodes.

Bug 1293251 - Can not access service endpoint between different nodes.

Summary: Can not access service endpoint between different nodes.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	3.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Dan Winship
QA Contact:	Meng Bo
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-12-21 08:46 UTC by Johnny Liu
Modified:	2016-01-26 19:20 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-01-26 19:20:25 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
debug log (272.99 KB, application/x-gzip) 2016-01-07 03:04 UTC, Meng Bo	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2016:0070	0	normal	SHIPPED_LIVE	Important: Red Hat OpenShift Enterprise 3.1.1 bug fix and enhancement update	2016-01-27 00:12:41 UTC

Description Johnny Liu 2015-12-21 08:46:06 UTC

Description of problem:
After installation, service endpoint can not be accessed across nodes.

Version-Release number of selected component (if applicable):
atomic-openshift-3.1.1.0-1.git.0.8632732.el7aos.x86_64
AtomicOpenShift/3.1/2015-12-19.3

How reproducible:
Always

Steps to Reproduce:
1.Set up an env: 1 master + 1 node
2.Install docker-registry or create a simple service, e.g:
oc create -f https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/networking/list_for_pods.json
3.Get service endpoint
# oc get svc
NAME              CLUSTER_IP      EXTERNAL_IP   PORT(S)                 SELECTOR                  AGE
docker-registry   172.30.53.111   <none>        5000/TCP                docker-registry=default   2h
kubernetes        172.30.0.1      <none>        443/TCP,53/UDP,53/TCP   <none>                    2h
test-service      172.30.54.65    <none>        27017/TCP               name=test-pods            1h
4. On master, access SVC endpoint

Actual results:
No response
# curl 172.30.54.65:27017
^C
# curl 172.30.53.111:5000
^C

Expected results:
Should be accessed successfully.

Additional info:
The endpoint could be accessed on the node where pod is running.
This issue does NOT happen on released version - 3.1.0.4

Comment 2 Dan Winship 2016-01-05 18:48:47 UTC

Should be fixed by https://github.com/openshift/openshift-sdn/pull/236

Comment 3 Meng Bo 2016-01-06 08:24:08 UTC

Checked on origin with latest origin code and openshift-sdn code, the issue still can be reproduced.

$ oc get po -o wide 
NAME            READY     STATUS    RESTARTS   AGE       NODE
test-rc-vra8y   1/1       Running   0          10m       node2.bmeng.local
test-rc-z2hv3   1/1       Running   0          10m       node3.bmeng.local

All the pods can be accessed from node directly.

$ oc get svc
NAME           CLUSTER_IP      EXTERNAL_IP   PORT(S)     SELECTOR         AGE
test-service   172.30.19.173   <none>        27017/TCP   name=test-pods   11m

The service can be accessed only from node2 and node3.
And when accessing the service on node2 and node3, 50% of tries will fail, due to maybe the failed tries point to the pod on the other node.


The below openflow rule appears on all the node
 cookie=0x0, duration=848.950s, table=3, n_packets=0, n_bytes=0, priority=100,ip,nw_dst=10.1.0.0/24 actions=move:NXM_NX_TUN_ID[0..31]->NXM_NX_REG0[],goto_table:6

Comment 4 Dan Winship 2016-01-06 13:56:49 UTC

hm... works for me... can you get debug.sh output? (https://raw.githubusercontent.com/openshift/openshift-sdn/master/hack/debug.sh)

Comment 5 Meng Bo 2016-01-07 03:04:17 UTC

Created attachment 1112328 [details]
debug log

Here is the debug log from my env.

Comment 6 Meng Bo 2016-01-11 09:12:14 UTC

I cannot reproduce this bug anymore with 

origin build:
# openshift version
openshift v1.1-730-gad80e1f
kubernetes v1.1.0-origin-1107-g4c8e6f4

openshift-sdn build:
da8ad5dc5c94012eb222221d909b2b6fa678500f

Comment 7 Meng Bo 2016-01-11 09:15:38 UTC

Should be fixed by https://github.com/openshift/openshift-sdn/pull/237 ?

Comment 8 Johnny Liu 2016-01-12 05:50:53 UTC

Re-test this bug with atomic-openshift-sdn-ovs-3.1.1.1-1.git.0.dba03a7.el7aos.x86_64 in AtomicOpenShift/3.1/2016-01-11.1, still does NOT work. Seem like the fix PR is not merged into OSE from upstream yet. So move the status to MODIFIED.

Comment 9 Johnny Liu 2016-01-13 02:52:51 UTC

To be easier to track this bug's status, move this bug to "ASSIGNED".

Comment 10 Eric Paris 2016-01-13 17:01:26 UTC

moving ON_QA as i'm told there was a new OSE build this morning.

Comment 12 Johnny Liu 2016-01-14 05:14:03 UTC

Verified this bug with AtomicOpenShift/3.1/2016-01-13.1 puddle, and PASS.


[root@openshift-125 ~]# curl 172.30.240.45:5000
[root@openshift-125 ~]# 

No hang is seen there.

Comment 14 errata-xmlrpc 2016-01-26 19:20:25 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2016:0070

Note You need to log in before you can comment on or make changes to this bug.