Bug 1293251 - Can not access service endpoint between different nodes.
Summary: Can not access service endpoint between different nodes.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 3.1.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: ---
Assignee: Dan Winship
QA Contact: Meng Bo
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-12-21 08:46 UTC by Johnny Liu
Modified: 2016-01-26 19:20 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-01-26 19:20:25 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
debug log (272.99 KB, application/x-gzip)
2016-01-07 03:04 UTC, Meng Bo
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2016:0070 0 normal SHIPPED_LIVE Important: Red Hat OpenShift Enterprise 3.1.1 bug fix and enhancement update 2016-01-27 00:12:41 UTC

Description Johnny Liu 2015-12-21 08:46:06 UTC
Description of problem:
After installation, service endpoint can not be accessed across nodes.

Version-Release number of selected component (if applicable):
atomic-openshift-3.1.1.0-1.git.0.8632732.el7aos.x86_64
AtomicOpenShift/3.1/2015-12-19.3

How reproducible:
Always

Steps to Reproduce:
1.Set up an env: 1 master + 1 node
2.Install docker-registry or create a simple service, e.g:
oc create -f https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/networking/list_for_pods.json
3.Get service endpoint
# oc get svc
NAME              CLUSTER_IP      EXTERNAL_IP   PORT(S)                 SELECTOR                  AGE
docker-registry   172.30.53.111   <none>        5000/TCP                docker-registry=default   2h
kubernetes        172.30.0.1      <none>        443/TCP,53/UDP,53/TCP   <none>                    2h
test-service      172.30.54.65    <none>        27017/TCP               name=test-pods            1h
4. On master, access SVC endpoint

Actual results:
No response
# curl 172.30.54.65:27017
^C
# curl 172.30.53.111:5000
^C

Expected results:
Should be accessed successfully.

Additional info:
The endpoint could be accessed on the node where pod is running.
This issue does NOT happen on released version - 3.1.0.4

Comment 2 Dan Winship 2016-01-05 18:48:47 UTC
Should be fixed by https://github.com/openshift/openshift-sdn/pull/236

Comment 3 Meng Bo 2016-01-06 08:24:08 UTC
Checked on origin with latest origin code and openshift-sdn code, the issue still can be reproduced.

$ oc get po -o wide 
NAME            READY     STATUS    RESTARTS   AGE       NODE
test-rc-vra8y   1/1       Running   0          10m       node2.bmeng.local
test-rc-z2hv3   1/1       Running   0          10m       node3.bmeng.local

All the pods can be accessed from node directly.

$ oc get svc
NAME           CLUSTER_IP      EXTERNAL_IP   PORT(S)     SELECTOR         AGE
test-service   172.30.19.173   <none>        27017/TCP   name=test-pods   11m

The service can be accessed only from node2 and node3.
And when accessing the service on node2 and node3, 50% of tries will fail, due to maybe the failed tries point to the pod on the other node.


The below openflow rule appears on all the node
 cookie=0x0, duration=848.950s, table=3, n_packets=0, n_bytes=0, priority=100,ip,nw_dst=10.1.0.0/24 actions=move:NXM_NX_TUN_ID[0..31]->NXM_NX_REG0[],goto_table:6

Comment 4 Dan Winship 2016-01-06 13:56:49 UTC
hm... works for me... can you get debug.sh output? (https://raw.githubusercontent.com/openshift/openshift-sdn/master/hack/debug.sh)

Comment 5 Meng Bo 2016-01-07 03:04:17 UTC
Created attachment 1112328 [details]
debug log

Here is the debug log from my env.

Comment 6 Meng Bo 2016-01-11 09:12:14 UTC
I cannot reproduce this bug anymore with 

origin build:
# openshift version
openshift v1.1-730-gad80e1f
kubernetes v1.1.0-origin-1107-g4c8e6f4

openshift-sdn build:
da8ad5dc5c94012eb222221d909b2b6fa678500f

Comment 7 Meng Bo 2016-01-11 09:15:38 UTC
Should be fixed by https://github.com/openshift/openshift-sdn/pull/237 ?

Comment 8 Johnny Liu 2016-01-12 05:50:53 UTC
Re-test this bug with atomic-openshift-sdn-ovs-3.1.1.1-1.git.0.dba03a7.el7aos.x86_64 in AtomicOpenShift/3.1/2016-01-11.1, still does NOT work. Seem like the fix PR is not merged into OSE from upstream yet. So move the status to MODIFIED.

Comment 9 Johnny Liu 2016-01-13 02:52:51 UTC
To be easier to track this bug's status, move this bug to "ASSIGNED".

Comment 10 Eric Paris 2016-01-13 17:01:26 UTC
moving ON_QA as i'm told there was a new OSE build this morning.

Comment 12 Johnny Liu 2016-01-14 05:14:03 UTC
Verified this bug with AtomicOpenShift/3.1/2016-01-13.1 puddle, and PASS.


[root@openshift-125 ~]# curl 172.30.240.45:5000
[root@openshift-125 ~]# 

No hang is seen there.

Comment 14 errata-xmlrpc 2016-01-26 19:20:25 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2016:0070


Note You need to log in before you can comment on or make changes to this bug.