Bug 1293251 - Can not access service endpoint between different nodes.
Can not access service endpoint between different nodes.
Status: CLOSED ERRATA
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking (Show other bugs)
3.1.0
Unspecified Unspecified
urgent Severity urgent
: ---
: ---
Assigned To: Dan Winship
Meng Bo
: Regression, TestBlocker
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-12-21 03:46 EST by Johnny Liu
Modified: 2016-01-26 14:20 EST (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-01-26 14:20:25 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
debug log (272.99 KB, application/x-gzip)
2016-01-06 22:04 EST, Meng Bo
no flags Details

  None (edit)
Description Johnny Liu 2015-12-21 03:46:06 EST
Description of problem:
After installation, service endpoint can not be accessed across nodes.

Version-Release number of selected component (if applicable):
atomic-openshift-3.1.1.0-1.git.0.8632732.el7aos.x86_64
AtomicOpenShift/3.1/2015-12-19.3

How reproducible:
Always

Steps to Reproduce:
1.Set up an env: 1 master + 1 node
2.Install docker-registry or create a simple service, e.g:
oc create -f https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/networking/list_for_pods.json
3.Get service endpoint
# oc get svc
NAME              CLUSTER_IP      EXTERNAL_IP   PORT(S)                 SELECTOR                  AGE
docker-registry   172.30.53.111   <none>        5000/TCP                docker-registry=default   2h
kubernetes        172.30.0.1      <none>        443/TCP,53/UDP,53/TCP   <none>                    2h
test-service      172.30.54.65    <none>        27017/TCP               name=test-pods            1h
4. On master, access SVC endpoint

Actual results:
No response
# curl 172.30.54.65:27017
^C
# curl 172.30.53.111:5000
^C

Expected results:
Should be accessed successfully.

Additional info:
The endpoint could be accessed on the node where pod is running.
This issue does NOT happen on released version - 3.1.0.4
Comment 2 Dan Winship 2016-01-05 13:48:47 EST
Should be fixed by https://github.com/openshift/openshift-sdn/pull/236
Comment 3 Meng Bo 2016-01-06 03:24:08 EST
Checked on origin with latest origin code and openshift-sdn code, the issue still can be reproduced.

$ oc get po -o wide 
NAME            READY     STATUS    RESTARTS   AGE       NODE
test-rc-vra8y   1/1       Running   0          10m       node2.bmeng.local
test-rc-z2hv3   1/1       Running   0          10m       node3.bmeng.local

All the pods can be accessed from node directly.

$ oc get svc
NAME           CLUSTER_IP      EXTERNAL_IP   PORT(S)     SELECTOR         AGE
test-service   172.30.19.173   <none>        27017/TCP   name=test-pods   11m

The service can be accessed only from node2 and node3.
And when accessing the service on node2 and node3, 50% of tries will fail, due to maybe the failed tries point to the pod on the other node.


The below openflow rule appears on all the node
 cookie=0x0, duration=848.950s, table=3, n_packets=0, n_bytes=0, priority=100,ip,nw_dst=10.1.0.0/24 actions=move:NXM_NX_TUN_ID[0..31]->NXM_NX_REG0[],goto_table:6
Comment 4 Dan Winship 2016-01-06 08:56:49 EST
hm... works for me... can you get debug.sh output? (https://raw.githubusercontent.com/openshift/openshift-sdn/master/hack/debug.sh)
Comment 5 Meng Bo 2016-01-06 22:04 EST
Created attachment 1112328 [details]
debug log

Here is the debug log from my env.
Comment 6 Meng Bo 2016-01-11 04:12:14 EST
I cannot reproduce this bug anymore with 

origin build:
# openshift version
openshift v1.1-730-gad80e1f
kubernetes v1.1.0-origin-1107-g4c8e6f4

openshift-sdn build:
da8ad5dc5c94012eb222221d909b2b6fa678500f
Comment 7 Meng Bo 2016-01-11 04:15:38 EST
Should be fixed by https://github.com/openshift/openshift-sdn/pull/237 ?
Comment 8 Johnny Liu 2016-01-12 00:50:53 EST
Re-test this bug with atomic-openshift-sdn-ovs-3.1.1.1-1.git.0.dba03a7.el7aos.x86_64 in AtomicOpenShift/3.1/2016-01-11.1, still does NOT work. Seem like the fix PR is not merged into OSE from upstream yet. So move the status to MODIFIED.
Comment 9 Johnny Liu 2016-01-12 21:52:51 EST
To be easier to track this bug's status, move this bug to "ASSIGNED".
Comment 10 Eric Paris 2016-01-13 12:01:26 EST
moving ON_QA as i'm told there was a new OSE build this morning.
Comment 12 Johnny Liu 2016-01-14 00:14:03 EST
Verified this bug with AtomicOpenShift/3.1/2016-01-13.1 puddle, and PASS.


[root@openshift-125 ~]# curl 172.30.240.45:5000
[root@openshift-125 ~]# 

No hang is seen there.
Comment 14 errata-xmlrpc 2016-01-26 14:20:25 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2016:0070

Note You need to log in before you can comment on or make changes to this bug.