Bug 1293251

Summary:

Can not access service endpoint between different nodes.

Product:

OpenShift Container Platform

Reporter:

Johnny Liu <jialiu>

Component:

Networking

Assignee:

Dan Winship <danw>

Status:

CLOSED ERRATA

QA Contact:

Meng Bo <bmeng>

Severity:

urgent

Docs Contact:

Priority:

urgent

Version:

3.1.0

CC:

aos-bugs, bleanhar, eparis, haowang, jkrieger, jokerman

Target Milestone:

---

Keywords:

Regression, TestBlocker

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2016-01-26 19:20:25 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
debug log	none

Description Johnny Liu 2015-12-21 08:46:06 UTC

Description of problem:
After installation, service endpoint can not be accessed across nodes.

Version-Release number of selected component (if applicable):
atomic-openshift-3.1.1.0-1.git.0.8632732.el7aos.x86_64
AtomicOpenShift/3.1/2015-12-19.3

How reproducible:
Always

Steps to Reproduce:
1.Set up an env: 1 master + 1 node
2.Install docker-registry or create a simple service, e.g:
oc create -f https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/networking/list_for_pods.json
3.Get service endpoint
# oc get svc
NAME              CLUSTER_IP      EXTERNAL_IP   PORT(S)                 SELECTOR                  AGE
docker-registry   172.30.53.111   <none>        5000/TCP                docker-registry=default   2h
kubernetes        172.30.0.1      <none>        443/TCP,53/UDP,53/TCP   <none>                    2h
test-service      172.30.54.65    <none>        27017/TCP               name=test-pods            1h
4. On master, access SVC endpoint

Actual results:
No response
# curl 172.30.54.65:27017
^C
# curl 172.30.53.111:5000
^C

Expected results:
Should be accessed successfully.

Additional info:
The endpoint could be accessed on the node where pod is running.
This issue does NOT happen on released version - 3.1.0.4

Comment 2 Dan Winship 2016-01-05 18:48:47 UTC

Should be fixed by https://github.com/openshift/openshift-sdn/pull/236

Comment 3 Meng Bo 2016-01-06 08:24:08 UTC

Checked on origin with latest origin code and openshift-sdn code, the issue still can be reproduced.

$ oc get po -o wide 
NAME            READY     STATUS    RESTARTS   AGE       NODE
test-rc-vra8y   1/1       Running   0          10m       node2.bmeng.local
test-rc-z2hv3   1/1       Running   0          10m       node3.bmeng.local

All the pods can be accessed from node directly.

$ oc get svc
NAME           CLUSTER_IP      EXTERNAL_IP   PORT(S)     SELECTOR         AGE
test-service   172.30.19.173   <none>        27017/TCP   name=test-pods   11m

The service can be accessed only from node2 and node3.
And when accessing the service on node2 and node3, 50% of tries will fail, due to maybe the failed tries point to the pod on the other node.


The below openflow rule appears on all the node
 cookie=0x0, duration=848.950s, table=3, n_packets=0, n_bytes=0, priority=100,ip,nw_dst=10.1.0.0/24 actions=move:NXM_NX_TUN_ID[0..31]->NXM_NX_REG0[],goto_table:6

Comment 4 Dan Winship 2016-01-06 13:56:49 UTC

hm... works for me... can you get debug.sh output? (https://raw.githubusercontent.com/openshift/openshift-sdn/master/hack/debug.sh)

Comment 5 Meng Bo 2016-01-07 03:04:17 UTC

Created attachment 1112328 [details]
debug log

Here is the debug log from my env.

Comment 6 Meng Bo 2016-01-11 09:12:14 UTC

I cannot reproduce this bug anymore with 

origin build:
# openshift version
openshift v1.1-730-gad80e1f
kubernetes v1.1.0-origin-1107-g4c8e6f4

openshift-sdn build:
da8ad5dc5c94012eb222221d909b2b6fa678500f

Comment 7 Meng Bo 2016-01-11 09:15:38 UTC

Should be fixed by https://github.com/openshift/openshift-sdn/pull/237 ?

Comment 8 Johnny Liu 2016-01-12 05:50:53 UTC

Re-test this bug with atomic-openshift-sdn-ovs-3.1.1.1-1.git.0.dba03a7.el7aos.x86_64 in AtomicOpenShift/3.1/2016-01-11.1, still does NOT work. Seem like the fix PR is not merged into OSE from upstream yet. So move the status to MODIFIED.

Comment 9 Johnny Liu 2016-01-13 02:52:51 UTC

To be easier to track this bug's status, move this bug to "ASSIGNED".

Comment 10 Eric Paris 2016-01-13 17:01:26 UTC

moving ON_QA as i'm told there was a new OSE build this morning.

Comment 12 Johnny Liu 2016-01-14 05:14:03 UTC

Verified this bug with AtomicOpenShift/3.1/2016-01-13.1 puddle, and PASS.


[root@openshift-125 ~]# curl 172.30.240.45:5000
[root@openshift-125 ~]# 

No hang is seen there.

Comment 14 errata-xmlrpc 2016-01-26 19:20:25 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2016:0070