Description of problem: After installation, service endpoint can not be accessed across nodes. Version-Release number of selected component (if applicable): atomic-openshift-3.1.1.0-1.git.0.8632732.el7aos.x86_64 AtomicOpenShift/3.1/2015-12-19.3 How reproducible: Always Steps to Reproduce: 1.Set up an env: 1 master + 1 node 2.Install docker-registry or create a simple service, e.g: oc create -f https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/networking/list_for_pods.json 3.Get service endpoint # oc get svc NAME CLUSTER_IP EXTERNAL_IP PORT(S) SELECTOR AGE docker-registry 172.30.53.111 <none> 5000/TCP docker-registry=default 2h kubernetes 172.30.0.1 <none> 443/TCP,53/UDP,53/TCP <none> 2h test-service 172.30.54.65 <none> 27017/TCP name=test-pods 1h 4. On master, access SVC endpoint Actual results: No response # curl 172.30.54.65:27017 ^C # curl 172.30.53.111:5000 ^C Expected results: Should be accessed successfully. Additional info: The endpoint could be accessed on the node where pod is running. This issue does NOT happen on released version - 3.1.0.4
Should be fixed by https://github.com/openshift/openshift-sdn/pull/236
Checked on origin with latest origin code and openshift-sdn code, the issue still can be reproduced. $ oc get po -o wide NAME READY STATUS RESTARTS AGE NODE test-rc-vra8y 1/1 Running 0 10m node2.bmeng.local test-rc-z2hv3 1/1 Running 0 10m node3.bmeng.local All the pods can be accessed from node directly. $ oc get svc NAME CLUSTER_IP EXTERNAL_IP PORT(S) SELECTOR AGE test-service 172.30.19.173 <none> 27017/TCP name=test-pods 11m The service can be accessed only from node2 and node3. And when accessing the service on node2 and node3, 50% of tries will fail, due to maybe the failed tries point to the pod on the other node. The below openflow rule appears on all the node cookie=0x0, duration=848.950s, table=3, n_packets=0, n_bytes=0, priority=100,ip,nw_dst=10.1.0.0/24 actions=move:NXM_NX_TUN_ID[0..31]->NXM_NX_REG0[],goto_table:6
hm... works for me... can you get debug.sh output? (https://raw.githubusercontent.com/openshift/openshift-sdn/master/hack/debug.sh)
Created attachment 1112328 [details] debug log Here is the debug log from my env.
I cannot reproduce this bug anymore with origin build: # openshift version openshift v1.1-730-gad80e1f kubernetes v1.1.0-origin-1107-g4c8e6f4 openshift-sdn build: da8ad5dc5c94012eb222221d909b2b6fa678500f
Should be fixed by https://github.com/openshift/openshift-sdn/pull/237 ?
Re-test this bug with atomic-openshift-sdn-ovs-3.1.1.1-1.git.0.dba03a7.el7aos.x86_64 in AtomicOpenShift/3.1/2016-01-11.1, still does NOT work. Seem like the fix PR is not merged into OSE from upstream yet. So move the status to MODIFIED.
To be easier to track this bug's status, move this bug to "ASSIGNED".
moving ON_QA as i'm told there was a new OSE build this morning.
Verified this bug with AtomicOpenShift/3.1/2016-01-13.1 puddle, and PASS. [root@openshift-125 ~]# curl 172.30.240.45:5000 [root@openshift-125 ~]# No hang is seen there.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2016:0070