Description of problem: Inside the same namespace having two applications (A and B). After scaling down B, pods from application A have no longer connectivity to the pods. However connectivity to the service endpoint works or from outside the pods. Version-Release number of selected component (if applicable): openshift v3.5.5.15 kubernetes v1.5.2+43a9be4 networkPlugin: ovs-multitenant How reproducible: Sometimes Steps to Reproduce: 1. oc new-project pod2pod 2. Create example Perl applications "panda" and "koala" $ oc new-app perl~https://github.com/openshift/dancer-ex.git --name=panda $ oc new-app perl~https://github.com/openshift/dancer-ex.git --name=koala 3. Label nodes to ensure each app is deployed in different nodes $ oc label node infra-0.rromerorhsso.quicklab.pnq2.cee.redhat.com app=panda $ oc label node node-0.rromerorhsso.quicklab.pnq2.cee.redhat.com app=koala 4. Patch both deployment configs $ oc patch dc panda -p '{"spec": {"template": {"spec": {"nodeSelector": {"app": "panda"}}}}}' $ oc patch dc koala -p '{"spec": {"template": {"spec": {"nodeSelector": {"app": "koala"}}}}}' 5. Scale up koala and test connectivity $ oc scale dc/koala --replicas=3 $ for panda in `oc get po | grep Running | grep panda | awk '{print$1}'`; do for koala in `oc get po -o wide | grep Running | grep koala | awk '{print$6}'`; do echo "$panda to $koala"; oc exec $panda -- curl -ILs http://$koala:8080 ; done ; done panda-2-6292g to 10.130.0.47 HTTP/1.1 200 OK Date: Tue, 23 May 2017 20:25:51 GMT Server: Apache/2.4.18 (Red Hat) mod_perl/2.0.9 Perl/v5.24.0 Content-Length: 42494 Content-Type: text/html; charset=UTF-8 panda-2-6292g to 10.130.0.50 HTTP/1.1 200 OK Date: Tue, 23 May 2017 20:25:51 GMT Server: Apache/2.4.18 (Red Hat) mod_perl/2.0.9 Perl/v5.24.0 Content-Length: 42494 Content-Type: text/html; charset=UTF-8 panda-2-6292g to 10.130.0.49 HTTP/1.1 200 OK Date: Tue, 23 May 2017 20:25:52 GMT Server: Apache/2.4.18 (Red Hat) mod_perl/2.0.9 Perl/v5.24.0 Content-Length: 42494 Content-Type: text/html; charset=UTF-8 6. Scale down to 1 and repeat connectivity $ oc scale dc/koala --replicas=1 $ for koala in `oc get po -o wide | grep Running | grep koala | awk '{print$6}'`; do echo "$panda to $koala"; oc exec $panda -- curl -ILs http://$koala:8080 ; done ; done Actual results: sh-4.2$ curl -IL http://10.129.0.40:8080 curl: (7) Failed connect to 10.129.0.40:8080; Connection timed out Expected results: sh-4.2$ curl -IL http://10.129.0.40:8080 HTTP/1.1 200 OK Additional info: I tried to replicate the problem in the same cluster and I couldn't. Besides, after trying the initial project was fixed and I have not been able to reproduce it again. Here is all the information I could retrieve. [root@infra-0 ~]# docker inspect -f '{{.State.Pid}}' d493cfd23b27 122423 [root@infra-0 ~]# nsenter -n -t 122423 [root@infra-0 ~]# iptables-save > nodejs.iptables [root@infra-0 ~]# tcpdump -i any -w nodejs.pcap [root@infra-0 ~]# ip neigh 10.130.0.41 dev eth0 lladdr 7e:24:b3:77:cc:47 STALE 10.130.0.42 dev eth0 lladdr e6:98:2f:43:5e:bf STALE 10.130.0.38 dev eth0 lladdr 0a:85:3c:d2:52:8e STALE 10.129.0.1 dev eth0 lladdr 42:4a:16:78:ed:aa REACHABLE 10.130.0.39 dev eth0 lladdr 8e:1c:95:72:6d:56 STALE 10.130.0.40 dev eth0 lladdr d2:e7:78:9e:8c:20 STALE [root@infra-0 ~]# ip route default via 10.129.0.1 dev eth0 10.128.0.0/14 dev eth0 10.129.0.0/23 dev eth0 proto kernel scope link src 10.129.0.40 224.0.0.0/4 dev eth0 [root@node-0 ~]# docker inspect -f '{{.State.Pid}}' 6c02fd0db3d4 4605 [root@node-0 ~]# nsenter -n -t 4605 [root@node-0 ~]# iptables-save > dancer.iptables [root@node-0 ~]# tcpdump -i any -w dancer.pcap [root@node-0 ~]# ip neigh 10.130.0.42 dev eth0 lladdr e6:98:2f:43:5e:bf STALE 10.129.0.43 dev eth0 lladdr ae:2c:78:8c:c3:8a STALE 10.129.0.44 dev eth0 lladdr 0e:45:54:ab:19:e7 STALE 10.129.0.41 dev eth0 lladdr 42:4c:90:91:32:01 STALE 10.130.0.1 dev eth0 lladdr fa:ea:35:8d:8e:ba STALE 10.129.0.42 dev eth0 lladdr 56:1d:83:69:00:84 STALE 10.129.0.40 dev eth0 lladdr 4a:3d:9c:2f:60:a7 STALE 10.129.0.1 dev eth0 lladdr 42:4a:16:78:ed:aa STALE [root@node-0 ~]# ip route default via 10.130.0.1 dev eth0 10.128.0.0/14 dev eth0 10.130.0.0/23 dev eth0 proto kernel scope link src 10.130.0.38 224.0.0.0/4 dev eth0
As I said in the problem description, I managed to reproduce it the first time but not the second. The "connection timeout" in step 6 was retrieved from the initial reproducer with "dancer" and "nodejs" applications. So are the outuput of the iptables, tcpdump, ip neigh and ip route commands.
Created attachment 1283466 [details] iptables 10.254.185.49
Created attachment 1283467 [details] iptables 10.254.250.55
Created attachment 1283468 [details] oadm diagnostics
Can reproduce this issue in my env: Below is the steps I used: oc create -f https://raw.githubusercontent.com/weliang1/Openshift_Networking/master/OCP/deployment-with-pod.yaml oc create -f https://raw.githubusercontent.com/openshift-qe/v3-testfiles/3b3859001d64e0a1aba78ff20646a2fc29078bf3/deployment/deployment-with-service.yaml for pod in `oc get po | grep Running | grep hello-pod | awk '{print$1}'`; do for service in `oc get po -o wide | grep Running | grep openshift | awk '{print$6}'`; do echo "$pod to $service"; oc exec $pod -- curl -ILs http://$service:8080 ; done ; done oc scale dc/hello-pod --replicas=5 oc scale dc/hello-openshift --replicas=5 for pod in `oc get po | grep Running | grep hello-pod | awk '{print$1}'`; do for service in `oc get po -o wide | grep Running | grep openshift | awk '{print$6}'`; do echo "$pod to $service"; oc exec $pod -- curl -ILs http://$service:8080 ; done ; done oc scale dc/hello-pod --replicas=1 oc scale dc/hello-openshift --replicas=1 for pod in `oc get po | grep Running | grep hello-pod | awk '{print$1}'`; do for service in `oc get po -o wide | grep Running | grep openshift | awk '{print$6}'`; do echo "$pod to $service"; oc exec $pod -- curl -ILs http://$service:8080 ; done ; done oc scale dc/hello-pod --replicas=5 oc scale dc/hello-openshift --replicas=5 for pod in `oc get po | grep Running | grep hello-pod | awk '{print$1}'`; do for service in `oc get po -o wide | grep Running | grep openshift | awk '{print$6}'`; do echo "$pod to $service"; oc exec $pod -- curl -ILs http://$service:8080 ; done ; done oc rollout latest hello-openshift oc rollout latest hello-pod for pod in `oc get po | grep Running | grep hello-pod | awk '{print$1}'`; do for service in `oc get po -o wide | grep Running | grep openshift | awk '{print$6}'`; do echo "$pod to $service"; oc exec $pod -- curl -ILs http://$service:8080 ; done ; done Even I scale high number as below still can not see the issue: oc scale dc/hello-pod --replicas=20 oc scale dc/hello-openshift --replicas=20 oc rollout latest hello-openshift oc rollout latest hello-pod
> Can reproduce this issue in my env: ... > Even I scale high number as below still can not see the issue: Did you mean to say "CAN'T reproduce this" in the first sentence?
Yes, I want to say I CAN'T reproduce it in my env.
I reproduce this pods connectivity issue in my env after run checking script instead of manual testing.
Reproduce Steps: oc create -f https://raw.githubusercontent.com/weliang1/Openshift_Networking/master/OCP/deployment-with-pod.yaml oc create -f https://raw.githubusercontent.com/weliang1/Openshift_Networking/master/OCP/test.yaml sleep 10 oc scale dc/hello-pod --replicas=5 oc scale dc/hello-openshift --replicas=5 sleep 20 for pod in `oc get po | grep Running | grep hello-pod | awk '{print$1}'`; do for service in `oc get po -o wide | grep Running | grep openshift | awk '{print$6}'`; do echo "$pod to $service"; oc exec $pod -- curl -ILs http://$service:8080 ; done ; done while true do for pod in `oc get po | grep Running | grep hello-pod | awk '{print$1}'`; do for service in `oc get po -o wide | grep Running | grep openshift | awk '{print$6}'`; do echo "$pod to $service"; oc exec $pod -- curl -ILs http://$service:8080 ; done ; done oc rollout latest hello-openshift oc rollout latest hello-pod sleep 35 done
My testing env: AWS, mulitenant plugin, containerized, one master, two nodes. So far I can not reproduce this issue when I use NON containerized env.
Please note that the original issue was reported against a setup that is _not_ containerized. So, whatever the race condition is, it may happen more often when containerized, but that's not the root cause of the problem.
Created attachment 1286179 [details] ovs-dump-pew05
Created attachment 1286180 [details] ovs-dump-azuur05
I have attached the output of the following command run on the two nodes: # ovs-ofctl -O OpenFlow13 dump-flows br0 Source pod/node uzl-rhel-apache-ipam-115-d58qf 2/2 Running 0 20m 10.1.17.177 osclu1-azuur-05.uz.kuleuven.ac.be target pod/node uzl-rhel-perl-ipam-102-hbwhb 1/1 Running 0 20m 10.1.11.151 osclu1-pew-05.uz.kuleuven.ac.be Nodes IP osclu1-pew-05.uz.kuleuven.ac.be= 10.254.185.49 osclu1-azuur-05.uz.kuleuven.ac.be=10.254.250.55 As expected pod to pod connectivity fails but source-node to pod and target-node to pod connectivity works. * note that the connectivity is affected in both directions
Based on those traces, the OVS is wrong in the same way it was with Weibin's case. The VNID for the project that the pods are in is 0x39d500, and that does not exist in table 80 in the ovs-dump-azuur05 dump.
https://github.com/openshift/origin/pull/14560
*** Bug 1452225 has been marked as a duplicate of this bug. ***
Tested and verified in "atomic-openshift-3.6.96-1.git.0.381dd63.el7" image
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:1716