Bug 1454948
| Summary: | pod-to-pod connectivity lost after rescaling with ovs-multitenant | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Ruben Romero Montes <rromerom> | ||||||||||||
| Component: | Networking | Assignee: | Ben Bennett <bbennett> | ||||||||||||
| Status: | CLOSED ERRATA | QA Contact: | Meng Bo <bmeng> | ||||||||||||
| Severity: | urgent | Docs Contact: | |||||||||||||
| Priority: | urgent | ||||||||||||||
| Version: | 3.5.0 | CC: | akaiser, aos-bugs, bbennett, danw, dcbw, eparis, gsapienz, javier.ramirez, jkaur, mark.vinkx, mifiedle, misalunk, nbhatt, pdwyer, rhowe, sjr, smunilla, tcarlin, tibrahim, tmanor, vcorrea, wabouham, weliang | ||||||||||||
| Target Milestone: | --- | ||||||||||||||
| Target Release: | --- | ||||||||||||||
| Hardware: | Unspecified | ||||||||||||||
| OS: | Unspecified | ||||||||||||||
| Whiteboard: | aos-scalability-36 | ||||||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||
| Doc Text: |
Cause: We were incorrectly removing VNID allow rules before they were really unused. It appears that when containers had startup errors it can cause the tracking to get out of sync.
Consequence: The rules that allowed communication for a namespace were removed early, so that if there were still pod in that namespace on the node, they could not communicate with one another.
Fix: Change the way that the tracking is done so that we avoid the nasty edge cases around pod creation / deletion failures.
Result: The VNID tracking does not fail so traffic flows.
|
Story Points: | --- | ||||||||||||
| Clone Of: | |||||||||||||||
| : | 1462338 (view as bug list) | Environment: | |||||||||||||
| Last Closed: | 2017-08-10 05:25:32 UTC | Type: | Bug | ||||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||||
| Documentation: | --- | CRM: | |||||||||||||
| Verified Versions: | Category: | --- | |||||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
| Embargoed: | |||||||||||||||
| Bug Depends On: | |||||||||||||||
| Bug Blocks: | 1267746, 1462338 | ||||||||||||||
| Attachments: |
|
||||||||||||||
|
Description
Ruben Romero Montes
2017-05-23 20:49:43 UTC
As I said in the problem description, I managed to reproduce it the first time but not the second. The "connection timeout" in step 6 was retrieved from the initial reproducer with "dancer" and "nodejs" applications. So are the outuput of the iptables, tcpdump, ip neigh and ip route commands. Created attachment 1283466 [details]
iptables 10.254.185.49
Created attachment 1283467 [details]
iptables 10.254.250.55
Created attachment 1283468 [details]
oadm diagnostics
Can reproduce this issue in my env: Below is the steps I used: oc create -f https://raw.githubusercontent.com/weliang1/Openshift_Networking/master/OCP/deployment-with-pod.yaml oc create -f https://raw.githubusercontent.com/openshift-qe/v3-testfiles/3b3859001d64e0a1aba78ff20646a2fc29078bf3/deployment/deployment-with-service.yaml for pod in `oc get po | grep Running | grep hello-pod | awk '{print$1}'`; do for service in `oc get po -o wide | grep Running | grep openshift | awk '{print$6}'`; do echo "$pod to $service"; oc exec $pod -- curl -ILs http://$service:8080 ; done ; done oc scale dc/hello-pod --replicas=5 oc scale dc/hello-openshift --replicas=5 for pod in `oc get po | grep Running | grep hello-pod | awk '{print$1}'`; do for service in `oc get po -o wide | grep Running | grep openshift | awk '{print$6}'`; do echo "$pod to $service"; oc exec $pod -- curl -ILs http://$service:8080 ; done ; done oc scale dc/hello-pod --replicas=1 oc scale dc/hello-openshift --replicas=1 for pod in `oc get po | grep Running | grep hello-pod | awk '{print$1}'`; do for service in `oc get po -o wide | grep Running | grep openshift | awk '{print$6}'`; do echo "$pod to $service"; oc exec $pod -- curl -ILs http://$service:8080 ; done ; done oc scale dc/hello-pod --replicas=5 oc scale dc/hello-openshift --replicas=5 for pod in `oc get po | grep Running | grep hello-pod | awk '{print$1}'`; do for service in `oc get po -o wide | grep Running | grep openshift | awk '{print$6}'`; do echo "$pod to $service"; oc exec $pod -- curl -ILs http://$service:8080 ; done ; done oc rollout latest hello-openshift oc rollout latest hello-pod for pod in `oc get po | grep Running | grep hello-pod | awk '{print$1}'`; do for service in `oc get po -o wide | grep Running | grep openshift | awk '{print$6}'`; do echo "$pod to $service"; oc exec $pod -- curl -ILs http://$service:8080 ; done ; done Even I scale high number as below still can not see the issue: oc scale dc/hello-pod --replicas=20 oc scale dc/hello-openshift --replicas=20 oc rollout latest hello-openshift oc rollout latest hello-pod > Can reproduce this issue in my env: ... > Even I scale high number as below still can not see the issue: Did you mean to say "CAN'T reproduce this" in the first sentence? Yes, I want to say I CAN'T reproduce it in my env. I reproduce this pods connectivity issue in my env after run checking script instead of manual testing. Reproduce Steps: oc create -f https://raw.githubusercontent.com/weliang1/Openshift_Networking/master/OCP/deployment-with-pod.yaml oc create -f https://raw.githubusercontent.com/weliang1/Openshift_Networking/master/OCP/test.yaml sleep 10 oc scale dc/hello-pod --replicas=5 oc scale dc/hello-openshift --replicas=5 sleep 20 for pod in `oc get po | grep Running | grep hello-pod | awk '{print$1}'`; do for service in `oc get po -o wide | grep Running | grep openshift | awk '{print$6}'`; do echo "$pod to $service"; oc exec $pod -- curl -ILs http://$service:8080 ; done ; done while true do for pod in `oc get po | grep Running | grep hello-pod | awk '{print$1}'`; do for service in `oc get po -o wide | grep Running | grep openshift | awk '{print$6}'`; do echo "$pod to $service"; oc exec $pod -- curl -ILs http://$service:8080 ; done ; done oc rollout latest hello-openshift oc rollout latest hello-pod sleep 35 done My testing env: AWS, mulitenant plugin, containerized, one master, two nodes. So far I can not reproduce this issue when I use NON containerized env. Please note that the original issue was reported against a setup that is _not_ containerized. So, whatever the race condition is, it may happen more often when containerized, but that's not the root cause of the problem. Created attachment 1286179 [details]
ovs-dump-pew05
Created attachment 1286180 [details]
ovs-dump-azuur05
I have attached the output of the following command run on the two nodes: # ovs-ofctl -O OpenFlow13 dump-flows br0 Source pod/node uzl-rhel-apache-ipam-115-d58qf 2/2 Running 0 20m 10.1.17.177 osclu1-azuur-05.uz.kuleuven.ac.be target pod/node uzl-rhel-perl-ipam-102-hbwhb 1/1 Running 0 20m 10.1.11.151 osclu1-pew-05.uz.kuleuven.ac.be Nodes IP osclu1-pew-05.uz.kuleuven.ac.be= 10.254.185.49 osclu1-azuur-05.uz.kuleuven.ac.be=10.254.250.55 As expected pod to pod connectivity fails but source-node to pod and target-node to pod connectivity works. * note that the connectivity is affected in both directions Based on those traces, the OVS is wrong in the same way it was with Weibin's case. The VNID for the project that the pods are in is 0x39d500, and that does not exist in table 80 in the ovs-dump-azuur05 dump. *** Bug 1452225 has been marked as a duplicate of this bug. *** Tested and verified in "atomic-openshift-3.6.96-1.git.0.381dd63.el7" image Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:1716 |