Description of problem: The UDP RR transactions are zero for 16KB msg size but have results when the message size is smaller Version-Release number of selected component (if applicable): openshift v3.5.0.14+20b49d0 kubernetes v1.5.2+43a9be4 etcd 3.1.0 Openshift SDN plugin: ovs-multitenant How reproducible: Always Steps to Reproduce: 1. Create receiver-sender services after creating pods. 2. Using pbench-uperf tool run the network statistics for the following options on both 3.4 and 3.5: UDP stream for 64, 1024, 16K UDP RR for 64, 1024, 16K 3. Note the transactions for UDP RR for 16K Actual results: The results are zero for UDP RR Expected results: The results should be non-zero and closer to the results obtained in 3.4. Additional info: The results collected in 3.4 and 3.5 are: 3.4: http://perf-infra.ec2.breakage.org/pbench/results/ip-172-31-5-240/uperf_svc-to-svc_PODS_2_UDP_NN_2016-12-20_14:41:01/result.html 3.4: http://perf-infra.ec2.breakage.org/pbench/results/ip-172-31-54-180/uperf_svc-to-svc-NN_PODS_2_UDP_2017-02-03_03:53:12/result.html The setup used: 1 master, 2 nodes. The service pods are created on two different nodes and the tests are run between two projects. that is two sets of pods. the pbench-uperf command run for the test is: pbench-uperf --test-types=stream,rr --runtime=30 --message-sizes=64,1024,16384 --protocols=udp --instances=1 --samples=3 --max-stddev=10 --clients=<<ip of sender1 svc>>,<<ip of sender2 svc>> --servers=<<ip of receiver1 svc>>,<<ip of receiver2 svc>> --config=svc-to-svc-NN_PODS_2_UDP
Quick test host to pod running iperf3 udp 16k packets work.
pod to pod UDP RR 16k works fine as the results below: http://perf-infra.ec2.breakage.org/pbench/results/ip-172-31-54-180/uperf_pod-to-pod-NN_PODS_2_UDP_2017-02-03_03:20:20/result.html The issue is with service to service where backing pod for the services are on two different nodes of openshift.
Based on Phil's test with 3.5 (bleeding edge) we can't reproduce this. Can you give us the full details of the 3.4 and 3.5 machines? Are they bare-metal? VMs? What are the versions of the kernel and of openvswitch. Thanks!
Siva: Thanks. Can you try it directly to the pod IP address and see if that works? It would help bisect the problem.
Ben: As I mentioned in Comment 2 I know for sure that pod IP works and the link I gave is the results of pod ip to pod ip UDP stream and RR test. Basically I run series of network tests which are as follows: 1. create two pods and run stream and RR for TCP, UDP 64, 1024 and 16K packets between pod ip addresses 2. create two services and two pods backing them and run stream and RR for TCP, UDP 64, 1024 and 16K packets between svc ip addresses All the scenarios give non-zero results except for svc ip to svc ip UDP RR for packet size 16k. Here are the detailed steps for the setup and requested infomarion: The 3.4 and 3.5 are identical setups on AWS EC2. They are as follows Openshift Nodes Model OS Master+etcd AWS EC2 m4.xlarge RHEL 7.3 NodeOne AWS EC2 m4.xlarge RHEL 7.3 NodeTwo AWS EC2 m4.xlarge RHEL 7.3 Requested versions are same on both. # ovs-vswitchd --version ovs-vswitchd (Open vSwitch) 2.5.0 Compiled Nov 22 2016 12:40:37 # uname -r 3.10.0-514.6.1.el7.x86_64 In order to perform the tests, both the sender and receiver pods are created using the following (the json files are attached as attachments): # ---label nodes as sender and receivers so the pods are created on those nodes #oc label node <<nodeOne>> --overwrite region=sender #oc label node <<nodeTwo>> --overwrite region=receiver # --- create the project #oc create project uperf-1 #-- create sender pod #oc process -p REGION=sender -p ROLE=sender -f uperf-rc-template.json | oc create --namespace=uperf-1 -f - #-- create sender service # oc process -p ROLE=sender -f uperf-svc-template-1.json | oc create --namespace=uperf-1 -f - #-- create receiver pod #oc process -p REGION=sender -p ROLE=sender -f uperf-rc-template.json | oc create --namespace=uperf-1 -f - #-- create receiver service # oc process -p ROLE=sender -f uperf-svc-template-1.json | oc create --namespace=uperf-1 -f - #-- Set sysctl net.ipv4.ip_local_port_range in all pods #oc exec <<receiverPodCreated>> --namespace=uperf-1 -- sysctlnet.ipv4.ip_local_port_range="20010 20019" #oc exec <<senderPodCreated>> --namespace=uperf-1 -- sysctl net.ipv4.ip_local_port_range="20010 20019" #-- get ip of sender and receiver #oc get svc After the services are set up and the ips are obtained, for the test using the pbench-uperf benchmarking tool(https://github.com/distributed-system-analysis/pbench), which uses uperf to run the network tests using the command pbench-uperf --test-types=stream,rr --runtime=30 --message-sizes=64,1024,16384 --protocols=udp --instances=1 --samples=3 --max-stddev=10 --clients=<<ip of sender1 svc>>,<<ip of sender2 svc>> --servers=<<ip of receiver1 svc>>,<<ip of receiver2 svc>> --config=svc-to-svc-NN_PODS_2_UDP Let me know if you want to access the test environment I created for these tests. The above set up for the tests are done by automated scripts written specifically for these tests and are at: https://github.com/openshift/svt/tree/master/networking/synthetic
Created attachment 1248493 [details] rc json file
Created attachment 1248494 [details] Service template
For same exact setup of the machines but using ovs plugin "subnet" gives these results which are non-zero http://perf-infra.ec2.breakage.org/pbench/results/ip-172-31-4-116/uperf_svc-to-svc-NN_PODS_2_UDP_2017-02-09_21:07:23/result.html This result above indicates some performance issue with multitenant plugin for UDP RR transcations when the packet size is 16K.
Can you please grab the output from iptables-save on both multitenant versions please?
We see 16k UDP packets flow through to a pod (the problem we saw with iperf3 was that it opens a TCP connection to report the results... but that gets blocked by the service proxy, but pbench-uperf seems not to have that problem). In addition to the iperf-save requested above, can you get the pid for the server pod and do: nsenter -n -t <pid> tcpdump -n -i eth0 udp Then paste the results from the tcpdump for 1024 bytes and 16k. Thanks!
I assume this is when the tests are running. isn't it?.
I have sent the details via email. please let me know if you have any issues
UDP seems to be fine. I ran: On the receiving pod: nc -lup 20021 > /tmp/foo On the sending pod: perl -le 'print "A"x16384' | nc -u 172.25.96.92 20021 I see 16384 bytes transmitted successfully (by looking at /tmp/foo). And that is to the service address. I see that you have two projects set up and both receiving IP addresses are passed. With multitenant, what is supposed to happen? I would not expect pods in one project to talk to pods in the other.
Agreed simple transmission happens fine and expectation is that only pods in projects can talk to each other. The two projects are just for scale, that is two senders and two receivers are communicating with each other simultaneously within the same project. That is, to just simulate load/scale in the environment. Here is what is happening in multi-tenant, just for simplicity, made it to just one project 1. Click on this link and notice the throughput and latency for 16K msg size as zero whereas 64, 1024K it is non-zero which means the messages got transmitted. This data is summarized by the wrapper around the uperf tool. http://perf-infra.ec2.breakage.org/pbench/results/ip-172-31-27-63/uperf_svc-to-svc-NN_PODS_1_UDP_2017-02-13_21:00:02/result.html If you want system level data collected by tools then they are here: http://perf-infra.ec2.breakage.org/pbench/results/ip-172-31-27-63/uperf_svc-to-svc-NN_PODS_1_UDP_2017-02-13_21:00:02/3-udp_rr-16384B-1i/sample1/tools-default/svc-to-svc-NN_PODS_1:ip-172-31-32-182.us-west-2.compute.internal/ Here is the environment system status: root@ip-172-31-27-63: ~/svt/networking/synthetic # oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE uperf-receiver-8s04c 1/1 Running 0 10m 172.20.1.66 ip-172-31-32-182.us-west-2.compute.internal uperf-sender-ws1gw 1/1 Running 0 11m 172.20.3.77 ip-172-31-12-44.us-west-2.compute.internal root@ip-172-31-27-63: ~/svt/networking/synthetic # oc get svc NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE uperf-receiver 172.26.174.12 <none> 22/TCP,20010/TCP,20011/TCP,20012/TCP,20013/TCP,20014/TCP,20015/TCP,20016/TCP,20017/TCP,20018/TCP,20019/TCP,20010/UDP,20011/UDP,20012/UDP,20013/UDP,20014/UDP,20015/UDP,20016/UDP,20017/UDP,20018/UDP,20019/UDP 11m uperf-sender 172.26.136.199 <none> 22/TCP,20010/TCP,20011/TCP,20012/TCP,20013/TCP,20014/TCP,20015/TCP,20016/TCP,20017/TCP,20018/TCP,20019/TCP,20010/UDP,20011/UDP,20012/UDP,20013/UDP,20014/UDP,20015/UDP,20016/UDP,20017/UDP,20018/UDP,20019/UDP 11m the command that runs the UDP 16K transmission: pbench-uperf --test-types=rr --runtime=30 --message-sizes=64,1024,16384 --protocols=udp --instances=1 --samples=3 --max-stddev=10 --clients=172.26.136.199 --servers=172.26.174.12 --config=svc-to-svc-NN_PODS_1_UDP Basically, the tool is running msg sizes 64, 1024K, 16K for 30 secs runtime between the SVC IPs and it is UDP protocal, RR test. As inidcated above 64, 1024K msg sizes we see transmission, the question is why it is zero when the size is 16K. if you want to see the openvswitch data log, it is here http://perf-infra.ec2.breakage.org/pbench/results/ip-172-31-27-63/uperf_svc-to-svc-NN_PODS_1_UDP_2017-02-13_21:00:02/3-udp_rr-16384B-1i/sample1/tools-default/svc-to-svc-NN_PODS_1:ip-172-31-32-182.us-west-2.compute.internal/openvswitch/openvswitch-stdout.txt
Ok... thanks for the access. @aconole and I looked at the setup and it's "interesting". The nc works because it linebuffers the input at a smaller value than the MTU. When I ran uperf in isolation: Server: /usr/local/bin/uperf -s -P 20010 Client: /usr/local/bin/uperf -x -a -i 1 -P 20010 -m file.xml <?xml version="1.0"?> <profile name="udp-rr-16384B-1i"> <group nthreads="1"> <transaction iterations="1"> <flowop type="connect" options="remotehost=172.27.55.64 protocol=udp"/> </transaction> <transaction duration="1s"> <flowop type="write" options="size=16384"/> <flowop type="read" options="size=16384"/> <transaction iterations="1"> <flowop type="disconnect" /> </transaction> </group> </profile> Watching the packets on the sending host, they are not fragmented. We see large packets pop out of the veth and then go into iptables, never to return. Whereas any packets smaller than the veth go in and are correctly rewritten by the nat rules.
Forgot to mention that aconole thinks that the veth ought to be fragmenting the packets... he is going to reassign to the net team to look at further. BTW Siva: Was the version of the OS the same for the 3.3 and 3.4 tests?
Ben, yes it has been RHEL 7.3 for 3.3, 3.4. I guess 3.2 was RHEL 7.2 but we are comparing 3.4 and 3.5 now.
Same version of OVS too, right?
yes same OVS version. I'm planning to run the same tests again on OVS 2.6 but so far it has been OVS 2.5
Root cause: 1) OVS applies flows to each fragment of a packet individually 2) The openshift-sdn rules for non-admin-VNID pods sending to a service match on TCP/UDP destination port 3) By default, OVS sets the TCP/UDP port numbers to 0 for *all* fragments of a fragmented packet, even the first one that would otherwise have them available 4) Thus, we drop all fragmented, non-default-VNID service traffic in our OVS flow table=60 because the port numbers don't match TLDR; openshift-sdn currently drops fragmented multi-tenant (non-admin-VNID) service traffic originating from pods. Solutions: a) use OVS's conntrack functionality to re-assemble the packet and thus make the port #s available again in the flow tables. However, OVS conntrack functionality appears to conflict with the kernel's iptables functionality in various ways, and we rely on iptables for service proxying. Unless we can solve the incompatibility issue this won't work. b) drop port matching from our OVS rules for non-admin-VNID services. This would normally work, except if we expect users to manually assign the same service IP to multiple services and differentiate only on port #s.
Dan: How did this work in 3.4?
Possible workaround pushed as https://github.com/openshift/origin/pull/13162
(In reply to Ben Bennett from comment #21) > Dan: How did this work in 3.4? We're really not sure about that. I tried to spin up a 3.4 cluster, but the DIND tooling had a bug back then. Were you able to get something set up to check 3.4?
schituku: any results from 3.4 env?
Mike, I have a created a 3.4 cluster and ran the tests but the bug showed up in the results. I've given details of this environment to Ben so that he can further look at it. Not sure how the results were non-empty when I ran the tests back then. here are the results: 3.4 now: http://perf-infra.ec2.breakage.org/pbench/results/ip-172-31-9-161/uperf_svc-to-svc-NN_PODS_2_UDP_2017-03-02_16:44:34/result.html shows 0 for the 16KB UDP 3.4 during 3.4 release: http://perf-infra.ec2.breakage.org/pbench/results/ip-172-31-5-240/uperf_svc-to-svc_PODS_2_UDP_NN_2016-12-20_14:41:01/result.html
This has been merged into ocp and is in OCP v3.6.27 or newer.
Verified this bug and validated that the UDP RR msgs are not zero any more. Here is the data for the run. http://pbench.perf.lab.eng.bos.redhat.com/results/EC2::ip-172-31-53-180/uperf_svc-to-svc-NN_PODS_8_UDP_2017-04-20_22:23:11/result.html
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:1716