Bug 1419692 - ovs-multitenant: 3.5 UDP RR transactions are zero for 16KB msg size between svc to svc
Summary: ovs-multitenant: 3.5 UDP RR transactions are zero for 16KB msg size between s...
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 3.5.0
Hardware: All
OS: All
Target Milestone: ---
: ---
Assignee: Dan Williams
QA Contact: Siva Reddy
Whiteboard: aos-scalability-35
Depends On:
TreeView+ depends on / blocked
Reported: 2017-02-06 18:56 UTC by Siva Reddy
Modified: 2017-08-16 19:51 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: some fragmented IP packets were mistakenly dropped by openshift-node instead of being delivered to pods. Consequence: large UDP and TCP packets could have some or all fragments dropped instead of being delivered. Fix: ensure that fragments are correctly evaluated and sent to their destination. Result: large UDP and TCP packets should be delivered to pods in the cluster
Clone Of:
Last Closed: 2017-08-10 05:17:28 UTC
Target Upstream Version:

Attachments (Terms of Use)
rc json file (2.06 KB, text/plain)
2017-02-07 20:28 UTC, Siva Reddy
no flags Details
Service template (2.71 KB, text/plain)
2017-02-07 20:29 UTC, Siva Reddy
no flags Details

System ID Private Priority Status Summary Last Updated
Origin (Github) 13162 0 None None None 2017-03-01 18:15:31 UTC
Red Hat Product Errata RHEA-2017:1716 0 normal SHIPPED_LIVE Red Hat OpenShift Container Platform 3.6 RPM Release Advisory 2017-08-10 09:02:50 UTC

Description Siva Reddy 2017-02-06 18:56:23 UTC
Description of problem:
    The UDP RR transactions are zero for 16KB msg size but have results when the message size is smaller

Version-Release number of selected component (if applicable):
openshift v3.5.0.14+20b49d0
kubernetes v1.5.2+43a9be4
etcd 3.1.0

Openshift SDN plugin: ovs-multitenant

How reproducible:

Steps to Reproduce:
1. Create receiver-sender services after creating pods.
2. Using pbench-uperf tool run the network statistics for the following options on both 3.4 and 3.5:
   UDP stream for 64, 1024, 16K 
   UDP RR for 64, 1024, 16K
3. Note the transactions for UDP RR for 16K 

Actual results:
   The results are zero for UDP RR 

Expected results:
   The results should be non-zero and closer to the results obtained in 3.4.

Additional info:
   The results collected in 3.4 and 3.5 are:
3.4: http://perf-infra.ec2.breakage.org/pbench/results/ip-172-31-5-240/uperf_svc-to-svc_PODS_2_UDP_NN_2016-12-20_14:41:01/result.html
3.4: http://perf-infra.ec2.breakage.org/pbench/results/ip-172-31-54-180/uperf_svc-to-svc-NN_PODS_2_UDP_2017-02-03_03:53:12/result.html

    The setup used:
1 master, 2 nodes. The service pods are created on two different nodes and the tests are run between two projects. that is two sets of pods.
    the pbench-uperf command run for the test is:

pbench-uperf --test-types=stream,rr --runtime=30 --message-sizes=64,1024,16384 --protocols=udp --instances=1 --samples=3 --max-stddev=10 --clients=<<ip of sender1 svc>>,<<ip of sender2 svc>> --servers=<<ip of receiver1 svc>>,<<ip of receiver2 svc>> --config=svc-to-svc-NN_PODS_2_UDP

Comment 1 Phil Cameron 2017-02-07 19:15:57 UTC
Quick test host to pod running iperf3 udp 16k packets work.

Comment 2 Siva Reddy 2017-02-07 19:29:22 UTC
pod to pod UDP RR 16k works fine as the results below:
    The issue is with service to service where backing pod for the services are on two different nodes of openshift.

Comment 3 Ben Bennett 2017-02-07 19:30:17 UTC
Based on Phil's test with 3.5 (bleeding edge) we can't reproduce this.  Can you give us the full details of the 3.4 and 3.5 machines?  Are they bare-metal?  VMs?

What are the versions of the kernel and of openvswitch.


Comment 4 Ben Bennett 2017-02-07 19:40:54 UTC
Siva: Thanks.  Can you try it directly to the pod IP address and see if that works?  It would help bisect the problem.

Comment 5 Siva Reddy 2017-02-07 20:27:53 UTC
    As I mentioned in Comment 2 I know for sure that pod IP works and the link I gave is the results of pod ip to pod ip UDP stream and RR test. Basically I run series of network tests which are as follows:
   1. create two pods and run stream and RR for TCP, UDP 64, 1024 and 16K packets between pod ip addresses 
   2. create two services and two pods backing them and run stream and RR for TCP, UDP 64, 1024 and 16K packets between svc ip addresses 

All the scenarios give non-zero results except for svc ip to svc ip UDP RR for packet size 16k.

Here are the detailed steps for the setup and requested infomarion:

The 3.4 and 3.5 are identical setups on AWS EC2. They are as follows
Openshift Nodes        Model               OS
 Master+etcd      AWS EC2 m4.xlarge     RHEL 7.3
 NodeOne          AWS EC2 m4.xlarge     RHEL 7.3
 NodeTwo          AWS EC2 m4.xlarge     RHEL 7.3

Requested versions are same on both. 
# ovs-vswitchd --version
ovs-vswitchd (Open vSwitch) 2.5.0
Compiled Nov 22 2016 12:40:37

# uname -r

      In order to perform the tests, both the sender and receiver pods are created using the following (the json files are attached as attachments):
# ---label nodes as sender and receivers so the pods are created on those nodes
#oc label node <<nodeOne>> --overwrite region=sender
#oc label node <<nodeTwo>> --overwrite region=receiver
# --- create the project
#oc create project uperf-1
#-- create sender pod
#oc process -p REGION=sender -p ROLE=sender -f uperf-rc-template.json | oc create --namespace=uperf-1 -f -
#-- create sender service
# oc process -p ROLE=sender -f uperf-svc-template-1.json | oc create --namespace=uperf-1 -f -
#-- create receiver pod
#oc process -p REGION=sender -p ROLE=sender -f uperf-rc-template.json | oc create --namespace=uperf-1 -f -
#-- create receiver service
# oc process -p ROLE=sender -f uperf-svc-template-1.json | oc create 
--namespace=uperf-1 -f -
#-- Set  sysctl net.ipv4.ip_local_port_range in all pods
#oc exec <<receiverPodCreated>> --namespace=uperf-1 
-- sysctlnet.ipv4.ip_local_port_range="20010 20019"
#oc exec <<senderPodCreated>> --namespace=uperf-1 
-- sysctl net.ipv4.ip_local_port_range="20010 20019"
#-- get ip of sender and receiver
#oc get svc

    After the services are set up and the ips are obtained, for the test using the pbench-uperf benchmarking tool(https://github.com/distributed-system-analysis/pbench), which uses uperf to run the network tests using the command

pbench-uperf --test-types=stream,rr --runtime=30 --message-sizes=64,1024,16384 --protocols=udp --instances=1 --samples=3 --max-stddev=10 --clients=<<ip of sender1 svc>>,<<ip of sender2 svc>> --servers=<<ip of receiver1 svc>>,<<ip of receiver2 svc>> --config=svc-to-svc-NN_PODS_2_UDP

    Let me know if you want to access the test environment I created for these tests. 
The above set up for the tests are done by automated scripts written specifically for these tests and are at:

Comment 6 Siva Reddy 2017-02-07 20:28:48 UTC
Created attachment 1248493 [details]
rc json file

Comment 7 Siva Reddy 2017-02-07 20:29:11 UTC
Created attachment 1248494 [details]
Service template

Comment 8 Siva Reddy 2017-02-10 14:51:55 UTC
For same exact setup of the machines but using ovs plugin "subnet" gives these results which are non-zero

      This result above indicates some performance issue with multitenant plugin for UDP RR transcations when the packet size is 16K.

Comment 9 Ben Bennett 2017-02-10 18:48:58 UTC
Can you please grab the output from iptables-save on both multitenant versions please?

Comment 10 Ben Bennett 2017-02-10 20:45:08 UTC
We see 16k UDP packets flow through to a pod (the problem we saw with iperf3 was that it opens a TCP connection to report the results... but that gets blocked by the service proxy, but pbench-uperf seems not to have that problem).

In addition to the iperf-save requested above, can you get the pid for the server pod and do:
  nsenter -n -t <pid>
  tcpdump -n -i eth0 udp

Then paste the results from the tcpdump for 1024 bytes and 16k.  Thanks!

Comment 11 Siva Reddy 2017-02-10 21:16:12 UTC
I assume this is when the tests are running. isn't it?.

Comment 12 Siva Reddy 2017-02-13 18:11:43 UTC
I have sent the details via email. please let me know if you have any issues

Comment 13 Ben Bennett 2017-02-13 20:15:24 UTC
UDP seems to be fine.  I ran:

On the receiving pod:
  nc -lup 20021 > /tmp/foo

On the sending pod:
  perl -le 'print "A"x16384' | nc -u 20021

I see 16384 bytes transmitted successfully (by looking at /tmp/foo).  And that is to the service address.

I see that you have two projects set up and both receiving IP addresses are passed.  With multitenant, what is supposed to happen?  I would not expect pods in one project to talk to pods in the other.

Comment 14 Siva Reddy 2017-02-13 21:28:38 UTC
   Agreed simple transmission happens fine and expectation is that only pods in projects can talk to each other. The two projects are just for scale, that is two senders and two receivers are communicating with each other simultaneously within the same project. That is, to just simulate load/scale in the environment. Here is what is happening in multi-tenant, just for simplicity, made it to just one project 

1. Click on this link and notice the throughput and latency for 16K msg size as zero whereas 64, 1024K it is non-zero which means the messages got transmitted. This data is summarized by the wrapper around the uperf tool.

If you want system level data collected by tools then they are here:

Here is the environment system status:

root@ip-172-31-27-63: ~/svt/networking/synthetic # oc get pods -o wide
NAME                   READY     STATUS    RESTARTS   AGE       IP            NODE
uperf-receiver-8s04c   1/1       Running   0          10m   ip-172-31-32-182.us-west-2.compute.internal
uperf-sender-ws1gw     1/1       Running   0          11m   ip-172-31-12-44.us-west-2.compute.internal
root@ip-172-31-27-63: ~/svt/networking/synthetic # oc get svc 
NAME             CLUSTER-IP       EXTERNAL-IP   PORT(S)                                                                                                                                                                                                          AGE
uperf-receiver    <none>        22/TCP,20010/TCP,20011/TCP,20012/TCP,20013/TCP,20014/TCP,20015/TCP,20016/TCP,20017/TCP,20018/TCP,20019/TCP,20010/UDP,20011/UDP,20012/UDP,20013/UDP,20014/UDP,20015/UDP,20016/UDP,20017/UDP,20018/UDP,20019/UDP   11m
uperf-sender   <none>        22/TCP,20010/TCP,20011/TCP,20012/TCP,20013/TCP,20014/TCP,20015/TCP,20016/TCP,20017/TCP,20018/TCP,20019/TCP,20010/UDP,20011/UDP,20012/UDP,20013/UDP,20014/UDP,20015/UDP,20016/UDP,20017/UDP,20018/UDP,20019/UDP   11m

the command that runs the UDP 16K transmission:
pbench-uperf --test-types=rr --runtime=30 --message-sizes=64,1024,16384 --protocols=udp --instances=1 --samples=3 --max-stddev=10 --clients= --servers= --config=svc-to-svc-NN_PODS_1_UDP

Basically, the tool is running msg sizes 64, 1024K, 16K for 30 secs runtime between the SVC IPs and it is UDP protocal, RR test. As inidcated above 64, 1024K msg sizes we see transmission, the question is why it is zero when the size is 16K.

if you want to see the openvswitch data log, it is here

Comment 15 Ben Bennett 2017-02-15 20:48:38 UTC
Ok... thanks for the access.  @aconole and I looked at the setup and it's "interesting".  The nc works because it linebuffers the input at a smaller value than the MTU.

When I ran uperf in isolation:
  Server: /usr/local/bin/uperf -s -P 20010

  Client: /usr/local/bin/uperf -x -a -i 1 -P 20010 -m file.xml

<?xml version="1.0"?>
<profile name="udp-rr-16384B-1i">
  <group nthreads="1">
    <transaction iterations="1">
      <flowop type="connect" options="remotehost= protocol=udp"/>
    <transaction duration="1s">
      <flowop type="write" options="size=16384"/>
      <flowop type="read"  options="size=16384"/>
    <transaction iterations="1">
      <flowop type="disconnect" />

Watching the packets on the sending host, they are not fragmented.  We see large packets pop out of the veth and then go into iptables, never to return.  Whereas any packets smaller than the veth go in and are correctly rewritten by the nat rules.

Comment 16 Ben Bennett 2017-02-15 20:49:52 UTC
Forgot to mention that aconole thinks that the veth ought to be fragmenting the packets... he is going to reassign to the net team to look at further.

BTW Siva: Was the version of the OS the same for the 3.3 and 3.4 tests?

Comment 17 Siva Reddy 2017-02-15 20:58:56 UTC
Ben, yes it has been RHEL 7.3 for 3.3, 3.4. I guess 3.2 was RHEL 7.2 but we are comparing 3.4 and 3.5 now.

Comment 18 Ben Bennett 2017-02-15 21:13:18 UTC
Same version of OVS too, right?

Comment 19 Siva Reddy 2017-02-15 21:26:54 UTC
yes same OVS version. I'm planning to run the same tests again on OVS 2.6 but so far it has been
OVS 2.5

Comment 20 Dan Williams 2017-02-23 22:01:59 UTC
Root cause:

1) OVS applies flows to each fragment of a packet individually
2) The openshift-sdn rules for non-admin-VNID pods sending to a service match on TCP/UDP destination port
3) By default, OVS sets the TCP/UDP port numbers to 0 for *all* fragments of a fragmented packet, even the first one that would otherwise have them available
4) Thus, we drop all fragmented, non-default-VNID service traffic in our OVS flow table=60 because the port numbers don't match

TLDR; openshift-sdn currently drops fragmented multi-tenant (non-admin-VNID) service traffic originating from pods.


a) use OVS's conntrack functionality to re-assemble the packet and thus make the port #s available again in the flow tables.  However, OVS conntrack functionality appears to conflict with the kernel's iptables functionality in various ways, and we rely on iptables for service proxying.  Unless we can solve the incompatibility issue this won't work.

b) drop port matching from our OVS rules for non-admin-VNID services.  This would normally work, except if we expect users to manually assign the same service IP to multiple services and differentiate only on port #s.

Comment 21 Ben Bennett 2017-03-01 17:01:43 UTC
Dan: How did this work in 3.4?

Comment 23 Dan Williams 2017-03-01 19:18:07 UTC
Possible workaround pushed as https://github.com/openshift/origin/pull/13162

Comment 24 Dan Williams 2017-03-03 02:56:34 UTC
(In reply to Ben Bennett from comment #21)
> Dan: How did this work in 3.4?

We're really not sure about that.  I tried to spin up a 3.4 cluster, but the DIND tooling had a bug back then.  Were you able to get something set up to check 3.4?

Comment 25 Mike Fiedler 2017-03-03 03:38:50 UTC
schituku:  any results from 3.4 env?

Comment 26 Siva Reddy 2017-03-03 14:03:08 UTC
Mike, I have a created a 3.4 cluster and ran the tests but the bug showed up in the results. I've given details of this environment to Ben so that he can further look at it.
   Not sure how the results were non-empty when I ran the tests back then.

here are the results:
3.4 now: http://perf-infra.ec2.breakage.org/pbench/results/ip-172-31-9-161/uperf_svc-to-svc-NN_PODS_2_UDP_2017-03-02_16:44:34/result.html
    shows 0 for the 16KB UDP 
3.4 during 3.4 release: http://perf-infra.ec2.breakage.org/pbench/results/ip-172-31-5-240/uperf_svc-to-svc_PODS_2_UDP_NN_2016-12-20_14:41:01/result.html

Comment 28 Troy Dawson 2017-04-11 21:08:16 UTC
This has been merged into ocp and is in OCP v3.6.27 or newer.

Comment 32 Siva Reddy 2017-04-25 08:51:17 UTC
Verified this bug and validated that the UDP RR msgs are not zero any more. Here is the data for the run.

Comment 34 errata-xmlrpc 2017-08-10 05:17:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.