Bug 1424076

Summary: vxlan: performance can suffer unless GRO is disabled on vxlan interface
Product: Red Hat Enterprise Linux 7 Reporter: Patrick Talbert <ptalbert>
Component: kernelAssignee: Jiri Benc <jbenc>
kernel sub component: Tunnel QA Contact: Jan Tluka <jtluka>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: urgent CC: atomlin, atragler, brault, dhoward, ealcaniz, fbaudin, hsowa, jbenc, jeharris, jiji, mleitner, mmilgram, network-qe, pneedle, qding, rmanes
Version: 7.3Keywords: ZStream
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: kernel-3.10.0-599.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1431197 (view as bug list) Environment:
Last Closed: 2017-08-02 05:42:35 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1298243, 1323132, 1429597, 1431197    

Description Patrick Talbert 2017-02-17 15:49:23 UTC
Description of problem:
VXLAN interface types have GRO enabled by default in RHEL 7.3. This is causing a performance regression for a strategic customer when reviewing throughput across a vxlan link.


Version-Release number of selected component (if applicable):
kernel-3.10.0-514.6.1.el7.x86_64


How reproducible:
Always?


Steps to Reproduce:
1. Create a VXLAN tunnel between two RHEL 7.3 hosts and enable UDP flow hashing using IP src/dst and L4 src/dst (without hashing on L4 the performance is never very good):

Host A:

# ip link set dev enp12s0 mtu 9000
# ip link set dev enp12s0 up
# ip addr add 172.168.0.2/24 dev enp12s0
# ethtool -U enp12s0 rx-flow-hash udp4 sdfn

# ovs-vsctl add-br br-int
# ovs-vsctl add-port br-int enp12s0
# ovs-vsctl add-port br-int vxlan404 -- set Interface vxlan404 type=vxlan options:remote_ip=172.168.0.3

# ip link set dev br-int mtu 1500
# ip link set dev br-int up
# ip addr add 5.5.5.2/24 dev br-int

Host B:

# ip link set dev enp12s0 mtu 9000
# ip link set dev enp12s0 up
# ip addr add 172.168.0.3/24 dev enp12s0
# ethtool -U enp12s0 rx-flow-hash udp4 sdfn

# ovs-vsctl add-br br-int
# ovs-vsctl add-port br-int enp12s0
# ovs-vsctl add-port br-int vxlan404 -- set Interface vxlan404 type=vxlan options:remote_ip=172.168.0.2

# ip link set dev br-int mtu 1500
# ip link set dev br-int up
# ip addr add 5.5.5.3/24 dev br-int



2. Run an iperf3 test from Host B to Host A on the two 10G interfaces (the 172.168.0.0/24 network). Throughput should be > 9Gb/sec.

3. Run an iperf3 test between the two VXLAN interfaces (5.5.5.0/24 network):

(Host A)# iperf3 -s

(Host B): iperf3 -c 5.5.5.2 -P 12 -w 200K


Actual results:

Throughput never quite hits ~7Gb/sec:

[root@ibm-x3550m4-9 ~]# iperf3 -c 5.5.5.2 -P 12 -w 200K

[  4]   9.00-10.00  sec  37.0 MBytes   310 Mbits/sec   73   82.0 KBytes       
[  6]   9.00-10.00  sec  96.1 MBytes   805 Mbits/sec    2    208 KBytes       
[  8]   9.00-10.00  sec  98.2 MBytes   823 Mbits/sec    4    205 KBytes       
[ 10]   9.00-10.00  sec  93.1 MBytes   781 Mbits/sec   15    204 KBytes       
[ 12]   9.00-10.00  sec  34.0 MBytes   285 Mbits/sec  215   86.3 KBytes       
[ 14]   9.00-10.00  sec  71.6 MBytes   600 Mbits/sec   50    174 KBytes       
[ 16]   9.00-10.00  sec  65.9 MBytes   553 Mbits/sec   65    164 KBytes       
[ 18]   9.00-10.00  sec  80.3 MBytes   673 Mbits/sec   52    157 KBytes       
[ 20]   9.00-10.00  sec  81.2 MBytes   681 Mbits/sec   90    120 KBytes       
[ 22]   9.00-10.00  sec  34.3 MBytes   288 Mbits/sec  206   86.3 KBytes       
[ 24]   9.00-10.00  sec  66.1 MBytes   554 Mbits/sec   38    116 KBytes       
[ 26]   9.00-10.00  sec  54.9 MBytes   460 Mbits/sec   84    136 KBytes       
[SUM]   9.00-10.00  sec   813 MBytes  6.81 Gbits/sec  894             
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec   414 MBytes   347 Mbits/sec  1375             sender
[  4]   0.00-10.00  sec   414 MBytes   347 Mbits/sec                  receiver
[  6]   0.00-10.00  sec   931 MBytes   781 Mbits/sec  166             sender
[  6]   0.00-10.00  sec   930 MBytes   780 Mbits/sec                  receiver
[  8]   0.00-10.00  sec   932 MBytes   782 Mbits/sec  218             sender
[  8]   0.00-10.00  sec   932 MBytes   781 Mbits/sec                  receiver
[ 10]   0.00-10.00  sec   887 MBytes   744 Mbits/sec  275             sender
[ 10]   0.00-10.00  sec   887 MBytes   744 Mbits/sec                  receiver
[ 12]   0.00-10.00  sec   363 MBytes   305 Mbits/sec  1745             sender
[ 12]   0.00-10.00  sec   363 MBytes   304 Mbits/sec                  receiver
[ 14]   0.00-10.00  sec   834 MBytes   700 Mbits/sec  443             sender
[ 14]   0.00-10.00  sec   834 MBytes   699 Mbits/sec                  receiver
[ 16]   0.00-10.00  sec   756 MBytes   634 Mbits/sec  612             sender
[ 16]   0.00-10.00  sec   756 MBytes   634 Mbits/sec                  receiver
[ 18]   0.00-10.00  sec   838 MBytes   703 Mbits/sec  400             sender
[ 18]   0.00-10.00  sec   838 MBytes   702 Mbits/sec                  receiver
[ 20]   0.00-10.00  sec   707 MBytes   593 Mbits/sec  581             sender
[ 20]   0.00-10.00  sec   707 MBytes   593 Mbits/sec                  receiver
[ 22]   0.00-10.00  sec   316 MBytes   265 Mbits/sec  1840             sender
[ 22]   0.00-10.00  sec   316 MBytes   265 Mbits/sec                  receiver
[ 24]   0.00-10.00  sec   599 MBytes   502 Mbits/sec  882             sender
[ 24]   0.00-10.00  sec   599 MBytes   502 Mbits/sec                  receiver
[ 26]   0.00-10.00  sec   548 MBytes   459 Mbits/sec  1048             sender
[ 26]   0.00-10.00  sec   547 MBytes   459 Mbits/sec                  receiver
[SUM]   0.00-10.00  sec  7.93 GBytes  6.82 Gbits/sec  9585             sender
[SUM]   0.00-10.00  sec  7.93 GBytes  6.81 Gbits/sec                  receiver


Expected results:

Higher throughput with no retransmissions.


If you disable GRO on the vxlan interface of the Receive side (Host A) the performance is much better and the retransmissions do not occur:

[root@ibm-x3550m4-10 ~]# ethtool -K vxlan_sys_4789 gro off

[root@ibm-x3550m4-9 ~]# iperf3 -c 5.5.5.2 -P 12 -w 200K

[  4]   9.00-10.00  sec   118 MBytes   991 Mbits/sec    0    202 KBytes       
[  6]   9.00-10.00  sec   114 MBytes   957 Mbits/sec    0    205 KBytes       
[  8]   9.00-10.00  sec  56.5 MBytes   474 Mbits/sec    0    206 KBytes       
[ 10]   9.00-10.00  sec  55.1 MBytes   462 Mbits/sec    0    205 KBytes       
[ 12]   9.00-10.00  sec   110 MBytes   919 Mbits/sec    0    279 KBytes       
[ 14]   9.00-10.00  sec   109 MBytes   915 Mbits/sec    0    205 KBytes       
[ 16]   9.00-10.00  sec  53.2 MBytes   447 Mbits/sec    0    209 KBytes       
[ 18]   9.00-10.00  sec  51.4 MBytes   432 Mbits/sec    0    216 KBytes       
[ 20]   9.00-10.00  sec   107 MBytes   896 Mbits/sec    0    202 KBytes       
[ 22]   9.00-10.00  sec  49.8 MBytes   418 Mbits/sec    0    205 KBytes       
[ 24]   9.00-10.00  sec  47.8 MBytes   401 Mbits/sec    0    209 KBytes       
[ 26]   9.00-10.00  sec   102 MBytes   854 Mbits/sec    0    208 KBytes       
[SUM]   9.00-10.00  sec   973 MBytes  8.17 Gbits/sec    0             
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  1.15 GBytes   990 Mbits/sec    0             sender
[  4]   0.00-10.00  sec  1.15 GBytes   990 Mbits/sec                  receiver
[  6]   0.00-10.00  sec  1.12 GBytes   964 Mbits/sec    0             sender
[  6]   0.00-10.00  sec  1.12 GBytes   963 Mbits/sec                  receiver
[  8]   0.00-10.00  sec   570 MBytes   478 Mbits/sec    0             sender
[  8]   0.00-10.00  sec   569 MBytes   478 Mbits/sec                  receiver
[ 10]   0.00-10.00  sec   552 MBytes   463 Mbits/sec    0             sender
[ 10]   0.00-10.00  sec   552 MBytes   463 Mbits/sec                  receiver
[ 12]   0.00-10.00  sec  1.06 GBytes   910 Mbits/sec    0             sender
[ 12]   0.00-10.00  sec  1.06 GBytes   909 Mbits/sec                  receiver
[ 14]   0.00-10.00  sec  1.06 GBytes   908 Mbits/sec    0             sender
[ 14]   0.00-10.00  sec  1.06 GBytes   907 Mbits/sec                  receiver
[ 16]   0.00-10.00  sec   532 MBytes   446 Mbits/sec    0             sender
[ 16]   0.00-10.00  sec   531 MBytes   446 Mbits/sec                  receiver
[ 18]   0.00-10.00  sec   512 MBytes   429 Mbits/sec    0             sender
[ 18]   0.00-10.00  sec   512 MBytes   429 Mbits/sec                  receiver
[ 20]   0.00-10.00  sec  1.03 GBytes   881 Mbits/sec    0             sender
[ 20]   0.00-10.00  sec  1.03 GBytes   881 Mbits/sec                  receiver
[ 22]   0.00-10.00  sec   495 MBytes   415 Mbits/sec    0             sender
[ 22]   0.00-10.00  sec   495 MBytes   415 Mbits/sec                  receiver
[ 24]   0.00-10.00  sec   476 MBytes   400 Mbits/sec    0             sender
[ 24]   0.00-10.00  sec   476 MBytes   399 Mbits/sec                  receiver
[ 26]   0.00-10.00  sec  1006 MBytes   844 Mbits/sec    0             sender
[ 26]   0.00-10.00  sec  1006 MBytes   844 Mbits/sec                  receiver
[SUM]   0.00-10.00  sec  9.46 GBytes  8.13 Gbits/sec    0             sender
[SUM]   0.00-10.00  sec  9.46 GBytes  8.12 Gbits/sec                  receiver



Additional info:

A 7.2 kernel on the Receive side shows good performance, so some change in 7.3 introduced this regression.

I was able to narrow it down to something between 3.10.0-432.el7 and 3.10.0-433.el7. There are a lot of tunnel/vxlan/openvwitch commits between these two revisions but the most obvious culprit is:

4452a36 [net] vxlan: GRO support at tunnel layer


I tested with upstream and performance over the VXLAN is good with GRO enabled. So I assume there is some change upstream which helps avoid this regression but I have not yet found it.... so many changes to go through....


Also, just a note that I am setting the sender's write buffer size with iperf3 using the -w option in an attempt to clamp the sender's congestion window to avoid the performance issue caused by BZ1418870.

Comment 5 Jiri Benc 2017-03-03 21:31:26 UTC
I was able to reproduce the problem, identify the fix and verify it.

This is fixed by the following upstream commit:

commit 88340160f3ad22401b00f4efcee44f7ec4769b19
Author: Martin KaFai Lau <kafai>
Date:   Fri Jan 16 10:11:00 2015 -0800

    ip_tunnel: Create percpu gro_cell
    
    In the ipip tunnel, the skb->queue_mapping is lost in ipip_rcv().
    All skb will be queued to the same cell->napi_skbs.  The
    gro_cell_poll is pinned to one core under load.  In production traffic,
    we also see severe rx_dropped in the tunl iface and it is probably due to
    this limit: skb_queue_len(&cell->napi_skbs) > netdev_max_backlog.
    
    This patch is trying to alloc_percpu(struct gro_cell) and schedule
    gro_cell_poll to process the skb in the same core.
    
    Signed-off-by: Martin KaFai Lau <kafai>
    Acked-by: Eric Dumazet <edumazet>
    Signed-off-by: David S. Miller <davem>

Comment 9 Jiri Benc 2017-03-06 13:07:27 UTC
Reproduction script:

#!/bin/bash

iface=em1
h=1
# h=2 for the other side

oh=$((3 - h))

ip l s em1 mtu 9000 up
ip -4 a f em1
ip a a 192.168.99.$h/24 dev em1
ethtool -U em1 rx-flow-hash udp4 sdfn
ovs-vsctl del-br ovs0
ovs-vsctl add-br ovs0
ovs-vsctl add-port ovs0 vxlan0 -- set interface vxlan0 type=vxlan options:remote_ip=192.168.99.$oh
ovs-vsctl add-port ovs0 i0 -- set interface i0 type=internal
ip l s i0 up
ip a a 192.168.98.$h/24 dev i0

if [[ $h = 2 ]]; then
    iperf3 -s
else
    iperf3 -c 192.168.98.2 -P 100 -w 200K
fi

Comment 14 Rafael Aquini 2017-03-10 14:37:04 UTC
Patch(es) committed on kernel repository and an interim kernel build is undergoing testing

Comment 17 Rafael Aquini 2017-03-13 11:15:58 UTC
Patch(es) available on kernel-3.10.0-599.el7

Comment 22 Jan Tluka 2017-04-26 14:49:38 UTC
Reproduced on 3.10.0-514.6.1.el7 on ixgbe NIC. Throughput at ~8 Gbit/s

Verified on kernel 3.10.0-655.el7 on ixgbe NIC. The performance numbers got back to ~9.1 Gbit/s.

Comment 23 errata-xmlrpc 2017-08-02 05:42:35 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:1842