RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1424076 - vxlan: performance can suffer unless GRO is disabled on vxlan interface
Summary: vxlan: performance can suffer unless GRO is disabled on vxlan interface
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: kernel
Version: 7.3
Hardware: All
OS: Linux
urgent
urgent
Target Milestone: rc
: ---
Assignee: Jiri Benc
QA Contact: Jan Tluka
URL:
Whiteboard:
Depends On:
Blocks: 1298243 1323132 1429597 1431197
TreeView+ depends on / blocked
 
Reported: 2017-02-17 15:49 UTC by Patrick Talbert
Modified: 2020-09-10 10:13 UTC (History)
16 users (show)

Fixed In Version: kernel-3.10.0-599.el7
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1431197 (view as bug list)
Environment:
Last Closed: 2017-08-02 05:42:35 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 2933951 0 None None None 2017-02-17 16:31:06 UTC
Red Hat Product Errata RHSA-2017:1842 0 normal SHIPPED_LIVE Important: kernel security, bug fix, and enhancement update 2017-08-01 18:22:09 UTC

Description Patrick Talbert 2017-02-17 15:49:23 UTC
Description of problem:
VXLAN interface types have GRO enabled by default in RHEL 7.3. This is causing a performance regression for a strategic customer when reviewing throughput across a vxlan link.


Version-Release number of selected component (if applicable):
kernel-3.10.0-514.6.1.el7.x86_64


How reproducible:
Always?


Steps to Reproduce:
1. Create a VXLAN tunnel between two RHEL 7.3 hosts and enable UDP flow hashing using IP src/dst and L4 src/dst (without hashing on L4 the performance is never very good):

Host A:

# ip link set dev enp12s0 mtu 9000
# ip link set dev enp12s0 up
# ip addr add 172.168.0.2/24 dev enp12s0
# ethtool -U enp12s0 rx-flow-hash udp4 sdfn

# ovs-vsctl add-br br-int
# ovs-vsctl add-port br-int enp12s0
# ovs-vsctl add-port br-int vxlan404 -- set Interface vxlan404 type=vxlan options:remote_ip=172.168.0.3

# ip link set dev br-int mtu 1500
# ip link set dev br-int up
# ip addr add 5.5.5.2/24 dev br-int

Host B:

# ip link set dev enp12s0 mtu 9000
# ip link set dev enp12s0 up
# ip addr add 172.168.0.3/24 dev enp12s0
# ethtool -U enp12s0 rx-flow-hash udp4 sdfn

# ovs-vsctl add-br br-int
# ovs-vsctl add-port br-int enp12s0
# ovs-vsctl add-port br-int vxlan404 -- set Interface vxlan404 type=vxlan options:remote_ip=172.168.0.2

# ip link set dev br-int mtu 1500
# ip link set dev br-int up
# ip addr add 5.5.5.3/24 dev br-int



2. Run an iperf3 test from Host B to Host A on the two 10G interfaces (the 172.168.0.0/24 network). Throughput should be > 9Gb/sec.

3. Run an iperf3 test between the two VXLAN interfaces (5.5.5.0/24 network):

(Host A)# iperf3 -s

(Host B): iperf3 -c 5.5.5.2 -P 12 -w 200K


Actual results:

Throughput never quite hits ~7Gb/sec:

[root@ibm-x3550m4-9 ~]# iperf3 -c 5.5.5.2 -P 12 -w 200K

[  4]   9.00-10.00  sec  37.0 MBytes   310 Mbits/sec   73   82.0 KBytes       
[  6]   9.00-10.00  sec  96.1 MBytes   805 Mbits/sec    2    208 KBytes       
[  8]   9.00-10.00  sec  98.2 MBytes   823 Mbits/sec    4    205 KBytes       
[ 10]   9.00-10.00  sec  93.1 MBytes   781 Mbits/sec   15    204 KBytes       
[ 12]   9.00-10.00  sec  34.0 MBytes   285 Mbits/sec  215   86.3 KBytes       
[ 14]   9.00-10.00  sec  71.6 MBytes   600 Mbits/sec   50    174 KBytes       
[ 16]   9.00-10.00  sec  65.9 MBytes   553 Mbits/sec   65    164 KBytes       
[ 18]   9.00-10.00  sec  80.3 MBytes   673 Mbits/sec   52    157 KBytes       
[ 20]   9.00-10.00  sec  81.2 MBytes   681 Mbits/sec   90    120 KBytes       
[ 22]   9.00-10.00  sec  34.3 MBytes   288 Mbits/sec  206   86.3 KBytes       
[ 24]   9.00-10.00  sec  66.1 MBytes   554 Mbits/sec   38    116 KBytes       
[ 26]   9.00-10.00  sec  54.9 MBytes   460 Mbits/sec   84    136 KBytes       
[SUM]   9.00-10.00  sec   813 MBytes  6.81 Gbits/sec  894             
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec   414 MBytes   347 Mbits/sec  1375             sender
[  4]   0.00-10.00  sec   414 MBytes   347 Mbits/sec                  receiver
[  6]   0.00-10.00  sec   931 MBytes   781 Mbits/sec  166             sender
[  6]   0.00-10.00  sec   930 MBytes   780 Mbits/sec                  receiver
[  8]   0.00-10.00  sec   932 MBytes   782 Mbits/sec  218             sender
[  8]   0.00-10.00  sec   932 MBytes   781 Mbits/sec                  receiver
[ 10]   0.00-10.00  sec   887 MBytes   744 Mbits/sec  275             sender
[ 10]   0.00-10.00  sec   887 MBytes   744 Mbits/sec                  receiver
[ 12]   0.00-10.00  sec   363 MBytes   305 Mbits/sec  1745             sender
[ 12]   0.00-10.00  sec   363 MBytes   304 Mbits/sec                  receiver
[ 14]   0.00-10.00  sec   834 MBytes   700 Mbits/sec  443             sender
[ 14]   0.00-10.00  sec   834 MBytes   699 Mbits/sec                  receiver
[ 16]   0.00-10.00  sec   756 MBytes   634 Mbits/sec  612             sender
[ 16]   0.00-10.00  sec   756 MBytes   634 Mbits/sec                  receiver
[ 18]   0.00-10.00  sec   838 MBytes   703 Mbits/sec  400             sender
[ 18]   0.00-10.00  sec   838 MBytes   702 Mbits/sec                  receiver
[ 20]   0.00-10.00  sec   707 MBytes   593 Mbits/sec  581             sender
[ 20]   0.00-10.00  sec   707 MBytes   593 Mbits/sec                  receiver
[ 22]   0.00-10.00  sec   316 MBytes   265 Mbits/sec  1840             sender
[ 22]   0.00-10.00  sec   316 MBytes   265 Mbits/sec                  receiver
[ 24]   0.00-10.00  sec   599 MBytes   502 Mbits/sec  882             sender
[ 24]   0.00-10.00  sec   599 MBytes   502 Mbits/sec                  receiver
[ 26]   0.00-10.00  sec   548 MBytes   459 Mbits/sec  1048             sender
[ 26]   0.00-10.00  sec   547 MBytes   459 Mbits/sec                  receiver
[SUM]   0.00-10.00  sec  7.93 GBytes  6.82 Gbits/sec  9585             sender
[SUM]   0.00-10.00  sec  7.93 GBytes  6.81 Gbits/sec                  receiver


Expected results:

Higher throughput with no retransmissions.


If you disable GRO on the vxlan interface of the Receive side (Host A) the performance is much better and the retransmissions do not occur:

[root@ibm-x3550m4-10 ~]# ethtool -K vxlan_sys_4789 gro off

[root@ibm-x3550m4-9 ~]# iperf3 -c 5.5.5.2 -P 12 -w 200K

[  4]   9.00-10.00  sec   118 MBytes   991 Mbits/sec    0    202 KBytes       
[  6]   9.00-10.00  sec   114 MBytes   957 Mbits/sec    0    205 KBytes       
[  8]   9.00-10.00  sec  56.5 MBytes   474 Mbits/sec    0    206 KBytes       
[ 10]   9.00-10.00  sec  55.1 MBytes   462 Mbits/sec    0    205 KBytes       
[ 12]   9.00-10.00  sec   110 MBytes   919 Mbits/sec    0    279 KBytes       
[ 14]   9.00-10.00  sec   109 MBytes   915 Mbits/sec    0    205 KBytes       
[ 16]   9.00-10.00  sec  53.2 MBytes   447 Mbits/sec    0    209 KBytes       
[ 18]   9.00-10.00  sec  51.4 MBytes   432 Mbits/sec    0    216 KBytes       
[ 20]   9.00-10.00  sec   107 MBytes   896 Mbits/sec    0    202 KBytes       
[ 22]   9.00-10.00  sec  49.8 MBytes   418 Mbits/sec    0    205 KBytes       
[ 24]   9.00-10.00  sec  47.8 MBytes   401 Mbits/sec    0    209 KBytes       
[ 26]   9.00-10.00  sec   102 MBytes   854 Mbits/sec    0    208 KBytes       
[SUM]   9.00-10.00  sec   973 MBytes  8.17 Gbits/sec    0             
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  1.15 GBytes   990 Mbits/sec    0             sender
[  4]   0.00-10.00  sec  1.15 GBytes   990 Mbits/sec                  receiver
[  6]   0.00-10.00  sec  1.12 GBytes   964 Mbits/sec    0             sender
[  6]   0.00-10.00  sec  1.12 GBytes   963 Mbits/sec                  receiver
[  8]   0.00-10.00  sec   570 MBytes   478 Mbits/sec    0             sender
[  8]   0.00-10.00  sec   569 MBytes   478 Mbits/sec                  receiver
[ 10]   0.00-10.00  sec   552 MBytes   463 Mbits/sec    0             sender
[ 10]   0.00-10.00  sec   552 MBytes   463 Mbits/sec                  receiver
[ 12]   0.00-10.00  sec  1.06 GBytes   910 Mbits/sec    0             sender
[ 12]   0.00-10.00  sec  1.06 GBytes   909 Mbits/sec                  receiver
[ 14]   0.00-10.00  sec  1.06 GBytes   908 Mbits/sec    0             sender
[ 14]   0.00-10.00  sec  1.06 GBytes   907 Mbits/sec                  receiver
[ 16]   0.00-10.00  sec   532 MBytes   446 Mbits/sec    0             sender
[ 16]   0.00-10.00  sec   531 MBytes   446 Mbits/sec                  receiver
[ 18]   0.00-10.00  sec   512 MBytes   429 Mbits/sec    0             sender
[ 18]   0.00-10.00  sec   512 MBytes   429 Mbits/sec                  receiver
[ 20]   0.00-10.00  sec  1.03 GBytes   881 Mbits/sec    0             sender
[ 20]   0.00-10.00  sec  1.03 GBytes   881 Mbits/sec                  receiver
[ 22]   0.00-10.00  sec   495 MBytes   415 Mbits/sec    0             sender
[ 22]   0.00-10.00  sec   495 MBytes   415 Mbits/sec                  receiver
[ 24]   0.00-10.00  sec   476 MBytes   400 Mbits/sec    0             sender
[ 24]   0.00-10.00  sec   476 MBytes   399 Mbits/sec                  receiver
[ 26]   0.00-10.00  sec  1006 MBytes   844 Mbits/sec    0             sender
[ 26]   0.00-10.00  sec  1006 MBytes   844 Mbits/sec                  receiver
[SUM]   0.00-10.00  sec  9.46 GBytes  8.13 Gbits/sec    0             sender
[SUM]   0.00-10.00  sec  9.46 GBytes  8.12 Gbits/sec                  receiver



Additional info:

A 7.2 kernel on the Receive side shows good performance, so some change in 7.3 introduced this regression.

I was able to narrow it down to something between 3.10.0-432.el7 and 3.10.0-433.el7. There are a lot of tunnel/vxlan/openvwitch commits between these two revisions but the most obvious culprit is:

4452a36 [net] vxlan: GRO support at tunnel layer


I tested with upstream and performance over the VXLAN is good with GRO enabled. So I assume there is some change upstream which helps avoid this regression but I have not yet found it.... so many changes to go through....


Also, just a note that I am setting the sender's write buffer size with iperf3 using the -w option in an attempt to clamp the sender's congestion window to avoid the performance issue caused by BZ1418870.

Comment 5 Jiri Benc 2017-03-03 21:31:26 UTC
I was able to reproduce the problem, identify the fix and verify it.

This is fixed by the following upstream commit:

commit 88340160f3ad22401b00f4efcee44f7ec4769b19
Author: Martin KaFai Lau <kafai>
Date:   Fri Jan 16 10:11:00 2015 -0800

    ip_tunnel: Create percpu gro_cell
    
    In the ipip tunnel, the skb->queue_mapping is lost in ipip_rcv().
    All skb will be queued to the same cell->napi_skbs.  The
    gro_cell_poll is pinned to one core under load.  In production traffic,
    we also see severe rx_dropped in the tunl iface and it is probably due to
    this limit: skb_queue_len(&cell->napi_skbs) > netdev_max_backlog.
    
    This patch is trying to alloc_percpu(struct gro_cell) and schedule
    gro_cell_poll to process the skb in the same core.
    
    Signed-off-by: Martin KaFai Lau <kafai>
    Acked-by: Eric Dumazet <edumazet>
    Signed-off-by: David S. Miller <davem>

Comment 9 Jiri Benc 2017-03-06 13:07:27 UTC
Reproduction script:

#!/bin/bash

iface=em1
h=1
# h=2 for the other side

oh=$((3 - h))

ip l s em1 mtu 9000 up
ip -4 a f em1
ip a a 192.168.99.$h/24 dev em1
ethtool -U em1 rx-flow-hash udp4 sdfn
ovs-vsctl del-br ovs0
ovs-vsctl add-br ovs0
ovs-vsctl add-port ovs0 vxlan0 -- set interface vxlan0 type=vxlan options:remote_ip=192.168.99.$oh
ovs-vsctl add-port ovs0 i0 -- set interface i0 type=internal
ip l s i0 up
ip a a 192.168.98.$h/24 dev i0

if [[ $h = 2 ]]; then
    iperf3 -s
else
    iperf3 -c 192.168.98.2 -P 100 -w 200K
fi

Comment 14 Rafael Aquini 2017-03-10 14:37:04 UTC
Patch(es) committed on kernel repository and an interim kernel build is undergoing testing

Comment 17 Rafael Aquini 2017-03-13 11:15:58 UTC
Patch(es) available on kernel-3.10.0-599.el7

Comment 22 Jan Tluka 2017-04-26 14:49:38 UTC
Reproduced on 3.10.0-514.6.1.el7 on ixgbe NIC. Throughput at ~8 Gbit/s

Verified on kernel 3.10.0-655.el7 on ixgbe NIC. The performance numbers got back to ~9.1 Gbit/s.

Comment 23 errata-xmlrpc 2017-08-02 05:42:35 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:1842


Note You need to log in before you can comment on or make changes to this bug.