Red Hat Bugzilla – Bug 1424076
vxlan: performance can suffer unless GRO is disabled on vxlan interface
Last modified: 2017-08-02 01:42:35 EDT
Description of problem: VXLAN interface types have GRO enabled by default in RHEL 7.3. This is causing a performance regression for a strategic customer when reviewing throughput across a vxlan link. Version-Release number of selected component (if applicable): kernel-3.10.0-514.6.1.el7.x86_64 How reproducible: Always? Steps to Reproduce: 1. Create a VXLAN tunnel between two RHEL 7.3 hosts and enable UDP flow hashing using IP src/dst and L4 src/dst (without hashing on L4 the performance is never very good): Host A: # ip link set dev enp12s0 mtu 9000 # ip link set dev enp12s0 up # ip addr add 172.168.0.2/24 dev enp12s0 # ethtool -U enp12s0 rx-flow-hash udp4 sdfn # ovs-vsctl add-br br-int # ovs-vsctl add-port br-int enp12s0 # ovs-vsctl add-port br-int vxlan404 -- set Interface vxlan404 type=vxlan options:remote_ip=172.168.0.3 # ip link set dev br-int mtu 1500 # ip link set dev br-int up # ip addr add 5.5.5.2/24 dev br-int Host B: # ip link set dev enp12s0 mtu 9000 # ip link set dev enp12s0 up # ip addr add 172.168.0.3/24 dev enp12s0 # ethtool -U enp12s0 rx-flow-hash udp4 sdfn # ovs-vsctl add-br br-int # ovs-vsctl add-port br-int enp12s0 # ovs-vsctl add-port br-int vxlan404 -- set Interface vxlan404 type=vxlan options:remote_ip=172.168.0.2 # ip link set dev br-int mtu 1500 # ip link set dev br-int up # ip addr add 5.5.5.3/24 dev br-int 2. Run an iperf3 test from Host B to Host A on the two 10G interfaces (the 172.168.0.0/24 network). Throughput should be > 9Gb/sec. 3. Run an iperf3 test between the two VXLAN interfaces (5.5.5.0/24 network): (Host A)# iperf3 -s (Host B): iperf3 -c 5.5.5.2 -P 12 -w 200K Actual results: Throughput never quite hits ~7Gb/sec: [root@ibm-x3550m4-9 ~]# iperf3 -c 5.5.5.2 -P 12 -w 200K [ 4] 9.00-10.00 sec 37.0 MBytes 310 Mbits/sec 73 82.0 KBytes [ 6] 9.00-10.00 sec 96.1 MBytes 805 Mbits/sec 2 208 KBytes [ 8] 9.00-10.00 sec 98.2 MBytes 823 Mbits/sec 4 205 KBytes [ 10] 9.00-10.00 sec 93.1 MBytes 781 Mbits/sec 15 204 KBytes [ 12] 9.00-10.00 sec 34.0 MBytes 285 Mbits/sec 215 86.3 KBytes [ 14] 9.00-10.00 sec 71.6 MBytes 600 Mbits/sec 50 174 KBytes [ 16] 9.00-10.00 sec 65.9 MBytes 553 Mbits/sec 65 164 KBytes [ 18] 9.00-10.00 sec 80.3 MBytes 673 Mbits/sec 52 157 KBytes [ 20] 9.00-10.00 sec 81.2 MBytes 681 Mbits/sec 90 120 KBytes [ 22] 9.00-10.00 sec 34.3 MBytes 288 Mbits/sec 206 86.3 KBytes [ 24] 9.00-10.00 sec 66.1 MBytes 554 Mbits/sec 38 116 KBytes [ 26] 9.00-10.00 sec 54.9 MBytes 460 Mbits/sec 84 136 KBytes [SUM] 9.00-10.00 sec 813 MBytes 6.81 Gbits/sec 894 - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth Retr [ 4] 0.00-10.00 sec 414 MBytes 347 Mbits/sec 1375 sender [ 4] 0.00-10.00 sec 414 MBytes 347 Mbits/sec receiver [ 6] 0.00-10.00 sec 931 MBytes 781 Mbits/sec 166 sender [ 6] 0.00-10.00 sec 930 MBytes 780 Mbits/sec receiver [ 8] 0.00-10.00 sec 932 MBytes 782 Mbits/sec 218 sender [ 8] 0.00-10.00 sec 932 MBytes 781 Mbits/sec receiver [ 10] 0.00-10.00 sec 887 MBytes 744 Mbits/sec 275 sender [ 10] 0.00-10.00 sec 887 MBytes 744 Mbits/sec receiver [ 12] 0.00-10.00 sec 363 MBytes 305 Mbits/sec 1745 sender [ 12] 0.00-10.00 sec 363 MBytes 304 Mbits/sec receiver [ 14] 0.00-10.00 sec 834 MBytes 700 Mbits/sec 443 sender [ 14] 0.00-10.00 sec 834 MBytes 699 Mbits/sec receiver [ 16] 0.00-10.00 sec 756 MBytes 634 Mbits/sec 612 sender [ 16] 0.00-10.00 sec 756 MBytes 634 Mbits/sec receiver [ 18] 0.00-10.00 sec 838 MBytes 703 Mbits/sec 400 sender [ 18] 0.00-10.00 sec 838 MBytes 702 Mbits/sec receiver [ 20] 0.00-10.00 sec 707 MBytes 593 Mbits/sec 581 sender [ 20] 0.00-10.00 sec 707 MBytes 593 Mbits/sec receiver [ 22] 0.00-10.00 sec 316 MBytes 265 Mbits/sec 1840 sender [ 22] 0.00-10.00 sec 316 MBytes 265 Mbits/sec receiver [ 24] 0.00-10.00 sec 599 MBytes 502 Mbits/sec 882 sender [ 24] 0.00-10.00 sec 599 MBytes 502 Mbits/sec receiver [ 26] 0.00-10.00 sec 548 MBytes 459 Mbits/sec 1048 sender [ 26] 0.00-10.00 sec 547 MBytes 459 Mbits/sec receiver [SUM] 0.00-10.00 sec 7.93 GBytes 6.82 Gbits/sec 9585 sender [SUM] 0.00-10.00 sec 7.93 GBytes 6.81 Gbits/sec receiver Expected results: Higher throughput with no retransmissions. If you disable GRO on the vxlan interface of the Receive side (Host A) the performance is much better and the retransmissions do not occur: [root@ibm-x3550m4-10 ~]# ethtool -K vxlan_sys_4789 gro off [root@ibm-x3550m4-9 ~]# iperf3 -c 5.5.5.2 -P 12 -w 200K [ 4] 9.00-10.00 sec 118 MBytes 991 Mbits/sec 0 202 KBytes [ 6] 9.00-10.00 sec 114 MBytes 957 Mbits/sec 0 205 KBytes [ 8] 9.00-10.00 sec 56.5 MBytes 474 Mbits/sec 0 206 KBytes [ 10] 9.00-10.00 sec 55.1 MBytes 462 Mbits/sec 0 205 KBytes [ 12] 9.00-10.00 sec 110 MBytes 919 Mbits/sec 0 279 KBytes [ 14] 9.00-10.00 sec 109 MBytes 915 Mbits/sec 0 205 KBytes [ 16] 9.00-10.00 sec 53.2 MBytes 447 Mbits/sec 0 209 KBytes [ 18] 9.00-10.00 sec 51.4 MBytes 432 Mbits/sec 0 216 KBytes [ 20] 9.00-10.00 sec 107 MBytes 896 Mbits/sec 0 202 KBytes [ 22] 9.00-10.00 sec 49.8 MBytes 418 Mbits/sec 0 205 KBytes [ 24] 9.00-10.00 sec 47.8 MBytes 401 Mbits/sec 0 209 KBytes [ 26] 9.00-10.00 sec 102 MBytes 854 Mbits/sec 0 208 KBytes [SUM] 9.00-10.00 sec 973 MBytes 8.17 Gbits/sec 0 - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth Retr [ 4] 0.00-10.00 sec 1.15 GBytes 990 Mbits/sec 0 sender [ 4] 0.00-10.00 sec 1.15 GBytes 990 Mbits/sec receiver [ 6] 0.00-10.00 sec 1.12 GBytes 964 Mbits/sec 0 sender [ 6] 0.00-10.00 sec 1.12 GBytes 963 Mbits/sec receiver [ 8] 0.00-10.00 sec 570 MBytes 478 Mbits/sec 0 sender [ 8] 0.00-10.00 sec 569 MBytes 478 Mbits/sec receiver [ 10] 0.00-10.00 sec 552 MBytes 463 Mbits/sec 0 sender [ 10] 0.00-10.00 sec 552 MBytes 463 Mbits/sec receiver [ 12] 0.00-10.00 sec 1.06 GBytes 910 Mbits/sec 0 sender [ 12] 0.00-10.00 sec 1.06 GBytes 909 Mbits/sec receiver [ 14] 0.00-10.00 sec 1.06 GBytes 908 Mbits/sec 0 sender [ 14] 0.00-10.00 sec 1.06 GBytes 907 Mbits/sec receiver [ 16] 0.00-10.00 sec 532 MBytes 446 Mbits/sec 0 sender [ 16] 0.00-10.00 sec 531 MBytes 446 Mbits/sec receiver [ 18] 0.00-10.00 sec 512 MBytes 429 Mbits/sec 0 sender [ 18] 0.00-10.00 sec 512 MBytes 429 Mbits/sec receiver [ 20] 0.00-10.00 sec 1.03 GBytes 881 Mbits/sec 0 sender [ 20] 0.00-10.00 sec 1.03 GBytes 881 Mbits/sec receiver [ 22] 0.00-10.00 sec 495 MBytes 415 Mbits/sec 0 sender [ 22] 0.00-10.00 sec 495 MBytes 415 Mbits/sec receiver [ 24] 0.00-10.00 sec 476 MBytes 400 Mbits/sec 0 sender [ 24] 0.00-10.00 sec 476 MBytes 399 Mbits/sec receiver [ 26] 0.00-10.00 sec 1006 MBytes 844 Mbits/sec 0 sender [ 26] 0.00-10.00 sec 1006 MBytes 844 Mbits/sec receiver [SUM] 0.00-10.00 sec 9.46 GBytes 8.13 Gbits/sec 0 sender [SUM] 0.00-10.00 sec 9.46 GBytes 8.12 Gbits/sec receiver Additional info: A 7.2 kernel on the Receive side shows good performance, so some change in 7.3 introduced this regression. I was able to narrow it down to something between 3.10.0-432.el7 and 3.10.0-433.el7. There are a lot of tunnel/vxlan/openvwitch commits between these two revisions but the most obvious culprit is: 4452a36 [net] vxlan: GRO support at tunnel layer I tested with upstream and performance over the VXLAN is good with GRO enabled. So I assume there is some change upstream which helps avoid this regression but I have not yet found it.... so many changes to go through.... Also, just a note that I am setting the sender's write buffer size with iperf3 using the -w option in an attempt to clamp the sender's congestion window to avoid the performance issue caused by BZ1418870.
I was able to reproduce the problem, identify the fix and verify it. This is fixed by the following upstream commit: commit 88340160f3ad22401b00f4efcee44f7ec4769b19 Author: Martin KaFai Lau <kafai@fb.com> Date: Fri Jan 16 10:11:00 2015 -0800 ip_tunnel: Create percpu gro_cell In the ipip tunnel, the skb->queue_mapping is lost in ipip_rcv(). All skb will be queued to the same cell->napi_skbs. The gro_cell_poll is pinned to one core under load. In production traffic, we also see severe rx_dropped in the tunl iface and it is probably due to this limit: skb_queue_len(&cell->napi_skbs) > netdev_max_backlog. This patch is trying to alloc_percpu(struct gro_cell) and schedule gro_cell_poll to process the skb in the same core. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Reproduction script: #!/bin/bash iface=em1 h=1 # h=2 for the other side oh=$((3 - h)) ip l s em1 mtu 9000 up ip -4 a f em1 ip a a 192.168.99.$h/24 dev em1 ethtool -U em1 rx-flow-hash udp4 sdfn ovs-vsctl del-br ovs0 ovs-vsctl add-br ovs0 ovs-vsctl add-port ovs0 vxlan0 -- set interface vxlan0 type=vxlan options:remote_ip=192.168.99.$oh ovs-vsctl add-port ovs0 i0 -- set interface i0 type=internal ip l s i0 up ip a a 192.168.98.$h/24 dev i0 if [[ $h = 2 ]]; then iperf3 -s else iperf3 -c 192.168.98.2 -P 100 -w 200K fi
Patch(es) committed on kernel repository and an interim kernel build is undergoing testing
Patch(es) available on kernel-3.10.0-599.el7
Reproduced on 3.10.0-514.6.1.el7 on ixgbe NIC. Throughput at ~8 Gbit/s Verified on kernel 3.10.0-655.el7 on ixgbe NIC. The performance numbers got back to ~9.1 Gbit/s.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:1842