Bug 1378656

Summary:	[LLNL 7.4 Bug] Serious Performance regression with NATed IPoIB connected mode
Product:	Red Hat Enterprise Linux 7	Reporter:	Ben Woodard <woodard>
Component:	kernel	Assignee:	Jonathan Toppins <jtoppins>
kernel sub component:	Infiniband	QA Contact:	zguo <zguo>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	urgent
Priority:	urgent	CC:	bhu, cascardo, cyates, ddutile, dhoward, foraker1, hartsjc, honli, infiniband-qe, ivecera, jarod, jmcnicol, jshortt, jtoppins, lmiksik, mstowell, pabeni, rdma-dev-team, snagar, tgummels, woodard, yizhan, zguo
Version:	7.3	Keywords:	ZStream
Target Milestone:	rc
Target Release:	7.4
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	kernel-3.10.0-516.el7	Doc Type:	Bug Fix
Doc Text:	Cause: A change to the GSO control block causes a conflict with the IPoIB control block resulting in IPoIB address information that is cached in the control block to get overwritten. Consequence: The overwriting of the cached data results in a significant performance degradation on IPoIB fabrics. Fix: Move the IPoIB address information to another section of the socket buffer preventing the overwrite. Result: Restoration of original performance.	Story Points:	---
Clone Of:
Clones:	1390668 (view as bug list)		Environment:
Last Closed:	2017-08-02 01:46:05 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1298243, 1353018, 1381646, 1390668, 1446211

Description Ben Woodard 2016-09-23 01:42:43 UTC

Description of problem:
The general network topology is:
Client <--(IPoIB)--> Linux NAT GW <--(10GigE or 1GigE)--> Filer.

With IPoIB in connected mode using the 7.2 kernel we get about 35Gb/s over the OPA IPoIB link.
opal2: uname -a
Linux opal2 3.10.0-327.28.2.1chaos.ch6.x86_64 #1 SMP Wed Aug 3 15:09:48
PDT 2016 x86_64 x86_64 x86_64 GNU/Linux
opal2: iperf -c opal191 -e
------------------------------------------------------------
Client connecting to opal191, TCP port 5001
TCP window size: 2.50 MByte (default)
------------------------------------------------------------
[  3] local 192.168.128.2 port 42924 connected with 192.168.128.191 port
5001
[ ID] Interval        Transfer    Bandwidth       Write/Err  Rtry    Cwnd
[  3] 0.00-10.00 sec  40.9 GBytes  35.2 Gbits/sec  1/0        11  12083K

With 7.3 the kernel we only get about 2.1Mb/s. 
ipa1: iperf -c opal191-nfs  -e
------------------------------------------------------------
Client connecting to opal191-nfs, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[  3] local 192.168.112.1 port 40478 connected with 134.9.6.77 port 5001
[ ID] Interval        Transfer    Bandwidth       Write/Err  Rtry    Cwnd
[  3] 0.00-10.13 sec  2.58 MBytes  2.14 Mbits/sec  1/0       602      2K

Please note that number of retries goes up considerably.
We know that the performance problem is not due to the slowness of the ethernet link because when we take the 7.3 IPoIB link out of the loop we get nearly wire speed.

This greatly impacts performance because user directories and workspaces are on these filers.

This problem was first noticed with NFS but subsequent testing showed that the same performance problem was easily observable with iperf.
In this case daatagram mode is much faster than connected mode.
[root@ipa1:~]# echo datagram > /sys/class/net/hsi0/mode
[root@ipa1:~]# iperf -c opal191-nfs -e
------------------------------------------------------------
Client connecting to opal191-nfs, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[  3] local 192.168.112.1 port 40486 connected with 134.9.6.77 port 5001
[ ID] Interval        Transfer    Bandwidth       Write/Err  Rtry    Cwnd
[  3] 0.00-10.02 sec  1.12 GBytes   958 Mbits/sec  1/0     16356    708K

The ethernet link has a normal Ethernet MTU of 1500 while the MTU of IPoIB is 64K. Trying to force the MTU of the IPoIB down doesn't seem to help. It doesn't seem to impact performance between two nodes on the IB fabric.

This affects both OPA and Mellanox cards with the 7.3 kernel and so we do not believe that it is driver issue.

Version-Release number of selected component (if applicable):
3.10.0-506

How reproducible:
Always.

nodes:
 opal2 3.10.0-327.28.2.1chaos
 opal191 3.10.0-327.28.2.1chaos
 ipa1 3.10.0-506.el7
 ipa13 3.10.0-506.el7


iptables NAT rule being used:

Chain POSTROUTING (policy ACCEPT 373K packets, 36M bytes)
pkts bytes target     prot opt in     out     source
destination
109K 6540K SNAT       all  --  any    nfs     192.168.128.0/24
anywhere             to:134.9.6.77

Comment 2 Doug Ledford 2016-09-23 04:19:59 UTC

This issue exists upstream too.  The problem is that an upstream commit to the net core stack (which I didn't know had been backported to rhel, but after this bug report I'm almost 100% certain it has) causes problems for the IPoIB stack.  The offending upstream commit is

9207f9d45b0a net: preserve IP control block during GSO segmentation

The upstream bug report is:

https://bugzilla.kernel.org/show_bug.cgi?id=111921

So far, we don't have a solution (I haven't had the time to write one up and no one else has stepped up to do the work).  My current thought bouncing around in my head is to create and idr for sgids and save the idr for the sgid in the skb->cb area instead of the sgid itself.  Then the size of the data stored in skb->cb by the ipoib code will only be 4 bytes instead of the 20 bytes, and by our calculations, we are only 6 bytes short right now after the upstream patch above, so that actually leaves 10 bytes to spare in skb->cb.

Comment 15 Travis Gummels 2016-09-26 17:41:16 UTC

Jim/Ben,

Kernel with possible fix for the reported defect is available here for testing:

http://people.redhat.com/tgummels/partners/.llnl-b63b38b9fe0e3b7a06169ce18c2c1ad9/

If you could test and provide feedback it would be greatly appreciated.

Thank you,

Travis

Comment 16 Travis Gummels 2016-09-26 17:50:47 UTC

Scratch what I said in Comment 15, this defect still needs to be sorted out.

Comment 36 Honggang LI 2016-10-08 04:59:18 UTC

(In reply to Ben Woodard from comment #0)
Hi, Ben

> [root@ipa1:~]# iperf -c opal191-nfs -e

Which upstream release iperf had been used for this test? I tried iperf-2.0.4-3 and 3.0.12.tar.gz. None of them support '-e' option. Could you please provide the URL link to iperf for me?

[root@ib2-qa-06 ~]# iperf -c 10.73.131.37  -e
iperf: invalid option -- e

Comment 66 Doug Ledford 2016-10-11 17:31:56 UTC

Upstream submission of proposed fix:

http://marc.info/?l=linux-rdma&m=147620680520525&w=2

Comment 67 Ben Woodard 2016-10-11 18:02:29 UTC

Doug or Paolo,

Our clusters are homogeneous and so interoperability between different flavors of IB are less of a concern but I wanted to double check that the interop problems mentioned in http://marc.info/?l=linux-rdma&m=147620743420718&w=2 won't crop up between linux and the switches.

Comment 68 Doug Ledford 2016-10-11 18:21:24 UTC

Put the new kernel in place, but don't change any of your configured MTU settings in your network setups and let's see what comes out.  It may be that the path MTU discovery works, and as I mentioned on list, the system uses the max MTU even if you try to set it higher, and things more or less just work.  Your setup is a perfect test bed for that.  So, why don't you tell us if you have problems, and provide the answer on the upstream discussion as I suspect it would be highly useful.

Comment 69 Ben Woodard 2016-10-11 18:41:16 UTC

Wilco, foraker is building the kernel now and getting it installed on one of our test clusters. We will then try to run it through our testing process. Results forthcoming. Thank all of you for jumping on this.

Comment 70 Jim Foraker 2016-10-11 21:25:10 UTC

Initial testing with kernel-3.10.0-510.el7.IPoIB_fix_2.x86_64.rpm has gone well.  The NFS latency issues we were seeing have disappeared, and I have been able to sustain   ~940Mbps with iperf over a NATted 1Gbps link, and ~24Gbps on the local (Mellanox QDR) fabric.

In terms of MTU testing, our node health scripts noticed and complained about the reduced MTU, but iperf still achieved ~24Gbps between a test machine and one running a 3.10.0-496.el7-based kernel.  As Ben pointed out, our fabrics are generally _very_ homogenous, and the ethernet NATs involve a large MTU change regardless, so we may not be particularly good at ferreting out subtle MTU change issues.

Our more modern (OPA, 10GigE) test hardware is currently down for critical hardware maintenance.  Once it's back up, I'll get the patch on there and rerun the iperf tests to see if we can get 10Gbps line speed.  More importantly, those clusters see real (developer) use and are routinely stressed, so it will be a much more thorough test of the changes.

Comment 71 Ben Woodard 2016-10-12 22:10:37 UTC

V1 of the patch didn't quite make it all the way to the OPA test cluster before v2 came out due to power work.

V2 like V1 seems to solve the problem on the MLX4 test cluster. However, on OPA we seem to be getting a huge number of RNR retry failures and CM seems to be dropping. Interestingly, setting the MTU down to 65000 seems to clear up the problem. So this fix may have tickled a latent problem in the OPA hfi1 driver or maybe there is some reason why this doesn't work as well on OPA. Can you test on OPA nodes to see if you can reproduce this? Without understanding the nature of this problem, I'm not sure if this is a problem with the driver or the fix.

The majority of the affected clusters are running OPA.

Comment 75 Ben Woodard 2016-10-12 23:38:44 UTC

OK we've got a handle on this now.

The problem with OPA appears to be a configuration issue combined with a newly discovered driver bug. The problem we were seeing is not related to this IPoIB issue. Sorry for the confusion. We knew that you were anxious for results and so we were trying to get you feedback ASAP rather than doing our normal careful analysis. You can consider the OPA problem a red herring.

Comment 76 Don Dutile (Red Hat) 2016-10-13 01:51:57 UTC

(In reply to Ben Woodard from comment #75)
> OK we've got a handle on this now.
> 
> The problem with OPA appears to be a configuration issue combined with a
> newly discovered driver bug. The problem we were seeing is not related to
> this IPoIB issue. Sorry for the confusion. We knew that you were anxious for
> results and so we were trying to get you feedback ASAP rather than doing our
> normal careful analysis. You can consider the OPA problem a red herring.

Thanks. Doug expects the patch will be in one of his bug-fix/next branch's,
at which point I can backport it to 7.4, and mark it for 7.3-z-stream, hoping to make it into next Tuesday's kernel freeze date for 1st 7.3-zstream release.

Comment 77 Travis Gummels 2016-10-18 18:26:12 UTC

IB/ipoib: move back IB LL address into the hard header
https://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=fc791b6335152c5278dc4a4991bcb2d329f806f9

[PATCH v3] IB/ipoib: move back IB LL address into the hard header
https://www.spinics.net/lists/linux-rdma/msg41712.html

Comment 84 Jim Foraker 2016-10-20 22:12:03 UTC

Both NFS and iperf testing has been successful for us with the kernel-3.10.0-514.el7.test kernel.  Additionally, we've been running a -510 kernel plus the upstream v2 commit on our 192 node OPA QA cluster for several days now with great results.

Comment 85 Rafael Aquini 2016-10-31 15:51:29 UTC

Patch(es) committed on kernel repository and an interim kernel build is undergoing testing

Comment 87 Rafael Aquini 2016-11-01 14:30:54 UTC

Patch(es) available on kernel-3.10.0-516.el7

Comment 90 Ben Woodard 2016-12-05 23:36:25 UTC

Customer explicitly requests that this bug be made publicly visible.

Comment 91 zguo 2016-12-27 10:31:18 UTC

1) Issue had been reproduced over 3.10.0-514.el7.x86_64 on rdma-dev-[02,10,11] with the set-up described in 
https://bugzilla.redhat.com/show_bug.cgi?id=1378656#c42
https://bugzilla.redhat.com/show_bug.cgi?id=1378656#c46

root@rdma-dev-11 ~]$ iperf -c 172.31.0.32
------------------------------------------------------------
Client connecting to 172.31.0.32, TCP port 5001
TCP window size: 2.50 MByte (default)
------------------------------------------------------------
[  3] local 172.31.1.41 port 50534 connected with 172.31.0.32 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  2.45 MBytes  2.06 Mbits/sec
[root@rdma-dev-11 ~]$ iperf -c 172.31.0.32
------------------------------------------------------------
Client connecting to 172.31.0.32, TCP port 5001
TCP window size: 2.50 MByte (default)
------------------------------------------------------------
[  3] local 172.31.1.41 port 50540 connected with 172.31.0.32 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-15.0 sec  2.51 MBytes  1.41 Mbits/sec
[root@rdma-dev-11 ~]$ uname -r
3.10.0-514.el7.x86_64

2) issue is gone on the fixed kernel.

[root@rdma-dev-02 network-scripts]$ sh 2-setup-iperf-server.sh 
+ hostname
rdma-dev-02
+ uname -r
3.10.0-516.el7.x86_64
+ ip addr show
+ grep -w inet
    inet 127.0.0.1/8 scope host lo
    inet 10.16.45.168/21 brd 10.16.47.255 scope global dynamic lom_1
    inet 172.31.0.32/24 brd 172.31.0.255 scope global dynamic mlx5_ib0
+ ip route
default via 10.16.47.254 dev lom_1  proto static  metric 100 
10.16.40.0/21 dev lom_1  proto kernel  scope link  src 10.16.45.168  metric 100 
172.31.0.0/24 dev mlx5_ib0  proto kernel  scope link  src 172.31.0.32  metric 150 
224.0.0.0/4 dev mlx5_ib0  scope link 
+ route add default gw 172.31.0.40 mlx5_ib0
+ ip route
default via 172.31.0.40 dev mlx5_ib0 
default via 10.16.47.254 dev lom_1  proto static  metric 100 
10.16.40.0/21 dev lom_1  proto kernel  scope link  src 10.16.45.168  metric 100 
172.31.0.0/24 dev mlx5_ib0  proto kernel  scope link  src 172.31.0.32  metric 150 
224.0.0.0/4 dev mlx5_ib0  scope link 
+ iperf -s
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
[  4] local 172.31.0.32 port 5001 connected with 172.31.0.40 port 39848
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0-10.0 sec  1.48 GBytes  1.27 Gbits/sec
[  5] local 172.31.0.32 port 5001 connected with 172.31.0.40 port 39850
[  5]  0.0-10.0 sec  1.66 GBytes  1.42 Gbits/sec
[  4] local 172.31.0.32 port 5001 connected with 172.31.0.40 port 39852
[  4]  0.0-10.2 sec  1.82 GBytes  1.54 Gbits/sec
[  5] local 172.31.0.32 port 5001 connected with 172.31.0.40 port 39854
[  5]  0.0-10.0 sec  1.55 GBytes  1.33 Gbits/sec
[  4] local 172.31.0.32 port 5001 connected with 172.31.0.40 port 39856
[  4]  0.0-10.0 sec  1.78 GBytes  1.52 Gbits/sec
[  5] local 172.31.0.32 port 5001 connected with 172.31.0.40 port 39858
[  5]  0.0-10.2 sec  1.70 GBytes  1.42 Gbits/sec
[  4] local 172.31.0.32 port 5001 connected with 172.31.0.40 port 39860
[  4]  0.0-10.0 sec  1.81 GBytes  1.56 Gbits/sec
[  5] local 172.31.0.32 port 5001 connected with 172.31.0.40 port 39862
[  5]  0.0-10.0 sec  1.76 GBytes  1.51 Gbits/sec
[  4] local 172.31.0.32 port 5001 connected with 172.31.0.40 port 39864
[  4]  0.0-10.1 sec  1.56 GBytes  1.33 Gbits/sec
[  5] local 172.31.0.32 port 5001 connected with 172.31.0.40 port 39866
[  5]  0.0-10.2 sec  1.32 GBytes  1.12 Gbits/sec

[root@rdma-dev-11 network-scripts]$ sh 3-setup-iperf-client.sh 
+ hostname
rdma-dev-11
+ uname -r
3.10.0-516.el7.x86_64
+ ip addr show
+ grep -w inet
    inet 127.0.0.1/8 scope host lo
    inet 10.16.45.208/21 brd 10.16.47.255 scope global dynamic lom_1
    inet 172.31.1.41/24 brd 172.31.1.255 scope global dynamic mlx4_ib1
+ ip route
default via 10.16.47.254 dev lom_1  proto static  metric 100 
10.16.40.0/21 dev lom_1  proto kernel  scope link  src 10.16.45.208  metric 100 
172.31.1.0/24 dev mlx4_ib1  proto kernel  scope link  src 172.31.1.41  metric 150 
+ route add default gw 172.31.1.40 mlx4_ib1
+ ip route
default via 172.31.1.40 dev mlx4_ib1 
default via 10.16.47.254 dev lom_1  proto static  metric 100 
10.16.40.0/21 dev lom_1  proto kernel  scope link  src 10.16.45.208  metric 100 
172.31.1.0/24 dev mlx4_ib1  proto kernel  scope link  src 172.31.1.41  metric 150 
+ ping -c 3 172.31.0.32
PING 172.31.0.32 (172.31.0.32) 56(84) bytes of data.
64 bytes from 172.31.0.32: icmp_seq=2 ttl=63 time=2.53 ms
64 bytes from 172.31.0.32: icmp_seq=3 ttl=63 time=0.129 ms

--- 172.31.0.32 ping statistics ---
3 packets transmitted, 2 received, 33% packet loss, time 2001ms
rtt min/avg/max/mdev = 0.129/1.331/2.533/1.202 ms
++ seq -w 1 10
+ for i in '$(seq -w 1 10)'
+ iperf -c 172.31.0.32
------------------------------------------------------------
Client connecting to 172.31.0.32, TCP port 5001
TCP window size: 2.50 MByte (default)
------------------------------------------------------------
[  3] local 172.31.1.41 port 39848 connected with 172.31.0.32 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  1.48 GBytes  1.27 Gbits/sec
+ for i in '$(seq -w 1 10)'
+ iperf -c 172.31.0.32
------------------------------------------------------------
Client connecting to 172.31.0.32, TCP port 5001
TCP window size: 2.50 MByte (default)
------------------------------------------------------------
[  3] local 172.31.1.41 port 39850 connected with 172.31.0.32 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  1.66 GBytes  1.42 Gbits/sec
+ for i in '$(seq -w 1 10)'
+ iperf -c 172.31.0.32
------------------------------------------------------------
Client connecting to 172.31.0.32, TCP port 5001
TCP window size: 2.50 MByte (default)
------------------------------------------------------------
[  3] local 172.31.1.41 port 39852 connected with 172.31.0.32 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.2 sec  1.82 GBytes  1.54 Gbits/sec
+ for i in '$(seq -w 1 10)'
+ iperf -c 172.31.0.32
------------------------------------------------------------
Client connecting to 172.31.0.32, TCP port 5001
TCP window size: 2.50 MByte (default)
------------------------------------------------------------
[  3] local 172.31.1.41 port 39854 connected with 172.31.0.32 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  1.55 GBytes  1.33 Gbits/sec
+ for i in '$(seq -w 1 10)'
+ iperf -c 172.31.0.32
------------------------------------------------------------
Client connecting to 172.31.0.32, TCP port 5001
TCP window size: 2.50 MByte (default)
------------------------------------------------------------
[  3] local 172.31.1.41 port 39856 connected with 172.31.0.32 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  1.78 GBytes  1.53 Gbits/sec
+ for i in '$(seq -w 1 10)'
+ iperf -c 172.31.0.32
------------------------------------------------------------
Client connecting to 172.31.0.32, TCP port 5001
TCP window size: 2.50 MByte (default)
------------------------------------------------------------
[  3] local 172.31.1.41 port 39858 connected with 172.31.0.32 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.2 sec  1.70 GBytes  1.42 Gbits/sec
+ for i in '$(seq -w 1 10)'
+ iperf -c 172.31.0.32
------------------------------------------------------------
Client connecting to 172.31.0.32, TCP port 5001
TCP window size: 2.50 MByte (default)
------------------------------------------------------------
[  3] local 172.31.1.41 port 39860 connected with 172.31.0.32 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  1.81 GBytes  1.56 Gbits/sec
+ for i in '$(seq -w 1 10)'
+ iperf -c 172.31.0.32
------------------------------------------------------------
Client connecting to 172.31.0.32, TCP port 5001
TCP window size: 2.50 MByte (default)
------------------------------------------------------------
[  3] local 172.31.1.41 port 39862 connected with 172.31.0.32 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  1.76 GBytes  1.51 Gbits/sec
+ for i in '$(seq -w 1 10)'
+ iperf -c 172.31.0.32
------------------------------------------------------------
Client connecting to 172.31.0.32, TCP port 5001
TCP window size: 2.50 MByte (default)
------------------------------------------------------------
[  3] local 172.31.1.41 port 39864 connected with 172.31.0.32 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.1 sec  1.56 GBytes  1.33 Gbits/sec
+ for i in '$(seq -w 1 10)'
+ iperf -c 172.31.0.32
------------------------------------------------------------
Client connecting to 172.31.0.32, TCP port 5001
TCP window size: 2.50 MByte (default)
------------------------------------------------------------
[  3] local 172.31.1.41 port 39866 connected with 172.31.0.32 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.2 sec  1.32 GBytes  1.12 Gbits/sec

Comment 94 errata-xmlrpc 2017-08-02 01:46:05 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:1842

Comment 95 errata-xmlrpc 2017-08-02 02:20:27 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:1842