Red Hat Bugzilla – Bug 1378656
[LLNL 7.4 Bug] Serious Performance regression with NATed IPoIB connected mode
Last modified: 2017-08-01 22:20:27 EDT
Description of problem: The general network topology is: Client <--(IPoIB)--> Linux NAT GW <--(10GigE or 1GigE)--> Filer. With IPoIB in connected mode using the 7.2 kernel we get about 35Gb/s over the OPA IPoIB link. opal2: uname -a Linux opal2 3.10.0-327.28.2.1chaos.ch6.x86_64 #1 SMP Wed Aug 3 15:09:48 PDT 2016 x86_64 x86_64 x86_64 GNU/Linux opal2: iperf -c opal191 -e ------------------------------------------------------------ Client connecting to opal191, TCP port 5001 TCP window size: 2.50 MByte (default) ------------------------------------------------------------ [ 3] local 192.168.128.2 port 42924 connected with 192.168.128.191 port 5001 [ ID] Interval Transfer Bandwidth Write/Err Rtry Cwnd [ 3] 0.00-10.00 sec 40.9 GBytes 35.2 Gbits/sec 1/0 11 12083K With 7.3 the kernel we only get about 2.1Mb/s. ipa1: iperf -c opal191-nfs -e ------------------------------------------------------------ Client connecting to opal191-nfs, TCP port 5001 TCP window size: 85.0 KByte (default) ------------------------------------------------------------ [ 3] local 192.168.112.1 port 40478 connected with 134.9.6.77 port 5001 [ ID] Interval Transfer Bandwidth Write/Err Rtry Cwnd [ 3] 0.00-10.13 sec 2.58 MBytes 2.14 Mbits/sec 1/0 602 2K Please note that number of retries goes up considerably. We know that the performance problem is not due to the slowness of the ethernet link because when we take the 7.3 IPoIB link out of the loop we get nearly wire speed. This greatly impacts performance because user directories and workspaces are on these filers. This problem was first noticed with NFS but subsequent testing showed that the same performance problem was easily observable with iperf. In this case daatagram mode is much faster than connected mode. [root@ipa1:~]# echo datagram > /sys/class/net/hsi0/mode [root@ipa1:~]# iperf -c opal191-nfs -e ------------------------------------------------------------ Client connecting to opal191-nfs, TCP port 5001 TCP window size: 85.0 KByte (default) ------------------------------------------------------------ [ 3] local 192.168.112.1 port 40486 connected with 134.9.6.77 port 5001 [ ID] Interval Transfer Bandwidth Write/Err Rtry Cwnd [ 3] 0.00-10.02 sec 1.12 GBytes 958 Mbits/sec 1/0 16356 708K The ethernet link has a normal Ethernet MTU of 1500 while the MTU of IPoIB is 64K. Trying to force the MTU of the IPoIB down doesn't seem to help. It doesn't seem to impact performance between two nodes on the IB fabric. This affects both OPA and Mellanox cards with the 7.3 kernel and so we do not believe that it is driver issue. Version-Release number of selected component (if applicable): 3.10.0-506 How reproducible: Always. nodes: opal2 3.10.0-327.28.2.1chaos opal191 3.10.0-327.28.2.1chaos ipa1 3.10.0-506.el7 ipa13 3.10.0-506.el7 iptables NAT rule being used: Chain POSTROUTING (policy ACCEPT 373K packets, 36M bytes) pkts bytes target prot opt in out source destination 109K 6540K SNAT all -- any nfs 192.168.128.0/24 anywhere to:134.9.6.77
This issue exists upstream too. The problem is that an upstream commit to the net core stack (which I didn't know had been backported to rhel, but after this bug report I'm almost 100% certain it has) causes problems for the IPoIB stack. The offending upstream commit is 9207f9d45b0a net: preserve IP control block during GSO segmentation The upstream bug report is: https://bugzilla.kernel.org/show_bug.cgi?id=111921 So far, we don't have a solution (I haven't had the time to write one up and no one else has stepped up to do the work). My current thought bouncing around in my head is to create and idr for sgids and save the idr for the sgid in the skb->cb area instead of the sgid itself. Then the size of the data stored in skb->cb by the ipoib code will only be 4 bytes instead of the 20 bytes, and by our calculations, we are only 6 bytes short right now after the upstream patch above, so that actually leaves 10 bytes to spare in skb->cb.
Jim/Ben, Kernel with possible fix for the reported defect is available here for testing: http://people.redhat.com/tgummels/partners/.llnl-b63b38b9fe0e3b7a06169ce18c2c1ad9/ If you could test and provide feedback it would be greatly appreciated. Thank you, Travis
Scratch what I said in Comment 15, this defect still needs to be sorted out.
(In reply to Ben Woodard from comment #0) Hi, Ben > [root@ipa1:~]# iperf -c opal191-nfs -e Which upstream release iperf had been used for this test? I tried iperf-2.0.4-3 and 3.0.12.tar.gz. None of them support '-e' option. Could you please provide the URL link to iperf for me? [root@ib2-qa-06 ~]# iperf -c 10.73.131.37 -e iperf: invalid option -- e
Upstream submission of proposed fix: http://marc.info/?l=linux-rdma&m=147620680520525&w=2
Doug or Paolo, Our clusters are homogeneous and so interoperability between different flavors of IB are less of a concern but I wanted to double check that the interop problems mentioned in http://marc.info/?l=linux-rdma&m=147620743420718&w=2 won't crop up between linux and the switches.
Put the new kernel in place, but don't change any of your configured MTU settings in your network setups and let's see what comes out. It may be that the path MTU discovery works, and as I mentioned on list, the system uses the max MTU even if you try to set it higher, and things more or less just work. Your setup is a perfect test bed for that. So, why don't you tell us if you have problems, and provide the answer on the upstream discussion as I suspect it would be highly useful.
Wilco, foraker is building the kernel now and getting it installed on one of our test clusters. We will then try to run it through our testing process. Results forthcoming. Thank all of you for jumping on this.
Initial testing with kernel-3.10.0-510.el7.IPoIB_fix_2.x86_64.rpm has gone well. The NFS latency issues we were seeing have disappeared, and I have been able to sustain ~940Mbps with iperf over a NATted 1Gbps link, and ~24Gbps on the local (Mellanox QDR) fabric. In terms of MTU testing, our node health scripts noticed and complained about the reduced MTU, but iperf still achieved ~24Gbps between a test machine and one running a 3.10.0-496.el7-based kernel. As Ben pointed out, our fabrics are generally _very_ homogenous, and the ethernet NATs involve a large MTU change regardless, so we may not be particularly good at ferreting out subtle MTU change issues. Our more modern (OPA, 10GigE) test hardware is currently down for critical hardware maintenance. Once it's back up, I'll get the patch on there and rerun the iperf tests to see if we can get 10Gbps line speed. More importantly, those clusters see real (developer) use and are routinely stressed, so it will be a much more thorough test of the changes.
V1 of the patch didn't quite make it all the way to the OPA test cluster before v2 came out due to power work. V2 like V1 seems to solve the problem on the MLX4 test cluster. However, on OPA we seem to be getting a huge number of RNR retry failures and CM seems to be dropping. Interestingly, setting the MTU down to 65000 seems to clear up the problem. So this fix may have tickled a latent problem in the OPA hfi1 driver or maybe there is some reason why this doesn't work as well on OPA. Can you test on OPA nodes to see if you can reproduce this? Without understanding the nature of this problem, I'm not sure if this is a problem with the driver or the fix. The majority of the affected clusters are running OPA.
OK we've got a handle on this now. The problem with OPA appears to be a configuration issue combined with a newly discovered driver bug. The problem we were seeing is not related to this IPoIB issue. Sorry for the confusion. We knew that you were anxious for results and so we were trying to get you feedback ASAP rather than doing our normal careful analysis. You can consider the OPA problem a red herring.
(In reply to Ben Woodard from comment #75) > OK we've got a handle on this now. > > The problem with OPA appears to be a configuration issue combined with a > newly discovered driver bug. The problem we were seeing is not related to > this IPoIB issue. Sorry for the confusion. We knew that you were anxious for > results and so we were trying to get you feedback ASAP rather than doing our > normal careful analysis. You can consider the OPA problem a red herring. Thanks. Doug expects the patch will be in one of his bug-fix/next branch's, at which point I can backport it to 7.4, and mark it for 7.3-z-stream, hoping to make it into next Tuesday's kernel freeze date for 1st 7.3-zstream release.
IB/ipoib: move back IB LL address into the hard header https://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=fc791b6335152c5278dc4a4991bcb2d329f806f9 [PATCH v3] IB/ipoib: move back IB LL address into the hard header https://www.spinics.net/lists/linux-rdma/msg41712.html
Both NFS and iperf testing has been successful for us with the kernel-3.10.0-514.el7.test kernel. Additionally, we've been running a -510 kernel plus the upstream v2 commit on our 192 node OPA QA cluster for several days now with great results.
Patch(es) committed on kernel repository and an interim kernel build is undergoing testing
Patch(es) available on kernel-3.10.0-516.el7
Customer explicitly requests that this bug be made publicly visible.
1) Issue had been reproduced over 3.10.0-514.el7.x86_64 on rdma-dev-[02,10,11] with the set-up described in https://bugzilla.redhat.com/show_bug.cgi?id=1378656#c42 https://bugzilla.redhat.com/show_bug.cgi?id=1378656#c46 root@rdma-dev-11 ~]$ iperf -c 172.31.0.32 ------------------------------------------------------------ Client connecting to 172.31.0.32, TCP port 5001 TCP window size: 2.50 MByte (default) ------------------------------------------------------------ [ 3] local 172.31.1.41 port 50534 connected with 172.31.0.32 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 2.45 MBytes 2.06 Mbits/sec [root@rdma-dev-11 ~]$ iperf -c 172.31.0.32 ------------------------------------------------------------ Client connecting to 172.31.0.32, TCP port 5001 TCP window size: 2.50 MByte (default) ------------------------------------------------------------ [ 3] local 172.31.1.41 port 50540 connected with 172.31.0.32 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-15.0 sec 2.51 MBytes 1.41 Mbits/sec [root@rdma-dev-11 ~]$ uname -r 3.10.0-514.el7.x86_64 2) issue is gone on the fixed kernel. [root@rdma-dev-02 network-scripts]$ sh 2-setup-iperf-server.sh + hostname rdma-dev-02 + uname -r 3.10.0-516.el7.x86_64 + ip addr show + grep -w inet inet 127.0.0.1/8 scope host lo inet 10.16.45.168/21 brd 10.16.47.255 scope global dynamic lom_1 inet 172.31.0.32/24 brd 172.31.0.255 scope global dynamic mlx5_ib0 + ip route default via 10.16.47.254 dev lom_1 proto static metric 100 10.16.40.0/21 dev lom_1 proto kernel scope link src 10.16.45.168 metric 100 172.31.0.0/24 dev mlx5_ib0 proto kernel scope link src 172.31.0.32 metric 150 224.0.0.0/4 dev mlx5_ib0 scope link + route add default gw 172.31.0.40 mlx5_ib0 + ip route default via 172.31.0.40 dev mlx5_ib0 default via 10.16.47.254 dev lom_1 proto static metric 100 10.16.40.0/21 dev lom_1 proto kernel scope link src 10.16.45.168 metric 100 172.31.0.0/24 dev mlx5_ib0 proto kernel scope link src 172.31.0.32 metric 150 224.0.0.0/4 dev mlx5_ib0 scope link + iperf -s ------------------------------------------------------------ Server listening on TCP port 5001 TCP window size: 85.3 KByte (default) ------------------------------------------------------------ [ 4] local 172.31.0.32 port 5001 connected with 172.31.0.40 port 39848 [ ID] Interval Transfer Bandwidth [ 4] 0.0-10.0 sec 1.48 GBytes 1.27 Gbits/sec [ 5] local 172.31.0.32 port 5001 connected with 172.31.0.40 port 39850 [ 5] 0.0-10.0 sec 1.66 GBytes 1.42 Gbits/sec [ 4] local 172.31.0.32 port 5001 connected with 172.31.0.40 port 39852 [ 4] 0.0-10.2 sec 1.82 GBytes 1.54 Gbits/sec [ 5] local 172.31.0.32 port 5001 connected with 172.31.0.40 port 39854 [ 5] 0.0-10.0 sec 1.55 GBytes 1.33 Gbits/sec [ 4] local 172.31.0.32 port 5001 connected with 172.31.0.40 port 39856 [ 4] 0.0-10.0 sec 1.78 GBytes 1.52 Gbits/sec [ 5] local 172.31.0.32 port 5001 connected with 172.31.0.40 port 39858 [ 5] 0.0-10.2 sec 1.70 GBytes 1.42 Gbits/sec [ 4] local 172.31.0.32 port 5001 connected with 172.31.0.40 port 39860 [ 4] 0.0-10.0 sec 1.81 GBytes 1.56 Gbits/sec [ 5] local 172.31.0.32 port 5001 connected with 172.31.0.40 port 39862 [ 5] 0.0-10.0 sec 1.76 GBytes 1.51 Gbits/sec [ 4] local 172.31.0.32 port 5001 connected with 172.31.0.40 port 39864 [ 4] 0.0-10.1 sec 1.56 GBytes 1.33 Gbits/sec [ 5] local 172.31.0.32 port 5001 connected with 172.31.0.40 port 39866 [ 5] 0.0-10.2 sec 1.32 GBytes 1.12 Gbits/sec [root@rdma-dev-11 network-scripts]$ sh 3-setup-iperf-client.sh + hostname rdma-dev-11 + uname -r 3.10.0-516.el7.x86_64 + ip addr show + grep -w inet inet 127.0.0.1/8 scope host lo inet 10.16.45.208/21 brd 10.16.47.255 scope global dynamic lom_1 inet 172.31.1.41/24 brd 172.31.1.255 scope global dynamic mlx4_ib1 + ip route default via 10.16.47.254 dev lom_1 proto static metric 100 10.16.40.0/21 dev lom_1 proto kernel scope link src 10.16.45.208 metric 100 172.31.1.0/24 dev mlx4_ib1 proto kernel scope link src 172.31.1.41 metric 150 + route add default gw 172.31.1.40 mlx4_ib1 + ip route default via 172.31.1.40 dev mlx4_ib1 default via 10.16.47.254 dev lom_1 proto static metric 100 10.16.40.0/21 dev lom_1 proto kernel scope link src 10.16.45.208 metric 100 172.31.1.0/24 dev mlx4_ib1 proto kernel scope link src 172.31.1.41 metric 150 + ping -c 3 172.31.0.32 PING 172.31.0.32 (172.31.0.32) 56(84) bytes of data. 64 bytes from 172.31.0.32: icmp_seq=2 ttl=63 time=2.53 ms 64 bytes from 172.31.0.32: icmp_seq=3 ttl=63 time=0.129 ms --- 172.31.0.32 ping statistics --- 3 packets transmitted, 2 received, 33% packet loss, time 2001ms rtt min/avg/max/mdev = 0.129/1.331/2.533/1.202 ms ++ seq -w 1 10 + for i in '$(seq -w 1 10)' + iperf -c 172.31.0.32 ------------------------------------------------------------ Client connecting to 172.31.0.32, TCP port 5001 TCP window size: 2.50 MByte (default) ------------------------------------------------------------ [ 3] local 172.31.1.41 port 39848 connected with 172.31.0.32 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 1.48 GBytes 1.27 Gbits/sec + for i in '$(seq -w 1 10)' + iperf -c 172.31.0.32 ------------------------------------------------------------ Client connecting to 172.31.0.32, TCP port 5001 TCP window size: 2.50 MByte (default) ------------------------------------------------------------ [ 3] local 172.31.1.41 port 39850 connected with 172.31.0.32 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 1.66 GBytes 1.42 Gbits/sec + for i in '$(seq -w 1 10)' + iperf -c 172.31.0.32 ------------------------------------------------------------ Client connecting to 172.31.0.32, TCP port 5001 TCP window size: 2.50 MByte (default) ------------------------------------------------------------ [ 3] local 172.31.1.41 port 39852 connected with 172.31.0.32 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.2 sec 1.82 GBytes 1.54 Gbits/sec + for i in '$(seq -w 1 10)' + iperf -c 172.31.0.32 ------------------------------------------------------------ Client connecting to 172.31.0.32, TCP port 5001 TCP window size: 2.50 MByte (default) ------------------------------------------------------------ [ 3] local 172.31.1.41 port 39854 connected with 172.31.0.32 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 1.55 GBytes 1.33 Gbits/sec + for i in '$(seq -w 1 10)' + iperf -c 172.31.0.32 ------------------------------------------------------------ Client connecting to 172.31.0.32, TCP port 5001 TCP window size: 2.50 MByte (default) ------------------------------------------------------------ [ 3] local 172.31.1.41 port 39856 connected with 172.31.0.32 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 1.78 GBytes 1.53 Gbits/sec + for i in '$(seq -w 1 10)' + iperf -c 172.31.0.32 ------------------------------------------------------------ Client connecting to 172.31.0.32, TCP port 5001 TCP window size: 2.50 MByte (default) ------------------------------------------------------------ [ 3] local 172.31.1.41 port 39858 connected with 172.31.0.32 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.2 sec 1.70 GBytes 1.42 Gbits/sec + for i in '$(seq -w 1 10)' + iperf -c 172.31.0.32 ------------------------------------------------------------ Client connecting to 172.31.0.32, TCP port 5001 TCP window size: 2.50 MByte (default) ------------------------------------------------------------ [ 3] local 172.31.1.41 port 39860 connected with 172.31.0.32 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 1.81 GBytes 1.56 Gbits/sec + for i in '$(seq -w 1 10)' + iperf -c 172.31.0.32 ------------------------------------------------------------ Client connecting to 172.31.0.32, TCP port 5001 TCP window size: 2.50 MByte (default) ------------------------------------------------------------ [ 3] local 172.31.1.41 port 39862 connected with 172.31.0.32 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 1.76 GBytes 1.51 Gbits/sec + for i in '$(seq -w 1 10)' + iperf -c 172.31.0.32 ------------------------------------------------------------ Client connecting to 172.31.0.32, TCP port 5001 TCP window size: 2.50 MByte (default) ------------------------------------------------------------ [ 3] local 172.31.1.41 port 39864 connected with 172.31.0.32 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.1 sec 1.56 GBytes 1.33 Gbits/sec + for i in '$(seq -w 1 10)' + iperf -c 172.31.0.32 ------------------------------------------------------------ Client connecting to 172.31.0.32, TCP port 5001 TCP window size: 2.50 MByte (default) ------------------------------------------------------------ [ 3] local 172.31.1.41 port 39866 connected with 172.31.0.32 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.2 sec 1.32 GBytes 1.12 Gbits/sec
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:1842