Bug 1378656
Summary: | [LLNL 7.4 Bug] Serious Performance regression with NATed IPoIB connected mode | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Ben Woodard <woodard> | |
Component: | kernel | Assignee: | Jonathan Toppins <jtoppins> | |
kernel sub component: | Infiniband | QA Contact: | zguo <zguo> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | urgent | |||
Priority: | urgent | CC: | bhu, cascardo, cyates, ddutile, dhoward, foraker1, hartsjc, honli, infiniband-qe, ivecera, jarod, jmcnicol, jshortt, jtoppins, lmiksik, mstowell, pabeni, rdma-dev-team, snagar, tgummels, woodard, yizhan, zguo | |
Version: | 7.3 | Keywords: | ZStream | |
Target Milestone: | rc | |||
Target Release: | 7.4 | |||
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | kernel-3.10.0-516.el7 | Doc Type: | Bug Fix | |
Doc Text: |
Cause:
A change to the GSO control block
causes a conflict with the IPoIB
control block resulting in IPoIB
address information that is cached
in the control block to get overwritten.
Consequence:
The overwriting of the cached data
results in a significant performance
degradation on IPoIB fabrics.
Fix:
Move the IPoIB address information
to another section of the socket buffer
preventing the overwrite.
Result:
Restoration of original performance.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1390668 (view as bug list) | Environment: | ||
Last Closed: | 2017-08-02 01:46:05 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1298243, 1353018, 1381646, 1390668, 1446211 |
Description
Ben Woodard
2016-09-23 01:42:43 UTC
This issue exists upstream too. The problem is that an upstream commit to the net core stack (which I didn't know had been backported to rhel, but after this bug report I'm almost 100% certain it has) causes problems for the IPoIB stack. The offending upstream commit is 9207f9d45b0a net: preserve IP control block during GSO segmentation The upstream bug report is: https://bugzilla.kernel.org/show_bug.cgi?id=111921 So far, we don't have a solution (I haven't had the time to write one up and no one else has stepped up to do the work). My current thought bouncing around in my head is to create and idr for sgids and save the idr for the sgid in the skb->cb area instead of the sgid itself. Then the size of the data stored in skb->cb by the ipoib code will only be 4 bytes instead of the 20 bytes, and by our calculations, we are only 6 bytes short right now after the upstream patch above, so that actually leaves 10 bytes to spare in skb->cb. Jim/Ben, Kernel with possible fix for the reported defect is available here for testing: http://people.redhat.com/tgummels/partners/.llnl-b63b38b9fe0e3b7a06169ce18c2c1ad9/ If you could test and provide feedback it would be greatly appreciated. Thank you, Travis Scratch what I said in Comment 15, this defect still needs to be sorted out. (In reply to Ben Woodard from comment #0) Hi, Ben > [root@ipa1:~]# iperf -c opal191-nfs -e Which upstream release iperf had been used for this test? I tried iperf-2.0.4-3 and 3.0.12.tar.gz. None of them support '-e' option. Could you please provide the URL link to iperf for me? [root@ib2-qa-06 ~]# iperf -c 10.73.131.37 -e iperf: invalid option -- e Upstream submission of proposed fix: http://marc.info/?l=linux-rdma&m=147620680520525&w=2 Doug or Paolo, Our clusters are homogeneous and so interoperability between different flavors of IB are less of a concern but I wanted to double check that the interop problems mentioned in http://marc.info/?l=linux-rdma&m=147620743420718&w=2 won't crop up between linux and the switches. Put the new kernel in place, but don't change any of your configured MTU settings in your network setups and let's see what comes out. It may be that the path MTU discovery works, and as I mentioned on list, the system uses the max MTU even if you try to set it higher, and things more or less just work. Your setup is a perfect test bed for that. So, why don't you tell us if you have problems, and provide the answer on the upstream discussion as I suspect it would be highly useful. Wilco, foraker is building the kernel now and getting it installed on one of our test clusters. We will then try to run it through our testing process. Results forthcoming. Thank all of you for jumping on this. Initial testing with kernel-3.10.0-510.el7.IPoIB_fix_2.x86_64.rpm has gone well. The NFS latency issues we were seeing have disappeared, and I have been able to sustain ~940Mbps with iperf over a NATted 1Gbps link, and ~24Gbps on the local (Mellanox QDR) fabric. In terms of MTU testing, our node health scripts noticed and complained about the reduced MTU, but iperf still achieved ~24Gbps between a test machine and one running a 3.10.0-496.el7-based kernel. As Ben pointed out, our fabrics are generally _very_ homogenous, and the ethernet NATs involve a large MTU change regardless, so we may not be particularly good at ferreting out subtle MTU change issues. Our more modern (OPA, 10GigE) test hardware is currently down for critical hardware maintenance. Once it's back up, I'll get the patch on there and rerun the iperf tests to see if we can get 10Gbps line speed. More importantly, those clusters see real (developer) use and are routinely stressed, so it will be a much more thorough test of the changes. V1 of the patch didn't quite make it all the way to the OPA test cluster before v2 came out due to power work. V2 like V1 seems to solve the problem on the MLX4 test cluster. However, on OPA we seem to be getting a huge number of RNR retry failures and CM seems to be dropping. Interestingly, setting the MTU down to 65000 seems to clear up the problem. So this fix may have tickled a latent problem in the OPA hfi1 driver or maybe there is some reason why this doesn't work as well on OPA. Can you test on OPA nodes to see if you can reproduce this? Without understanding the nature of this problem, I'm not sure if this is a problem with the driver or the fix. The majority of the affected clusters are running OPA. OK we've got a handle on this now. The problem with OPA appears to be a configuration issue combined with a newly discovered driver bug. The problem we were seeing is not related to this IPoIB issue. Sorry for the confusion. We knew that you were anxious for results and so we were trying to get you feedback ASAP rather than doing our normal careful analysis. You can consider the OPA problem a red herring. (In reply to Ben Woodard from comment #75) > OK we've got a handle on this now. > > The problem with OPA appears to be a configuration issue combined with a > newly discovered driver bug. The problem we were seeing is not related to > this IPoIB issue. Sorry for the confusion. We knew that you were anxious for > results and so we were trying to get you feedback ASAP rather than doing our > normal careful analysis. You can consider the OPA problem a red herring. Thanks. Doug expects the patch will be in one of his bug-fix/next branch's, at which point I can backport it to 7.4, and mark it for 7.3-z-stream, hoping to make it into next Tuesday's kernel freeze date for 1st 7.3-zstream release. IB/ipoib: move back IB LL address into the hard header https://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=fc791b6335152c5278dc4a4991bcb2d329f806f9 [PATCH v3] IB/ipoib: move back IB LL address into the hard header https://www.spinics.net/lists/linux-rdma/msg41712.html Both NFS and iperf testing has been successful for us with the kernel-3.10.0-514.el7.test kernel. Additionally, we've been running a -510 kernel plus the upstream v2 commit on our 192 node OPA QA cluster for several days now with great results. Patch(es) committed on kernel repository and an interim kernel build is undergoing testing Patch(es) available on kernel-3.10.0-516.el7 Customer explicitly requests that this bug be made publicly visible. 1) Issue had been reproduced over 3.10.0-514.el7.x86_64 on rdma-dev-[02,10,11] with the set-up described in https://bugzilla.redhat.com/show_bug.cgi?id=1378656#c42 https://bugzilla.redhat.com/show_bug.cgi?id=1378656#c46 root@rdma-dev-11 ~]$ iperf -c 172.31.0.32 ------------------------------------------------------------ Client connecting to 172.31.0.32, TCP port 5001 TCP window size: 2.50 MByte (default) ------------------------------------------------------------ [ 3] local 172.31.1.41 port 50534 connected with 172.31.0.32 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 2.45 MBytes 2.06 Mbits/sec [root@rdma-dev-11 ~]$ iperf -c 172.31.0.32 ------------------------------------------------------------ Client connecting to 172.31.0.32, TCP port 5001 TCP window size: 2.50 MByte (default) ------------------------------------------------------------ [ 3] local 172.31.1.41 port 50540 connected with 172.31.0.32 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-15.0 sec 2.51 MBytes 1.41 Mbits/sec [root@rdma-dev-11 ~]$ uname -r 3.10.0-514.el7.x86_64 2) issue is gone on the fixed kernel. [root@rdma-dev-02 network-scripts]$ sh 2-setup-iperf-server.sh + hostname rdma-dev-02 + uname -r 3.10.0-516.el7.x86_64 + ip addr show + grep -w inet inet 127.0.0.1/8 scope host lo inet 10.16.45.168/21 brd 10.16.47.255 scope global dynamic lom_1 inet 172.31.0.32/24 brd 172.31.0.255 scope global dynamic mlx5_ib0 + ip route default via 10.16.47.254 dev lom_1 proto static metric 100 10.16.40.0/21 dev lom_1 proto kernel scope link src 10.16.45.168 metric 100 172.31.0.0/24 dev mlx5_ib0 proto kernel scope link src 172.31.0.32 metric 150 224.0.0.0/4 dev mlx5_ib0 scope link + route add default gw 172.31.0.40 mlx5_ib0 + ip route default via 172.31.0.40 dev mlx5_ib0 default via 10.16.47.254 dev lom_1 proto static metric 100 10.16.40.0/21 dev lom_1 proto kernel scope link src 10.16.45.168 metric 100 172.31.0.0/24 dev mlx5_ib0 proto kernel scope link src 172.31.0.32 metric 150 224.0.0.0/4 dev mlx5_ib0 scope link + iperf -s ------------------------------------------------------------ Server listening on TCP port 5001 TCP window size: 85.3 KByte (default) ------------------------------------------------------------ [ 4] local 172.31.0.32 port 5001 connected with 172.31.0.40 port 39848 [ ID] Interval Transfer Bandwidth [ 4] 0.0-10.0 sec 1.48 GBytes 1.27 Gbits/sec [ 5] local 172.31.0.32 port 5001 connected with 172.31.0.40 port 39850 [ 5] 0.0-10.0 sec 1.66 GBytes 1.42 Gbits/sec [ 4] local 172.31.0.32 port 5001 connected with 172.31.0.40 port 39852 [ 4] 0.0-10.2 sec 1.82 GBytes 1.54 Gbits/sec [ 5] local 172.31.0.32 port 5001 connected with 172.31.0.40 port 39854 [ 5] 0.0-10.0 sec 1.55 GBytes 1.33 Gbits/sec [ 4] local 172.31.0.32 port 5001 connected with 172.31.0.40 port 39856 [ 4] 0.0-10.0 sec 1.78 GBytes 1.52 Gbits/sec [ 5] local 172.31.0.32 port 5001 connected with 172.31.0.40 port 39858 [ 5] 0.0-10.2 sec 1.70 GBytes 1.42 Gbits/sec [ 4] local 172.31.0.32 port 5001 connected with 172.31.0.40 port 39860 [ 4] 0.0-10.0 sec 1.81 GBytes 1.56 Gbits/sec [ 5] local 172.31.0.32 port 5001 connected with 172.31.0.40 port 39862 [ 5] 0.0-10.0 sec 1.76 GBytes 1.51 Gbits/sec [ 4] local 172.31.0.32 port 5001 connected with 172.31.0.40 port 39864 [ 4] 0.0-10.1 sec 1.56 GBytes 1.33 Gbits/sec [ 5] local 172.31.0.32 port 5001 connected with 172.31.0.40 port 39866 [ 5] 0.0-10.2 sec 1.32 GBytes 1.12 Gbits/sec [root@rdma-dev-11 network-scripts]$ sh 3-setup-iperf-client.sh + hostname rdma-dev-11 + uname -r 3.10.0-516.el7.x86_64 + ip addr show + grep -w inet inet 127.0.0.1/8 scope host lo inet 10.16.45.208/21 brd 10.16.47.255 scope global dynamic lom_1 inet 172.31.1.41/24 brd 172.31.1.255 scope global dynamic mlx4_ib1 + ip route default via 10.16.47.254 dev lom_1 proto static metric 100 10.16.40.0/21 dev lom_1 proto kernel scope link src 10.16.45.208 metric 100 172.31.1.0/24 dev mlx4_ib1 proto kernel scope link src 172.31.1.41 metric 150 + route add default gw 172.31.1.40 mlx4_ib1 + ip route default via 172.31.1.40 dev mlx4_ib1 default via 10.16.47.254 dev lom_1 proto static metric 100 10.16.40.0/21 dev lom_1 proto kernel scope link src 10.16.45.208 metric 100 172.31.1.0/24 dev mlx4_ib1 proto kernel scope link src 172.31.1.41 metric 150 + ping -c 3 172.31.0.32 PING 172.31.0.32 (172.31.0.32) 56(84) bytes of data. 64 bytes from 172.31.0.32: icmp_seq=2 ttl=63 time=2.53 ms 64 bytes from 172.31.0.32: icmp_seq=3 ttl=63 time=0.129 ms --- 172.31.0.32 ping statistics --- 3 packets transmitted, 2 received, 33% packet loss, time 2001ms rtt min/avg/max/mdev = 0.129/1.331/2.533/1.202 ms ++ seq -w 1 10 + for i in '$(seq -w 1 10)' + iperf -c 172.31.0.32 ------------------------------------------------------------ Client connecting to 172.31.0.32, TCP port 5001 TCP window size: 2.50 MByte (default) ------------------------------------------------------------ [ 3] local 172.31.1.41 port 39848 connected with 172.31.0.32 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 1.48 GBytes 1.27 Gbits/sec + for i in '$(seq -w 1 10)' + iperf -c 172.31.0.32 ------------------------------------------------------------ Client connecting to 172.31.0.32, TCP port 5001 TCP window size: 2.50 MByte (default) ------------------------------------------------------------ [ 3] local 172.31.1.41 port 39850 connected with 172.31.0.32 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 1.66 GBytes 1.42 Gbits/sec + for i in '$(seq -w 1 10)' + iperf -c 172.31.0.32 ------------------------------------------------------------ Client connecting to 172.31.0.32, TCP port 5001 TCP window size: 2.50 MByte (default) ------------------------------------------------------------ [ 3] local 172.31.1.41 port 39852 connected with 172.31.0.32 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.2 sec 1.82 GBytes 1.54 Gbits/sec + for i in '$(seq -w 1 10)' + iperf -c 172.31.0.32 ------------------------------------------------------------ Client connecting to 172.31.0.32, TCP port 5001 TCP window size: 2.50 MByte (default) ------------------------------------------------------------ [ 3] local 172.31.1.41 port 39854 connected with 172.31.0.32 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 1.55 GBytes 1.33 Gbits/sec + for i in '$(seq -w 1 10)' + iperf -c 172.31.0.32 ------------------------------------------------------------ Client connecting to 172.31.0.32, TCP port 5001 TCP window size: 2.50 MByte (default) ------------------------------------------------------------ [ 3] local 172.31.1.41 port 39856 connected with 172.31.0.32 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 1.78 GBytes 1.53 Gbits/sec + for i in '$(seq -w 1 10)' + iperf -c 172.31.0.32 ------------------------------------------------------------ Client connecting to 172.31.0.32, TCP port 5001 TCP window size: 2.50 MByte (default) ------------------------------------------------------------ [ 3] local 172.31.1.41 port 39858 connected with 172.31.0.32 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.2 sec 1.70 GBytes 1.42 Gbits/sec + for i in '$(seq -w 1 10)' + iperf -c 172.31.0.32 ------------------------------------------------------------ Client connecting to 172.31.0.32, TCP port 5001 TCP window size: 2.50 MByte (default) ------------------------------------------------------------ [ 3] local 172.31.1.41 port 39860 connected with 172.31.0.32 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 1.81 GBytes 1.56 Gbits/sec + for i in '$(seq -w 1 10)' + iperf -c 172.31.0.32 ------------------------------------------------------------ Client connecting to 172.31.0.32, TCP port 5001 TCP window size: 2.50 MByte (default) ------------------------------------------------------------ [ 3] local 172.31.1.41 port 39862 connected with 172.31.0.32 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 1.76 GBytes 1.51 Gbits/sec + for i in '$(seq -w 1 10)' + iperf -c 172.31.0.32 ------------------------------------------------------------ Client connecting to 172.31.0.32, TCP port 5001 TCP window size: 2.50 MByte (default) ------------------------------------------------------------ [ 3] local 172.31.1.41 port 39864 connected with 172.31.0.32 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.1 sec 1.56 GBytes 1.33 Gbits/sec + for i in '$(seq -w 1 10)' + iperf -c 172.31.0.32 ------------------------------------------------------------ Client connecting to 172.31.0.32, TCP port 5001 TCP window size: 2.50 MByte (default) ------------------------------------------------------------ [ 3] local 172.31.1.41 port 39866 connected with 172.31.0.32 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.2 sec 1.32 GBytes 1.12 Gbits/sec Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:1842 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:1842 |