Escalated to Bugzilla from IssueTracker
Event posted on 12-08-2009 02:37pm EST by woodard From: Trent D'Hooge <tdhooge> Subject: mlx4_en driver issue Date: December 8, 2009 1:22:54 PM CST To: Ben Coyote Woodard <woodard>, woodard9, Ira Weiny <weiny2> Sending this to you first before opening a ticket so that we are on the same page. Then we should open a ticket with RH. RHEL5.3 used mtnic, RHEL5.4 uses the mlx4_en driver Mellanox firmware version on the 10GigE card is 2.7.0. When using the mtnic driver we were at firmware version 2.5.914. First problem seen: The mlx4_en driver seems to be losing enough packets to cause a number of TCP connections to fail, timeout, and then eventually get connected. Lustre does not like this and gets upset. (Even if Lustre was not upset this could cause major performance issues...) first problem found by Ira: The 10GigE card was not using MSI interupts. He fixed this, but we are still having problems. from e-mails going around: First our conclusion is the unified driver is BROKEN... As I say at the bottom of this email. The only thing which has changed is the software. We are using the unified driver from RHEL 5.4. The only modification has been the patch I just applied to get MSI to work... Now for the gory details... After enabling MSI we still see connections getting into SYN_RECV and causing problems. ifconfig and ethtool show only a few errors on the RX side. I don't know how running these nodes back to back is going to present the problem. Right now running 2 nodes against each other through the switch results in no errors. I believe there is something more complex going on because of the large number of TCP connections which Lustre establishes. We still see a large number of retransmissions in TCP. # hype139 /sys/module/mlx4_core/parameters > netstat -s | grep retrans 1059776 segments retransmited 569233 fast retransmits 467332 forward retransmits 2536 retransmits in slow start 628 sack retransmits failed # hype139 /sys/module/mlx4_core/parameters > ifconfig eth2 eth2 Link encap:Ethernet HWaddr 00:02:C9:04:6E:88 inet addr:172.16.1.201 Bcast:172.16.7.255 Mask:255.255.248.0 inet6 addr: fe80::202:c9ff:fe04:6e88/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:7500 Metric:1 RX packets:102570804 errors:20 dropped:23 overruns:23 frame:43 TX packets:211462289 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:83809240941 (78.0 GiB) TX bytes:1162296790558 (1.0 TiB) # hype139 /sys/module/mlx4_core/parameters > ifconfig eth3 eth3 Link encap:Ethernet HWaddr 00:02:C9:04:6E:89 inet addr:172.16.9.203 Bcast:172.16.15.255 Mask:255.255.248.0 inet6 addr: fe80::202:c9ff:fe04:6e89/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:7500 Metric:1 RX packets:114022111 errors:0 dropped:29 overruns:29 frame:29 TX packets:241317415 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:90188106711 (83.9 GiB) TX bytes:1345718307382 (1.2 TiB) I attempted to turn on debugging in the driver but nothing is being printed to the console. # hype139 /sys/module/mlx4_core/parameters > cat /sys/module/mlx4_core/parameters/debug_level 1 here are the core settings. /sys/module/mlx4_core/parameters/block_loopback 1 /sys/module/mlx4_core/parameters/debug_level 1 /sys/module/mlx4_core/parameters/enable_qos N /sys/module/mlx4_core/parameters/internal_err_reset 1 /sys/module/mlx4_core/parameters/log_mtts_per_seg 3 /sys/module/mlx4_core/parameters/log_num_cq 0 /sys/module/mlx4_core/parameters/log_num_mac 2 /sys/module/mlx4_core/parameters/log_num_mcg 0 /sys/module/mlx4_core/parameters/log_num_mpt 0 /sys/module/mlx4_core/parameters/log_num_mtt 0 /sys/module/mlx4_core/parameters/log_num_qp 0 /sys/module/mlx4_core/parameters/log_num_srq 0 /sys/module/mlx4_core/parameters/log_num_vlan 0 /sys/module/mlx4_core/parameters/log_rdmarc_per_qp 0 /sys/module/mlx4_core/parameters/msi_x 1 /sys/module/mlx4_core/parameters/panic_on_catas 0 /sys/module/mlx4_core/parameters/set_4k_mtu 0 /sys/module/mlx4_core/parameters/use_prio N And the ethernet driver settings: # hype139 /sys/module/mlx4_core/parameters > for file in /sys/module/mlx4_en/parameters/*; do echo $file; cat $file; done /sys/module/mlx4_en/parameters/inline_thold 104 /sys/module/mlx4_en/parameters/ip_reasm 1 /sys/module/mlx4_en/parameters/num_lro 0 /sys/module/mlx4_en/parameters/pfcrx 0 /sys/module/mlx4_en/parameters/pfctx 0 /sys/module/mlx4_en/parameters/rss_mask 5 /sys/module/mlx4_en/parameters/rss_xor 0 We are still looking for errors anywhere else in the system (ie on the switches or other network cards). But we have NOT FOUND ANY. We ran for 3 days with Myricom cards over the weekend without any issues. The mtnic driver we were using previously worked (After much pain!). So we are highly suspicious of the new unified driver. Once again, ONLY THE SOFTWARE HAS CHANGED here... :-( Is there perhaps a FW upgrade which needs to be done with the unified driver? # hype139 /sys/module/mlx4_core/parameters > mstflint -d 02:00.0 q Image type: ConnectX FW Version: 2.7.0 Device ID: 26428 Chip Revision: A0 Description: Node Port1 Port2 Sys image GUIDs: 0002c9030004e948 0002c9030004e949 0002c9030004e94a 0002c9030004e94b MACs: 000000000000 000000000001 Board ID: (MT_0C40110009) VSD: PSID: MT_0C40110009 # hype139 /sys/module/mlx4_core/parameters > mstflint -d 85:00.0 q Image type: ConnectX FW Version: 2.7.0 Device ID: 25448 Chip Revision: A0 Description: Port1 Port2 MACs: 0002c9046e88 0002c9046e89 Board ID: (MT_0BD0110004) VSD: PSID: MT_0BD0110004 # hype139 /sys/module/mlx4_core/parameters > mstflint -d 86:00.0 q Image type: ConnectX FW Version: 2.7.0 Device ID: 26428 Chip Revision: A0 Description: Node Port1 Port2 Sys image GUIDs: 0002c9030004e928 0002c9030004e929 0002c9030004e92a 0002c9030004e92b MACs: 000000000000 000000000001 Board ID: (MT_0C40110009) VSD: PSID: MT_0C40110009 I don't know what else to try. We will continue to look for a smaller scale reproducer but nothing we have done so far is working. Ira Begin forwarded message: Date: Tue, 8 Dec 2009 09:00:53 -0800 From: Jim Garlick <garlick> To: weiny2, behlendorf1, morrone2 Subject: SYN_RECV connections are back on hype Uh oh, looks like the old problem is back. Jim ehype139: Active Internet connections (w/o servers) ehype139: Proto Recv-Q Send-Q Local Address Foreign Address State ehype139: tcp 0 0 hype139-lnet0:lustresvc strauss2-eth2:1023 SYN_RECV ehype139: tcp 0 0 hype139-lnet0:lustresvc pigs7-lnet0:edvrpftpd SYN_RECV ehype139: tcp 0 0 hype139-lnet0:lustresvc levi3-eth2:edvrpftpd SYN_RECV ehype139: tcp 0 0 hype139-lnet0:lustresvc strauss10-eth2:1023 SYN_RECV ehype139: tcp 0 0 hype139-lnet0:lustresvc momus12-eth2:1021 SYN_RECV ehype139: tcp 0 0 hype139-lnet0:lustresvc tycho12-lnet0:1021 SYN_RECV ehype139: tcp 0 0 hype139-lnet0:lustresvc momus2-eth2:1023 SYN_RECV ehype139: tcp 0 0 hype139-lnet0:lustresvc strauss13-eth2:1020 SYN_RECV ehype139: tcp 0 0 hype139-lnet0:lustresvc levi4-eth2:1021 SYN_RECV ehype139: tcp 0 0 hype139-lnet0:lustresvc momus5-eth2:1020 SYN_RECV ehype139: tcp 0 0 hype139-lnet0:lustresvc pigs4-lnet0:1021 SYN_RECV ehype139: tcp 0 0 hype139-lnet0:lustresvc strauss12-eth2:1021 SYN_RECV ehype139: tcp 0 0 hype139-lnet0:lustresvc pigs2-lnet0:1023 SYN_RECV ehype139: tcp 0 0 hype139-lnet0:lustresvc tycho4-lnet0:1021 SYN_RECV ehype139: tcp 0 0 hype139-lnet0:lustresvc strauss1-eth2:1020 SYN_RECV ehype139: tcp 0 0 hype139-lnet0:lustresvc momus14-eth2:1023 SYN_RECV ehype139: tcp 0 0 hype139-lnet0:lustresvc tycho7-lnet0:edvrpftpd SYN_RECV ehype139: tcp 0 0 hype139-lnet0:lustresvc momus9-eth2:1020 SYN_RECV ehype139: tcp 0 0 hype139-lnet0:lustresvc strauss6-eth2:1023 SYN_RECV ehype139: tcp 0 0 hype139-lnet0:lustresvc pigs14-lnet0:1023 SYN_RECV ehype139: tcp 0 0 hype139-lnet0:lustresvc tycho10-lnet0:1023 SYN_RECV ehype139: tcp 0 0 hype139-lnet0:lustresvc momus6-eth2:1023 SYN_RECV ehype139: tcp 0 0 hype139-lnet0:lustresvc levi6-eth2:1023 SYN_RECV ehype139: tcp 0 0 hype139-lnet0:lustresvc pigs15-lnet0:edvrpftpd SYN_RECV ehype139: tcp 0 0 hype139-lnet0:lustresvc tycho6-lnet0:1023 SYN_RECV ehype139: tcp 0 0 hype139-lnet0:lustresvc momus10-eth2:1023 SYN_RECV This event sent from IssueTracker by kbaxley [LLNL (HPC)] issue 373976
Event posted on 2009-12-16 13:23 PST by woodard Thanks Doug, Regarding the FW version Mellanox did not change much. They sent me a "2.7.0" version which reduced the number of outstanding PCI transactions on the bus from 16 to 12 to 8 to 4. I tried the 12, 8, and 4 versions. They thought there was evidence of a PCI issue but none of these helped. From our point of view we did not think this was the issue but we tried the FW just to make sure. Ira This event sent from IssueTracker by woodard issue 373976
Did you try the driver Doug attached? Did it behave any different than the 5.4 one?
I've just added a bug of our own (relating our IBM HPC gear at VLSCI at the University of Melbourne) where we are seeing packets arriving on the physical eth1 10Gb/s interface being delivered incorrectly by the driver to eth0. https://bugzilla.redhat.com/show_bug.cgi?id=649623 We have replicated with this using 3 cards (Mellanox ConnectX2 MT26448) in 2 different servers so we've pretty confident it's not a hardware problem. This is with RHEL5.5 We've found that using the mlx4_en driver from the Mellanox site does seem to fix it though - so it might be worth investigating yourselves.
Red Hat have told us this bug won't get fixed in 5.7, but they will look at whether or not they will fix it in 5.8. :-( It does appear that RHEL 6.1 might have the newer version of the driver without this problem though..
Closing this as not a bug. The original customer report was closed indicating that the problem was due to faulty hardware. If you disagree with this please open a support case with Red Hat support at access.redhat.com