Bug 482747 - NFS in kernel 2.6.18-128.el5 gives RPC: bad TCP reclen error
NFS in kernel 2.6.18-128.el5 gives RPC: bad TCP reclen error
Status: CLOSED DUPLICATE of bug 475567
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.3
x86_64 Linux
low Severity high
: rc
: ---
Assigned To: Andy Gospodarek
Red Hat Kernel QE team
:
Depends On:
Blocks: 533192
  Show dependency treegraph
 
Reported: 2009-01-27 15:43 EST by Brian Smith
Modified: 2014-06-29 19:01 EDT (History)
14 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-09-30 09:43:41 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Ethernet crash on 1st 5.3 reboot. Did not occur again after a cold boot. (1.70 KB, text/plain)
2009-01-30 16:24 EST, Brian Smith
no flags Details
nfsstat during hang (1.74 KB, text/plain)
2009-02-04 12:41 EST, Brian Smith
no flags Details
/var/log/messages during hang (252.15 KB, text/plain)
2009-02-04 12:42 EST, Brian Smith
no flags Details
eth0 2.6.18-128.1.1.el5PAE boot crash (3.15 KB, text/plain)
2009-02-17 11:09 EST, Brian Smith
no flags Details
Panic under nfs load (2.94 KB, text/plain)
2009-02-17 11:10 EST, Brian Smith
no flags Details
panic from the R805 (5.93 KB, image/png)
2009-04-20 17:59 EDT, Matt Bernstein
no flags Details

  None (edit)
Description Brian Smith 2009-01-27 15:43:56 EST
Description of problem:

My main NFS server had been on the new 5.3 kernel (2.6.18-128.el5PAE) for 24 hours when I rebooted for a file system check.  After reboot the server gave a sequence of many:

Jan 23 17:08:04 arwen kernel: RPC: bad TCP reclen 0x08c9d76e (large)
Jan 23 17:08:04 arwen kernel: RPC: bad TCP reclen 0x2669447d (large)
Jan 23 17:08:04 arwen kernel: RPC: bad TCP reclen 0x707b6cf3 (large)
Jan 23 17:08:04 arwen kernel: RPC: bad TCP reclen 0x44a5af3b (non-terminal)
Jan 23 17:08:04 arwen kernel: svc: bad direction 1, dropping request
Jan 23 17:08:04 arwen kernel: RPC: bad TCP reclen 0x00fd0004 (non-terminal)

errors to /var/log/messages.  Upon reboot the behavior recommenced, ending in a hang.  When rebooted again the errors started again.  Doing:
  /etc/init.d/nfs stop
stopped the errors, restarting nfs restarted the errors.

How reproducible:

It was reproducible through three reboots, then I needed the server up so I went back to 2.6.18-92.1.22.el5PAE.
Comment 1 Jeff Layton 2009-01-30 13:59:04 EST
I don't think we did any substantial changes to that area of code between 5.2 and 5.3...

Sounds like the client was spewing bad packets, or maybe a problem lower on the network stack. Can you provide a little more info about your environment?

What sort of clients are you using?
What kind of network interface is on this server?
Comment 2 Brian Smith 2009-01-30 16:23:34 EST
This is a Dell 2950 III, network is Broadcom NetXtreme II BCM5708.  The NIC did have a panic on the 1st reboot at the 'enabling eth0' part of boot, but never after that one time.  I have a copy of the panic, I'll attach it.

This machine is one of four nfs servers I have.  The others are running 2.6.18-128, but this is the main one.  It serves nfs to 35 other RHEL 5.3 linux machines, both 32 and 64 bit.  It also serves samba to up to 100 PCs, bot XP, Vista and Server 2003/2008.

These clients were rebooted to 5.3 a couple days before the issues, some of them may have been writing nfs at the time, but generally I experience few issues during a quick reboot.
Comment 3 Brian Smith 2009-01-30 16:24:43 EST
Created attachment 330506 [details]
Ethernet crash on 1st 5.3 reboot.  Did not occur again after a cold boot.
Comment 4 Brian Smith 2009-02-01 22:44:11 EST
Ok rebooted the server friday night.  Two days no issues.  Perhaps it was a fluke client flood at reboot time...
Comment 5 Jeff Layton 2009-02-03 16:44:42 EST
Ok...

I think the thing to do here is open a separate BZ case for the oops, since that's probably a separate issue (you might also want to search BZ to see if someone has already reported that).

On the NFS server side problem, I'll leave this open for a while and flag it needinfo. If the problem crops up again, we can take a harder look at it...
Comment 6 Brian Smith 2009-02-04 12:40:34 EST
Happened again in the middle of the day.  I happened to be untarring a new mysql over NFS and it hung.  Interestingly ssh from the affected machine, arwen, to moria (a client) also failed with:
  ssh: Bad packet length <random number>
I could ssh to moria from other machines.

I'm back on the last 5.2 kernel.

This makes me think it is not nfs, but possibly ethernet driver related.  Though it wasn't a total failure, at least at first, since at least some of the messages got to my syslog server.

I do notice an error from a machine running MTU 1500 to the server which is at MTU 9000, though that was after trouble started.

I'll attach a trimmed /var/log/messages and an nfsstat output.  Didn't think of a lot else to do and wanted to get the machine up asap.
Comment 7 Brian Smith 2009-02-04 12:41:30 EST
Created attachment 330894 [details]
nfsstat during hang
Comment 8 Brian Smith 2009-02-04 12:42:36 EST
Created attachment 330895 [details]
/var/log/messages during hang
Comment 9 Jeff Layton 2009-02-12 15:39:17 EST
Looking at the code svc_tcp_recvfrom() it really looks like we're just ending up with garbage being returned by svc_recvfrom() (which basically just does a kernel_recvmsg() to read from the socket).

I also notice in your messages:

Feb  4 12:03:43 arwen smbd[5470]: [2009/02/04 12:03:43, 0] lib/util_sock.c:read_data(534) 
Feb  4 12:03:43 arwen smbd[5470]:   read_data: read failure for 65469 bytes to client 128.146.6.224. Error = Connection reset by peer 

...and...

Feb  4 11:56:48 arwen named[4189]: socket.c:1156: unexpected error:
Feb  4 11:56:48 arwen named[4189]: internal_send: 0.0.14.16#53: Invalid argument

...and...

Feb  4 11:50:26 arwen smbd[6823]: [2009/02/04 11:50:26, 0] printing/print_cups.c:cups_cache_reload(143) 
Feb  4 11:50:26 arwen smbd[6823]:   Unable to get printer list - server-error-service-unavailable 
Feb  4 11:50:26 arwen nmbd[4884]: [2009/02/04 11:50:26, 0] nmbd/nmbd_packets.c:process_dgram(1270) 
Feb  4 11:50:26 arwen nmbd[4884]:   process_dgram: ignoring malformed3 (datasize = 109, len=24, off=86) datagram packet sent to name STAT<1
e> from IP 128.146.7.7 

...those really make it look like there may be some sort of problem in a layer lower than in the RPC or NFS code.

You mentioned that you had 2 other servers that were working fine with 5.3 kernels. Do they have similar hardware? In particular, the network cards? If so, this might be indicative of a bad NIC.
Comment 10 Brian Smith 2009-02-17 11:08:18 EST
I have six other compute machines with similiar hardware, but they are running the 64-bit RHEL client kernel, and the problem machine is running the 32-bit server kernel.  The compute machines also don't do a ton of nfs/network.

This motherboard has two nics, and I did try the second nic, with the same result.  (Of course, since they are both on the motherboad, I'm not sure if they are truly separate.)

I seem to be able to get the 5.3 kernel to crash after about two hours by going to a 64-bit 5.3 machine and untarring mysql in a loop reading from one nfs mount and writing to another nfs mount.

On 2.6.18-92.1.22.el5PAE I ran this test for twelve hours with no issues.

On 2.6.18-128.1.1.el5PAE once it crashed on boot (attached as crash2.txt) and once under NFS load (nfs_load_crash.txt).  Another test it just went tharn with 'TCP reclen' errors.

I have a new server that just came in and I will install 5.3 32-bit on it and try a few tests.
Comment 11 Brian Smith 2009-02-17 11:09:55 EST
Created attachment 332238 [details]
eth0 2.6.18-128.1.1.el5PAE boot crash
Comment 12 Brian Smith 2009-02-17 11:10:41 EST
Created attachment 332239 [details]
Panic under nfs load
Comment 13 Jeff Layton 2009-03-02 09:21:24 EST
Brian, any results from testing with the newer host?

Reassigning to Andy since this looks like a problem in lower-level networking (certainly some of the stack traces here don't involve RPC or NFS at all). My guess is that  problem is addressed the RPC problems should also go away. If they don't then we can tackle that separately.
Comment 14 Brian Smith 2009-03-02 15:46:22 EST
I ran the tar over nfs, plus an smbget on the newer host for about 24 hours, no problems.

So I installed a second ethernet card in the main server (Intel 82571EB dual nic pci-x, I didn't have a Broadcom.  I suppose the Intel may even be a better nic).  I ran the tar test on it for 12 hours twice, no issues.

So at this point I have to think it is either the Broadcom driver in 5.3 vs 5.2 or a hardware issue with the Broadcom nic.  I'm leaning to hardware, since even though it worked fine on 5.2 the other machine should have failed too on a 5.3 driver issue.

I'm still a bit wait and see but so far I've got 5 days uptime with no errors.

I'll likely call Dell and seeing if I can get a new mainboard during a planned downtime.
Comment 15 Andy Gospodarek 2009-03-03 11:45:45 EST
Thanks for the update, Brian.

I agreed with Jeff (he and I talked about it before his statement in comment #13 and the reassignment), but I would like to see more testing done on actual bnx2 hardware before we close this one out completely as any driver issue I would like to get fixed.
Comment 16 Brian Smith 2009-03-05 16:51:28 EST
Obviously, I'm hesitant to crash the production server.  Also, if I get the new mainboard, I may reinstall 64-bit on that server, which will then be changing two things at once (hw and 32->64).  I will likely go back to the bnx2 then to see, and I can report that back.

Although it didn't crash in my test, if there is anything I can do on the newer host to torture test it for you before I put it into production I'd be happy to.
Comment 17 Brian Smith 2009-03-26 13:32:32 EDT
FYI, Dell had me do a BIOS update (2.2.3 to 2.5.0).  That _seems_ to have extended the crash time for about 2 hours with my test to about 12.

Also, I updated to 5.3 64-bit last night.  It panicked again on the test after about 15 hours.

I'm back on the e1000 interface as the quarter starts next week.
Comment 18 Andy Gospodarek 2009-03-26 15:46:18 EDT
Brian, thanks for keeping us in the loop on your findings.  Did you happen to save the panic logs?  I would like to see how close they are to the ones in comment #11 and comment #12.
Comment 19 Brian Smith 2009-03-30 13:45:32 EDT
(In reply to comment #18)
> Brian, thanks for keeping us in the loop on your findings.  Did you happen to
> save the panic logs?  I would like to see how close they are to the ones in
> comment #11 and comment #12.  


Sadly the scroll back on the terminal server was set wrong for this crash.

I attempted to get it to crash again and it would only spam the 'TCP
reclen' hang.

Dell and I decided to do the motherboard swap, which we did.  After
about an hour of:

while [ 0 ]; do tar -xzvf mysql.tgz; done (from server1; nfs to nfs on client1)
and
while [ 0 ]; do smbget 9.5MB_file; done (from server1 to local disk on client2)

The new motherboard gave the same errors.  So depsite the new server
seeming to work, it appears to be software.

I did run the tests for 48 hours using the intel nic, with no issues.
So since the quarter is starting, I'll continue to use it.  I've abused
my users with enough testing. ;)

I still find it odd that no one else has run into this...I don't think
I'm doing anything _that_ odd.

Thanks for the timely responses and suggestions!
Comment 20 Andy Gospodarek 2009-03-30 15:58:59 EDT
I agree that what you are doing doesn't seem that odd.  I discovered we have a Dell 2850 here that I will try and use to reproduce this and will let you know if I'm able to do so (let's hope!).

I also wonder if this is specific to 32-bit kernels and there is something odd with a frame length that is calculated when pulling the data out of the hardware in the bnx2 driver, so I might start looking at options there.
Comment 21 Brian Smith 2009-03-30 17:40:07 EDT
I did update to 64-bit as part of the testing I was hoping it was 32 only, but alas.

If it helps, system runs dns, sendmail, mimedefang, spamassassin, nfs, dovecot, samba, and nfs.  Most of its filesystems are off of a SAS MD3000 (LSI SAS1068 controller), and some off an old Ultra 160 SCSI box (LSI 53c1030 controller).

It's hooked to a Cisco switch using an MTU of 9000.

It isn't that heavily loaded generally.  My tests, though constant, shouldn't tax it.
Comment 22 Andy Gospodarek 2009-03-31 10:07:11 EDT
Did you have your clients and/or servers using an MTU of 9000 as well?  I downloaded a tarball and had the client extract the contents over the network overnight, but didn't see any problems.  Were you just extracting over the network, or were you extracting and writing over the network?

Because of your notes in comment #21 I set the MTU to 9000 on the server (which will use a different memory pool and could make a difference), but left the client at 1500.  I could change it, but I would need some different hardware since the card in that system doesn't do jumbo frames.  Let me know if you had the client and server using 9000 for the MTU or just the server when you can.
Comment 23 Brian Smith 2009-04-01 11:37:09 EDT
The machine doing the tar is running 2.6.18-128.1.1.el5 x86_64 using a Broadcom Corporation NetXtreme BCM5704 with an MTU of 9000.  I discovered this tar crash when I was updating mysql, so I used it. This tar test alone will crash it, though caveat below:

However during break there are less students who get home drives via samba so I added in the smbget, which did make it crash faster.  That machine is running 2.6.18-128.1.1.el5PAE i686 using an Intel Corporation 82540EM Gigabit with an MTU of 9000.

These machines are connected to a Cisco 3650 switch.  Other machines in the dept are connected through other switches and using myriad different hardware, some at 100Mb (thus 1500MTU) and some at 1000Mb (all linux at 9000, windows probably mixed).  

I never tried to crash the machine with the smbget alone...so theoretically this could be the more sensitive test.
Comment 24 Adam Hough 2009-04-10 12:58:44 EDT
Note: I am running Centos 5.3 not RHEL5.3 though should be roughly the same kernel and other packages. The machines were just recent upgraded to 5.3 from 5.2.



I am having the same "RPC: bad TCP reclen" problem it seems.  Here is a kernel panic that I caught will the system was booting.  This has happened on two ProLiant DL360 G5 systems now.

When this crash happened was when one of the  systems was running some gfs tuning commands after a reboot from having the "RPC: bad TCP reclen" problem. 
http://www.gradientzero.com/bugs/possibly_related_kernel_panic.txt

Dmesg of one of the systems but has not had any errors yet in this one:
http://www.gradientzero.com/bugs/dmesg.system2.20090410.txt

*.debug syslog of the system2 
http://www.gradientzero.com/bugs/syslog.all.system2.20090410.txt

Both machines have an MTU of 9000 and are connected to an Foundry switch.   

If you would like me to test anything I will be happy to do that or if you need more information I would be happy to help with that.

If it is a problem I am running Centos instead of RHEL then that is not impossible to fix as I can install RHEL 5.3 on the machines.
Comment 25 Matt Bernstein 2009-04-20 17:58:10 EDT
This might be the same issue; again it's Dell (PowerEdge R805, M600, 6950) and bnx2, again it's 9000-byte frames.

It's also CentOS. kernel{,-xen}-2.6.18-128.1.6.el5 panics, seemingly under load (though much of our load is UDP: NFSv3 or amanda) only with 9000-byte frames. kernel{,-xen}-2.6.18-92.1.22.el5 is much more stable.

The 6950 is regularly under high load (current loadavg ~ 27), but (unlike the others) uses 1500-byte frames and hasn't yet crashed after two weeks.

The M600s run Xen and an iscsi initiator and appears to crash under load, though I have one with an uptime of 4 days. Sometimes they panic on boot, though.

The R805 is a big file server, and has not survived 24 hours on the 5.3 kernel.

All the above servers use their two bnx2 interfaces bonded (with LACP) to an HP ProCurve zl switch (running K.13.nn) and running vlan interfaces on top of bond0.
Comment 26 Matt Bernstein 2009-04-20 17:59:45 EDT
Created attachment 340444 [details]
panic from the R805
Comment 27 Matt Bernstein 2009-04-30 16:46:57 EDT
(In reply to comment #25)

> All the above servers use their two bnx2 interfaces bonded (with LACP) to an HP
> ProCurve zl switch (running K.13.nn) and running vlan interfaces on top of
> bond0.  

Just a data point, this has nothing to do with bonding as an R710 I've been allowed to play with crashed before I got as far as configuring bonding, so this crash appears to be bnx2 + 9000-byte MTU.

Might try the 5.3 kernel with the 5.2 bnx2 driver if I can find the time.
Comment 28 Adam Hough 2009-04-30 17:56:14 EDT
The two servers that were crashing on me have their public interface MTU set to 9000 and most of the clients are set to 9000 but I cannot guarantee that all of the clients were set to 9000.

I have since upgraded another machine with the 2.6.18-128.1.6.el5 kernel but it has been stable running NFS with an MTU of 1500. 

The two that were crashing have been stable since I back leveled their kernel to the latest 5.2 kernel.
Comment 29 Matt Bernstein 2009-05-06 03:33:49 EDT
Rebuilt the 5.3 kernel without any bnx2 patches since 5.2, and it still crashes. I too am stuck on the latest 5.2 kernel if I want to use jumbo frames, and even then they don't work on Xen with pv_ops guests (eg Fedora).
Comment 30 Matt Bernstein 2009-05-06 11:46:30 EDT
Tried kernel-xen-2.6.18-141.el5 (via <http://people.redhat.com/dzickus/el5/>) on the R710 and it too seems to spontaneously combust.
Comment 31 Matt Bernstein 2009-05-28 10:14:58 EDT
The Dell OMSA "netxtreme2" driver has stopped the crashing for me (fingers crossed). So, here some "ethtool -i eth0" output.

2.6.18-92.1.22.el5 (works)

  version: 1.6.9
  firmware-version: 4.4.1

2.6.18-128.1.6.el5 (crashes with jumbo frames, works with 1500-byte frames)

  version: 1.7.9-1
  firmware-version: 2.9.1

2.6.18-128.1.10.el5 (with dkms-built bnx2 from netxtreme2 RPM, working so far)

  version: 1.8.5b
  firmware-version: 4.6.4 NCSI 1.0.6
Comment 32 deanx 2009-06-10 10:12:08 EDT
I have an Oracle RAC which keeps kernel panicing on bnx2 driver under high load whenever the 2.6.18-128.1.6.el5 kernel is loaded.   While not directly this bug, it contains many of the same elements; MTU=8000, Bonding and a similar kernel crash trace output.  Both nodes in a two node RAC exhibit same behavior with this kernel version.

I have reverted back to the 2.6.18-92.1.18.el5 kernel, and so far the systems have been stable.  Meanwhile outside of the kernel the rest of the system have been patched with current 5.3 patches. It is easy to focus on the oracleasm kernel modules... but, I am very suspicious that the bnx2 kernel module is the center of the problem.
Comment 33 Andy Gospodarek 2009-06-15 17:58:52 EDT
It's only a dual processor system, but I've been running netperf and a kernel-build simultaneously (load avg of ~8) and haven't seen any problems.

I'm sure it's a problem since several have reported it, but it would be nice to have an idea how we can reproduce this so we can be sure the next update is fixed.

We are getting close to freezing releasing another update for RHEL5, so if anyone wants to test the latest kernels from:

http://people.redhat.com/dzickus/

they are welcome to do so (in fact it's really appreciated) and report back whether or not they are still broken with the update bnx2 driver.
Comment 34 deanx 2009-06-16 09:44:23 EDT
RE: RAC crashing ...

 We rebooted the RAC systems back to the 2.6,18-92.1.18.el5 kernel (and corresponding oracleasm modules) the the Oracle RAC has been stable for a week now. Previously with the 2.8.18-128.1.6.el5 kernel one or the other node of a two node RAC would randomly crash every 24-36 hours depending on load.

 Meanwhile we are running with nearly current updates on most other RHN patches except we rebooted on the older kernel.  We setup a kdump server .. but we also rebooted onto the older kernel at the same time and have not seen a kernel panic since.  I will have to beg the DBAs on these systems to test your proposed kernel.. I am not sure they will let me given its working again.

 The RAC configuration has a bonded interface on the private networking.  The bond is setup in active-backup mode and we have tested with both MII and Arp monitoring... we have thoroughly wrung out and tested any issues with the private networking and the switch plant, the problem seems to be isolated to the kernel modules.  Another interesting point is for redundancy sake the Bonded interface uses one interface from the onboard NIC (bnx2 on an IBM 3650) and one interface from the Add-on NIC, a e1000 compatible NIC).  Lastly the MTU is set to 8192 on the Bond0 interface level.

 It seems to me it is crashing is due to overruns in the receive buffers/interupts... If I can capture a kernel dump I will forward it.  Meanwhile I will also look at the draft kernel.
Comment 35 Andy Gospodarek 2009-06-16 10:14:56 EDT
deanx, thanks for the details -- those are extremely helpful.

Do you know if the e1000-based NIC or the bnx2-based NIC is normally the active device in this bond?

I also want to let you know that I completely understand about not being able to take down a production system.  Verification of upcoming releases does help us close bugs more quickly, but I understand that everyone doesn't have the ability to easily try the out -- especially if the only system showing any problems is a production system.

I've setup my test again and will let it run for a while with the bnx2-based device in an active-backup bond.
Comment 36 deanx 2009-06-16 13:17:36 EDT
eth1 is the bnx2 and is the Primary.  The other bond is eth2 and is on the e1000 card and is the secondary.

modrpobe.conf -
...
alias bond0 bonding
options bond0 mode=1 primary=eth1 arp_interval=2000 arp_ip_target=172.17.228.66,10.1.228.68
--
I have tried this in mii and arp mode... and completely redid the switch plant.  Those turned out not to impact this.

Attached is a screen dump of the last crash.  It ended in a skb_over_panic.   Sorry for the crashtrace in image format... hopefully my kdump server will help.

 -Dean Eckstrom
Comment 37 deanx 2009-06-16 13:31:10 EDT
the crash trace I have is in image format... and I can't attach.  So I will attempt to summary the trace (without all the addresses [...]);
----
[...] do_softirq+0x2c/0x85
[...] do_IRQ+0xec/oxf5
[...] ret_from_intr+0x0a/0xa
[...] copy_user_generic+0x3b/0x16c
[...] skb_dequeue+0x48/0x50
[...] memcpy_toiovec+0x36/0x66
[...] skb_copy_datagram_iovec+0x169/0x237
[...] udp_recvmsg+0xc8/0x24f
[...] __pollwait+0x0/0xe2
[...] sock_common_recvmsg=0x2d/0x43
[...] sock_recvmsg+0x101/0x120
[...] autoremove_wake_function+0x0/0x2e
[...] sockfd_lookup_light+0x33/0x56
[...] sys_recvmsg+0x15c/0x24c
[...] lock_task_sighand+0x2c/0x53
[...] getrusage+0x1e1/0x1fc

kernel panic: not syncing skb_over_panic

----
 This is all I have at this time, I have seen similar traces on prior crashes.  Hopefully this provides a glimmer of insight how the system got lost.
Comment 38 Matt Bernstein 2009-07-15 09:50:58 EDT
I'm convinced this bug only triggers on bnx2 with jumbo frames.

I'm wondering if this is a horrible interaction with the NIC firmware, as I've not managed to reproduce the bug with Dell's latest firmware, even without their own driver (which has mysteriously vanished from their yum repo).

http://support.dell.com/support/downloads/download.aspx?c=us&l=en&s=gen&releaseid=R214709&fileid=304120

Anyway, those of you afflicted with this bug on Dell servers might like to try applying the above firmware.
Comment 39 Andy Gospodarek 2009-07-15 09:59:21 EDT
Thanks, Matt!  I agree this seems to be specific to systems with MTU>1500, and possibly some specific vendors.

So after upgrading to the firmware in the link you are running the stock RHEL5.3 bnx2 driver with an MTU>1500 and TSO enabled without any problems?  (I think the answer is 'yes' but I want to be sure. :)
Comment 40 Matt Bernstein 2009-07-18 06:54:28 EDT
It's a "yes so far". I've now survived a week running 2.6.18-128.1.16.el5xen on two Dell M600 hypervisors, with:

# ethtool -i eth0
driver: bnx2
version: 1.7.9-1
firmware-version: 4.6.2

# ethtool -k eth0
Offload parameters for eth0:
Cannot get device udp large send offload settings: Operation not supported
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: on
udp fragmentation offload: off
generic segmentation offload: off
Comment 41 Andy Gospodarek 2009-07-20 10:20:41 EDT
Thanks great to hear, Matt.  Thanks for the feedback!

Based on your results, I would recommend to others on this bug who are listening and using Dell hardware to consider this firmware update from Dell:

http://support.dell.com/support/downloads/download.aspx?c=us&l=en&s=gen&releaseid=R214709&fileid=304120
Comment 43 Orlando Richards 2009-08-26 05:35:51 EDT
I'm seeing this same issue on RH5.3 on an IBM X3550 node. I've updated the firmware to the latest from IBM (v4.6.0) - but still no joy. Unfortunately, IBM don't seem to have anything higher available - I see Matt has updated to 4.6.2 (comment #40 above).

# uname -a
Linux host.domain 2.6.18-128.7.1.el5 #1 SMP Wed Aug 19 04:00:49 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux

# ethtool -i eth0
driver: bnx2
version: 1.7.9-2
firmware-version: 4.6.0 ipms 1.6.0
bus-info: 0000:04:00.0

# ethtool -k eth0
Offload parameters for eth0:
Cannot get device udp large send offload settings: Operation not supported
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: on
udp fragmentation offload: off
generic segmentation offload: off

The card works fine with mtu 1500, but not with mtu 9000.

I can confirm that it was working fine with MTU 9000 in kernel version kernel-2.6.18-92.1.22.el5. Anecdotally, I have seen something similar with the latest generation of RH4 kernels as well, but have not investigated further.
Comment 45 Kevin Graham 2009-09-15 19:10:28 EDT
Much agreed with comment 9 -- we're seeing garbage on a couple of levels with what appears to be the same problem (9000 byte MTU on 2.6.18-128.el5's bnx2, specifically on a Dell R900's planar interfaces).

For example, nss_ldap returns assertion failures:

# ps -ef >/dev/null
ps: ../../../libraries/liblber/io.c:491: ber_get_next: Assertion `ber->ber_buf == ((void *)0)' failed.
# 

...in addition to the previously noted example of sshd trashing connections such as (MSS on this connection would be ~1460, so it's trashing more than just jumbo'ed packets):

debug1: SSH2_MSG_KEX_DH_GEX_REQUEST(1024<1024<8192) sent
debug1: expecting SSH2_MSG_KEX_DH_GEX_GROUP
Received disconnect from 192.2.0.3: 2: Bad packet length 2294956570.

This is on a NFS-attached (v3/tcp) database box; bouncing the network interface (via network init script, not sure if anyone tried simply up/down via ifconfig) clears things up the userland stuff, but knfs remains wedged. (Don't have the tcpdump's to prove it, but brief ~30s spikes in sar suggests that its in a very slow and futile series of retries).

We'd seen very consistent hangs since being migrated from another box (same hardware) that had been stable, though kernel would have gotten bumped up to 5.3 baseline in the process.

Dropping MTU back to 1500 completely stabilized it; went from ~24h intervals between hangs to no problems in over 40d.
Comment 50 Kevin Graham 2009-09-29 18:44:01 EDT
Followup to (my own) comment 45 -- we walked this machine up to 2.6.18-164.el5 to pick up bug 475567. Even without comment 38 update to firmware, its been stable and running @ 9000 bytes clean for +5 days now.

That bug (bug 475567 comment 57) cites http://kbase.redhat.com/faq/docs/DOC-18867, which I had previously dismissed as just the 'bnx2-triggered panic on MTU change' bug, but reading it in the context of this bug, it definitely looks like a match.
Comment 51 Andy Gospodarek 2009-09-30 09:43:41 EDT
(In reply to comment #50)
> Followup to (my own) comment 45 -- we walked this machine up to 2.6.18-164.el5
> to pick up bug 475567. Even without comment 38 update to firmware, its been
> stable and running @ 9000 bytes clean for +5 days now.
> 
> That bug (bug 475567 comment 57) cites
> http://kbase.redhat.com/faq/docs/DOC-18867, which I had previously dismissed as
> just the 'bnx2-triggered panic on MTU change' bug, but reading it in the
> context of this bug, it definitely looks like a match.  

Thanks for the feedback, Kevin.  I'm glad to hear the new kernel it working for you.

There were definitely problems with bnx2 and jumbo frames when TSO was enabled on 5.3 (and upstream kernels).  I hope that all who cannot update to 5.4 (2.6.18-164) can either workaround this issue by disabling TSO with ethtool or using a non-jumbo MTU.

I am going to close this as a duplicate of bug 475567.  If anyone is still having problems please with bnx2 on 5.3 with jumbo frames, *please* re-open this bug and we can get it worked out.

Thanks!

*** This bug has been marked as a duplicate of bug 475567 ***

Note You need to log in before you can comment on or make changes to this bug.