Red Hat Bugzilla – Bug 218725
crash under heavy NFS traffic on HP DL360G4 with BCM5704 and tg3 driver
Last modified: 2010-10-22 03:18:36 EDT
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:220.127.116.11) Gecko/20060909 Firefox/18.104.22.168
Description of problem:
Hardware: HP DL360G4, 4GB RAM
Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet (rev 10)
uname -a: Linux jojo 2.6.9-42.0.3.ELsmp #1 SMP Mon Sep 25 17:28:02 EDT 2006 i686 i686 i386 GNU/Linux
Under heavy NFS load (multiple cpios from NFS mounted filesystems), machine first becomes unresponsive to the network and subsequently crashes.
Running version 3.52rh of the tg3 driver. Firmware is latest released from HP (Firmware v. 7.60)
Settings for eth0:
Supported ports: [ MII ]
Supported link modes: 10baseT/Half 10baseT/Full
Supports auto-negotiation: Yes
Advertised link modes: 10baseT/Half 10baseT/Full
Advertised auto-negotiation: Yes
Port: Twisted Pair
Supports Wake-on: g
Current message level: 0x00000010 (16)
Link detected: yes
We have compiled some data which is attached. A crash dump is also available though it is 1.3GB compressed.
Also open as RH support case: 1118936
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Start ~15 cpios reading through large NFS mounted filesystems.
2. Wait approximately 1 hour (or less)
Machine first drops off network (no ping). Can log into console and /etc/init.d/network restart will restore system IF CAUGHT IN TIME. If left alone, will crash on its own.
Machine should not drop off network / crash due to NFS load.
Created attachment 143013 [details]
Compendium of data related to crash
Tried 3.66 tg3 driver directly from Broadcom. No change in failure mode.
Last thing on console screen:Call Trace:
lssysmon S EF18F980 2980 7612 7603 7613 (NOTLB)
e3ed6f9c 00000082 0000000a ef18f980 ef18f680 f002c030 00000019 c180ede0
f002c030 00000000 c1817740 c1816de0 00000001 00000000 2ea20840 000f4416
f002c030 efd41130 efd4129c 00000001 e3ed6000 e3ed6000 e3ed6fac 08075a80
lssys R running 2540 7613 7612 (NOTLB)
tg3: eth0: transmit timed out, resetting
OK, we have some progress to report:
Upgraded the NIC (NC7782 built-in BCM5704 based) firmware from 3.26 to 3.27b.
This was an adventure in itself as the utility provided by HP for on-line Linux
upgrade does NOT work -- though it gives no error message. Had to get a MS-DOS
floppy(!) to do the FW upgrade.
We have now successfully stress tested the machine for 24 hours with FW 3.27b
and the 3.66 tg3 driver from Broadcom.
When we reverted back to tg3 driver 3.52rh (as supplied from RH under RHEL4U4),
the same stress test causes a crash within 90 minutes. HP does note that the
new FW requires 3.58b or higher, so this may not be unexpected.
It is looking like a combination of buggy FW and old drivers...
John Linville's experimental kernel:
in conjunction with HP's latest NIC FW:
MAC PCI-ID BC PXE IPMI UMP NIC
001185C1A841 14E4-1648-0E11-00D0 3.27 - - - 2.36 - - NetXtreme BCM5704 Gigabit
001185C1A840 14E4-1648-0E11-00D0 3.27 - - - 2.36 - - NetXtreme BCM5704 Gigabit
appears to have resolved the issue.
Would you mind verifying with Jason Baron's test kernels?
Those should hae the same version of tg3 that is in my kernels, but are closer
overall to what will become the official RHEL 4.5 kernels. Do they also
resolve this issue?
OK, will download and test kernel. Because testing is by exhaustion, this will
take 24-36 hours.
Linux dumbo 2.6.9-42.39.ELsmp #1 SMP Fri Jan 5 18:58:47 EST 2007 i686 i686 i386
crashes under heavy NFS load. This kernel does NOT fix the bug.
Known fixes are:
2.6.9-42.0.3.ELsmp with Broadcom's 3.66d tg3 driver module
Plezse fix before releasing U5.
Well, I'm at a loss. The tg3 driver between those kernels is the same.
FWIW, I have published a jwltest.181 kernel. Please give that a try and post
the results. I'm not sure what it will tell us, but it would be good to know
if things change.
Based on customer comments, I believe this issue to no longer be reproducible.