From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.7) Gecko/20060909 Firefox/1.5.0.7 Description of problem: Hardware: HP DL360G4, 4GB RAM Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet (rev 10) uname -a: Linux jojo 2.6.9-42.0.3.ELsmp #1 SMP Mon Sep 25 17:28:02 EDT 2006 i686 i686 i386 GNU/Linux Under heavy NFS load (multiple cpios from NFS mounted filesystems), machine first becomes unresponsive to the network and subsequently crashes. Running version 3.52rh of the tg3 driver. Firmware is latest released from HP (Firmware v. 7.60) ethtool output: Settings for eth0: Supported ports: [ MII ] Supported link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Half 1000baseT/Full Supports auto-negotiation: Yes Advertised link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Half 1000baseT/Full Advertised auto-negotiation: Yes Speed: 1000Mb/s Duplex: Full Port: Twisted Pair PHYAD: 1 Transceiver: internal Auto-negotiation: on Supports Wake-on: g Wake-on: d Current message level: 0x00000010 (16) Link detected: yes We have compiled some data which is attached. A crash dump is also available though it is 1.3GB compressed. Also open as RH support case: 1118936 Version-Release number of selected component (if applicable): 2.6.9-42.0.3.ELsmp How reproducible: Always Steps to Reproduce: 1. Start ~15 cpios reading through large NFS mounted filesystems. 2. Wait approximately 1 hour (or less) 3. Actual Results: Machine first drops off network (no ping). Can log into console and /etc/init.d/network restart will restore system IF CAUGHT IN TIME. If left alone, will crash on its own. Expected Results: Machine should not drop off network / crash due to NFS load. Additional info:
Created attachment 143013 [details] Compendium of data related to crash
Tried 3.66 tg3 driver directly from Broadcom. No change in failure mode. Last thing on console screen:Call Trace: [<c02d268d>] schedule+0x83d/0x8db [<c02d26bd>] schedule+0x86d/0x8db [<c0105157>] sys_rt_sigsuspend+0xed/0x108 [<c02d47cb>] syscall_call+0x7/0xb lssysmon S EF18F980 2980 7612 7603 7613 (NOTLB) e3ed6f9c 00000082 0000000a ef18f980 ef18f680 f002c030 00000019 c180ede0 f002c030 00000000 c1817740 c1816de0 00000001 00000000 2ea20840 000f4416 f002c030 efd41130 efd4129c 00000001 e3ed6000 e3ed6000 e3ed6fac 08075a80 Call Trace: [<c0105157>] sys_rt_sigsuspend+0xed/0x108 [<c02d47cb>] syscall_call+0x7/0xb lssys R running 2540 7613 7612 (NOTLB) tg3: eth0: transmit timed out, resetting [<c015af11>] vfs_read+0xb6/0xe2
OK, we have some progress to report: Upgraded the NIC (NC7782 built-in BCM5704 based) firmware from 3.26 to 3.27b. This was an adventure in itself as the utility provided by HP for on-line Linux upgrade does NOT work -- though it gives no error message. Had to get a MS-DOS floppy(!) to do the FW upgrade. We have now successfully stress tested the machine for 24 hours with FW 3.27b and the 3.66 tg3 driver from Broadcom. When we reverted back to tg3 driver 3.52rh (as supplied from RH under RHEL4U4), the same stress test causes a crash within 90 minutes. HP does note that the new FW requires 3.58b or higher, so this may not be unexpected. It is looking like a combination of buggy FW and old drivers... Ken
John Linville's experimental kernel: 2.6.9-42.32.EL.jwltest.180smp in conjunction with HP's latest NIC FW: #/usr/sbin/hpnicfwupg -c MAC PCI-ID BC PXE IPMI UMP NIC 001185C1A841 14E4-1648-0E11-00D0 3.27 - - - 2.36 - - NetXtreme BCM5704 Gigabit Ethernet 001185C1A840 14E4-1648-0E11-00D0 3.27 - - - 2.36 - - NetXtreme BCM5704 Gigabit Ethernet appears to have resolved the issue.
Would you mind verifying with Jason Baron's test kernels? http://people.redhat.com/~jbaron/rhel4/ Those should hae the same version of tg3 that is in my kernels, but are closer overall to what will become the official RHEL 4.5 kernels. Do they also resolve this issue?
OK, will download and test kernel. Because testing is by exhaustion, this will take 24-36 hours. Ken
FAILURE!!!!! Linux dumbo 2.6.9-42.39.ELsmp #1 SMP Fri Jan 5 18:58:47 EST 2007 i686 i686 i386 GNU/Linux crashes under heavy NFS load. This kernel does NOT fix the bug. Known fixes are: 2.6.9-42.32.EL.jwltest.180smp OR 2.6.9-42.0.3.ELsmp with Broadcom's 3.66d tg3 driver module Plezse fix before releasing U5. Ken
Well, I'm at a loss. The tg3 driver between those kernels is the same. FWIW, I have published a jwltest.181 kernel. Please give that a try and post the results. I'm not sure what it will tell us, but it would be good to know if things change.
Based on customer comments, I believe this issue to no longer be reproducible.