Bug 218725 - crash under heavy NFS traffic on HP DL360G4 with BCM5704 and tg3 driver
crash under heavy NFS traffic on HP DL360G4 with BCM5704 and tg3 driver
Status: CLOSED NOTABUG
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel (Show other bugs)
4.4
i686 Linux
medium Severity high
: ---
: ---
Assigned To: John W. Linville
Brian Brock
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2006-12-06 19:55 EST by Ken Nishimura
Modified: 2010-10-22 03:18 EDT (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2007-01-26 11:19:52 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Compendium of data related to crash (658.99 KB, application/x-gzip)
2006-12-06 19:57 EST, Ken Nishimura
no flags Details

  None (edit)
Description Ken Nishimura 2006-12-06 19:55:48 EST
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.7) Gecko/20060909 Firefox/1.5.0.7

Description of problem:
Hardware: HP DL360G4, 4GB RAM
Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet (rev 10)
uname -a: Linux jojo 2.6.9-42.0.3.ELsmp #1 SMP Mon Sep 25 17:28:02 EDT 2006 i686 i686 i386 GNU/Linux

Under heavy NFS load (multiple cpios from NFS mounted filesystems), machine first becomes unresponsive to the network and subsequently crashes.

Running version 3.52rh of the tg3 driver.  Firmware is latest released from HP (Firmware v. 7.60)

ethtool output:
Settings for eth0:
        Supported ports: [ MII ]
        Supported link modes:   10baseT/Half 10baseT/Full 
                                100baseT/Half 100baseT/Full 
                                1000baseT/Half 1000baseT/Full 
        Supports auto-negotiation: Yes
        Advertised link modes:  10baseT/Half 10baseT/Full 
                                100baseT/Half 100baseT/Full 
                                1000baseT/Half 1000baseT/Full 
        Advertised auto-negotiation: Yes
        Speed: 1000Mb/s
        Duplex: Full
        Port: Twisted Pair
        PHYAD: 1
        Transceiver: internal
        Auto-negotiation: on
        Supports Wake-on: g
        Wake-on: d
        Current message level: 0x00000010 (16)
        Link detected: yes


We have compiled some data which is attached.  A crash dump is also available though it is 1.3GB compressed. 

Also open as RH support case: 1118936

Version-Release number of selected component (if applicable):
2.6.9-42.0.3.ELsmp

How reproducible:
Always


Steps to Reproduce:
1. Start ~15 cpios reading through large NFS mounted filesystems.
2. Wait approximately 1 hour (or less)
3.

Actual Results:
Machine first drops off network (no ping).  Can log into console and /etc/init.d/network restart will restore system IF CAUGHT IN TIME.  If left alone, will crash on its own.

Expected Results:
Machine should not drop off network / crash due to NFS load.

Additional info:
Comment 1 Ken Nishimura 2006-12-06 19:57:35 EST
Created attachment 143013 [details]
Compendium of data related to crash
Comment 2 Ken Nishimura 2006-12-07 17:59:55 EST
Tried 3.66 tg3 driver directly from Broadcom.  No change in failure mode.
Last thing on console screen:Call Trace:                                       
                            
 [<c02d268d>] schedule+0x83d/0x8db                 
 [<c02d26bd>] schedule+0x86d/0x8db                                              
 [<c0105157>] sys_rt_sigsuspend+0xed/0x108     
 [<c02d47cb>] syscall_call+0x7/0xb                                              
lssysmon      S EF18F980  2980  7612   7603  7613               (NOTLB)
e3ed6f9c 00000082 0000000a ef18f980 ef18f680 f002c030 00000019 c180ede0
       f002c030 00000000 c1817740 c1816de0 00000001 00000000 2ea20840 000f4416
       f002c030 efd41130 efd4129c 00000001 e3ed6000 e3ed6000 e3ed6fac 08075a80
Call Trace:                                                                   
 [<c0105157>] sys_rt_sigsuspend+0xed/0x108                             
 [<c02d47cb>] syscall_call+0x7/0xb                 
lssys         R running  2540  7613   7612                     (NOTLB)
tg3: eth0: transmit timed out, resetting                              
 [<c015af11>] vfs_read+0xb6/0xe2

Comment 3 Ken Nishimura 2006-12-13 15:25:03 EST
OK, we have some progress to report:

Upgraded the NIC (NC7782 built-in BCM5704 based) firmware from 3.26 to 3.27b. 
This was an adventure in itself as the utility provided by HP for on-line Linux
upgrade does NOT work -- though it gives no error message.  Had to get a MS-DOS
floppy(!) to do the FW upgrade.

We have now successfully stress tested the machine for 24 hours with FW 3.27b
and the 3.66 tg3 driver from Broadcom.

When we reverted back to tg3 driver 3.52rh (as supplied from RH under RHEL4U4),
the same stress test causes a crash within 90 minutes.  HP does note that the
new FW requires 3.58b or higher, so this may not be unexpected.

It is looking like a combination of buggy FW and old drivers...

Ken
Comment 4 Ken Nishimura 2006-12-21 13:01:22 EST
John Linville's experimental kernel:

 2.6.9-42.32.EL.jwltest.180smp

in conjunction with HP's latest NIC FW:

#/usr/sbin/hpnicfwupg -c
MAC          PCI-ID              BC   PXE   IPMI  UMP  NIC
001185C1A841 14E4-1648-0E11-00D0 3.27 - - - 2.36  - -  NetXtreme BCM5704 Gigabit
Ethernet 
001185C1A840 14E4-1648-0E11-00D0 3.27 - - - 2.36  - -  NetXtreme BCM5704 Gigabit
Ethernet 

appears to have resolved the issue.
Comment 6 John W. Linville 2007-01-10 10:00:17 EST
Would you mind verifying with Jason Baron's test kernels?

   http://people.redhat.com/~jbaron/rhel4/

Those should hae the same version of tg3 that is in my kernels, but are closer 
overall to what will become the official RHEL 4.5 kernels.  Do they also 
resolve this issue?
Comment 7 Ken Nishimura 2007-01-10 10:40:40 EST
OK, will download and test kernel.  Because testing is by exhaustion, this will
take 24-36 hours.

Ken
Comment 8 Ken Nishimura 2007-01-10 18:51:09 EST
FAILURE!!!!!

Linux dumbo 2.6.9-42.39.ELsmp #1 SMP Fri Jan 5 18:58:47 EST 2007 i686 i686 i386
GNU/Linux

crashes under heavy NFS load.  This kernel does NOT fix the bug. 

Known fixes are:

2.6.9-42.32.EL.jwltest.180smp
OR
2.6.9-42.0.3.ELsmp with Broadcom's 3.66d tg3 driver module

Plezse fix before releasing U5.

Ken
Comment 9 John W. Linville 2007-01-11 15:46:46 EST
Well, I'm at a loss.  The tg3 driver between those kernels is the same.

FWIW, I have published a jwltest.181 kernel.  Please give that a try and post 
the results.  I'm not sure what it will tell us, but it would be good to know 
if things change.
Comment 11 John W. Linville 2007-01-26 11:21:48 EST
Based on customer comments, I believe this issue to no longer be reproducible.

Note You need to log in before you can comment on or make changes to this bug.