Bug 103267 - tg3 locks up under heavy traffic, recovers
Summary: tg3 locks up under heavy traffic, recovers
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Linux
Classification: Retired
Component: kernel
Version: 7.3
Hardware: i386
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Arjan van de Ven
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2003-08-28 11:15 UTC by Jan Iven
Modified: 2007-04-18 16:57 UTC (History)
1 user (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2004-09-30 15:41:29 UTC
Embargoed:


Attachments (Terms of Use)

Description Jan Iven 2003-08-28 11:15:36 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4) Gecko/20030703

Description of problem:
Under intensive traffic on dual 2.4GHz XEON, the tg3 driver locks up after
~90minutes. We get messages like

NETDEV WATCHDOG: eth0: transmit timed out
tg3: eth0: transmit timed out, resetting
tg3: tg3_stop_block timed out, ofs=3400 enable_bit=2
tg3: tg3_stop_block timed out, ofs=2400 enable_bit=2
tg3: tg3_stop_block timed out, ofs=1400 enable_bit=2
tg3: tg3_stop_block timed out, ofs=c00 enable_bit=2

with long enough timeouts that NFS/AFS connections get timed out. The system
eventually recovers by itself (which is good, btw).

Version-Release number of selected component (if applicable):
2.4.20-18.7.cernsmp (recompiled at CERN), tg3.c: 1.5

How reproducible:
Always

Steps to Reproduce:
1. criss-cross traffic inside a farm is enough to produce this. We use
memory-to-memory transfers for tests.

    

Additional info:

I have tried this as well with the tg3 driver from the current "severn" beta
(tg3.c: 1.6), recompiled inside the 2.4.20-18 kernel. Same problem, same
frequency. Will try to repeat with "severn" next week.


# lspci -vvv -s 03:01.0
03:01.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5700 Gigabit
Ethernet (rev 12)
        Subsystem: 3Com Corporation 3C996-T 1000BaseTX
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+
Stepping- SERR+ FastB2B-
        Status: Cap+ 66Mhz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
        Latency: 64 (16000ns min), cache line size 08
        Interrupt: pin A routed to IRQ 48
        Region 0: Memory at fc200000 (64-bit, non-prefetchable) [size=64K]
        Capabilities: [40] PCI-X non-bridge device.
                Command: DPERE- ERO- RBC=0 OST=0
                Status: Bus=0 Dev=0 Func=0 64bit- 133MHz- SCD- USC-, DC=simple,
DMMRBC=0, DMOST=0, DMCRS=0, RSCEM-
        Capabilities: [48] Power Management version 2
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
PME(D0-,D1-,D2-,D3hot+,D3cold+)
                Status: D0 PME-Enable+ DSel=0 DScale=1 PME-
        Capabilities: [50] Vital Product Data
        Capabilities: [58] Message Signalled Interrupts: 64bit+ Queue=0/3 Enable-
                Address: 08d90c80d0a44488  Data: 0002

Comment 1 Jan Iven 2003-08-29 13:53:12 UTC
FYI, the bcm5700 driver does NOT work in this case, it falls over completely
(kernel panic).

Comment 2 Jan Iven 2003-09-12 11:36:04 UTC
Next try, this time with the 2.4.21-20.1.2024.2.1.nptlsmp kernel from
severn-beta1. Lockup is a lot quicker, first machine down after 30seconds or so. 

I have tried with "tg3_debug=0x7fff" (and reloading the driver), doesn't seem to
make any difference in terms of verbosity.




Comment 3 Robert Binz 2004-09-14 20:46:13 UTC
I was getting the same error with a Sun V20z server with RHES 3 loaded
usign the tg3 driver.  I have installed the BCM  linux driver v 7.3.5
from http://www.broadcom.com/drivers/downloaddrivers.php and then used
the following to set my network card:
ethtool -s eth0 speed 100 duplex full autoneg off
(The cisco switch is set to 100baseT-FD)

With these two changes, I have been able to perform a 36 hour 495
concurrent user with no erros.

Comment 4 Bugzilla owner 2004-09-30 15:41:29 UTC
Thanks for the bug report. However, Red Hat no longer maintains this version of
the product. Please upgrade to the latest version and open a new bug if the problem
persists.

The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, 
and if you believe this bug is interesting to them, please report the problem in
the bug tracker at: http://bugzilla.fedora.us/



Note You need to log in before you can comment on or make changes to this bug.