From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4) Gecko/20030703 Description of problem: Under intensive traffic on dual 2.4GHz XEON, the tg3 driver locks up after ~90minutes. We get messages like NETDEV WATCHDOG: eth0: transmit timed out tg3: eth0: transmit timed out, resetting tg3: tg3_stop_block timed out, ofs=3400 enable_bit=2 tg3: tg3_stop_block timed out, ofs=2400 enable_bit=2 tg3: tg3_stop_block timed out, ofs=1400 enable_bit=2 tg3: tg3_stop_block timed out, ofs=c00 enable_bit=2 with long enough timeouts that NFS/AFS connections get timed out. The system eventually recovers by itself (which is good, btw). Version-Release number of selected component (if applicable): 2.4.20-18.7.cernsmp (recompiled at CERN), tg3.c: 1.5 How reproducible: Always Steps to Reproduce: 1. criss-cross traffic inside a farm is enough to produce this. We use memory-to-memory transfers for tests. Additional info: I have tried this as well with the tg3 driver from the current "severn" beta (tg3.c: 1.6), recompiled inside the 2.4.20-18 kernel. Same problem, same frequency. Will try to repeat with "severn" next week. # lspci -vvv -s 03:01.0 03:01.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5700 Gigabit Ethernet (rev 12) Subsystem: 3Com Corporation 3C996-T 1000BaseTX Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- Status: Cap+ 66Mhz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 64 (16000ns min), cache line size 08 Interrupt: pin A routed to IRQ 48 Region 0: Memory at fc200000 (64-bit, non-prefetchable) [size=64K] Capabilities: [40] PCI-X non-bridge device. Command: DPERE- ERO- RBC=0 OST=0 Status: Bus=0 Dev=0 Func=0 64bit- 133MHz- SCD- USC-, DC=simple, DMMRBC=0, DMOST=0, DMCRS=0, RSCEM- Capabilities: [48] Power Management version 2 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold+) Status: D0 PME-Enable+ DSel=0 DScale=1 PME- Capabilities: [50] Vital Product Data Capabilities: [58] Message Signalled Interrupts: 64bit+ Queue=0/3 Enable- Address: 08d90c80d0a44488 Data: 0002
FYI, the bcm5700 driver does NOT work in this case, it falls over completely (kernel panic).
Next try, this time with the 2.4.21-20.1.2024.2.1.nptlsmp kernel from severn-beta1. Lockup is a lot quicker, first machine down after 30seconds or so. I have tried with "tg3_debug=0x7fff" (and reloading the driver), doesn't seem to make any difference in terms of verbosity.
I was getting the same error with a Sun V20z server with RHES 3 loaded usign the tg3 driver. I have installed the BCM linux driver v 7.3.5 from http://www.broadcom.com/drivers/downloaddrivers.php and then used the following to set my network card: ethtool -s eth0 speed 100 duplex full autoneg off (The cisco switch is set to 100baseT-FD) With these two changes, I have been able to perform a 36 hour 495 concurrent user with no erros.
Thanks for the bug report. However, Red Hat no longer maintains this version of the product. Please upgrade to the latest version and open a new bug if the problem persists. The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, and if you believe this bug is interesting to them, please report the problem in the bug tracker at: http://bugzilla.fedora.us/