Bug 308041

Summary: tg3: watchdog timeout in BCM95700A6 rev 7104
Product: Red Hat Enterprise Linux 4 Reporter: Marcus Alves Grando <marcus>
Component: kernelAssignee: Andy Gospodarek <agospoda>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Martin Jenner <mjenner>
Severity: high Docs Contact:
Priority: high    
Version: 4.5CC: agospoda, benlu, davem, duck, fhirtz, jtorrice, mcarlson, mchan, pale, peterm, tao, zbuhman
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-01-05 19:00:56 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 461297    

Description Marcus Alves Grando 2007-09-26 20:50:22 UTC
I upgrade kernel to 2.6.9-59 (jbarton) in one DELL PE 6650, after that my
network card have many problems of watchdog timeout. After turn off TSO in
ethtool this server works fine.

Some info:

# uname -a
Linux elimba.terra.com 2.6.9-59.ELsmp #1 SMP Tue Sep 25 09:01:25 BRT 2007 i686
i686 i386 GNU/Linux

* From messages
Sep 26 13:54:56 elimba kernel: NETDEV WATCHDOG: eth1: transmit timed out
Sep 26 13:54:56 elimba kernel: tg3: eth1: transmit timed out, resetting
Sep 26 13:54:56 elimba kernel: tg3: tg3_stop_block timed out, ofs=2c00 enable_bit=2
Sep 26 13:54:56 elimba kernel: tg3: tg3_stop_block timed out, ofs=3400 enable_bit=2
Sep 26 13:54:56 elimba kernel: tg3: tg3_stop_block timed out, ofs=2400 enable_bit=2
Sep 26 13:54:56 elimba kernel: tg3: tg3_stop_block timed out, ofs=1800 enable_bit=2
Sep 26 13:54:57 elimba kernel: tg3: tg3_stop_block timed out, ofs=4800 enable_bit=2
Sep 26 13:54:57 elimba kernel: tg3: eth1: Link is down.
Sep 26 13:55:00 elimba kernel: tg3: eth1: Link is up at 1000 Mbps, full duplex.
Sep 26 13:55:00 elimba kernel: tg3: eth1: Flow control is off for TX and off for RX.

# lspci -vv
08:01.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5700 Gigabit
Ethernet (rev 14)
	Subsystem: Dell Broadcom BCM5700 1000Base-T
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping-
SERR+ FastB2B-
	Status: Cap+ 66Mhz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort-
<MAbort- >SERR- <PERR-
	Latency: 32 (16000ns min), Cache Line Size 10
	Interrupt: pin A routed to IRQ 185
	Region 0: Memory at fba10000 (64-bit, non-prefetchable) [size=64K]
	Capabilities: [40] PCI-X non-bridge device.
		Command: DPERE- ERO- RBC=0 OST=0
		Status: Bus=255 Dev=31 Func=1 64bit+ 133MHz+ SCD- USC-, DC=simple, DMMRBC=0,
DMOST=0, DMCRS=0, RSCEM-
	Capabilities: [48] Power Management version 2
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold-)
		Status: D0 PME-Enable- DSel=0 DScale=1 PME-
	Capabilities: [50] Vital Product Data
	Capabilities: [58] Message Signalled Interrupts: 64bit+ Queue=0/3 Enable-
		Address: 12e930b1d68dea04  Data: a194

08:02.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5700 Gigabit
Ethernet (rev 14)
	Subsystem: Dell Broadcom BCM5700 1000Base-T
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping-
SERR+ FastB2B-
	Status: Cap+ 66Mhz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort-
<MAbort- >SERR- <PERR-
	Latency: 32 (16000ns min), Cache Line Size 10
	Interrupt: pin A routed to IRQ 193
	Region 0: Memory at fba00000 (64-bit, non-prefetchable) [size=64K]
	Capabilities: [40] PCI-X non-bridge device.
		Command: DPERE- ERO- RBC=0 OST=0
		Status: Bus=255 Dev=31 Func=1 64bit+ 133MHz+ SCD- USC-, DC=simple, DMMRBC=0,
DMOST=0, DMCRS=0, RSCEM-
	Capabilities: [48] Power Management version 2
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold-)
		Status: D0 PME-Enable- DSel=0 DScale=1 PME-
	Capabilities: [50] Vital Product Data
	Capabilities: [58] Message Signalled Interrupts: 64bit+ Queue=0/3 Enable-
		Address: 39790c2c58ab9d20  Data: f291

# lspci -n
08:01.0 Class 0200: 14e4:1644 (rev 14)
08:02.0 Class 0200: 14e4:1644 (rev 14)

Maybe turn off TSO in this model are good idea.

Comment 1 Marcus Alves Grando 2007-09-26 20:53:42 UTC
Forget that:

# modinfo tg3
filename:       /lib/modules/2.6.9-59.ELsmp/kernel/drivers/net/tg3.ko
author:         David S. Miller (davem) and Jeff Garzik
(jgarzik)
description:    Broadcom Tigon3 ethernet driver
license:        GPL
version:        3.77 71727D9639669384D3745EA
parm:           tg3_debug:Tigon3 bitmapped debugging message enable value
vermagic:       2.6.9-59.ELsmp SMP 686 REGPARM 4KSTACKS gcc-3.4

Comment 2 Andy Gospodarek 2007-09-26 21:29:02 UTC
Do you see this problem only when sending large amounts of traffic on the network?

Comment 3 Marcus Alves Grando 2007-09-26 22:15:19 UTC
(In reply to comment #2)
> Do you see this problem only when sending large amounts of traffic on the network?

Yes, that's occurrs only when this server have huge network usage.

Regards

Comment 4 Marcus Alves Grando 2007-10-16 22:34:07 UTC
More info about that:

I rebuild jbarton kernel 2.6.9-62 + tg3 3.81 update and put in this server.
Afther that i see that TSO are disable by default on this network model. Another
point is that after some time i see watchdog timeout again. See below:

NETDEV WATCHDOG: eth1: transmit timed out
tg3: eth1: transmit timed out, resetting
tg3: DEBUG: MAC_TX_STATUS[00000008] MAC_RX_STATUS[00000000]
tg3: DEBUG: RDMAC_STATUS[00000000] WDMAC_STATUS[00000000]
tg3: tg3_stop_block timed out, ofs=2c00 enable_bit=2
tg3: tg3_stop_block timed out, ofs=2400 enable_bit=2
tg3: tg3_stop_block timed out, ofs=1800 enable_bit=2
tg3: tg3_stop_block timed out, ofs=4800 enable_bit=2
tg3: eth1: Link is down.
tg3: eth1: Link is up at 1000 Mbps, full duplex.
tg3: eth1: Flow control is off for TX and off for RX.
NETDEV WATCHDOG: eth1: transmit timed out
tg3: eth1: transmit timed out, resetting
tg3: DEBUG: MAC_TX_STATUS[00000008] MAC_RX_STATUS[00000000]
tg3: DEBUG: RDMAC_STATUS[00000000] WDMAC_STATUS[00000000]
tg3: tg3_stop_block timed out, ofs=2c00 enable_bit=2
tg3: tg3_stop_block timed out, ofs=3400 enable_bit=2
tg3: tg3_stop_block timed out, ofs=2400 enable_bit=2
tg3: tg3_stop_block timed out, ofs=1800 enable_bit=2
tg3: tg3_stop_block timed out, ofs=4800 enable_bit=2
tg3: eth1: Link is down.
tg3: eth1: Link is up at 1000 Mbps, full duplex.
tg3: eth1: Flow control is off for TX and off for RX.
NETDEV WATCHDOG: eth1: transmit timed out
tg3: eth1: transmit timed out, resetting
tg3: DEBUG: MAC_TX_STATUS[00000008] MAC_RX_STATUS[00000000]
tg3: DEBUG: RDMAC_STATUS[00000000] WDMAC_STATUS[00000000]
tg3: tg3_stop_block timed out, ofs=2c00 enable_bit=2
tg3: tg3_stop_block timed out, ofs=3400 enable_bit=2
tg3: tg3_stop_block timed out, ofs=2400 enable_bit=2
tg3: tg3_stop_block timed out, ofs=1800 enable_bit=2
tg3: tg3_stop_block timed out, ofs=4800 enable_bit=2
tg3: eth1: Link is down.
tg3: eth1: Link is up at 1000 Mbps, full duplex.
tg3: eth1: Flow control is off for TX and off for RX.

At this time my network traffic on this card are +-30Mbps. eth1 on this server
are used only to NFS.

dmesg on boot

tg3.c:v3.81 (September 5, 2007)
divert: allocating divert_blk for eth0
eth0: Tigon3 [partno(BCM95700A6) rev 7104 PHY(5411)] (PCI:66MHz:64-bit)
10/100/1000Base-T Ethernet 00:0d:56:70:df:4d
eth0: RXcsums[1] LinkChgREG[1] MIirq[1] ASF[0] WireSpeed[0] TSOcap[0]
eth0: dma_rwctrl[76ff000f] dma_mask[64-bit]
divert: allocating divert_blk for eth1
eth1: Tigon3 [partno(BCM95700A6) rev 7104 PHY(5411)] (PCI:66MHz:64-bit)
10/100/1000Base-T Ethernet 00:0d:56:70:df:4e
eth1: RXcsums[1] LinkChgREG[1] MIirq[1] ASF[0] WireSpeed[0] TSOcap[0]
eth1: dma_rwctrl[76ff000f] dma_mask[64-bit]
tg3: eth0: Link is up at 1000 Mbps, full duplex.
tg3: eth0: Flow control is off for TX and off for RX.
tg3: eth1: Link is up at 1000 Mbps, full duplex.
tg3: eth1: Flow control is off for TX and off for RX.

# modinfo tg3
filename:       /lib/modules/2.6.9-62.EL.mnag.2smp/kernel/drivers/net/tg3.ko
author:         David S. Miller (davem) and Jeff Garzik
(jgarzik)
description:    Broadcom Tigon3 ethernet driver
license:        GPL
version:        3.81 7761AE4F01E4189F0085C8E
parm:           tg3_debug:Tigon3 bitmapped debugging message enable value
vermagic:       2.6.9-62.EL.mnag.2smp SMP 686 REGPARM 4KSTACKS gcc-3.4

Comment 5 Marcus Alves Grando 2007-10-18 16:47:54 UTC
Ok... i rebuild src.rpm kernel and run tg3_dump_state() in tg3_tx_timeout().

I don't know if it can help, result below:

NETDEV WATCHDOG: eth1: transmit timed out
tg3: eth1: transmit timed out, resetting
DEBUG: PCI status [02b0] TG3PCI state[0000008e]
DEBUG: MAC_MODE[00e04c04] MAC_STATUS[01400003]
       MAC_EVENT[00000000] MAC_LED_CTRL[00000100]
DEBUG: MAC_TX_MODE[00000002] MAC_TX_STATUS[00000008]
       MAC_RX_MODE[00000402] MAC_RX_STATUS[00000000]
DEBUG: SNDDATAI_MODE[00000002] SNDDATAI_STATUS[00000000]
       SNDDATAI_STATSCTRL[00000003]
DEBUG: SNDDATAC_MODE[00000002]
DEBUG: SNDBDS_MODE[00000006] SNDBDS_STATUS[00000000]
DEBUG: SNDBDI_MODE[00000006] SNDBDI_STATUS[00000000]
DEBUG: SNDBDC_MODE[00000002]
DEBUG: RCVLPC_MODE[00000002] RCVLPC_STATUS[00000000]
       RCVLPC_STATSCTRL[00000001]
DEBUG: RCVDBDI_MODE[00000012] RCVDBDI_STATUS[00000000]
DEBUG: RCVDCC_MODE[00000006]
DEBUG: RCVBDI_MODE[00000006] RCVBDI_STATUS[00000000]
DEBUG: RCVCC_MODE[00000006] RCVCC_STATUS[00000000]
DEBUG: RCVLSC_MODE[00000006] RCVLSC_STATUS[00000000]
DEBUG: MBFREE_MODE[00000002] MBFREE_STATUS[00000000]
DEBUG: HOSTCC_MODE[00000002] HOSTCC_STATUS[00000000]
DEBUG: HOSTCC_STATS_BLK_HOST_ADDR[000000000d568000]
DEBUG: HOSTCC_STATUS_BLK_HOST_ADDR[0000000024870000]
DEBUG: HOSTCC_STATS_BLK_NIC_ADDR[00000300]
DEBUG: HOSTCC_STATUS_BLK_NIC_ADDR[00000b00]
DEBUG: MEMARB_MODE[00000002] MEMARB_STATUS[00000000]
DEBUG: BUFMGR_MODE[00000006] BUFMGR_STATUS[00000010]
DEBUG: BUFMGR_MB_POOL_ADDR[00008000] BUFMGR_MB_POOL_SIZE[00018000]
DEBUG: BUFMGR_DMA_DESC_POOL_ADDR[00002000] BUFMGR_DMA_DESC_POOL_SIZE[00002000]
DEBUG: RDMAC_MODE[000003fe] RDMAC_STATUS[00000000]
DEBUG: WDMAC_MODE[000003fe] WDMAC_STATUS[00000000]
DEBUG: DMAC_MODE[00000002]
DEBUG: GRC_MODE[04130034] GRC_MISC_CFG[0001f082]
DEBUG: GRC_LOCAL_CTRL[01009709]
DEBUG: RCVDBDI_JUMBO_BD[0000000000000000:00000002:00000000]
DEBUG: RCVDBDI_STD_BD[00000000229a8000:06000000:00006000]
DEBUG: RCVDBDI_MINI_BD[0000000000000000:00000002:00000000]
DEBUG: SRAM_SEND_RCB_0[000000001e838000:02000000:00004000]
DEBUG: SRAM_RCV_RET_RCB_0[0000000036a68000:04000000:00000000]
DEBUG: SRAM_STATUS_BLK[00000001:00000000:01990000:00000000:003a0399]
DEBUG: Host status block [00000000:00000000:(0000:0199:0000):(0399:003a)]
DEBUG: Host statistics block [00000000:00000000:00000000:00000000]
DEBUG: SNDHOST_PROD[0000000000000026] SNDNIC_PROD[000000000000007e]
DEBUG: NIC TXD(0)[00000000:0304ba02:002a0004:00000000]
DEBUG: NIC TXD(1)[00000000:0332f602:002a0004:00000000]
DEBUG: NIC TXD(2)[00000000:37d7da02:002a0004:00000000]
DEBUG: NIC TXD(3)[00000000:37e35802:002a0004:00000000]
DEBUG: NIC TXD(4)[00000000:37d7d202:002a0004:00000000]
DEBUG: NIC TXD(5)[00000000:37fc6602:002a0004:00000000]
DEBUG: NIC RXD_STD(0)[0][00000000:1c2bb812:00000040:00000004]
DEBUG: NIC RXD_STD(0)[1][00006bb4:00000000:00000000:00010180]
DEBUG: NIC RXD_STD(1)[0][00000000:0d832012:00000040:00000004]
DEBUG: NIC RXD_STD(1)[1][00006bb8:00000000:00000000:00010181]
DEBUG: NIC RXD_STD(2)[0][00000000:18ad8812:00000040:00000004]
DEBUG: NIC RXD_STD(2)[1][00006c4c:00000000:00000000:00010182]
DEBUG: NIC RXD_STD(3)[0][00000000:090d4012:00000040:00000004]
DEBUG: NIC RXD_STD(3)[1][00004594:00000000:00000000:00010183]
DEBUG: NIC RXD_STD(4)[0][00000000:2e740812:00000107:00003004]
DEBUG: NIC RXD_STD(4)[1][ffffffff:00000000:00000000:00010184]
DEBUG: NIC RXD_STD(5)[0][00000000:0a49f012:00000040:00000004]
DEBUG: NIC RXD_STD(5)[1][00004594:00000000:00000000:00010185]
DEBUG: NIC RXD_JUMBO(0)[0][0a0e1854:3c15d171:3db6bd4c:d7d77251]
DEBUG: NIC RXD_JUMBO(0)[1][84e41061:52a7221d:bc2576b2:5ec87a1b]
DEBUG: NIC RXD_JUMBO(1)[0][349c8333:c3612bd9:7fdb1ef7:267e9c33]
DEBUG: NIC RXD_JUMBO(1)[1][c2272f85:1ee96985:071fa587:5a526f09]
DEBUG: NIC RXD_JUMBO(2)[0][81f926a3:f08b860d:787552bc:66df5b7e]
DEBUG: NIC RXD_JUMBO(2)[1][c5640bb1:6ec792e3:8388ad34:e18df4a5]
DEBUG: NIC RXD_JUMBO(3)[0][be6a696e:0a440ca9:53668abf:ebb2da3b]
DEBUG: NIC RXD_JUMBO(3)[1][dae40140:1814eccf:f7e855ba:5539f469]
DEBUG: NIC RXD_JUMBO(4)[0][c2a23bf3:73bc7680:bf48b1a7:7bc5a4fc]
DEBUG: NIC RXD_JUMBO(4)[1][c4f751cd:e5e2096d:83ff78e6:77dc9d3b]
DEBUG: NIC RXD_JUMBO(5)[0][16dc7de5:5cd1657f:f2f9d5fb:9fedc6cc]
DEBUG: NIC RXD_JUMBO(5)[1][5a89287e:78413466:a6ba56fb:bff73134]
tg3: tg3_stop_block timed out, ofs=2c00 enable_bit=2
tg3: tg3_stop_block timed out, ofs=3400 enable_bit=2
tg3: tg3_stop_block timed out, ofs=2400 enable_bit=2
tg3: tg3_stop_block timed out, ofs=1800 enable_bit=2
tg3: tg3_stop_block timed out, ofs=4800 enable_bit=2
tg3: eth1: Link is down.
tg3: eth1: Link is up at 100 Mbps, full duplex.
tg3: eth1: Flow control is off for TX and off for RX.

Comment 6 Michael Chan 2007-10-18 17:53:22 UTC
TSO is not supported by 5700 at all and should not be enabled by any version 
of the driver.  I don't see anything obvious in the register dump.

Which version of tg3 before 3.77 did not show this tx timeout problem?



Comment 7 Marcus Alves Grando 2007-10-18 18:23:28 UTC
(In reply to comment #6)
> TSO is not supported by 5700 at all and should not be enabled by any version 
> of the driver.  I don't see anything obvious in the register dump.

Now i know that. ;)

> 
> Which version of tg3 before 3.77 did not show this tx timeout problem?

I really don't know. Because this server always have this problem.

I rebuild again latest redhat kernel with 3.81 driver update and enable again
debug on tg3. See below again:

# uname -r
2.6.9-62.EL.smp

# modinfo tg3
filename:       /lib/modules/2.6.9-62.EL.smp/kernel/drivers/net/tg3.ko
author:         David S. Miller (davem) and Jeff Garzik
(jgarzik)
description:    Broadcom Tigon3 ethernet driver
license:        GPL
version:        3.81 367C507F175A91EAAFC7F7D
parm:           tg3_debug:Tigon3 bitmapped debugging message enable value
vermagic:       2.6.9-62.EL.smp SMP 686 REGPARM 4KSTACKS gcc-3.4

# dmesg | egrep "(tg3|eth)"
divert: not allocating divert_blk for non-ethernet device lo
tg3.c:v3.81 (September 5, 2007)
divert: allocating divert_blk for eth0
eth0: Tigon3 [partno(BCM95700A6) rev 7104 PHY(5411)] (PCI:66MHz:64-bit)
10/100/1000Base-T Ethernet 00:0d:56:70:df:71
eth0: RXcsums[1] LinkChgREG[1] MIirq[1] ASF[0] WireSpeed[0] TSOcap[0]
eth0: dma_rwctrl[76ff000f] dma_mask[64-bit]
divert: allocating divert_blk for eth1
eth1: Tigon3 [partno(BCM95700A6) rev 7104 PHY(5411)] (PCI:66MHz:64-bit)
10/100/1000Base-T Ethernet 00:0d:56:70:df:72
eth1: RXcsums[1] LinkChgREG[1] MIirq[1] ASF[0] WireSpeed[0] TSOcap[0]
eth1: dma_rwctrl[76ff000f] dma_mask[64-bit]
tg3: eth0: Link is up at 1000 Mbps, full duplex.
tg3: eth0: Flow control is off for TX and off for RX.
tg3: eth1: Link is up at 100 Mbps, full duplex.
tg3: eth1: Flow control is off for TX and off for RX.
NETDEV WATCHDOG: eth1: transmit timed out
tg3: eth1: transmit timed out, resetting
tg3: DEBUG: MAC_TX_STATUS[00000008] MAC_RX_STATUS[00000000]
tg3: DEBUG: RDMAC_STATUS[00000000] WDMAC_STATUS[00000000]
tg3: DEBUG: PCI status [02b0] TG3PCI state[0000008e]
tg3: DEBUG: MAC_MODE[00e04c04] MAC_STATUS[05401013]
tg3: DEBUG: MAC_TX_MODE[00000002] MAC_TX_STATUS[00000008]
tg3: DEBUG: SNDDATAI_MODE[00000002] SNDDATAI_STATUS[00000000]
tg3: DEBUG: SNDDATAC_MODE[00000002]
tg3: DEBUG: SNDBDS_MODE[00000006] SNDBDS_STATUS[00000000]
tg3: DEBUG: SNDBDI_MODE[00000006] SNDBDI_STATUS[00000000]
tg3: DEBUG: SNDBDC_MODE[00000002]
tg3: DEBUG: RCVLPC_MODE[00000002] RCVLPC_STATUS[00000000]
tg3: DEBUG: RCVDBDI_MODE[00000012] RCVDBDI_STATUS[00000000]
tg3: DEBUG: RCVDCC_MODE[00000006]
tg3: DEBUG: RCVBDI_MODE[00000006] RCVBDI_STATUS[00000000]
tg3: DEBUG: RCVCC_MODE[00000006] RCVCC_STATUS[00000000]
tg3: DEBUG: RCVLSC_MODE[00000006] RCVLSC_STATUS[00000000]
tg3: DEBUG: MBFREE_MODE[00000002] MBFREE_STATUS[00000000]
tg3: DEBUG: HOSTCC_MODE[00000002] HOSTCC_STATUS[00000000]
tg3: DEBUG: HOSTCC_STATS_BLK_HOST_ADDR[0000000036529000]
tg3: DEBUG: HOSTCC_STATUS_BLK_HOST_ADDR[0000000036574000]
tg3: DEBUG: HOSTCC_STATS_BLK_NIC_ADDR[00000300]
tg3: DEBUG: HOSTCC_STATUS_BLK_NIC_ADDR[00000b00]
tg3: DEBUG: MEMARB_MODE[00000002] MEMARB_STATUS[00000000]
tg3: DEBUG: BUFMGR_MODE[00000006] BUFMGR_STATUS[00000010]
tg3: DEBUG: BUFMGR_MB_POOL_ADDR[00008000] BUFMGR_MB_POOL_SIZE[00018000]
tg3: DEBUG: BUFMGR_DMA_DESC_POOL_ADDR[00002000] BUFMGR_DMA_DESC_POOL_SIZE[00002000]
tg3: DEBUG: RDMAC_MODE[000003fe] RDMAC_STATUS[00000000]
tg3: DEBUG: WDMAC_MODE[000003fe] WDMAC_STATUS[00000000]
tg3: DEBUG: DMAC_MODE[00000002]
tg3: DEBUG: GRC_MODE[04130034] GRC_MISC_CFG[0001f082]
tg3: DEBUG: GRC_LOCAL_CTRL[01009709]
tg3: DEBUG: RCVDBDI_JUMBO_BD[0000000000000000:00000002:00000000]
tg3: DEBUG: RCVDBDI_STD_BD[0000000035ed4000:06000000:00006000]
tg3: DEBUG: RCVDBDI_MINI_BD[0000000000000000:00000002:00000000]
tg3: DEBUG: SRAM_SEND_RCB_0[0000000035ee0000:02000000:00004000]
tg3: DEBUG: SRAM_RCV_RET_RCB_0[0000000035ed8000:04000000:00000000]
tg3: DEBUG: SRAM_STATUS_BLK[00000001:00000000:01fd0000:00000000:01d203fd]
tg3: DEBUG: Host status block [00000000:00000000:(0000:01fd:0000):(03fd:01d2)]
tg3: DEBUG: Host statistics block [00000000:00000000:00000000:00000000]
tg3: DEBUG: SNDHOST_PROD[00000000000001be] SNDNIC_PROD[0000000000000016]
tg3: DEBUG: NIC TXD(0)[00000000:030bba02:002a0004:00000000]
tg3: DEBUG: NIC TXD(1)[00000000:030fa202:002a0004:00000000]
tg3: DEBUG: NIC TXD(2)[00000000:030ffe02:002a0004:00000000]
tg3: DEBUG: NIC TXD(3)[00000000:030be402:005a0005:00000000]
tg3: DEBUG: NIC TXD(4)[00000000:03090602:002a0004:00000000]
tg3: DEBUG: NIC TXD(5)[00000000:0309a602:002a0004:00000000]
tg3: DEBUG: NIC RXD_STD(0)[0][00000000:35ee7012:00000108:00003004]
tg3: DEBUG: NIC RXD_STD(0)[1][ffffffff:00000000:00000000:00010180]
tg3: DEBUG: NIC RXD_STD(1)[0][00000000:35a7e812:00000040:00000004]
tg3: DEBUG: NIC RXD_STD(1)[1][00006bb9:00000000:00000000:00010181]
tg3: DEBUG: NIC RXD_STD(2)[0][00000000:35a46012:00000040:00000004]
tg3: DEBUG: NIC RXD_STD(2)[1][0000d11e:00000000:00000000:00010182]
tg3: DEBUG: NIC RXD_STD(3)[0][00000000:35d7f812:00000040:00000004]
tg3: DEBUG: NIC RXD_STD(3)[1][00006c4b:00000000:00000000:00010183]
tg3: DEBUG: NIC RXD_STD(4)[0][00000000:35ee5012:00000040:00000004]
tg3: DEBUG: NIC RXD_STD(4)[1][00006b9f:00000000:00000000:00010184]
tg3: DEBUG: NIC RXD_STD(5)[0][00000000:35a82812:00000040:00000004]
tg3: DEBUG: NIC RXD_STD(5)[1][00006ba3:00000000:00000000:00010185]
tg3: DEBUG: NIC RXD_JUMBO(0)[0][020e9254:2c11d913:3db69d7c:d7c77250]
tg3: DEBUG: NIC RXD_JUMBO(0)[1][84c41071:12a7229c:fe2576b2:7ac93a1b]
tg3: DEBUG: NIC RXD_JUMBO(1)[0][34bc8b33:83612bd9:7fdb1ef7:267e1c33]
tg3: DEBUG: NIC RXD_JUMBO(1)[1][e2232f85:9e096984:471ba587:9a5a3f99]
tg3: DEBUG: NIC RXD_JUMBO(2)[0][81f922a3:e08b8e8c:587552bc:66ff5b7e]
tg3: DEBUG: NIC RXD_JUMBO(2)[1][c56403b1:6ec793e3:8394ad34:e18df0a0]
tg3: DEBUG: NIC RXD_JUMBO(3)[0][be6a412e:0bc61ca9:d36e8abf:e1b2da3b]
tg3: DEBUG: NIC RXD_JUMBO(3)[1][f8e40040:3810eccf:f7ead13e:d539f469]
tg3: DEBUG: NIC RXD_JUMBO(4)[0][c2a238f3:73ac7680:bb68b1a7:7bc5a4fc]
tg3: DEBUG: NIC RXD_JUMBO(4)[1][d1f751cd:e56a1d69:83f778e6:77cc9d1b]
tg3: DEBUG: NIC RXD_JUMBO(5)[0][17dc7de7:5cd0655e:f3f9ddfa:9fed46cc]
tg3: DEBUG: NIC RXD_JUMBO(5)[1][5a89287e:f84124e6:a6ba56f3:bff73534]
tg3: tg3_stop_block timed out, ofs=3400 enable_bit=2
tg3: tg3_stop_block timed out, ofs=2400 enable_bit=2
tg3: tg3_stop_block timed out, ofs=1800 enable_bit=2
tg3: tg3_stop_block timed out, ofs=4800 enable_bit=2
tg3: eth1: Link is down.
tg3: eth1: Link is up at 100 Mbps, full duplex.
tg3: eth1: Flow control is off for TX and off for RX.

# ethtool eth1
Settings for eth1:
	Supported ports: [ MII ]
	Supported link modes:   10baseT/Half 10baseT/Full 
	                        100baseT/Half 100baseT/Full 
	                        1000baseT/Half 1000baseT/Full 
	Supports auto-negotiation: Yes
	Advertised link modes:  10baseT/Half 10baseT/Full 
	                        100baseT/Half 100baseT/Full 
	                        1000baseT/Half 1000baseT/Full 
	Advertised auto-negotiation: Yes
	Speed: 100Mb/s
	Duplex: Full
	Port: Twisted Pair
	PHYAD: 1
	Transceiver: internal
	Auto-negotiation: on
	Supports Wake-on: g
	Wake-on: d
	Current message level: 0x000000ff (255)
	Link detected: yes

# ethtool -k eth1
Offload parameters for eth1:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: off

Comment 8 Marcus Alves Grando 2007-10-18 18:33:03 UTC
Maybe disable "tx-checksumming: on" can help? I see in dmesg support for
RXcsums[1] but don't see anything about TX checksum? Make any sense?

Regards

Comment 9 Andy Gospodarek 2007-10-18 18:51:33 UTC
(In reply to comment #8)
> Maybe disable "tx-checksumming: on" can help? I see in dmesg support for
> RXcsums[1] but don't see anything about TX checksum? Make any sense?
> 
> Regards

You could try it, but I wouldn't guess that it will save much.  By the way, was
the 3.81 update from here?

http://people.redhat.com/agospoda/#rhel4

I'm hoping that it is.  Thanks! :-)


Comment 10 Michael Chan 2007-10-18 18:52:07 UTC
I don't think checksum will have any effect on tx timeout, but you can try 
turning it off.

You have 2 5700 devices eth0 and eth1.  Do you see tx timeout only on eth1?

I just tested a 5700 NIC card with the same rev of chip using netperf and it 
ran fine.  What kind of traffic do you have on eth1?

Comment 11 Marcus Alves Grando 2007-10-18 19:53:02 UTC
(In reply to comment #9)
> (In reply to comment #8)
> > Maybe disable "tx-checksumming: on" can help? I see in dmesg support for
> > RXcsums[1] but don't see anything about TX checksum? Make any sense?
> > 
> > Regards
> 
> You could try it, but I wouldn't guess that it will save much.  By the way, was
> the 3.81 update from here?
> 
> http://people.redhat.com/agospoda/#rhel4
> 
> I'm hoping that it is.  Thanks! :-)
> 

Yes. I take 3.81 from your repo.

(In reply to comment #10)
> I don't think checksum will have any effect on tx timeout, but you can try 
> turning it off.
> 
> You have 2 5700 devices eth0 and eth1.  Do you see tx timeout only on eth1?

Yes. I see that only on eth1.

> 
> I just tested a 5700 NIC card with the same rev of chip using netperf and it 
> ran fine.  What kind of traffic do you have on eth1?

It's very strange... because some time accurrs every time and after pass more
than 2 days without that.

This nic are used to NFS traffic. More precisely email traffic via NFS.

I'll try turn off tx checksum to see what's happening.

Comment 12 Michael Chan 2007-10-18 20:11:10 UTC
Similar traffic goes through eth0 and eth1, but you only saw timeout on eth1? 
The 2 devices are on the same bus (08) and so if there are any issues on the 
bus, both devices should be affected.

Comment 13 Andy Gospodarek 2007-10-18 20:16:10 UTC
Marcus,

Is your MTU 1500 for these interfaces or larger?  If larger can you reproduce
this issue with an MTU of 1500?

Thanks!

Comment 14 Marcus Alves Grando 2007-10-18 20:36:42 UTC
(In reply to comment #12)
> Similar traffic goes through eth0 and eth1, but you only saw timeout on eth1? 

More or less. For example now eth0 receive 0.8Mb and send 8.51Mb and eth1 send
2.02Mb and receive 6.89Mb.

> The 2 devices are on the same bus (08) and so if there are any issues on the 
> bus, both devices should be affected.

Hmmm... actually eth0 is plugged in one cisco catalyst 297024 (IOS 12.2(25)SEB4)
and eth1 are plugged in one cisco catalyst 2924XLv (IOS 12.0(5)WC9a)

But i don't think that can make eth1 watchdog timeout.

(In reply to comment #13)
> Marcus,
> 
> Is your MTU 1500 for these interfaces or larger?  If larger can you reproduce
> this issue with an MTU of 1500?
> 
> Thanks!

All servers that accurrs that MTU are 1500. I don't use MTU greater than 1500.

Regards

Comment 15 Michael Chan 2007-10-18 20:45:22 UTC
>> The 2 devices are on the same bus (08) and so if there are any issues on
>> the bus, both devices should be affected.
>
> Hmmm... actually eth0 is plugged in one cisco catalyst 297024 (IOS
> 12.2(25)SEB4) and eth1 are plugged in one cisco catalyst 2924XLv (IOS
> 12.0(5)WC9a)

I was referring to the PCI bus.  Both devices are on bus 8 based on your lspci 
output.

Comment 16 Marcus Alves Grando 2007-10-18 20:49:27 UTC
(In reply to comment #15)
> >> The 2 devices are on the same bus (08) and so if there are any issues on
> >> the bus, both devices should be affected.
> >
> > Hmmm... actually eth0 is plugged in one cisco catalyst 297024 (IOS
> > 12.2(25)SEB4) and eth1 are plugged in one cisco catalyst 2924XLv (IOS
> > 12.0(5)WC9a)
> 
> I was referring to the PCI bus.  Both devices are on bus 8 based on your lspci 
> output.

Yes, both devices are on bus 08.

08:01.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5700 Gigabit
Ethernet (rev 14)
08:02.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5700 Gigabit
Ethernet (rev 14)

Comment 17 Marcus Alves Grando 2007-10-27 20:55:19 UTC
People, some news about that?

Regards

Comment 18 Michael Chan 2007-10-29 05:47:28 UTC
Because you always see the problem on eth0 only no matter what version of the 
driver you use, I think it is possible that you have a bad chip in eth0.  Do 
you have different machines exhibiting the same problem?

Comment 19 Marcus Alves Grando 2007-11-05 18:53:54 UTC
Yes.

I have many servers with this problem. Last Friday i update driver to 3.84 and
see this problems too. All servers that i maintain have this problem.

Another idea to debug this problem?

Regards

Comment 20 Michael Chan 2007-11-05 19:11:49 UTC
OK, I'll ask our QA lab to see if they have a Dell PE6650 to reproduce the 
problem.  Can you find a simple traffic pattern (such as netperf, iperf) that 
will easily trigger the problem?  This will make it easier for us to reproduce 
the problem.

Comment 21 Marcus Alves Grando 2007-11-07 16:07:21 UTC
(In reply to comment #20)
> OK, I'll ask our QA lab to see if they have a Dell PE6650 to reproduce the 
> problem.  Can you find a simple traffic pattern (such as netperf, iperf) that 
> will easily trigger the problem?  This will make it easier for us to reproduce 
> the problem.

I'll try use netperf/iperf to reproduce that.

One interesting point is after disable tx checksum offload with ethtool my
servers works normally. I enable that one week ago and works fine until now.

# ethtool -k eth0
Offload parameters for eth0:
rx-checksumming: on
tx-checksumming: off
scatter-gather: on
tcp segmentation offload: off

Maybe that's related with tx checksum?

Regards

Comment 22 Andy Gospodarek 2007-11-07 18:45:25 UTC
Hmmm, tg3_get_invariants definitely sets checksumming off for what I would guess
it an older version of 5700.  Any chance this needs to be exteneded to other
versions too?


        /* 5700 B0 chips do not support checksumming correctly due
         * to hardware bugs.
         */
        if (tp->pci_chip_rev_id == CHIPREV_ID_5700_B0)
                tp->tg3_flags |= TG3_FLAG_BROKEN_CHECKSUMS;



Comment 23 Michael Chan 2007-11-07 18:57:59 UTC
He is using B4 which shouldn't have the problem any more.  But in any case, 
the checksum problem was algorithmic, meaning that it would generate the wrong 
checksum on B0 chips.  I don't understand how tx checksum can cause tx timeout 
on eth0 only and not on eth1.

Comment 24 Andy Gospodarek 2008-02-26 20:30:38 UTC
(In reply to comment #23)
> He is using B4 which shouldn't have the problem any more.  But in any case, 
> the checksum problem was algorithmic, meaning that it would generate the wrong 
> checksum on B0 chips.  I don't understand how tx checksum can cause tx timeout 
> on eth0 only and not on eth1.

It would seem odd to me as well since these are on 2 different cards, right? 
The 5700 isn't a dual-port card (that is just the 5704 iirc), is it?  I could
understand if it was a 5704 where two ports are sharing one chip, so it would
seem that congestion on one port could cause problems on another port, but I
cannot guarantee that would even be a problem since I know little about the
hardware design itself.

Comment 25 Andy Gospodarek 2008-04-07 21:06:03 UTC
Marcus,

Is this still a problem?

We can try to clean up the tg3 driver and disable tx checksumming on rev B4 5700
chips, but Michael doesn't seem to think that is needed so I'm reluctant to do
that (and he knows the hardware well enough to know what is needed).

I'll take a look at the patch for tg3 that was added to 2.6.9-59 if you feel
that was the first kernel that you noticed having problems.



Comment 26 Marcus Alves Grando 2008-04-23 13:13:10 UTC
Andy,

Well, my test server that has a problem running for 43 days without a problem.
Now all my servers that has a tg3 NIC or have a old driver or have a 3.84
driver. I'll update to 2.6.9-69 (tg3 3.86) and see what's happening.

I don't know how I can reproduce that, and if I have more info I'll add here.

Regards

Comment 27 Andy Gospodarek 2008-04-23 19:14:32 UTC
Thanks for the update, Marcus.  I am glad your servers are running well, but
concerned that the problem has gone away.

Did you do anything else to the servers (like update the BIOS) recently? 

Comment 28 Marcus Alves Grando 2008-04-24 12:41:32 UTC
Ok, more news...

Now we change some servers with this chip to AS5. Now every time occurs watchdog
timeout.

# uname -r
2.6.18-89.el5.gtest.46PAE
# cat /var/log/dmesg | egrep "(eth|tg3)"
tg3.c:v3.86-rh (November 9, 2007)
eth0: Tigon3 [partno(BCM95700A6) rev 7104 PHY(5411)] (PCI:66MHz:64-bit)
10/100/1000Base-T Ethernet 00:11:43:32:41:58
eth0: RXcsums[1] LinkChgREG[1] MIirq[1] ASF[0] WireSpeed[0] TSOcap[0]
eth0: dma_rwctrl[76ff000f] dma_mask[64-bit]
eth1: Tigon3 [partno(BCM95700A6) rev 7104 PHY(5411)] (PCI:66MHz:64-bit)
10/100/1000Base-T Ethernet 00:11:43:32:41:59
eth1: RXcsums[1] LinkChgREG[1] MIirq[1] ASF[0] WireSpeed[0] TSOcap[0]
eth1: dma_rwctrl[76ff000f] dma_mask[64-bit]
# dmesg | egrep "(eth|tg3)"
tg3: tg3_stop_block timed out, ofs=1800 enable_bit=2
tg3: tg3_stop_block timed out, ofs=4800 enable_bit=2
tg3: eth0: Link is down.
tg3: eth0: Link is up at 1000 Mbps, full duplex.
tg3: eth0: Flow control is off for TX and off for RX.
NETDEV WATCHDOG: eth0: transmit timed out
tg3: eth0: transmit timed out, resetting
tg3: DEBUG: MAC_TX_STATUS[00000018] MAC_RX_STATUS[00000008]
tg3: DEBUG: RDMAC_STATUS[00000000] WDMAC_STATUS[00000000]
tg3: tg3_stop_block timed out, ofs=1800 enable_bit=2
tg3: tg3_stop_block timed out, ofs=4800 enable_bit=2
tg3: eth0: Link is down.
tg3: eth0: Link is up at 1000 Mbps, full duplex.
tg3: eth0: Flow control is off for TX and off for RX.
NETDEV WATCHDOG: eth0: transmit timed out
tg3: eth0: transmit timed out, resetting
tg3: DEBUG: MAC_TX_STATUS[00000008] MAC_RX_STATUS[00000008]
tg3: DEBUG: RDMAC_STATUS[00000000] WDMAC_STATUS[00000000]
tg3: tg3_stop_block timed out, ofs=1800 enable_bit=2
tg3: tg3_stop_block timed out, ofs=4800 enable_bit=2
tg3: eth0: Link is down.
tg3: eth0: Link is up at 1000 Mbps, full duplex.
tg3: eth0: Flow control is off for TX and off for RX.
# ethtool -i eth0
driver: tg3
version: 3.86-rh
firmware-version: 
bus-info: 0000:08:01.0
# ethtool -i eth1
driver: tg3
version: 3.86-rh
firmware-version: 
bus-info: 0000:08:02.0

I have a ethtool -d too... but I don't know if it help.

So, Michael I have the HW to test, maybe you can prepare one patch to identify this?

Regards

Comment 29 Marcus Alves Grando 2008-04-24 17:38:15 UTC
Andy,

Maybe you can replicate this BUG to AS5? It's critical since tg3 3.86 already
commited to AS5 kernel and AS5 is in beta stage.

Regards

Comment 30 Andy Gospodarek 2008-04-24 18:13:57 UTC
Is the tg3 using MSI interrupts?  Problems like these seem to happen when
network cards don't operate with some bridge chips.

If your system uses MSI, can you boot with pci=nomsi on the kernel command line
and let me know how well that works?

Also an lspci -vvv from the system would be helpful.

Thanks.

Comment 31 Michael Chan 2008-04-24 18:29:42 UTC
Andy, this old chip does not support MSI.  lspci will show that MSI is 
supported but tg3 will not use MSI.

Joe@broadcom, can you see if you can reproduce this problem using the AS5 
kernel?

We can also send a debug patch to Marcus to dump all registers during 
watchdog.  ethtool -d won't help because by then the chip has been reset 
already.

Comment 32 Andy Gospodarek 2008-04-24 19:02:02 UTC
Thanks for chiming in, Michael.

I get so many of these watchdog timeouts on various drivers and many of them
seem to come from irqs not working well and servicing the tx ring buffers.  Most
of these come from a lack of interaction between msi bridgees and nics.


Comment 33 Marcus Alves Grando 2008-04-24 19:25:15 UTC
(In reply to comment #30)
> Is the tg3 using MSI interrupts?  Problems like these seem to happen when
> network cards don't operate with some bridge chips.

# lspci -vvv
08:01.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5700 Gigabit
Ethernet (rev 14)
	Subsystem: Dell Broadcom BCM5700 1000Base-T
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping-
SERR+ FastB2B-
	Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort-
<MAbort- >SERR- <PERR-
	Latency: 32 (16000ns min), Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 193
	Region 0: Memory at fcd10000 (64-bit, non-prefetchable) [size=64K]
	Capabilities: [40] PCI-X non-bridge device
		Command: DPERE- ERO- RBC=512 OST=1
		Status: Dev=ff:1f.1 64bit+ 133MHz+ SCD- USC- DC=simple DMMRBC=512 DMOST=1
DMCRS=8 RSCEM- 266MHz- 533MHz-
	Capabilities: [48] Power Management version 2
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold-)
		Status: D0 PME-Enable- DSel=0 DScale=1 PME-
	Capabilities: [50] Vital Product Data
	Capabilities: [58] Message Signalled Interrupts: 64bit+ Queue=0/3 Enable-
		Address: 6451b204961402c0  Data: c620

08:02.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5700 Gigabit
Ethernet (rev 14)
	Subsystem: Dell Broadcom BCM5700 1000Base-T
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping-
SERR+ FastB2B-
	Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort-
<MAbort- >SERR- <PERR-
	Latency: 32 (16000ns min), Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 201
	Region 0: Memory at fcd00000 (64-bit, non-prefetchable) [size=64K]
	Capabilities: [40] PCI-X non-bridge device
		Command: DPERE- ERO- RBC=512 OST=1
		Status: Dev=ff:1f.1 64bit+ 133MHz+ SCD- USC- DC=simple DMMRBC=512 DMOST=1
DMCRS=8 RSCEM+ 266MHz- 533MHz-
	Capabilities: [48] Power Management version 2
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold-)
		Status: D0 PME-Enable- DSel=0 DScale=1 PME-
	Capabilities: [50] Vital Product Data
	Capabilities: [58] Message Signalled Interrupts: 64bit+ Queue=0/3 Enable-
		Address: 43a2947549442a08  Data: 1983

# cat /proc/interrupts 
           CPU0       CPU1       CPU2       CPU3       CPU4       CPU5      
CPU6       CPU7       
  0:  686828617          0          0          0          0          0         
0          0    IO-APIC-edge  timer
  1:         22        426         35          0         12          0         
0          0    IO-APIC-edge  i8042
  6:          3          0          0          0          0          0         
0          0    IO-APIC-edge  floppy
  8:          1          0          0          0          0          0         
0          0    IO-APIC-edge  rtc
  9:          1          0          0          0          0          0         
0          0   IO-APIC-level  acpi
 10:          0          0          0          0          0          0         
0          0   IO-APIC-level  ohci_hcd:usb1
 12:        106          7          0          0          0          0         
0          0    IO-APIC-edge  i8042
 14:         33        930      10175         66       2568          0         
0          0    IO-APIC-edge  ide0
177:       5895     194684      14801    1586214          0    5248290         
0          0   IO-APIC-level  megaraid
185:         15          0          0          0          0          0         
0          0   IO-APIC-level  aic7xxx
193:        451   67297841          0       2085          0          0         
0 3462533099   IO-APIC-level  eth0
201:        385    5131688       2311 1197221598          0  331115277         
0          0   IO-APIC-level  eth1
NMI:          0          0          0          0          0          0         
0          0 
LOC:  686919525  686911835  686893240  686913213  686919644  686919498 
686919687  686919644 
ERR:          0
MIS:          0

So, like Michael say about MSI, in lspci show enable and /proc/interrupts does
not appear.

Michael, feel free to sent me a patch to test, every ~5min watchdog timeout appear.

Thanks all.

Comment 34 Marcus Alves Grando 2008-05-07 18:08:50 UTC
Michael, any news?

Comment 35 Matt Carlson 2008-05-08 20:03:38 UTC
Marcus,

A couple things you could try :

1) Can you add the following just above 'schedule_work(&tp->reset_task);' in
   tg3_tx_timeout:

    printk(KERN_NOTICE "MAILBOX_INTERRUPT_0 = 0x%x, tp->irq_sync = %d\n",
           tr32_mailbox(MAILBOX_INTERRUPT_0 + TG3_64BIT_REG_LOW),
           tp->irq_sync);

   I want to make sure interrupts are still enabled.

2) I noticed that the link reports tx and rx flow control is off.  Is it
   possible to reproduce the problem if you connect to a switch that supports
   flow control?

3) I also noticed that PHY autopolling is turned on.  I really don't think it
would have any effect, but could you comment the following block of code in
tg3_setup_copper_phy :

#if 0
    /* ??? Without this setting Netgear GA302T PHY does not
     * ??? send/receive packets...
     */
    if ((tp->phy_id & PHY_ID_MASK) == PHY_ID_BCM5411 &&
        tp->pci_chip_rev_id == CHIPREV_ID_5700_ALTIMA) {
        tp->mi_mode |= MAC_MI_MODE_AUTO_POLL;
        tw32_f(MAC_MI_MODE, tp->mi_mode);
        udelay(80);
    }
#endif


Comment 36 Marcus Alves Grando 2008-05-09 18:11:22 UTC
Well,

I've added printk for first item in running kernel now. Let's wait for a new
watchdog now.

About second item, I need to find some switch to do that, but I'll try.

I've added a similar patch to tirth item. When if is true I put a printk to see
when this code are executed and the rest are commented. When I boot the server
with new kernel I already see this printk, let's wait a new whatchdog timeout.

--
if ((tp->phy_id & PHY_ID_MASK) == PHY_ID_BCM5411 && tp->pci_chip_rev_id ==
CHIPREV_ID_5700_ALTIMA) {
if ((tp->phy_id & PHY_ID_MASK) == PHY_ID_BCM5411 && tp->pci_chip_rev_id ==
CHIPREV_ID_5700_ALTIMA) {
tg3: eth0: Link is up at 1000 Mbps, full duplex.
tg3: eth0: Flow control is off for TX and off for RX.
if ((tp->phy_id & PHY_ID_MASK) == PHY_ID_BCM5411 && tp->pci_chip_rev_id ==
CHIPREV_ID_5700_ALTIMA) {
if ((tp->phy_id & PHY_ID_MASK) == PHY_ID_BCM5411 && tp->pci_chip_rev_id ==
CHIPREV_ID_5700_ALTIMA) {
tg3: eth1: Link is up at 100 Mbps, full duplex.
tg3: eth1: Flow control is off for TX and off for RX.
--

Comment 37 Marcus Alves Grando 2008-05-12 12:52:40 UTC
mcarlson,

Now it's happen again...

--dmesg--
NETDEV WATCHDOG: eth0: transmit timed out
tg3: eth0: transmit timed out, resetting
tg3: DEBUG: MAC_TX_STATUS[00000018] MAC_RX_STATUS[00000008]
tg3: DEBUG: RDMAC_STATUS[00000000] WDMAC_STATUS[00000000]
tg3: tg3_stop_block timed out, ofs=1800 enable_bit=2
tg3: tg3_stop_block timed out, ofs=4800 enable_bit=2
tg3: eth0: Link is down.
if ((tp->phy_id & PHY_ID_MASK) == PHY_ID_BCM5411 && tp->pci_chip_rev_id ==
CHIPREV_ID_5700_ALTIMA) {
if ((tp->phy_id & PHY_ID_MASK) == PHY_ID_BCM5411 && tp->pci_chip_rev_id ==
CHIPREV_ID_5700_ALTIMA) {
tg3: eth0: Link is up at 1000 Mbps, full duplex.
tg3: eth0: Flow control is off for TX and off for RX.
NETDEV WATCHDOG: eth0: transmit timed out
tg3: eth0: transmit timed out, resetting
tg3: DEBUG: MAC_TX_STATUS[00000008] MAC_RX_STATUS[00000000]
tg3: DEBUG: RDMAC_STATUS[00000000] WDMAC_STATUS[00000000]
tg3: tg3_stop_block timed out, ofs=2c00 enable_bit=2
tg3: tg3_stop_block timed out, ofs=3400 enable_bit=2
tg3: tg3_stop_block timed out, ofs=2400 enable_bit=2
tg3: tg3_stop_block timed out, ofs=4800 enable_bit=2
tg3: eth0: Link is down.
if ((tp->phy_id & PHY_ID_MASK) == PHY_ID_BCM5411 && tp->pci_chip_rev_id ==
CHIPREV_ID_5700_ALTIMA) {
if ((tp->phy_id & PHY_ID_MASK) == PHY_ID_BCM5411 && tp->pci_chip_rev_id ==
CHIPREV_ID_5700_ALTIMA) {
tg3: eth0: Link is up at 1000 Mbps, full duplex.
tg3: eth0: Flow control is off for TX and off for RX.
NETDEV WATCHDOG: eth0: transmit timed out
tg3: eth0: transmit timed out, resetting
tg3: DEBUG: MAC_TX_STATUS[00000008] MAC_RX_STATUS[00000008]
tg3: DEBUG: RDMAC_STATUS[00000000] WDMAC_STATUS[00000000]
tg3: tg3_stop_block timed out, ofs=1800 enable_bit=2
tg3: tg3_stop_block timed out, ofs=4800 enable_bit=2
tg3: eth0: Link is down.
--

I'll test today turn on flow control.

Regards

Comment 38 Marcus Alves Grando 2008-05-12 12:55:49 UTC
I forgot to put a important part... sorry.

--
NETDEV WATCHDOG: eth0: transmit timed out
tg3: eth0: transmit timed out, resetting
tg3: DEBUG: MAC_TX_STATUS[00000018] MAC_RX_STATUS[00000008]
tg3: DEBUG: RDMAC_STATUS[00000000] WDMAC_STATUS[00000000]
MAILBOX_INTERRUPT_0 = 0x0, tp->irq_sync = 0
tg3: tg3_stop_block timed out, ofs=1800 enable_bit=2
tg3: tg3_stop_block timed out, ofs=4800 enable_bit=2
tg3: eth0: Link is down.
if ((tp->phy_id & PHY_ID_MASK) == PHY_ID_BCM5411 && tp->pci_chip_rev_id ==
CHIPREV_ID_5700_ALTIMA) {
if ((tp->phy_id & PHY_ID_MASK) == PHY_ID_BCM5411 && tp->pci_chip_rev_id ==
CHIPREV_ID_5700_ALTIMA) {
tg3: eth0: Link is up at 1000 Mbps, full duplex.
tg3: eth0: Flow control is off for TX and off for RX.
NETDEV WATCHDOG: eth0: transmit timed out
tg3: eth0: transmit timed out, resetting
tg3: DEBUG: MAC_TX_STATUS[00000008] MAC_RX_STATUS[00000000]
tg3: DEBUG: RDMAC_STATUS[00000000] WDMAC_STATUS[00000000]
MAILBOX_INTERRUPT_0 = 0x0, tp->irq_sync = 0
tg3: tg3_stop_block timed out, ofs=2c00 enable_bit=2
tg3: tg3_stop_block timed out, ofs=3400 enable_bit=2
tg3: tg3_stop_block timed out, ofs=2400 enable_bit=2
tg3: tg3_stop_block timed out, ofs=4800 enable_bit=2
tg3: eth0: Link is down.
--

Comment 39 Marcus Alves Grando 2008-05-16 18:33:11 UTC
NETDEV WATCHDOG: eth0: transmit timed out
tg3: eth0: transmit timed out, resetting
tg3: DEBUG: MAC_TX_STATUS[00000008] MAC_RX_STATUS[00000008]
tg3: DEBUG: RDMAC_STATUS[00000000] WDMAC_STATUS[00000000]
MAILBOX_INTERRUPT_0 = 0x0, tp->irq_sync = 0
tg3: tg3_stop_block timed out, ofs=2c00 enable_bit=2
tg3: tg3_stop_block timed out, ofs=2400 enable_bit=2
tg3: tg3_stop_block timed out, ofs=1800 enable_bit=2
tg3: tg3_stop_block timed out, ofs=4800 enable_bit=2
tg3: eth0: Link is down.
if ((tp->phy_id & PHY_ID_MASK) == PHY_ID_BCM5411 && tp->pci_chip_rev_id ==
CHIPREV_ID_5700_ALTIMA) {
if ((tp->phy_id & PHY_ID_MASK) == PHY_ID_BCM5411 && tp->pci_chip_rev_id ==
CHIPREV_ID_5700_ALTIMA) {
tg3: eth0: Link is up at 1000 Mbps, full duplex.
tg3: eth0: Flow control is off for TX and off for RX.

Again... any news?

Comment 40 Matt Carlson 2008-05-17 00:40:13 UTC
I really am interested to see if flow control has any effect on the problem. 
The "tg3_stop_block timed out" messages appear to be telling us the internal
state machines are hung.  I'm thinking flow control will help.

Could you give me the output of 'ethtool -e <eth> offset 0x94 length 4'?

If you were to put a print message in tg3_reset_task() to tell us when it is
called, do you see it happen near these transmit timeouts?

Comment 44 Andy Gospodarek 2008-08-11 20:51:58 UTC
Matt were you suggesting that the customer enable flow control or ensure it is disabled?  From your past comments, I'm guessing you would like to see it enabled.

Comment 46 Matt Carlson 2008-08-14 18:47:29 UTC
Actually, I did mean to turn flow control off.  It isn't enough to just turn off the autonegotiation field though.  You have to turn off rx and tx flow control too.  Otherwise, the driver interprets the settings to mean the administrator wanted to force flow control on.

Comment 47 Andy Gospodarek 2008-08-14 18:58:28 UTC
Thanks, Matt!

Marcus, if you could test with flow-control completely disabled that would be great.

Comment 58 Andy Gospodarek 2008-10-24 16:58:35 UTC
Matt is correct that we really need to split this out into two separate bugs.  For now, I'd like to find out how things are progressing with Marcus.  In comment #46 Matt suggesting disabling flow control completely.  Since flow control is a parameter that can be autonegotiated between the host and the switch, you will need to disable it in all three spots, so your output from 'ethtool -a' now looks like this:

# ethtool -a eth2
Pause parameters for eth2:
Autonegotiate:  off
RX:             off
TX:             off

Would you be willing to try that, Marcus?

Comment 59 Andy Gospodarek 2008-10-24 17:33:53 UTC
I created bug 468420 to address the 5704 issues, so I'm going to make all those private and we should just focus on 5700 in this bug.

Comment 60 Andy Gospodarek 2008-12-15 20:33:36 UTC
Marcus,  I'm trying to get these RHEL4 bugzillas completed since we are doing another update soon.  I realize this has been around for a while, but I wonder if you have had a chance to try some of the flow-control changes from comment #58 and how your systems are now doing.  Thanks!

Comment 61 Marcus Alves Grando 2009-01-05 18:00:15 UTC
(In reply to comment #60)
> Marcus,  I'm trying to get these RHEL4 bugzillas completed since we are doing
> another update soon.  I realize this has been around for a while, but I wonder
> if you have had a chance to try some of the flow-control changes from comment
> #58 and how your systems are now doing.  Thanks!

Guys, I can't reproduce this anymore, since we changed related servers. I've tried to install again but without real usage, NICs works fine.

Regards

Comment 62 Andy Gospodarek 2009-01-05 19:00:56 UTC
Thanks, Marcus.

I hate to close this issue without any resolution, but it seems like you were the only person who could reproduce this issue and with your servers now being out of production, I'm not sure there is much we can do.  Please re-open this bug if you are able to reproduce this or if you or anyone else are experiencing problems.

Comment 63 Zack Buhman 2011-07-19 17:02:25 UTC
Hello. I can reproduce this bug on similar hardware (in this case a PE 2550) using kernel 2.6.39. I've been able to consistently reproduce this error; it happens every time my network administrator feels like resetting the switch that the machine is connected to, and while there is some substantial network load. I have also been able to reproduce this on kernels 2.6.32 and 2.6.38 (those are the only other two I've tested).

I would be happy to produce any additional information.

root@server1:~# dmesg | tail -n 44
[ 3538.016036] ------------[ cut here ]------------
[ 3538.021040] WARNING: at /build/buildd-linux-2.6_2.6.39-3-i386-0YkQQW/linux-2.6-2.6.39/debian/build/source_i386_none/net/sched/sch_generic.c:256 dev_watchdog+0xc9/0x15d()
[ 3538.031611] Hardware name: PowerEdge 2550                  
[ 3538.037070] NETDEV WATCHDOG: eth1 (tg3): transmit queue 0 timed out
[ 3538.042569] Modules linked in: decnet loop snd_pcm snd_timer snd soundcore snd_page_alloc evdev pcspkr i2c_piix4 i2c_core psmouse serio_raw dcdbas shpchp pci_hotplug parport_pc parport processor thermal_sys button ext4 mbcache jbd2 crc16 sr_mod cdrom ata_generic sg sd_mod crc_t10dif pata_serverworks libata ohci_hcd aacraid ehci_hcd floppy tg3 usbcore scsi_mod e100 libphy mii [last unloaded: scsi_wait_scan]
[ 3538.073246] Pid: 0, comm: swapper Not tainted 2.6.39-2-686-pae #1
[ 3538.079758] Call Trace:
[ 3538.086161]  [<c1036b45>] ? warn_slowpath_common+0x6a/0x7b
[ 3538.092660]  [<c1225fc4>] ? dev_watchdog+0xc9/0x15d
[ 3538.099123]  [<c1036bbc>] ? warn_slowpath_fmt+0x28/0x2c
[ 3538.105595]  [<c1225fc4>] ? dev_watchdog+0xc9/0x15d
[ 3538.112130]  [<c102ba83>] ? get_nohz_timer_target+0x3f/0x5c
[ 3538.118680]  [<c1041b1a>] ? __mod_timer+0x10c/0x116
[ 3538.125188]  [<c103bc50>] ? irq_enter+0x49/0x49
[ 3538.131626]  [<c1041be5>] ? mod_timer+0x67/0x6c
[ 3538.138048]  [<c103bc50>] ? irq_enter+0x49/0x49
[ 3538.144364]  [<c1041485>] ? run_timer_softirq+0x167/0x20b
[ 3538.150653]  [<c1225efb>] ? netif_tx_lock+0x4f/0x4f
[ 3538.156888]  [<c103bc50>] ? irq_enter+0x49/0x49
[ 3538.163064]  [<c103bceb>] ? __do_softirq+0x9b/0x14e
[ 3538.169106]  [<c103bc50>] ? irq_enter+0x49/0x49
[ 3538.175219]  <IRQ>  [<c103bb4b>] ? irq_exit+0x2f/0x79
[ 3538.181309]  [<c101bc1b>] ? smp_apic_timer_interrupt+0x6b/0x75
[ 3538.187449]  [<c12b2711>] ? apic_timer_interrupt+0x31/0x38
[ 3538.193526]  [<c1021c68>] ? native_safe_halt+0x2/0x3
[ 3538.199534]  [<c100dc94>] ? default_idle+0x50/0x87
[ 3538.205460]  [<c1007f9f>] ? cpu_idle+0x95/0xb2
[ 3538.211416]  [<c14207d4>] ? start_kernel+0x337/0x33c
[ 3538.217465] ---[ end trace f74ed1aa79d1afe1 ]---
[ 3538.223529] tg3 0000:01:08.0: eth1: transmit timed out, resetting
[ 3538.229698] tg3 0000:01:08.0: eth1: DEBUG: MAC_TX_STATUS[00000008] MAC_RX_STATUS[00000008]
[ 3538.236001] tg3 0000:01:08.0: eth1: DEBUG: RDMAC_STATUS[00000000] WDMAC_STATUS[00000000]
[ 3538.343602] tg3 0000:01:08.0: tg3_stop_block timed out, ofs=1800 enable_bit=2
[ 3538.449609] tg3 0000:01:08.0: tg3_stop_block timed out, ofs=c00 enable_bit=2
[ 3538.555402] tg3 0000:01:08.0: tg3_stop_block timed out, ofs=4800 enable_bit=2
[ 3538.692079] tg3 0000:01:08.0: eth1: Link is down
[ 3542.708988] tg3 0000:01:08.0: eth1: Link is up at 1000 Mbps, full duplex
[ 3542.714535] tg3 0000:01:08.0: eth1: Flow control is off for TX and off for RX
[ 6729.236040] tg3 0000:01:08.0: BAR 0: set to [mem 0xfeb00000-0xfeb0ffff 64bit] (PCI address [0xfeb00000-0xfeb0ffff])
[ 6729.462135] ADDRCONF(NETDEV_UP): eth1: link is not ready
[ 6732.460994] tg3 0000:01:08.0: eth1: Link is up at 1000 Mbps, full duplex
[ 6732.466846] tg3 0000:01:08.0: eth1: Flow control is off for TX and off for RX
[ 6732.473256] ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready
[ 6742.968016] eth1: no IPv6 routers present

root@server1:~# lspci -vvv
00:00.0 Host bridge: Broadcom CNB20HE Host Bridge (rev 23)
	Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-

00:00.1 Host bridge: Broadcom CNB20HE Host Bridge (rev 01)
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx-
	Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort+ >SERR- <PERR- INTx-
	Latency: 32, Cache Line Size: 32 bytes

00:00.2 Host bridge: Broadcom CNB20HE Host Bridge (rev 01)
	Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx-
	Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort+ >SERR- <PERR- INTx-

00:00.3 Host bridge: Broadcom CNB20HE Host Bridge (rev 01)
	Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx-
	Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort+ >SERR- <PERR- INTx-

00:0e.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27) (prog-if 00 [VGA controller])
	Subsystem: Dell PowerEdge 2550
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping+ SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 32 (2000ns min), Cache Line Size: 32 bytes
	Region 0: Memory at fd000000 (32-bit, non-prefetchable) [size=16M]
	Region 1: I/O ports at ec00 [size=256]
	Region 2: Memory at fe101000 (32-bit, non-prefetchable) [size=4K]
	[virtual] Expansion ROM at 80000000 [disabled] [size=128K]
	Capabilities: [5c] Power Management version 2
		Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-

00:0f.0 ISA bridge: Broadcom OSB4 South Bridge (rev 50)
	Subsystem: Broadcom OSB4 South Bridge
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Kernel driver in use: piix4_smbus

00:0f.1 IDE interface: Broadcom OSB4 IDE Controller (prog-if 8a [Master SecP PriP])
	Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 64
	Region 0: [virtual] Memory at 000001f0 (32-bit, non-prefetchable) [size=8]
	Region 1: [virtual] Memory at 000003f0 (type 3, non-prefetchable) [size=1]
	Region 2: [virtual] Memory at 00000170 (32-bit, non-prefetchable) [size=8]
	Region 3: [virtual] Memory at 00000370 (type 3, non-prefetchable) [size=1]
	Region 4: I/O ports at 08b0 [size=16]
	Kernel driver in use: pata_serverworks

00:0f.2 USB Controller: Broadcom OSB4/CSB5 OHCI USB Controller (rev 04) (prog-if 10 [OHCI])
	Subsystem: Broadcom OSB4/CSB5 OHCI USB Controller
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
	Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 32 (20000ns max), Cache Line Size: 32 bytes
	Interrupt: pin A routed to IRQ 11
	Region 0: Memory at fe100000 (32-bit, non-prefetchable) [size=4K]
	Kernel driver in use: ohci_hcd

01:08.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5700 Gigabit Ethernet (rev 12)
	Subsystem: Dell Broadcom BCM5700
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
	Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 32 (16000ns min), Cache Line Size: 32 bytes
	Interrupt: pin A routed to IRQ 17
	Region 0: Memory at feb00000 (64-bit, non-prefetchable) [size=64K]
	Capabilities: [40] PCI-X non-bridge device
		Command: DPERE- ERO+ RBC=512 OST=1
		Status: Dev=ff:1f.1 64bit+ 133MHz+ SCD- USC- DC=simple DMMRBC=512 DMOST=1 DMCRS=8 RSCEM- 266MHz- 533MHz-
	Capabilities: [48] Power Management version 2
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold-)
		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
	Capabilities: [50] Vital Product Data
		Unknown large resource type 00, will not decode more.
	Capabilities: [58] MSI: Enable- Count=1/8 Maskable- 64bit+
		Address: 604ac44e9ead9ed0  Data: 8db0
	Kernel driver in use: tg3

02:02.0 PCI bridge: Intel Corporation 80960RM (i960RM) Bridge (rev 01) (prog-if 00 [Normal decode])
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
	Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 32, Cache Line Size: 64 bytes
	Bus: primary=02, secondary=03, subordinate=03, sec-latency=32
	I/O behind bridge: 0000f000-00000fff
	Memory behind bridge: fff00000-000fffff
	Prefetchable memory behind bridge: fff00000-000fffff
	Secondary status: 66MHz- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort+ <SERR- <PERR-
	BridgeCtl: Parity- SERR+ NoISA+ VGA- MAbort- >Reset- FastB2B-
		PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-

02:02.1 RAID bus controller: Dell PowerEdge Expandable RAID Controller 3/Di (rev 01)
	Subsystem: Dell PERC 3/DiV [Viper]
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
	Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 32, Cache Line Size: 32 bytes
	Interrupt: pin A routed to IRQ 31
	Region 0: Memory at f0000000 (32-bit, prefetchable) [size=128M]
	Expansion ROM at fe800000 [disabled] [size=64K]
	Kernel driver in use: aacraid

02:04.0 Ethernet controller: Intel Corporation 82557/8/9/0/1 Ethernet Pro 100 (rev 08)
	Subsystem: Dell 10/100 Ethernet Server Adapter
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 32 (2000ns min, 14000ns max), Cache Line Size: 32 bytes
	Interrupt: pin A routed to IRQ 16
	Region 0: Memory at fe900000 (32-bit, non-prefetchable) [size=4K]
	Region 1: I/O ports at ccc0 [size=64]
	Region 2: Memory at fe700000 (32-bit, non-prefetchable) [size=1M]
	Expansion ROM at 80100000 [disabled] [size=1M]
	Capabilities: [dc] Power Management version 2
		Flags: PMEClk- DSI+ D1+ D2+ AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold+)
		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=2 PME-
	Kernel driver in use: e100

root@server1:~# cat /proc/interrupts 
            CPU0       CPU1       
   0:         42          0   IO-APIC-edge      timer
   1:        545        512   IO-APIC-edge      i8042
   6:          0          3   IO-APIC-edge      floppy
   7:          0          0   IO-APIC-edge      parport0
   8:          3          0   IO-APIC-edge      rtc0
   9:          0          0   IO-APIC-fasteoi   acpi
  11:          0          0   IO-APIC-fasteoi   ohci_hcd:usb1
  12:        400        345   IO-APIC-edge      i8042
  14:         75         29   IO-APIC-edge      pata_serverworks
  15:          0          0   IO-APIC-edge      pata_serverworks
  16:          6         10   IO-APIC-fasteoi 
  17:     709399     709871   IO-APIC-fasteoi   eth1
  31:      23738      23477   IO-APIC-fasteoi   aacraid
 NMI:       5593       5593   Non-maskable interrupts
 LOC:    2479111    1979429   Local timer interrupts
 SPU:          0          0   Spurious interrupts
 PMI:       5593       5593   Performance monitoring interrupts
 IWI:          0          0   IRQ work interrupts
 RES:      43696      43299   Rescheduling interrupts
 CAL:      14332      12977   Function call interrupts
 TLB:       4580       4019   TLB shootdowns
 TRM:          0          0   Thermal event interrupts
 THR:          0          0   Threshold APIC interrupts
 MCE:          0          0   Machine check exceptions
 MCP:         32         32   Machine check polls
 ERR:          0
 MIS:          0

I can dig deeper into my kernel logs for .38 errors.

root@server1:/var/log# cat messages.3 | grep kernel | tail -n 35
Jun 27 14:40:46 server1 kernel: imklog 5.8.1, log source = /proc/kmsg started.
Jun 28 02:42:05 server1 kernel: [51106.000021] ------------[ cut here ]------------
Jun 28 02:42:05 server1 kernel: [51106.005564] WARNING: at /build/buildd-linux-2.6_2.6.38-5-i386-gvX4XH/linux-2.6-2.6.38/debian/build/source_i386_none/net/sched/sch_generic.c:256 dev_watchdog+0xc9/0x15d()
Jun 28 02:42:05 server1 kernel: [51106.017080] Hardware name: PowerEdge 2550                  
Jun 28 02:42:05 server1 kernel: [51106.022989] NETDEV WATCHDOG: eth1 (tg3): transmit queue 0 timed out
Jun 28 02:42:05 server1 kernel: [51106.029031] Modules linked in: fuse btrfs zlib_deflate crc32c libcrc32c ufs qnx4 hfsplus hfs minix ntfs vfat msdos fat jfs xfs exportfs reiserfs ext3 jbd ext2 dm_mod decnet loop snd_pcm snd_timer snd soundcore snd_page_alloc psmouse tpm_tis shpchp dcdbas evdev pcspkr parport_pc processor i2c_piix4 tpm tpm_bios serio_raw i2c_core pci_hotplug parport thermal_sys button ext4 mbcache jbd2 crc16 sr_mod cdrom ata_generic sg sd_mod crc_t10dif pata_serverworks libata aacraid ohci_hcd ehci_hcd tg3 usbcore scsi_mod libphy e100 floppy mii nls_base [last unloaded: scsi_wait_scan]
Jun 28 02:42:05 server1 kernel: [51106.068981] Pid: 0, comm: kworker/0:0 Not tainted 2.6.38-2-686 #1
Jun 28 02:42:05 server1 kernel: [51106.075786] Call Trace:
Jun 28 02:42:05 server1 kernel: [51106.082625]  [<c102fa29>] ? warn_slowpath_common+0x6a/0x7b
Jun 28 02:42:05 server1 kernel: [51106.089485]  [<c12080c3>] ? dev_watchdog+0xc9/0x15d
Jun 28 02:42:05 server1 kernel: [51106.096251]  [<c102faa0>] ? warn_slowpath_fmt+0x28/0x2c
Jun 28 02:42:05 server1 kernel: [51106.102908]  [<c12080c3>] ? dev_watchdog+0xc9/0x15d
Jun 28 02:42:05 server1 kernel: [51106.109462]  [<c10349b1>] ? __do_softirq+0x0/0x14f
Jun 28 02:42:05 server1 kernel: [51106.116085]  [<c1041068>] ? __queue_work+0x2a9/0x2c3
Jun 28 02:42:05 server1 kernel: [51106.122758]  [<c10349b1>] ? __do_softirq+0x0/0x14f
Jun 28 02:42:05 server1 kernel: [51106.129506]  [<c1039b9d>] ? run_timer_softirq+0x167/0x20b
Jun 28 02:42:05 server1 kernel: [51106.136217]  [<c1207ffa>] ? dev_watchdog+0x0/0x15d
Jun 28 02:42:05 server1 kernel: [51106.142906]  [<c10349b1>] ? __do_softirq+0x0/0x14f
Jun 28 02:42:05 server1 kernel: [51106.149909]  [<c1034a4c>] ? __do_softirq+0x9b/0x14f
Jun 28 02:42:05 server1 kernel: [51106.156533]  [<c10349b1>] ? __do_softirq+0x0/0x14f
Jun 28 02:42:05 server1 kernel: [51106.163165]  <IRQ>  [<c1034935>] ? irq_exit+0x26/0x59
Jun 28 02:42:05 server1 kernel: [51106.169828]  [<c1015e30>] ? smp_apic_timer_interrupt+0x6b/0x75
Jun 28 02:42:05 server1 kernel: [51106.176521]  [<c1290651>] ? apic_timer_interrupt+0x31/0x38
Jun 28 02:42:05 server1 kernel: [51106.183122]  [<c101bb94>] ? native_safe_halt+0x2/0x3
Jun 28 02:42:05 server1 kernel: [51106.189536]  [<c10085c9>] ? default_idle+0x50/0x87
Jun 28 02:42:05 server1 kernel: [51106.195768]  [<c1002201>] ? cpu_idle+0x95/0xb0
Jun 28 02:42:05 server1 kernel: [51106.201959]  [<c128c207>] ? start_secondary+0x1b8/0x1bd
Jun 28 02:42:05 server1 kernel: [51106.208157] ---[ end trace d2c7eb5d333ff5a9 ]---
Jun 28 02:42:05 server1 kernel: [51106.577902] tg3 0000:01:08.0: eth1: Link is down
Jun 28 02:42:09 server1 kernel: [51109.997016] tg3 0000:01:08.0: eth1: Link is up at 1000 Mbps, full duplex
Jun 28 02:42:09 server1 kernel: [51110.002972] tg3 0000:01:08.0: eth1: Flow control is off for TX and off for RX
Jun 28 16:51:07 server1 kernel: [102048.068257] ices2[32063]: segfault at 0 ip 0805079f sp bfc17180 error 4 in ices2[8048000+13000]
Jun 29 15:20:13 server1 kernel: [182994.015710] ip_tables: (C) 2000-2006 Netfilter Core Team
Jul  2 14:32:40 server1 kernel: Kernel logging (proc) stopped.
Jul  2 14:32:40 server1 kernel: imklog 5.8.2, log source = /proc/kmsg started.

Comment 64 Patrick Ale 2013-10-09 07:22:16 UTC
This bug seems to be still shown in RHEL 6.4 x64.

2.6.32-358.14.1.el6.x86_64 #1 SMP Mon Jun 17 15:54:20 EDT 2013 x86_64 x86_64 x86_64 GNU/Linux

WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x26d/0x280() (Not tainted)
Hardware name: ProLiant DL360p Gen8
NETDEV WATCHDOG: eth0 (tg3): transmit queue 0 timed out
Modules linked in: nfs lockd fscache auth_rpcgss nfs_acl autofs4 sunrpc cpufreq_ondemand freq_table pcc_cpufreq bonding 8021q garp stp llc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 ext3 jbd dm_round_robin hpilo hpwdt tg3 microcode sg ses enclosure serio_raw iTCO_wdt iTCO_vendor_support ioatdma dca power_meter shpchp ext4 mbcache jbd2 sd_mod crc_t10dif hpsa qla2xxx scsi_transport_fc scsi_tgt pata_acpi ata_generic ata_piix dm_multipath dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
Pid: 0, comm: swapper Not tainted 2.6.32-358.14.1.el6.x86_64 #1
Call Trace:
 <IRQ>  [<ffffffff8106e307>] ? warn_slowpath_common+0x87/0xc0
 [<ffffffff8106e3f6>] ? warn_slowpath_fmt+0x46/0x50
 [<ffffffff81467d3d>] ? dev_watchdog+0x26d/0x280
 [<ffffffff81090dad>] ? insert_work+0x6d/0xb0
 [<ffffffff81012bf9>] ? sched_clock+0x9/0x10
 [<ffffffff81467ad0>] ? dev_watchdog+0x0/0x280
 [<ffffffff81081857>] ? run_timer_softirq+0x197/0x340
 [<ffffffff810a7f80>] ? tick_sched_timer+0x0/0xc0
 [<ffffffff8102ea2d>] ? lapic_next_event+0x1d/0x30
 [<ffffffff81076fd1>] ? __do_softirq+0xc1/0x1e0
 [<ffffffff8109b79b>] ? hrtimer_interrupt+0x14b/0x260
 [<ffffffff8100c1cc>] ? call_softirq+0x1c/0x30
 [<ffffffff8100de05>] ? do_softirq+0x65/0xa0
 [<ffffffff81076db5>] ? irq_exit+0x85/0x90
 [<ffffffff81517420>] ? smp_apic_timer_interrupt+0x70/0x9b
 [<ffffffff8100bb93>] ? apic_timer_interrupt+0x13/0x20
 <EOI>  [<ffffffff812d3a9e>] ? intel_idle+0xde/0x170
 [<ffffffff812d3a81>] ? intel_idle+0xc1/0x170
 [<ffffffff814153a7>] ? cpuidle_idle_call+0xa7/0x140
 [<ffffffff81009fc6>] ? cpu_idle+0xb6/0x110
 [<ffffffff814f35ca>] ? rest_init+0x7a/0x80
 [<ffffffff81c27f7b>] ? start_kernel+0x424/0x430
 [<ffffffff81c2733a>] ? x86_64_start_reservations+0x125/0x129
 [<ffffffff81c27438>] ? x86_64_start_kernel+0xfa/0x109
---[ end trace e5884b70674dc1df ]---

I had to ifdown/ifup the interface via ILOM. Is there a workable workaround?
Our switches are autoneg 1000MBps/FD and thereis no way I can get this changed.

Comment 65 Patrick Ale 2013-10-09 11:09:42 UTC
This is a different (newer) model though.

3:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)

Comment 66 Andy Gospodarek 2013-10-10 01:32:11 UTC
Patrick, this was a RHEL4 bug, so adding a report about RHEL6 isn't going to get much attention.  Please open a new bug on the product 'Red Hat Enterprise Linux 6' with the information you have.  You can assign this bug directly to ivecera.