152518 – RHEL4 kernel reports tg3_stop_block timed out and network interface stops responding

Bug 152518 - RHEL4 kernel reports tg3_stop_block timed out and network interface stops responding

Summary: RHEL4 kernel reports tg3_stop_block timed out and network interface stops res...

Keywords:
Status:	CLOSED CANTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	4.0
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	John W. Linville
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2005-03-29 22:48 UTC by Marc Michelsen
Modified:	2018-10-20 04:02 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2006-02-16 17:55:30 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
jwltest-tg3-3_25-rh.patch (36.83 KB, patch) 2005-04-04 15:10 UTC, John W. Linville	no flags	Details \| Diff
ifdebug (1.23 KB, text/plain) 2005-09-23 17:48 UTC, John W. Linville	no flags	Details
ifdebug output on broken eth0 running tg3 driver (2.34 MB, text/plain) 2005-09-28 05:46 UTC, Tom Christensen	no flags	Details
ifdebug output working tg3 (2.34 MB, text/plain) 2005-09-28 05:51 UTC, Tom Christensen	no flags	Details
working ifdebug of tg3 on a Dell 2550 i686 machine (407.52 KB, text/plain) 2005-10-18 13:17 UTC, Thomas J. Baker	no flags	Details
broken ifdebug of tg3 on Dell 2550 i686 SMP machine (407.93 KB, text/plain) 2005-10-18 13:19 UTC, Thomas J. Baker	no flags	Details
ifdebug of tg3 while working running 2.6.9-22.8.EL.jwltest.80smp (407.91 KB, text/plain) 2005-11-03 15:30 UTC, Thomas J. Baker	no flags	Details
ifdebug of tg3 while broken running 2.6.9-22.8.EL.jwltest.80smp (407.91 KB, text/plain) 2005-11-03 15:31 UTC, Thomas J. Baker	no flags	Details
jwltest-tg3-debug.patch (916 bytes, patch) 2005-11-09 22:06 UTC, John W. Linville	no flags	Details \| Diff
ethtool dump (343.48 KB, text/plain) 2005-11-21 15:47 UTC, Thomas J. Baker	no flags	Details
ethtool dumb of problematic interface (343.47 KB, text/plain) 2005-11-22 19:45 UTC, Thomas J. Baker	no flags	Details
Show Obsolete (1) View All

Description Marc Michelsen 2005-03-29 22:48:59 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.6) Gecko/20050317 Firefox/1.0.2

Description of problem:
Under heavy network load eth0 stopped responding.
I could not ping the machine or ping out from its console.
service network restart fixed it and it wasn't
necessary to reboot. The following was in the log after it happened:
Mar 24 13:27:54 challenger kernel: tg3: tg3_stop_block timed out, ofs=2000 enable_bit=2
Mar 24 13:27:54 challenger kernel: tg3: tg3_stop_block timed out, ofs=1400 enable_bit=2
Mar 24 13:27:54 challenger kernel: tg3: tg3_stop_block timed out, ofs=c00 enable_bit=2 

The machine is a Tyan 4882 quad-opteron with 16GB of RAM and 4 3ware 9500
SATA RAID cards and 40 400GB harddrives. There is one ext3 filesystem
on each 3ware card. The machine has frequently been under heavy network
load for a couple of weeks with this kernel with no problems, rsyncing
several terabytes of space from other machines. Most of this went to
one filesystem on one 3ware card, a card that that does not share the
same PCI bus has the onboard broadcom gigabit ethernets. The problem
occurred when about 100GB was being written to a filesystem on a 3ware card
that did share the same PCI bus as the onboard broadcom gigabit ethernets.
The motherboard has two gigabit ethernets and when the problem occurred
only the one under heavy load, eth0, stopped, the other was still usable.

I don't know if the broadcom and 3ware sharing the same PCI bus is part of 
the problem but I did find the following by someone who looks like
they had the same problem with the 2.6.9 kernel
http://lkml.org/lkml/2005/1/16/179 
and later he follows up saying 2.6.11-rc1 fixed his problem:
http://lkml.org/lkml/2005/1/23/77


Version-Release number of selected component (if applicable):
kernel-smp-2.6.9-6.25.EL

How reproducible:
Didn't try


Additional info:

Comment 1 Ian Neubert 2005-03-29 23:57:45 UTC

Me too! On x86_64 (Dual Opterons) with 2.6.9-5.0.3.ELsmp.

I just start a big transfer with scp and I will either kernel panic (see bug #
152525) or get this bug's error.

Slightly different ofs numbers though:
tg3: tg3_stop_block timed out, ofs=2c00 enable_bit=2
tg3: tg3_stop_block timed out, ofs=3400 enable_bit=2
tg3: tg3_stop_block timed out, ofs=2400 enable_bit=2
tg3: tg3_stop_block timed out, ofs=1800 enable_bit=2
tg3: tg3_stop_block timed out, ofs=4800 enable_bit=2

Comment 2 John W. Linville 2005-04-04 15:05:54 UTC

Well, let's start by trying an update of the tg3 driver.  Pre-built test 
kernels available here: 
 
   http://people.redhat.com/linville/kernels/rhel4/ 
 
Please give them a try and post the results.  Thanks!

Comment 3 John W. Linville 2005-04-04 15:10:48 UTC

Created attachment 112652 [details]
jwltest-tg3-3_25-rh.patch

Comment 4 Marc Michelsen 2005-04-05 16:20:55 UTC

I've been running the new kernel for almost a day

[root@challenger]# uptime
 08:34:25 up 23:29, 12 users,  load average: 0.98, 1.29, 0.96
[root@challenger]# uname -a
Linux challenger 2.6.9-6.38.EL.jwltest.10smp #1 SMP Fri Apr 1 16:53:56 EST 2005
x86_64 x86_64 x86_64 GNU/Linux
[root@challenger]#

and just before I got in today it did it again:

Apr  5 07:39:11 challenger kernel: tg3: tg3_stop_block timed out, ofs=2000
enable_bit=2

After the first time it happened I added this crontab
that runs once a minute. It tries to ping my gateway machine
and if it cant it restarts the network. Well, it couldn't ping it
and it restarted the network sucessfully right after the above message.

[root@challenger]# cat checktg3
#!/bin/bash
 
/bin/ping -c 1 10.95.176.3 > /dev/null 2>&1
if [ $? -ne 0 ] ; then
        /etc/init.d/network restart
        echo | mail -s "tg3 appears to be down on `uname -n`. Restarting
network." marc
fi
 
[root@challenger]#

I dont know for sure what was going on when it just happened
but we have a cluster of dual opteron machines running parallel
numerical models constantly and their output is now being written
to a filesystem on this machine. This particular filesystem that
users are writing to now is on one 3ware 9500 card that shares
the same PCI bus as the onboard broadcoms.

Comment 5 Travis Ogdon 2005-04-30 23:04:53 UTC

I'd been able to restart the networking successfully several times without
issue. Just got a kernel panic:

Apr 30 15:04:01 cpq100 crond(pam_unix)[23467]: session opened for user root by
(uid=0)
Apr 30 15:04:11 cpq100 kernel: tg3: tg3_stop_block timed out, ofs=2c00 enable_bit=2
Apr 30 15:04:11 cpq100 syslogd: sendto: Network is unreachable
Apr 30 15:04:11 cpq100 kernel: tg3: tg3_stop_block timed out, ofs=2400 enable_bit=2
Apr 30 15:04:11 cpq100 kernel: tg3: tg3_stop_block timed out, ofs=1800 enable_bit=2
Apr 30 15:04:11 cpq100 kernel: tg3: tg3_stop_block timed out, ofs=4800 enable_bit=2
Apr 30 15:04:11 cpq100 network: Shutting down interface eth0:  succeeded
Apr 30 15:04:12 cpq100 network: Shutting down interface eth1:  succeeded
Apr 30 15:04:12 cpq100 network: Shutting down loopback interface:  succeeded
Apr 30 15:04:12 cpq100 sysctl: net.ipv4.ip_forward = 0
Apr 30 15:04:12 cpq100 sysctl: net.ipv4.conf.default.rp_filter = 1
Apr 30 15:04:12 cpq100 sysctl: net.ipv4.conf.default.accept_source_route = 0
Apr 30 15:04:12 cpq100 sysctl: kernel.sysrq = 0
Apr 30 15:04:12 cpq100 sysctl: kernel.core_uses_pid = 1
Apr 30 15:04:12 cpq100 network: Setting network parameters:  succeeded
Apr 30 15:04:14 cpq100 kernel: Unable to handle kernel NULL pointer dereference
at virtual address 00000005
Apr 30 15:04:14 cpq100 kernel:  printing eip:
Apr 30 15:04:14 cpq100 kernel: c0182ebb
Apr 30 15:04:14 cpq100 kernel: *pde = 00004001
Apr 30 15:04:14 cpq100 kernel: Oops: 0000 [#1]
Apr 30 15:04:14 cpq100 kernel: SMP
Apr 30 15:04:14 cpq100 kernel: Modules linked in: ipt_LOG ipt_limit ipt_state
ip_conntrack iptable_filter ip_tables md5 ipv6 parport_pc lp parport autofs4
i2c_dev i2c_core sunrpc dm_mod button b
attery ac uhci_hcd ehci_hcd hw_random tg3 floppy ext3 jbd cciss sd_mod scsi_mod
Apr 30 15:04:14 cpq100 kernel: CPU:    0
Apr 30 15:04:14 cpq100 kernel: EIP:    0060:[<c0182ebb>]    Not tainted VLI
Apr 30 15:04:14 cpq100 kernel: EFLAGS: 00010286   (2.6.9-5.0.3.ELsmp)
Apr 30 15:04:14 cpq100 kernel: EIP is at remove_proc_entry+0x2f/0xe4
Apr 30 15:04:14 cpq100 kernel: eax: 00000000   ebx: 00000005   ecx: ffffffff  
edx: f7f87b80
Apr 30 15:04:14 cpq100 kernel: esi: c03332e0   edi: 00000005   ebp: c03b8fcc  
esp: c03b8f7c
Apr 30 15:04:14 cpq100 kernel: ds: 007b   es: 007b   ss: 0068
Apr 30 15:04:14 cpq100 kernel: Process swapper (pid: 0, threadinfo=c03b8000
task=c0312a60)
Apr 30 15:04:14 cpq100 kernel: Stack: f7f87b80 00000005 c35dd800 c03332e0
00000000 c03b8fcc f8ac2a6b c35dd800
Apr 30 15:04:14 cpq100 kernel:        f8aa6f72 ee501080 dae0ea80 c0276c7f
dae0ea80 c0454ea0 c33ba760 c0276a4c
Apr 30 15:04:14 cpq100 kernel:        00000000 c02769fc c01283df 00000246
c03b8fcc c03b8fcc 0000000a 00000001
Apr 30 15:04:14 cpq100 kernel: Call Trace:
Apr 30 15:04:14 cpq100 kernel:  [<f8ac2a6b>] snmp6_unregister_dev+0x2f/0x3e [ipv6]
Apr 30 15:04:14 cpq100 kernel:  [<f8aa6f72>] in6_dev_finish_destroy+0x71/0x80 [i
pv6]
Apr 30 15:04:14 cpq100 kernel:  [<c0276c7f>] dst_destroy+0x63/0xac
Apr 30 15:04:14 cpq100 kernel:  [<c0276a4c>] dst_run_gc+0x50/0xd3
Apr 30 15:04:14 cpq100 kernel:  [<c02769fc>] dst_run_gc+0x0/0xd3
Apr 30 15:04:14 cpq100 kernel:  [<c01283df>] run_timer_softirq+0x123/0x145
Apr 30 15:04:14 cpq100 kernel:  [<c0124b2c>] __do_softirq+0x4c/0xb1
Apr 30 15:04:14 cpq100 kernel:  [<c0107f39>] do_softirq+0x4f/0x56
Apr 30 15:04:14 cpq100 kernel:  =======================
Apr 30 15:04:14 cpq100 kernel:  [<c011633f>] smp_apic_timer_interrupt+0xd9/0xdd
Apr 30 15:04:14 cpq100 kernel:  [<c02c6aea>] apic_timer_interrupt+0x1a/0x20
Apr 30 15:04:14 cpq100 kernel:  [<c01040e5>] mwait_idle+0x33/0x42
Apr 30 15:04:14 cpq100 kernel:  [<c010409d>] cpu_idle+0x26/0x3b
Apr 30 15:04:14 cpq100 kernel:  [<c0382784>] start_kernel+0x194/0x198
Apr 30 15:04:14 cpq100 kernel: Code: 56 53 55 55 89 14 24 89 44 24 04 75 13 8d
4c 24 04 89 e2 e8 11 f9 ff ff 85 c0 0f 85 b6 00 00 00 8b 5c 24 04 31 c0 83 c9 ff
89 df <f2> ae f7 d1 49 8b 04 24 89
 cd 8d 70 34 83 78 34 00 0f 84 94 00
Apr 30 15:04:14 cpq100 kernel:  <0>Kernel panic - not syncing: Fatal exception
in interrupt


Any ideas on:

1. How long until we get a patch?
2. How to mitigate things until a patch arrives?

We're running RHEL 4 on an HP ProLiant DL380 G4 which appears to contain
tg3-based Broadcom Gigabit Ethernet NICs ("NC7781" cards according to HP). I'd
use HP's driver:

http://h18004.www1.hp.com/support/files/server/us/download/22321.html

but I'd rather stick with the vanilla driver from RH. We've had loads of success
with both RHEL 3 and 4 on a wide variety of DL320 gear. The cron job from
marc.edu has saved me several 30+ minute drives up until now...
thanks Marc.

Comment 6 John W. Linville 2005-05-02 13:03:06 UTC

The oops in comment 5 would appear to be a different problem, the one reported 
in bug 151874. 
 
The kernels mentioned in comment 2 contain a patch for that problem.  That 
patch has been submitted and should be available for U2.

Comment 9 John W. Linville 2005-05-11 20:22:12 UTC

I have an internal report that updating the firmware on the card resolved this 
issue for the RHEL3 version of this problem.  Any chance you can get a 
firmware update for the tg3 hardware?

Comment 11 John W. Linville 2005-05-16 15:04:15 UTC

Still no solid leads...however, I have taken an update of the tg3 driver in 
ther kernels here (same location as in comment 2): 
 
   http://people.redhat.com/linville/kernels/rhel4/ 
 
Please try those and let me know the results w/ the current driver.  Thanks!

Comment 13 Brian Harvey 2005-05-27 17:09:38 UTC

(In reply to comment #9)
> I have an internal report that updating the firmware on the card resolved this 
> issue for the RHEL3 version of this problem.  Any chance you can get a 
> firmware update for the tg3 hardware? 

Do you have more specifics on this internal report? Such as which firmware level
and which cards? I'm seeing the same exact problem with RHEL 3.0 update3 on HP
DL360G4's with HP's NC7782 card. I want to make sure that the firmware level I'm
going to will resolve this issue.

Comment 14 John W. Linville 2005-05-27 17:12:49 UTC

No, sorry, nothing more specific.  It was "the latest" of about 2 weeks ago, I 
belive.  I'm pretty sure it was an HP card in that case, FWIW...

Comment 17 John W. Linville 2005-07-06 19:43:56 UTC

More test kernels w/ latest version of tg3 available at same location as in 
comment 11...please give them a try and post your results here...thanks!

Comment 18 Tom Christensen 2005-08-08 19:59:17 UTC

I have a supermicro server with dual gigabit broadcom nics.  I am having this
problem as well, it seems for me to be associated with encrypted traffic.  I run
openvpn and ssh/sftp to this box often.  Whenever I transfer a large file
(50MB+) I get these same error messages, and the eth0 nic freezes.  It comes
back up after 10-15 minutes on its own as I see:
Aug  8 01:15:00 office kernel: NETDEV WATCHDOG: eth0: transmit timed out
Aug  8 01:15:00 office kernel: tg3: eth0: transmit timed out, resetting
Aug  8 01:15:00 office kernel: tg3: tg3_stop_block timed out, ofs=2c00 enable_bit=2
Aug  8 01:15:00 office kernel: tg3: tg3_stop_block timed out, ofs=4800 enable_bit=2
Aug  8 01:15:00 office kernel: tg3: eth0: Link is down.
Aug  8 01:15:02 office kernel: tg3: eth0: Link is up at 100 Mbps, full duplex.
Aug  8 01:15:02 office kernel: tg3: eth0: Flow control is on for TX and on for RX.

However that occurs after the nic has been down for 10-15 minutes and I see a
ton of these:
Aug  8 01:13:30 office openvpn[30076]: read UDPv4 [EHOSTUNREACH|EHOSTUNREACH]:
No route to host (code=113)
Aug  8 01:13:35 office openvpn[30076]: read UDPv4 [EHOSTUNREACH]: No route to
host (code=113)

I just tried your latest test kernel (with tg3 v3.33) and it still occurs.  the
eth0 nic is plugged into a dsl router, and has an iptables firewall on it.  I
don't think this has much to do with load (as the dsl link is only 1.5mb, and
isn't getting close to saturating the 100mb link), unless it has something to do
with TCP retransmits because the DSL link is saturated..

Comment 19 Jatin Nansi 2005-08-09 10:47:14 UTC

(In reply to comment #18)
Have you tried updating the firmware on the card? all reports of tg3 hangs have
apparently been solved after that.

Comment 35 Thomas J. Baker 2005-09-15 16:16:39 UTC

I am experiencing this on two Dell PowerEdge 2550s that were recently upgraded
to RHEL4. The 2.6.9-20.EL.jwltest.61smp kernel seems to have fixed the problem
for me. A user was doing a large rsync and each time he started the tg3 would
die. With this kernel, the rsync seems to be working so far.

Comment 36 John W. Linville 2005-09-15 17:42:51 UTC

Based on that info, I'll use this bug to track the update of tg3 in RHEL4 U3.

Comment 38 Thomas J. Baker 2005-09-16 15:18:57 UTC

I spoke too soon. The tg3 was dead this morning and after a reboot, another
rsync run by a user killed it quickly again. These systems are dual 1.4GHz P3s
with 2GB of memory. There is no firewall configured and they are pluggined into
a gigabit switch with several other servers.

During the install of RHEL4U1, booting with acpi off did not help. I haven't
tried it since.

The errors from this morning:

Sep 16 07:15:10 crouchingtiger kernel: NETDEV WATCHDOG: eth1: transmit timed out
Sep 16 07:15:10 crouchingtiger kernel: tg3: eth1: transmit timed out, resetting
Sep 16 07:15:10 crouchingtiger kernel: tg3: tg3_stop_block timed out, ofs=3400
enable_bit=2
Sep 16 07:15:10 crouchingtiger kernel: tg3: tg3_stop_block timed out, ofs=2400
enable_bit=2
Sep 16 07:15:10 crouchingtiger kernel: tg3: tg3_stop_block timed out, ofs=1800
enable_bit=2
Sep 16 07:15:10 crouchingtiger kernel: tg3: tg3_stop_block timed out, ofs=4800
enable_bit=2
Sep 16 07:15:11 crouchingtiger kernel: tg3: eth1: Link is down.
Sep 16 07:15:15 crouchingtiger kernel: tg3: eth1: Link is up at 1000 Mbps, full
duplex.
Sep 16 07:15:15 crouchingtiger kernel: tg3: eth1: Flow control is on for TX and
on for RX.

Comment 39 John W. Linville 2005-09-19 15:14:53 UTC

Does a simple ifdown/ifup get traffic flowing again?  Or is the reboot 
absolutely necessary

Comment 40 Thomas J. Baker 2005-09-19 15:38:57 UTC

The last time this happened, I did a ifdown, rmmod tg3, ifup and it started
working again but NFS mounts were still hosed. They may have recovered over time
but rebooting was quicker.

Comment 41 Tom Christensen 2005-09-20 18:02:59 UTC

(In reply to comment #19)
How do you update the firmware on the card?  Where do I get it? Where do I put
it?  Is it like a bios update?

Comment 42 John W. Linville 2005-09-20 18:13:54 UTC

It would likely be similar to a BIOS update.  If something is available, it 
would come from your card or system vendor.

Comment 43 Thomas J. Baker 2005-09-22 14:40:45 UTC

I just had a kernel panic with the test kernel. It seems unrelated to the tg3:

Sep 22 09:42:59 crouchingtiger kernel: Unable to handle kernel paging request at
virtual address 0040b709
Sep 22 09:42:59 crouchingtiger kernel:  printing eip:
Sep 22 09:42:59 crouchingtiger kernel: c0170496
Sep 22 09:42:59 crouchingtiger kernel: *pde = 00000000
Sep 22 09:42:59 crouchingtiger kernel: Oops: 0000 [#1]
Sep 22 09:42:59 crouchingtiger kernel: SMP
Sep 22 09:42:59 crouchingtiger kernel: Modules linked in: nfs nfsd exportfs
lockd md5 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core sunrpc button
battery ac tg3 e100 mii floppy sg dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod
megaraid_mbox megaraid_mm mptscsih mptbase aic7xxx sd_mod scsi_mod
Sep 22 09:42:59 crouchingtiger kernel: CPU:    0
Sep 22 09:42:59 crouchingtiger kernel: EIP:    0060:[<c0170496>]    Not tainted VLI
Sep 22 09:42:59 crouchingtiger kernel: EFLAGS: 00010206  
(2.6.9-20.EL.jwltest.61smp)
Sep 22 09:42:59 crouchingtiger kernel: EIP is at iput+0x25/0x61
Sep 22 09:42:59 crouchingtiger kernel: eax: 0040b6f5   ebx: c5d4494c   ecx:
f8bcabae   edx: c5d4494c
Sep 22 09:42:59 crouchingtiger kernel: esi: dcb0d364   edi: dcb0d36c   ebp:
0000007b   esp: f7cf6eec
Sep 22 09:42:59 crouchingtiger kernel: ds: 007b   es: 007b   ss: 0068
Sep 22 09:42:59 crouchingtiger kernel: Process kswapd0 (pid: 47,
threadinfo=f7cf6000 task=f7d17690)
Sep 22 09:42:59 crouchingtiger kernel: Stack: c5d4494c c016e0bc 00000000
0000008e 00000000 f7ffe9c0 c016e443 c0148718
Sep 22 09:42:59 crouchingtiger kernel:        005f1e00 00000000 00000021
00000000 0002ddaa 000000d0 00000020 c0324f00
Sep 22 09:42:59 crouchingtiger kernel:        00000002 c0324f00 0000000c
c01499a4 c02cf3b4 0002ddaa f7cf6f9c 00000000
Sep 22 09:42:59 crouchingtiger kernel: Call Trace:
Sep 22 09:42:59 crouchingtiger kernel:  [<c016e0bc>] prune_dcache+0x14b/0x19a
Sep 22 09:42:59 crouchingtiger kernel:  [<c016e443>] shrink_dcache_memory+0x14/0x2b
Sep 22 09:42:59 crouchingtiger kernel:  [<c0148718>] shrink_slab+0xf8/0x161
Sep 22 09:42:59 crouchingtiger kernel:  [<c01499a4>] balance_pgdat+0x1d2/0x2f8
Sep 22 09:42:59 crouchingtiger kernel:  [<c02cf3b4>] schedule+0x844/0x87a
Sep 22 09:42:59 crouchingtiger kernel:  [<c011fedc>] prepare_to_wait+0x12/0x4c
Sep 22 09:42:59 crouchingtiger kernel:  [<c0149b94>] kswapd+0xca/0xcc
Sep 22 09:42:59 crouchingtiger kernel:  [<c011ffb1>]
autoremove_wake_function+0x0/0x2d
Sep 22 09:42:59 crouchingtiger kernel:  [<c02d103a>] ret_from_fork+0x6/0x14
Sep 22 09:42:59 crouchingtiger kernel:  [<c011ffb1>]
autoremove_wake_function+0x0/0x2d
Sep 22 09:42:59 crouchingtiger kernel:  [<c0149aca>] kswapd+0x0/0xcc
Sep 22 09:42:59 crouchingtiger kernel:  [<c01041f1>] kernel_thread_helper+0x5/0xb
Sep 22 09:42:59 crouchingtiger kernel: Code: ff e9 e5 fe ff ff 53 85 c0 89 c3 74
58 83 bb 3c 01 00 00 20 8b 80 a4 00 00 00 8b 40 24 75 08 0f 0b 54 04 6c 8a 2e c0
85 c0 74 0b <8b> 50 14 85 d2 74 04 89 d8 ff d2 8d 43 1c ba f0 9d 32 c0 e8 66
Sep 22 09:42:59 crouchingtiger kernel:  <0>Fatal exception: panic in 5 seconds

The tg3 hardware in question is list with lspci-vv as
03:08.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5700 Gigabit
Ethernet (rev 12)

I'm running the latest Dell 2550 BIOS and I've searched and haven't been able to
find any tg3 firmware updates from Dell.

Comment 49 John W. Linville 2005-09-23 17:48:00 UTC

Created attachment 119196 [details]
ifdebug

Collect some handy network device debugging info...

Usage: ifdebug <network device>

Comment 50 John W. Linville 2005-09-23 17:53:30 UTC

Anyone/everyone seeing this problem, please be sure to attach the output of 
running the "ifdebug" script from comment 49 against the failing device.

Comment 52 Thomas J. Baker 2005-09-23 18:19:23 UTC

Do you want the script to be run when the device is failing or is any time OK?

Comment 53 John W. Linville 2005-09-23 18:52:07 UTC

Specifically after the failure.  I suppose a "working" sample wouldn't hurt 
either.

Comment 54 Tom Christensen 2005-09-28 05:46:47 UTC

Created attachment 119341 [details]
ifdebug output on broken eth0 running tg3 driver

Comment 55 Tom Christensen 2005-09-28 05:51:04 UTC

Created attachment 119342 [details]
ifdebug output working tg3

Comment 56 Tom Christensen 2005-09-28 05:55:45 UTC

OK, I am still running 2.6.9-11.ELsmp (I've tried all of the test kernels on
your site John, and none of them fix this issue for me.
I am using a supermicro X6DAL-TB2 with dual integrated BCM5721 NICS.  Supermicro
does not have firmware updates.

This is what I have so far:
1) Turning ACPI off definately helps.  The NIC will stay up a couple days vs. a
couple hours with it turned on.
2) Only the eth0 NIC ever goes down, I can dump 5+GB across my 1GB/s LAN to eth1
and it never drops, only eth0 which can never move data at faster than ~1.5mbps
(speed of the DSL that eth0 is plugged into) ever drops.
3) I tried the bcm5700 driver from broadcom and it acts the exact same way as
the tg3 driver (with ACPI turned on, it will stay up for a couple hours, ACPI
off, it will work for 3-5 days).
4) I don't believe its a hardware problem as I have 3 boxes (all with the same
mobo/NICs) that all exhibit the exact same problem. 

Ok, So I have run the ifdebug script on the latest crashes (above), these were
done with ACPI turned on.  As it takes longer for the box to die with ACPI off,
I'll have to post again when I get those debugs (if they would be helpful).

Comment 57 Tom Christensen 2005-10-01 18:40:50 UTC

Could this bug possibly be related to a race condition somewhere in interrupt
handling?  I use these 3 boxes for Asterisk, and they all have digium hardware
(which generates about 8000 interupts per second per card each box has 2 cards).
 I have been talking with Supermicro and they cannot reproduce this bug in their
lab with the exact same hardware (as far as motherboard, etc, not the digium cards).

The other "clue" as I mentioned above it seems that the NICs only fail when they
are connected at say 100mbps, but their actual available bandwidth is much less
(IE they are transmitting data across a WAN link at 1mbps)

Comment 58 Tom Christensen 2005-10-03 01:37:58 UTC

OK, a little more info..
Changed the setup of one of the boxes put eth0 on the inside eth1 on the
outside, broke my theory :).  The eth1 NIC just never dies, the eth0 NIC on the
inside dies.  Another possible clue:
ifconfig shows really weird stats on eth0:
eth0      Link encap:Ethernet  HWaddr 00:30:48:54:E5:78
          inet addr:192.168.55.1  Bcast:192.168.55.255  Mask:255.255.255.0
          inet6 addr: fe80::230:48ff:fe54:e578/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:162224 errors:0 dropped:0 overruns:0 frame:4294839772
          TX packets:1249098 errors:4294935415 dropped:0 overruns:0 carrier:0
          collisions:4294935415 txqueuelen:1000
          RX bytes:26933356 (25.6 MiB)  TX bytes:1619811485 (1.5 GiB)
          Interrupt:145

First of all notice the number of collisions, 4.3 billion collisions on 1.4
million packets?  Also, the TX bytes number jumps all over the place (517MB,
then I downloaded a large file onto the internal network, and it went up to
~700, and then dropped back down to ~300, and then counted up to the 1.5GiB
stated above).  I then uploaded a large file, and the RX bytes number didn't
increase at all.

All of this weirdness doesn't happen on eth1 at all.

Comment 59 Tom Christensen 2005-10-03 04:50:40 UTC

Did some more testing, the eth0 NIC is resetting itself pretty regularly.
I set up a cron job to save the output of ifconfig every minute.  Pretty much if
I push data over eth0, the NIC resets.  eth1 doesn't reset, its stats continue
counting up as it should.  So, there is a bug that is causing eth0 to reset, and
sometimes this reset fails and locks the NIC.  I have three symptoms:

1) NIC freezes, doing a network restart brings everything back
2) NIC freezes, network restart doesn't work, must restart entire box
3) Kernel Panic, entire box freezes

As I stated before this happens with both tg3 and bcm5700 drivers, so it may be
a motherboard layout issue, something with the PCI-E bus, I dunno, but if anyone
has an answer, at this point if it means replacing the hardware that may be
preferable to sitting around waiting for a software fix.  I read the tg3 source,
it seems there are lots of workarounds and fixups for the various broadcom nics,
but I notice there are none for the 5721 model, which is what I'm running. 
Could this be a workaround that is not being applied to this model?

Comment 60 John W. Linville 2005-10-05 13:36:03 UTC

Have you tried the test kernels at the location in comment 11?  That has a 
fairly up-to-date version of the driver that would be worth trying. 
 
The 5721 is fairly new.  It may be that it _needs_ a workaround or two, but 
they may not yet have been identified/written... :-( 
 
Please let me know the results of running with the aforementioned test 
kernels...thanks!

Comment 61 Tom Christensen 2005-10-05 20:07:45 UTC

Yes, I ran the latest test kernel there about a week ago.
Same exact problem still.  I think it was version 67.  I haven't tried 70 yet,
but I see that it is still the same version of the tg3 driver (3.39) as the last
one.
I also tried 2.6.14 rc1 which has tg3 3.40.  Still broken.

Comment 62 Thomas J. Baker 2005-10-18 13:17:33 UTC

Created attachment 120121 [details]
working ifdebug of tg3 on a Dell 2550 i686 machine

Comment 63 Thomas J. Baker 2005-10-18 13:19:10 UTC

Created attachment 120122 [details]
broken ifdebug of tg3 on Dell 2550 i686 SMP machine

Comment 64 Thomas J. Baker 2005-10-18 13:28:37 UTC

#62 is after rebooting (from hung tg3) into kernel-smp-2.6.9-22.EL while #63 is
before rebooting and is running kernel-smp-2.6.9-20.EL.jwltest.61

#62 -> working ifdebug while running kernel-smp-2.6.9-22.EL
#63 -> broken ifdebug while running kernel-smp-2.6.9-20.EL.jwltest.61

since the problem started, we've switched from a D-Link DGS-1008T gigabit switch
to a Dell 2716 gigabit switch. That seemed to help some but the problem persists.

Comment 65 John W. Linville 2005-11-02 14:54:11 UTC

The kernels currently at the location from comment 2 contain a very late 
version of the tg3 driver (based on 3.43).  Please give those a try and post 
the results...thanks!

Comment 66 Thomas J. Baker 2005-11-03 15:30:30 UTC

Created attachment 120691 [details]
ifdebug of tg3 while working running 2.6.9-22.8.EL.jwltest.80smp

Comment 67 Thomas J. Baker 2005-11-03 15:31:37 UTC

Created attachment 120692 [details]
ifdebug of tg3 while broken running 2.6.9-22.8.EL.jwltest.80smp

Comment 68 Thomas J. Baker 2005-11-03 15:37:11 UTC

The kernel with the latest driver (2.6.9-22.8.EL.jwltest.80smp) doesn't seem to
help. I've got three PE2550 SMP systems all with this problem but this one
doesn't display the "transmit timed out" error messages, it just stops working.
I don't know if the latest erratta kernel (kernel-smp-2.6.9-22.0.1.EL) and the
test kernel have eliminated those messages or not. I can provide any other
information you need. Anything to get this fixed!

Comment 69 Michael Chan 2005-11-03 19:04:14 UTC

(In reply to comment #68)
> I've got three PE2550 SMP systems all with this problem

It seems that you're using 5700 B2 chip. Is this problem recently introduced? 
Does it happen before tg3 v3.32? Can you try running it in UP (uni-processor) 
kernel and see if the problem goes away?

Comment 70 Thomas J. Baker 2005-11-03 19:11:46 UTC

The problem started when I went from RHEL3 to RHEL4. I don't know which revision
of the driver was in RHEL3. It's difficult to try UP because these are
productions systems. I'll try.

Details of the controller:

03:08.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5700 Gigabit
Ethernet (rev 12)
        Subsystem: Dell Broadcom BCM5700
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR+ FastB2B-
        Status: Cap+ 66Mhz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
        Latency: 32 (16000ns min), Cache Line Size 08
        Interrupt: pin A routed to IRQ 193
        Region 0: Memory at feb00000 (64-bit, non-prefetchable) [size=64K]
        Capabilities: [40] PCI-X non-bridge device.
                Command: DPERE- ERO- RBC=0 OST=0
                Status: Bus=255 Dev=31 Func=1 64bit+ 133MHz+ SCD- USC-,
DC=simple, DMMRBC=0, DMOST=0, DMCRS=0, RSCEM-
        Capabilities: [48] Power Management version 2
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
PME(D0-,D1-,D2-,D3hot+,D3cold-)
                Status: D0 PME-Enable- DSel=0 DScale=1 PME-
        Capabilities: [50] Vital Product Data
        Capabilities: [58] Message Signalled Interrupts: 64bit+ Queue=0/3 Enable-
                Address: bede463202944c04  Data: fc9f

Comment 71 John W. Linville 2005-11-03 19:23:27 UTC

Thomas, which update of RHEL3 were you using when you switched?  Which RHEL4
update were you using when it started?  I should be able to pin-down the tg3
versions involved.

Comment 72 Thomas J. Baker 2005-11-03 19:48:29 UTC

I'm pretty sure it was RHEL3U5 clean installed to RHEL4U1 with an
outside chance of it being RHEL3U6. I don't recall ever having a problem
with any version of RHEL3 on these systems though. And I've actually got
one system running the UP versions of the latest eratta kernel right
now. Did it matter which UP kernel I ran? Do you want me to try the test
one?

Comment 73 Michael Chan 2005-11-03 19:54:22 UTC

(In reply to comment #72)
>  Did it matter which UP kernel I ran?

It doesn't matter which UP kernel you use, as long as you use the same tg3 
driver on the UP kernel that would otherwise fail on an SMP kernel. I just want 
to know if the failure is SMP-related or not.

Comment 74 Thomas J. Baker 2005-11-03 19:58:20 UTC

It's just failed with the UP kernel (2.6.9-22.0.1.EL to be exact.)

Comment 75 Michael Chan 2005-11-09 17:51:59 UTC

(In reply to comment #61)
> I also tried 2.6.14 rc1 which has tg3 3.40.  Still broken.

Please modify tg3_tx_timeout() to call tg3_dump_state() before the reset.

Or comment out the schedule_work(&tp->reset_task) line in tg3_tx_timeout() and 
use ethtool -d to dump the registers after transmit timeout.

This way I can get a register dump in the failed state before the chip is reset.

Comment 76 John W. Linville 2005-11-09 22:06:50 UTC

Created attachment 120861 [details]
jwltest-tg3-debug.patch

Comment 77 John W. Linville 2005-11-09 22:10:22 UTC

Test kernels w/ the above patch are available at the same location as in 
comment 2.  Please give those a try, and post the information Michael 
requested in comment 75...thanks!

Comment 79 Michael Chan 2005-11-10 19:46:27 UTC

(In reply to comment #72)
> I'm pretty sure it was RHEL3U5 clean installed to RHEL4U1

John sent me the tg3 drivers in these 2 RH kernels and they were essentially 
the same version 3.22 ported to run on the 2 different kernels. I don't see 
anything in these 2 drivers that could explain one driver working and the other 
failing. So please use John's patch in comment 76 to capture the registers 
before the chip is reset during tx timeout.

Comment 80 Thomas J. Baker 2005-11-16 14:36:43 UTC

Which kernel should I patch?

Comment 81 John W. Linville 2005-11-16 14:49:57 UTC

The kernels here are already patched appropriately: 
 
   http://people.redhat.com/linville/kernels/rhel4/

Comment 82 Thomas J. Baker 2005-11-16 14:59:59 UTC

OK, I'll try go get those kernels booted soon.

Comment 83 Thomas J. Baker 2005-11-21 15:47:40 UTC

Created attachment 121300 [details]
ethtool dump

Comment 84 Thomas J. Baker 2005-11-21 15:52:51 UTC

Nevermind that ethtool dump. It's not valid.

Comment 85 Thomas J. Baker 2005-11-22 19:45:47 UTC

Created attachment 121368 [details]
ethtool dumb of problematic interface

I'm not seeing the tg_stop_block messages or any error messages for that
matter, but the interface seems to be locking up. Running the
2.6.9-22.17.EL.jwltest.88smp kernel.

Comment 86 Michael Chan 2005-11-23 18:48:56 UTC

(In reply to comment #85)
> ethtool dumb of problematic interface

Register dump doesn't show anything unusual. No error status in any register, 
and interrupts were enabled.

I'll ask our QA department to try to reproduce this if they have the same 
machine.

Comment 88 John W. Linville 2005-12-08 17:52:12 UTC

Regarding comment 9, I only meant to indicate that someone had told me that 
their problem was resolved by updating the firmware.  I am not privy to any 
specific information in that regard.

Comment 91 John W. Linville 2006-01-03 20:58:20 UTC

From bug 123218 comment 47: 
 
> I solved our tg3 locking problem by installing the HP firmware update found 
> here: 
> 
> http://h18004.www1.hp.com/support/files/server/us/download/23367.html 
> 
> It's for hpnicfwupg-1.2.2-1.i386.rpm which has been running without issue 
> for a couple months now. YMMV. 
> 
> Steve

Comment 97 John W. Linville 2006-02-16 17:55:30 UTC

This bug has hung-around for a long time... 
 
The tg3_stop_block timed-out message is somewhat generic and not necessarily a 
problem.  Please do not report it as a bug (or re-open this one) unless it 
results in an actual stop in the flow of traffic (i.e. an actual failure). 
 
Nearly everyone who has reported this as a real issue (i.e. an actual failure) 
has seen the problem disappear after applying a tg3 firmware update from their 
vendor.  Please do not report this as a bug (or re-open this one) unless you 
have already obtained and applied the latest tg3 firmware from your vendor. 
 
Given the above two facts and the fact that this bug has persisted so long 
that it no longer contains coherent information, I am closing this bug as 
"CANTFIX".  If you really believe you are having a problem, then please open a 
new bugzilla. 
 
Thanks...

Note You need to log in before you can comment on or make changes to this bug.