69920 – Kernel Crashes in TG3 Driver

Bug 69920 - Kernel Crashes in TG3 Driver

Summary: Kernel Crashes in TG3 Driver

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Linux
Classification:	Retired
Component:	kernel
Sub Component:
Version:	7.3
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Jeff Garzik
QA Contact:
Docs Contact:
URL:
Whiteboard:
Duplicates (3):	78166 78427 78822 (view as bug list)
Depends On:
Blocks:	79997
TreeView+	depends on / blocked

Reported:	2002-07-26 13:14 UTC by Thomas J. Baker
Modified:	2013-07-03 02:06 UTC (History)
CC List:	45 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2003-03-04 20:04:46 UTC
Embargoed:

Attachments	(Terms of Use)
Possible fix (1.79 KB, patch) 2002-11-20 19:36 UTC, Jeff Garzik	no flags	Details \| Diff
lspci -vvv for 2650 with tg3 problems (11.67 KB, text/plain) 2003-01-20 20:17 UTC, Need Real Name	no flags	Details
dmesg for 2650 with tg3 problems (13.13 KB, text/plain) 2003-01-20 20:19 UTC, Need Real Name	no flags	Details
PE2650 crash screen from kernel-2.4.18-24.7.xsmp.i686.rpm (64.52 KB, image/jpeg) 2003-02-12 19:40 UTC, Carl Litt	no flags	Details
Synthetic load cpu/network test, in perl. (2.03 KB, text/plain) 2003-02-22 02:55 UTC, Rodrigo Cunha	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2002:262	0	normal	SHIPPED_LIVE	: New kernel fixes local denial of service issue	2002-09-23 04:00:00 UTC

Description Thomas J. Baker 2002-07-26 13:14:26 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 Galeon/1.2.5 (X11; Linux i686; U;) Gecko/20020606

Description of problem:
After running for some amount of time, the kernel crashes in what looks like the
tigon 3 driver. The system is a Dell PowerEdge 2550 and the kernel is
kernel-smp-2.4.18-5.

Version-Release number of selected component (if applicable):


How reproducible:
Didn't try

Steps to Reproduce:
This system is on a gigabit switch with another identically configured system
and there is a lot of nfs traffic between the two. This is the second crash this
system has experienced.
	

Actual Results:  fitzcarraldo.sr.unh.edu login: Unable to handle kernel NULL
pointer dereference at virtual address 00000060
 printing eip:
f891e0e2
*pde = 00000000
Oops: 0000
ip_conntrack_ftp ip_conntrack_irc ip_conntrack loop nfs nfsd lockd sunrpc auto
CPU:    0
EIP:    0010:[<f891e0e2>]    Not tainted
EFLAGS: 00010206

EIP is at tg3_rx [tg3] 0x112 (2.4.18-5smp)
eax: 00000642   ebx: 00010000   ecx: ceef5340   edx: 00000000
esi: 00000000   edi: 0000001e   ebp: 000005ea   esp: f5ebdb5c
ds: 0018   es: 0018   ss: 0018
Process nfsd (pid: 1941, stackpage=f5ebd000)
Stack: 000005ee f3f44970 c03483c0 00000021 8000001e 00010000 f882ad30 f724e200 
       000000d0 00000001 c035a000 f3f44960 04000001 00000011 f891e52e f3f44960 
       f3f44960 c035a000 f891e5c4 f3f44960 f6f4c420 00000000 c010a53e 00000011 
Call Trace: [<f882ad30>] rw_intr [sd_mod] 0x210 
[<f891e52e>] tg3_interrupt_main_work [tg3] 0x3e 
[<f891e5c4>] tg3_interrupt [tg3] 0x44 
[<c010a53e>] handle_IRQ_event [kernel] 0x5e 
[<c010a755>] do_IRQ [kernel] 0xa5 
[<c0107a48>] __read_lock_failed [kernel] 0x8 
[<c0146898>] .text.lock.buffer [kernel] 0xcf 
[<c0143ba8>] getblk [kernel] 0x18 
[<c0143e18>] bread [kernel] 0x18 
[<f8872da4>] ext3_get_inode_loc [ext3] 0x124 
[<f8873840>] ext3_reserve_inode_write [ext3] 0x20 
[<f88738e8>] ext3_mark_inode_dirty [ext3] 0x18 
[<c0143d73>] __refile_buffer [kernel] 0x53 
[<f88739a8>] ext3_dirty_inode [ext3] 0x98 
[<c0143c66>] balance_dirty [kernel] 0x6 
[<c0144951>] __block_commit_write [kernel] 0xb1 
[<c0156f2e>] __mark_inode_dirty [kernel] 0x2e 
[<c0145172>] generic_commit_write [kernel] 0x92 
[<f88719cb>] ext3_commit_write [ext3] 0x19b 
[<c0139af8>] __alloc_pages [kernel] 0xa8 
[<c013246f>] generic_file_write [kernel] 0x55f 
[<f8968ef7>] nfsd_open [nfsd] 0x27 
[<f886ed02>] ext3_file_write [ext3] 0x22 
[<f896960c>] nfsd_write [nfsd] 0x14c 
[<c0215eb5>] inet_sendmsg [kernel] 0x35 
[<f887df20>] ext3_file_operations [ext3] 0x0 
[<f896ea20>] nfsd3_proc_write [nfsd] 0xf0 
[<f8977b9c>] nfsd_procedures3 [nfsd] 0xfc 
[<f8977b9c>] nfsd_procedures3 [nfsd] 0xfc 
[<f8965667>] nfsd_dispatch [nfsd] 0xb7 
[<f8946047>] svc_process_Rsmp_6ad37799 [sunrpc] 0x347 
[<f897745c>] nfsd_version3 [nfsd] 0x0 
[<f897747c>] nfsd_program [nfsd] 0x0 
[<f8965442>] nfsd [nfsd] 0x252 
[<f89651f0>] nfsd [nfsd] 0x0 
[<c0107286>] kernel_thread [kernel] 0x26 
[<f89651f0>] nfsd [nfsd] 0x0 


Code: 8b 46 60 85 c0 74 13 68 15 03 00 00 68 a0 4a 92 f8 e8 78 92 
 <0>Kernel panic: Aiee, killing interrupt handler!
In interrupt handler - not syncing
 


Additional info:

Comment 1 Thomas J. Baker 2002-08-09 12:40:27 UTC

Here's another crash. 

fitzcarraldo.sr.unh.edu login:
Red Hat Linux release 7.3 (Valhalla)
Kernel 2.4.18-5smp on an i686

fitzcarraldo.sr.unh.edu login: Unable to handle kernel NULL pointer dereference
at virtual address 00000060
 printing eip:
f891e0e2
*pde = 00000000
Oops: 0000
nfs nfsd lockd sunrpc autofs tg3 eepro100 ext3 jbd megaraid aic7xxx sd_mod
scsCPU:    0
EIP:    0010:[<f891e0e2>]    Not tainted
EFLAGS: 00010206

EIP is at tg3_rx [tg3] 0x112 (2.4.18-5smp)
eax: 00000642   ebx: 00010000   ecx: c8f99760   edx: 00000000
esi: 00000000   edi: 0000005e   ebp: 000005ea   esp: c030bef8
ds: 0018   es: 0018   ss: 0018
Process swapper (pid: 0, stackpage=c030b000)
Stack: 000005ee f4a7e170 c034cbc0 0000025f 8000025e 00010000 f58765e0 f4dfea80
       db94d4e0 c03de6c0 c035a000 f4a7e160 04000001 00000011 f891e52e f4a7e160
       f4a7e160 c035a000 f891e5c4 f4a7e160 f50e4b20 00000000 c010a53e 00000011
Call Trace: [<f891e52e>] tg3_interrupt_main_work [tg3] 0x3e
[<f891e5c4>] tg3_interrupt [tg3] 0x44
[<c010a53e>] handle_IRQ_event [kernel] 0x5e
[<c010a755>] do_IRQ [kernel] 0xa5
[<c0106e70>] default_idle [kernel] 0x0
[<c0105000>] stext [kernel] 0x0
[<c0106e70>] default_idle [kernel] 0x0
[<c0105000>] stext [kernel] 0x0
[<c0106e9c>] default_idle [kernel] 0x2c
[<c0106ef4>] cpu_idle [kernel] 0x24


Code: 8b 46 60 85 c0 74 13 68 15 03 00 00 68 a0 4a 92 f8 e8 78 92
 <0>Kernel panic: Aiee, killing interrupt handler!
In interrupt handler - not syncing

End Data

Comment 2 Brian Brock 2002-09-04 15:20:39 UTC

Have you attempted to use the 2.4.18-10 kernel?  If so, what were the results?

Comment 3 Thomas J. Baker 2002-09-04 17:12:52 UTC

I'm running the new kernel on two systems with TG3 ethernet and haven't had any
problems yet. It's only been a few days though.

Comment 4 Jefferson Ogata 2002-11-13 20:56:13 UTC

I'm seeing this problem also on Dell 2550s and on 6450s. This is with the latest
Red Hat kernel 2.4.18-17.7.x on Red Hat 7.3, all patches applied. smp and bigmem
kernels both appear to be affected.

The problem is IMHO unambiguously the tg3 driver. I had three different hosts
all exhibiting the same problem -- run for a few hours then hard hang. I
disabled the built-in Broadcom adapters and installed Intel Gb adapters and have
been running for over a week with no problem.

Comment 5 Arjan van de Ven 2002-11-16 11:06:57 UTC

An errata has been issued which should help the problem described in this bug report. 
This report is therefore being closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, please follow the link below. You may reopen 
this bug report if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2002-262.html

Comment 6 Need Real Name 2002-11-18 06:21:19 UTC

experienced periodic (few hours at times) system hang (usually without anything 
in message log.

Nov 18 06:08:44 vfdb kernel:
Nov 18 06:08:44 vfdb kernel: wait_on_irq, CPU 0:
Nov 18 06:08:44 vfdb kernel: irq:  1 [ 0 0 1 0 ]
Nov 18 06:08:44 vfdb kernel: bh:   0 [ 0 0 1 0 ]
Nov 18 06:08:44 vfdb kernel: Stack dumps:
Nov 18 06:08:44 vfdb kernel: CPU 1:00000000 00000000 00000000 00000000 00000000 
00000000 00000000 00000000
Nov 18 06:08:44 vfdb kernel:        00000000 00000000 00000000 00000000 
00000000 00000000 00000000 00000000
Nov 18 06:08:44 vfdb kernel:        00000000 00000000 00000000 00000000 
00000000 00000000 00000000 00000000
Nov 18 06:08:44 vfdb kernel: Call Trace: [<f89ca28b>] tg3_start_xmit [tg3] 
0x12b (0xc36b99c4))
Nov 18 06:08:44 vfdb kernel: [<f897f117>] ipfw_output_check [ipchains] 0x77 
(0xc36b99f4))
Nov 18 06:08:44 vfdb kernel: [<f89ca28b>] tg3_start_xmit [tg3] 0x12b 
(0xc36b9a38))
Nov 18 06:08:44 vfdb kernel: [<c01e7e5e>] dev_queue_xmit [kernel] 0x14e 
(0xc36b9a54))
Nov 18 06:08:44 vfdb kernel: [<f897f117>] ipfw_output_check [ipchains] 0x77 
(0xc36b9a68))
Nov 18 06:08:44 vfdb kernel: [<f8978b63>] check_for_unredirect [ipchains] 0x63 
(0xc36b9a88))
Nov 18 06:08:44 vfdb kernel: [<c01f1c94>] qdisc_restart [kernel] 0x14 
(0xc36b9aa4))
Nov 18 06:08:44 vfdb kernel: [<c01e7e5e>] dev_queue_xmit [kernel] 0x14e 
(0xc36b9ac8))
Nov 18 06:08:44 vfdb kernel: [<c01eee3e>] nf_iterate [kernel] 0x2e (0xc36b9ad0))
Nov 18 06:08:44 vfdb kernel: [<c02010bf>] ip_finish_output2 [kernel] 0xaf 
(0xc36b9aec))
Nov 18 06:08:44 vfdb kernel: [<c0201010>] ip_finish_output2 [kernel] 0x0 
(0xc36b9af4))
Nov 18 06:08:44 vfdb kernel: [<c01ef173>] nf_hook_slow [kernel] 0xd3 
(0xc36b9af8))
Nov 18 06:08:44 vfdb kernel: [<c01ef1aa>] nf_hook_slow [kernel] 0x10a 
(0xc36b9b10))
Nov 18 06:08:44 vfdb kernel: [<c01ffb02>] ip_output [kernel] 0x162 (0xc36b9b40))
Nov 18 06:08:44 vfdb kernel: [<c0201010>] ip_finish_output2 [kernel] 0x0 
(0xc36b9b58))
Nov 18 06:08:44 vfdb kernel: [<c01ffea0>] ip_queue_xmit [kernel] 0x390 
(0xc36b9b88))
Nov 18 06:08:44 vfdb kernel: [<c01e42b8>] skb_clone [kernel] 0x78 (0xc36b9ba4))
Nov 18 06:08:44 vfdb kernel: [<c0214a1e>] tcp_v4_send_check [kernel] 0x6e 
(0xc36b9bc8))
Nov 18 06:08:44 vfdb kernel: [<c020f6c5>] tcp_transmit_skb [kernel] 0x565 
(0xc36b9bf0))
Nov 18 06:08:44 vfdb kernel: [<c02101df>] tcp_write_xmit [kernel] 0x1df 
(0xc36b9c34))
Nov 18 06:08:44 vfdb kernel: [<c01e51b4>] skb_checksum [kernel] 0x54 
(0xc36b9c50))
Nov 18 06:08:44 vfdb kernel: [<c020d592>] __tcp_data_snd_check [kernel] 0x52 
(0xc36b9c68))
Nov 18 06:08:44 vfdb kernel: [<c020d53c>] tcp_new_space [kernel] 0x7c 
(0xc36b9c84))
Nov 18 06:08:44 vfdb kernel: [<c01e51b4>] skb_checksum [kernel] 0x54 
(0xc36b9c94))
Nov 18 06:08:44 vfdb kernel: [<c011a02b>] __wake_up [kernel] 0x4b (0xc36b9cd4))
Nov 18 06:08:44 vfdb kernel: [<c0215e6c>] tcp_v4_rcv [kernel] 0x3cc 
(0xc36b9cf0))
Nov 18 06:08:44 vfdb kernel: [<c01eee3e>] nf_iterate [kernel] 0x2e (0xc36b9d2c))
Nov 18 06:08:44 vfdb kernel: [<c01fd337>] ip_local_deliver_finish [kernel] 0xb7 
(0xc36b9d44))
Nov 18 06:08:44 vfdb kernel: [<c01fd280>] ip_local_deliver_finish [kernel] 0x0 
(0xc36b9d50))
Nov 18 06:08:44 vfdb kernel: [<c01ef173>] nf_hook_slow [kernel] 0xd3 
(0xc36b9d54))
Nov 18 06:08:44 vfdb kernel: [<c01fd280>] ip_local_deliver_finish [kernel] 0x0 
(0xc36b9d68))
Nov 18 06:08:44 vfdb kernel: [<c01ef1aa>] nf_hook_slow [kernel] 0x10a 
(0xc36b9d6c))
Nov 18 06:08:44 vfdb kernel: [<c0126115>] update_process_times [kernel] 0x25 
(0xc36b9da8))
Nov 18 06:08:44 vfdb kernel: [<c0116999>] smp_apic_timer_interrupt [kernel] 
0xa9 (0xc36b9dcc))
Nov 18 06:08:44 vfdb kernel: [<c01a0cc4>] account_io_start [kernel] 0x44 
(0xc36b9dd8))
Nov 18 06:08:44 vfdb kernel: [<c01a0c07>] locate_hd_struct [kernel] 0x27 
(0xc36b9de0))
Nov 18 06:08:44 vfdb kernel: [<c01a0d69>] req_new_io [kernel] 0x49 (0xc36b9df4))
Nov 18 06:08:44 vfdb kernel: [<f8814f0c>] scsi_queue_next_request [scsi_mod] 
0x5c (0xc36b9e50))
Nov 18 06:08:44 vfdb kernel: [<f8815139>] __scsi_end_request [scsi_mod] 0x139 
(0xc36b9e68))
Nov 18 06:08:44 vfdb kernel: [<c0126115>] update_process_times [kernel] 0x25 
(0xc36b9e84))
Nov 18 06:08:44 vfdb kernel: [<c0126115>] update_process_times [kernel] 0x25 
(0xc36b9ea0))
:
Nov 18 06:08:44 vfdb kernel: [<c0116999>] smp_apic_timer_interrupt [kernel] 
0xa9 (0xc36b9ef8))
Nov 18 06:08:44 vfdb kernel: [<c01266cc>] schedule_timeout [kernel] 0x7c 
(0xc36b9f80))
Nov 18 06:08:44 vfdb kernel: [<c0126640>] process_timeout [kernel] 0x0 
(0xc36b9f98))
Nov 18 06:08:44 vfdb kernel: [<c013a76e>] wakeup_memwaiters [kernel] 0xde 
(0xc36b9fb0))
Nov 18 06:08:44 vfdb kernel: [<c013a541>] kswapd [kernel] 0x381 (0xc36b9fd8))
Nov 18 06:08:44 vfdb kernel: [<c0105000>] stext [kernel] 0x0 (0xc36b9fe8))
Nov 18 06:08:44 vfdb kernel: [<c0107286>] kernel_thread [kernel] 0x26 
(0xc36b9ff0))
Nov 18 06:08:44 vfdb kernel: [<c013a1c0>] kswapd [kernel] 0x0 (0xc36b9ff8))
Nov 18 06:08:44 vfdb kernel:
Nov 18 06:08:44 vfdb kernel:
Nov 18 06:08:44 vfdb kernel: CPU 2:00000000 00000000 00000000 00000000 00000000 
00000000 00000000 00000000
Nov 18 06:08:44 vfdb kernel:        00000000 00000000 00000000 00000000 
00000000 00000000 00000000 00000000
Nov 18 06:08:44 vfdb kernel:        00000000 00000000 00000000 00000000 
00000000 00000000 00000000 00000000
Nov 18 06:08:44 vfdb kernel: Call Trace:
Nov 18 06:08:44 vfdb kernel:
Nov 18 06:08:44 vfdb kernel: CPU 3:55514246 55514246 55514246 55514246 55514246 
55514246 55514246 55514246
Nov 18 06:08:44 vfdb kernel:        55514246 55514246 55514246 55514246 
55514246 55514246 51514246 55514b30
Nov 18 06:08:44 vfdb kernel:        55514246 55514246 55514246 55514246 
55514246 55514246 55514246 55514246
Nov 18 06:08:44 vfdb kernel: Call Trace:
Nov 18 06:08:44 vfdb kernel:
Nov 18 06:08:44 vfdb kernel: CPU 0:f7fb3f28 c0246f0e 00000000 00000001 ffffffff 
00000000 c010a452 c0246f23
Nov 18 06:08:44 vfdb kernel:        00000001 f66de000 00000001 c017e1bf 
f66de368 c030a284 f7fb3f74 00000000
Nov 18 06:08:44 vfdb kernel:        f7fb2000 c01226de f66de000 f66de130 
c030a284 c0304f00 00000000 c012b3e5
Nov 18 06:08:44 vfdb kernel: Call Trace: [<c010a452>] __global_cli [kernel] 
0xe2 (0xf7fb3f40))
Nov 18 06:08:44 vfdb kernel: [<c017e1bf>] flush_to_ldisc [kernel] 0x9f 
(0xf7fb3f54))
Nov 18 06:08:44 vfdb kernel: [<c01226de>] __run_task_queue [kernel] 0x5e 
(0xf7fb3f6c))
Nov 18 06:08:44 vfdb kernel: [<c012b3e5>] context_thread [kernel] 0x155 
(0xf7fb3f84))
Nov 18 06:08:44 vfdb kernel: [<c012b290>] context_thread [kernel] 0x0 
(0xf7fb3fc8))
Nov 18 06:08:44 vfdb kernel: [<c0105000>] stext [kernel] 0x0 (0xf7fb3fe8))
Nov 18 06:08:44 vfdb kernel: [<c0107286>] kernel_thread [kernel] 0x26 
(0xf7fb3ff0))
Nov 18 06:08:44 vfdb kernel: [<c012b290>] context_thread [kernel] 0x0 
(0xf7fb3ff8))
Nov 18 06:08:44 vfdb kernel:

Comment 7 jmccann 2002-11-18 23:07:54 UTC

The errata (2.4.18-18.8.0smp) did not solve the problem for me.  Approximately
10 minutes after booting the kernel the system hung as before (with no audit trail).

I am still running the new kernel but I operate off my Fast Ethernet interface
and manually unload the tg3 driver.  This is the same workaround I was using for
the 2.4.18-17.8.0smp kernel.

RH8.0 on a Dell 4600.

Comment 8 Need Real Name 2002-11-19 10:00:06 UTC

Just to add:
my problems described above at 2002-11-18 01:21:19 are for 2002-11-18 01:21:19.
so the problem is not fixed.

Also, I am running dual proc 2G Xeon CPU on two DELL 2650's which experienced 
the same problem, so it is definitely the kernel.

Comment 9 redbugs 2002-11-19 22:09:13 UTC

Please re-open this bug.  The tg3 driver is still broken in the currently
available (2.4.18-18) kernel for RH 7.3 as of Tuesday 2002-11-19.

Admittedly I get a crash now that is unequivocally caused by tg3, where before I
had a "mystery lockup" with no errors in ESM, no errors in syslog, no messages
on screen, and no response to terminal or network I/O.

Systems are two identical Dell 2650's running drbd, heartbeat, nfs in a highly
redundant configuration with a crossover 1000bt cable.

Comment 10 Need Real Name 2002-11-20 11:28:57 UTC

To add my wieght this is a problem still. I have two dell 2650 2*2ghz xeon 
processors. These boxes are meant to be replacing our old groupwaise mail 
systems with a spanking redhat mail system. The boxes have shown this fault on 
RH7.1 through to RH8.0. Current config is RH7.3 installed using dell install 
disk with ALL errata applied and using the kernel-bigmem-2.4.18-
18.7.x.i686.rpm 'fix'. The system runs sendmail, 180 * 1MByte emails an hour 
using virus scanning & spam stomping. This seems fairly stable, approx 2 days 
uptime before locking, but if I run httpd with php scripts aswell then the 
crash occurs with 15 minutes. The httpd is not underload. A fix to this problem 
is sorely needed as I'm getting mud on my face at the moment as our MS exchange 
server is more stable than the redhat server.... Not good.

Comment 11 Need Real Name 2002-11-20 11:29:24 UTC

To add my wieght this is a problem still. I have two dell 2650 2*2ghz xeon 
processors. These boxes are meant to be replacing our old groupwaise mail 
systems with a spanking redhat mail system. The boxes have shown this fault on 
RH7.1 through to RH8.0. Current config is RH7.3 installed using dell install 
disk with ALL errata applied and using the kernel-bigmem-2.4.18-
18.7.x.i686.rpm 'fix'. The system runs sendmail, 180 * 1MByte emails an hour 
using virus scanning & spam stomping. This seems fairly stable, approx 2 days 
uptime before locking, but if I run httpd with php scripts aswell then the 
crash occurs with 15 minutes. The httpd is not underload. A fix to this problem 
is sorely needed as I'm getting mud on my face at the moment as our MS exchange 
server is more stable than the redhat server.... Not good.

Comment 12 Need Real Name 2002-11-20 11:29:39 UTC

To add my wieght this is a problem still. I have two dell 2650 2*2ghz xeon 
processors. These boxes are meant to be replacing our old groupwaise mail 
systems with a spanking redhat mail system. The boxes have shown this fault on 
RH7.1 through to RH8.0. Current config is RH7.3 installed using dell install 
disk with ALL errata applied and using the kernel-bigmem-2.4.18-
18.7.x.i686.rpm 'fix'. The system runs sendmail, 180 * 1MByte emails an hour 
using virus scanning & spam stomping. This seems fairly stable, approx 2 days 
uptime before locking, but if I run httpd with php scripts aswell then the 
crash occurs with 15 minutes. The httpd is not underload. A fix to this problem 
is sorely needed as I'm getting mud on my face at the moment as our MS exchange 
server is more stable than the redhat server.... Not good.

Comment 13 Need Real Name 2002-11-20 14:23:30 UTC

I've just used the "noapic" kernel boot option & have found this to make my 
system allot more stable than it ever has been. I'm compressing 6 Gbytes of 
data aswell as carrying out the functions that the server should be doing, & 
everything is running sweetly, (& very fast), prior to adding the "noapic" I 
would have expected the machine to have locked up by now even without the large 
compression test. I'm not saying this is a fix as it hasn't been running long 
enough... but it certainly seems to point to where the problem may lie. Any 
thoughts?

Comment 14 Amit Bhutani 2002-11-20 15:42:17 UTC

The latest Red Hat errata kernel 2.4.18-18.8.0 states that it addresses the
"Kernel Crashes in TG3 Driver" issue (Bugzilla ID:69920). 
After installing the kernel-source for the errata rpm and performing a diff
between the errata kernel (2.4.18-18.8.0) and the RH 8.0 stock kernel
(2.4.18-14), it was evident that the tg3 patch was "not" included in the
errata kernel. 

Refer below for the actual patch (originally posted on Linux Kernel Mailing
List)

ChangeSet 1.790, 2002/11/14 14:43:47-05:00, davem

	Fix tg3 net driver to properly disable interrupts during some TX
operations


# This patch includes the following deltas:
#	           ChangeSet	1.789   -> 1.790  
#	   drivers/net/tg3.c	1.37    -> 1.38   
#

 tg3.c |   46 ++++++++++++++++++++++++++++++++++++++--------
 1 files changed, 38 insertions(+), 8 deletions(-)


diff -Nru a/drivers/net/tg3.c b/drivers/net/tg3.c
--- a/drivers/net/tg3.c	Fri Nov 15 09:08:21 2002
+++ b/drivers/net/tg3.c	Fri Nov 15 09:08:21 2002
@@ -59,8 +59,8 @@
 
 #define DRV_MODULE_NAME		"tg3"
 #define PFX DRV_MODULE_NAME	": "
-#define DRV_MODULE_VERSION	"1.1"
-#define DRV_MODULE_RELDATE	"Aug 30, 2002"
+#define DRV_MODULE_VERSION	"1.2"
+#define DRV_MODULE_RELDATE	"Nov 14, 2002"
 
 #define TG3_DEF_MAC_MODE	0
 #define TG3_DEF_RX_MODE		0
@@ -2373,13 +2373,28 @@
 	/* No BH disabling for tx_lock here.  We are running in BH disabled
 	 * context and TX reclaim runs via tp->poll inside of a software
 	 * interrupt.  Rejoice!
+	 *
+	 * Actually, things are not so simple.  If we are to take a hw
+	 * IRQ here, we can deadlock, consider:
+	 *
+	 *       CPU1		CPU2
+	 *   tg3_start_xmit
+	 *   take tp->tx_lock
+	 *			tg3_timer
+	 *			take tp->lock
+	 *   tg3_interrupt
+	 *   spin on tp->lock
+	 *			spin on tp->tx_lock
+	 *
+	 * So we really do need to disable interrupts when taking
+	 * tx_lock here.
 	 */
-	spin_lock(&tp->tx_lock);
+	spin_lock_irq(&tp->tx_lock);
 
 	/* This is a hard error, log it. */
 	if (unlikely(TX_BUFFS_AVAIL(tp) <= (skb_shinfo(skb)->nr_frags + 1)))
{
 		netif_stop_queue(dev);
-		spin_unlock(&tp->tx_lock);
+		spin_unlock_irq(&tp->tx_lock);
 		printk(KERN_ERR PFX "%s: BUG! Tx Ring full when queue
awake!\n",
 		       dev->name);
 		return 1;
@@ -2520,7 +2535,7 @@
 		netif_stop_queue(dev);
 
 out_unlock:
-	spin_unlock(&tp->tx_lock);
+	spin_unlock_irq(&tp->tx_lock);
 
 	dev->trans_start = jiffies;
 
@@ -2538,13 +2553,28 @@
 	/* No BH disabling for tx_lock here.  We are running in BH disabled
 	 * context and TX reclaim runs via tp->poll inside of a software
 	 * interrupt.  Rejoice!
+	 *
+	 * Actually, things are not so simple.  If we are to take a hw
+	 * IRQ here, we can deadlock, consider:
+	 *
+	 *       CPU1		CPU2
+	 *   tg3_start_xmit
+	 *   take tp->tx_lock
+	 *			tg3_timer
+	 *			take tp->lock
+	 *   tg3_interrupt
+	 *   spin on tp->lock
+	 *			spin on tp->tx_lock
+	 *
+	 * So we really do need to disable interrupts when taking
+	 * tx_lock here.
 	*/
-	spin_lock(&tp->tx_lock);
+	spin_lock_irq(&tp->tx_lock);
 
 	/* This is a hard error, log it. */
 	if (unlikely(TX_BUFFS_AVAIL(tp) <= (skb_shinfo(skb)->nr_frags + 1)))
{
 		netif_stop_queue(dev);
-		spin_unlock(&tp->tx_lock);
+		spin_unlock_irq(&tp->tx_lock);
 		printk(KERN_ERR PFX "%s: BUG! Tx Ring full when queue
awake!\n",
 		       dev->name);
 		return 1;
@@ -2635,7 +2665,7 @@
 	if (TX_BUFFS_AVAIL(tp) <= (MAX_SKB_FRAGS + 1))
 		netif_stop_queue(dev);
 
-	spin_unlock(&tp->tx_lock);
+	spin_unlock_irq(&tp->tx_lock);
 
 	dev->trans_start = jiffies;

Comment 15 Jefferson Ogata 2002-11-20 18:53:54 UTC

FWIW, I diffed the 2.4.18-17.7.x and 2.4.18-18.7.x tg3 sources and there are
definitely some differences. I seem to recall the spin_lock and spin_lock_irq
calls differ, but I don't think the stuff about TX_BUFFS_AVAIL differed.

This bug is still marked CLOSED. Hey Red Hat, please REOPEN.

Comment 16 Jeff Garzik 2002-11-20 19:36:17 UTC

Created attachment 85747 [details]
Possible fix

Comment 17 Jeff Garzik 2002-11-20 19:36:57 UTC

Everyone: please try the patch I just attached to this bug report, and see if it
fixes the problem.

Comment 18 Need Real Name 2002-11-21 13:31:05 UTC

has anyone a compiled kernel with this patch for dell 2650

Comment 19 Matthew Melvin 2002-11-21 21:37:49 UTC

FWIW I've had success with this patch on a Dual 2Ghz Xeon IBM 335 xserver with 2
tg3 NICs.  Previously a large rdist to this box would cause it to hang within
about an hour under both 2.4.18-17.7.xsmp and 2.4.18-18.7.xsmp kernels. With the
patch applied to 2.4.18-18.7.xsmp it has been going without fault for the last
12 hours.

Comment 20 Jefferson Ogata 2002-11-22 18:01:29 UTC

Regarding 2.4.18-18.7.x (not using the patch): this is much more unstable than
2.4.18-17.7.x was. I have a 2650 that hadn't crashed at all -- since installing
the new kernel it won't stay up more than 24 hours. I'm building a patched
kernel to test on that machine.

Comment 21 Leos Bitto 2002-11-24 19:53:59 UTC

I have experienced a kernel crash with 2.4.18-18.7.xsmp on HP ProLiant DL580 G2 
(4*Xeon 1.6 GHz, 2 GB RAM) just few hours after I rebooted from 2.4.18-10smp. 
This is what 2.4.18-10smp says:

tg3.c:v0.99 (Jun 11, 2002)
eth0: Tigon3 [partno(284685-001) rev 0105 PHY(5701)] (PCIX:100MHz:64-bit) 
10/100/1000BaseT Ethernet xx:xx:xx:xx:xx:xx
eth0: Link is up at 100 Mbps, full duplex.
eth0: Flow control is off for TX and off for RX.

The NIC has always been connected to a 100 Mbps port at the Cisco Catalyst 
switch. It has never been running at 1000 Mbps. 2.4.18-10smp is running just 
fine (if you ignore its security problems).

Comment 22 Jeff Garzik 2002-11-25 15:26:19 UTC

*** Bug 78166 has been marked as a duplicate of this bug. ***

Comment 23 David Morse 2002-11-28 02:03:05 UTC

With a stock RH 8.0 kernel (2.4.18-14smp), tg3 v1.0 driver, a PowerEdge 1655MC 
(2x1.26GHz PIIIs, dual onboard BCM95703A31) would lock hard after 5-10 minutes 
of heavy network traffic (primarily transmits).

Tried jgarzik's 2.4.18-19.7.tg3.120smp kernel (with tg3 v1.2) on the same box, 
and running the same heavy TX load (both interfaces sending ~47MB/sec), it's 
been up for 8 hours straight.

Thus, at least in my case, it has fixed the lockup problem.

Thanks Jeff!

Comment 24 Jeff Garzik 2002-11-28 03:31:18 UTC

Thanks for the feedback so far.  To make things easier to access and test, I
have made available a drop-in tg3.c and tg3.c which should fix the tg3 crashes,
and have also created rpms (including source rpm) to make things easier to test.

Drop-in source code:
http://people.redhat.com/jgarzik/tg3/tg3-1.2/

Unofficial test rpms:
http://people.redhat.com/jgarzik/tg3/tg3-1.2/rpms/

(disclaimer/warning:  these rpms are unofficial, and should not be used in
production.  they have not gone through a full battery of Red Hat Q/A tests.  if
they damage your computer hardware, software, or scare your cat, it's not my fault.)

Comment 25 Ole Craig 2002-11-29 18:55:39 UTC

Hi -
	Just wanted to chime in on this bug. We upgraded to
2.4.18-18.7.x on a machine with a 3com 3c996T ethernet card, and had
*severe* stability problems: after an indeterminate period ranging
between five minutes and three hours, the system would hang. No
OOPSes, logged errors to syslog, or other diagnostics, just locked up
solid and requiring a hard reset. This system had been rock-steady
under the stock 7.3 kernel and the 2.4.18-5 errata kernel.
Retrograding to the 2.4.18-10bigmem kernel appears to have solved the
problem for now, but I'll definitely wait for a resolution on this
bugzilla entry before installing a newer kernel...


Hardware: dual 933MHz Piii system, SuperMicro 370DE6 mainboard, 2G
RAM, 3com 3c996-T gigabit ethernet NIC, U180 scsi system disk and 6-disk RAID
array (off of a Mylex A352.)

more-or-less stock RH7.3 with all relevant errata updates (except the
kernel, as detailed above.) Non-stock SW includes local Apache+mod_ssl
and php, and Sendmail 8.12.6/mimedefang/spamassassin. Primary
functions: mail server (averaging ~200 emails/hour), webmail server,
NIS master, and NFS server (with about 50 client machines.)

Comment 26 Chris Haag 2002-12-02 19:21:50 UTC

I have solved the problem on our machine using the "noapic" option for the 
kernel line in grub.conf. 

Machine: Dell Poweredge 2650, dual Xeon 2,4 Mhz, dual Broadcom 10/100/1000 
Ethernet adapters (tg3 driver!), 1 GB RAM, Onboard SCSI Raid

Symptom: 
- freezing, after 30 minutes or up to 10 hours
- sometimes displaying a message on console, sometimes not
- no response to ping
- fans running faster
- hard reset required

RedHat: Kernel 2.4.18-17.7.x and 2.4.18-18.7.x (SMP versions)

Solution: adding "noapic" in grub.conf, machine is now up for 4 1/2 days, no 
more trouble

Credits: Thanks to Guiseppe Raimondi from RedHat/UK

Next Steps: Maybe I will try jgarzik's new kernel with corrected tg3 drivers 
(hmm it's a production machine, see if I wait for the next official kernel...)

Idea: As far as I understand, apic distributes the interrupts to the four 
logical processors. Is it possible, that the tg3 driver is faulty in that area? 
When not using apic it works fine.

Still strange: I updates from the -17 kernel to the -18 kernel. Using the -17 
kernel everything was running properly. After doing the update, booting with 
the -18 *AND* the -17 kernel produced the same problem of freezing the machine. 
Something's wrong with the kernel update procedure, or did I misunderstand 
something? I thought I could go back to a previous kernel?

Comment 27 David Morse 2002-12-02 22:24:39 UTC

Jeff,
I left two systems w/ tg3 v1.2 running heavy network traffic continuously over 
the holidays.  It was still running fine until I stopped it this afternoon (5 
days straight!).

Tried tg3 v1.2txlock from
http://people.redhat.com/jgarzik/tg3/tg3-1.2txlock/

Here's some performance data from running netperf between 2 PE1655MC blades, 
each with 2 integrated BCM5703 NICs:
- Both ends using bcm5700 2.2.26: ~50.3MB/sec per NIC
- Both ends using v1.2 (no txlock): ~71MB/sec per NIC
- Both ends using v1.2txlock: ~71MB/sec per NIC

1.2txlock seems to be just as stable and performance seems to be equivalent (at 
least in this test).

Comment 28 Thomas J. Baker 2002-12-03 12:56:27 UTC

My original problem was with two Dell PE2550s and the problem was fixed with the
2.4.18-17.7.x or maybe the -10 one. But I've got a new 2650 with dual 2.8GHz P4
Xeons and 6GBs of memory and it hasn't gone overnight without hanging. There is
no debugging information at all, just a consistent hang. I've tried loading the
network and it seems fine but by the next morning, it's hung again. The test
kernel with the TG3 1.2 driver didn't make a difference. Admittedly, it could be
because of something else but it appears others are having trouble with the
2650s too.

Comment 29 Need Real Name 2002-12-03 13:08:01 UTC

I had this problem (see previous posts to this problem), with daul xeon dell 
2650 with 1Gbyte mem, I've found that the xbigmem kernel on RH7.3 & using the 
drivers supplied from the dell site for the broardcom network card for RH7.2 
have given me a stable platform, I used to get a lock up (no debug) after 4 
hours or so, uptime so far is 12 days with this combination & without adding 
the noapic stuff to the grub config. Basically I've two servers that were 
displaying the problem under light load & now don't even under heavey loading.

Hope this is of use.

Comment 30 Christopher McCrory 2002-12-03 17:05:52 UTC

Does the near furure hold an errata kernel?

Comment 31 Alex Finkel 2002-12-04 16:36:46 UTC

I have this problem with a Dell 2650 2x2.4Ghz CPU/2GB RAM dual BCM5701 gbit
ethernet running RH 8.0 with Kernel 2.4.18-18.8.0smp.

Reproducable hang while doing a recursive scp of a single directory from the
Dell 2650 to another machine. Directory contains about 20+ files and totals
about 1MB data.  Approx halfway through the directory system hangs.

Problem occurs with hyper-threading enabled or disabled.

Comment 32 Thomas J. Baker 2002-12-04 18:26:18 UTC

The noapic kernel parameter seems to keep the system from hanging in my case. I
don't know if that helps in the debugging or not.

Comment 33 Jeff Garzik 2002-12-05 18:46:00 UTC

Based on feedback, I will confirm that tg3 driver version 1.2 definitely fixes
these problems.

Users can get unofficial rpms containing these fixes from:
http://people.redhat.com/jgarzik/tg3/tg3-1.2/rpms/

or simply download the tg3 1.2 source code and drop it into your current kernel
build, from
http://people.redhat.com/jgarzik/tg3/tg3-1.2/

or simply download the latest stock kernel, 2.4.20,

or download the latest Red Hat rawhide kernel,

or wait for the next Red Hat release.

If you continue to see crashes with tg3 1.2, please open a new bug.

Comment 34 Jeff Garzik 2002-12-09 17:17:15 UTC

*** Bug 78822 has been marked as a duplicate of this bug. ***

Comment 35 Jeff Garzik 2002-12-11 20:32:48 UTC

*** Bug 78427 has been marked as a duplicate of this bug. ***

Comment 36 Chris Haag 2002-12-13 14:58:42 UTC

My test results so far:

1) Kernel 2.4.18-17.7.x -> OK
2) Update to Kernel 2.4.18-18.7.x -> crashes after 30 min. up to 12 hours (6 
times)
3) Running 2.4.18-18.7.x with noapic option -> crashes after approx. 7 days (2 
times)
4) Jeff Garzik's tg31.2 (120) kernel -> crash after 5 hours (once)

What are the experiences using Jeff Garzik's txlock (121) kernel? Anyone 
succeeded on a Dell Poweredge (mine is a 2650 dual xeon, dual broadcom)? Anyone 
out there still having troubles with a Dell Poweredge? (My so called solution 
using noapic option was in fact wrong - I wrote it too early)

(For more

Comment 37 Gary Mansell 2002-12-13 15:10:05 UTC

The bug still exists as far as I am concerned, the new tg3 module does not fix
the problem that I am seeing nor does the noapic option to the kernel.

I have had to go back to using the bcm5700 module instead of the tg3. I have
been up for 48hours now, and counting...

I have a dell PE2650 2x2.4Ghz Xeons, 2Gb RAM 2x HW mirrored sysdisks, 500Gb RAID
5 array attached via 2xQLA2310F cards. and an autoloader attached via SCSI card.
The machine has twin onboard Broadcom network cards

Comment 38 Chris Haag 2002-12-13 17:27:34 UTC

As it seems, that this bug is not solved, could you open this thread again, 
Jeff? 

I still have a 8 k$ machine here, that is not reliable. And I think I am not 
the only one. It does not make sense to open a new bug, as the history could be 
useful.

Comment 39 Jeff Garzik 2002-12-20 18:00:50 UTC

People are still seeing problems, re-opening bug.

Comment 40 Need Real Name 2002-12-31 18:59:50 UTC

I have a Dell PowerEdge 2650 dual Xeon 2.8 GHz, dual on-board Broadcom Gigabit 
NICS, with 6 gigabytes of memory. I have also been experiencing the system 
crashes on Redhat 8.0 - most likely due to the tg3 driver. I am running the 
bigmem version of the kernel.

By running PostgreSQL heavily, I am able to cause the system to crash 
regularly. With kernel-bigmem-2.4.18-19.8.0, the system crashed after eight 
days. That is actually an improvement over kernel-bigmem-2.4.18-18.8.0, which I 
was able to crash consistently after less than one day. So it looks like an 
improvement was made, but there are still bugs in the driver.

I don't know if this is related, but I am seeing the following messages 
in /var/log/messages (once or twice a day):

kernel: ENOMEM in do_get_write_access, retrying.
kernel: ENOMEM in journal_alloc_journal_head, retrying.

Comment 41 Jeff Garzik 2002-12-31 22:35:57 UTC

To all still experiencing problems,

1) please boot with "noapic" on the kernel command line.  You can run "cat
/proc/cmdline" to check for sure.

2) I have posted some new rpms for testing, based on the latest errata:

latest production tg3 release, 1.2a, built into unofficial rpms:
http://people.redhat.com/jgarzik/tg3/tg3-1.2a/rpms/

but I would like people to test my experiment which should provide additional
stability:
http://people.redhat.com/jgarzik/tg3/tg3-1.2a/exp1-rpms/

...and if that doesn't work for people, fall back to experiment 2:
http://people.redhat.com/jgarzik/tg3/tg3-1.2a/exp2-rpms/

Feedback requested!  On several systems, there is evidence that the lock-ups are
not directly related to driver but more to system board.  So please make sure to
attach 'dmesg' and 'lspci -vvv' output in future bug reports.

Comment 42 Ian McGuire 2003-01-01 06:01:08 UTC

I don't know if this helps; but, I have two Dell PE 2650's (dual 2.4 GHz P4 
Xeon) connected to a 100 Mbps full duplex switch and running the 2.4.18-10smp 
kernel that are very stable (90+ days under moderate to heavy load).  I have a 
third PE 2650 with the same hardware configuration running the 2.4.18-19smp 
kernel that won't run for more than 2 days without locking up under virtually 
no load.

I will try the tg3 release 1.2a kernel.

Comment 43 Chris Haag 2003-01-02 19:05:55 UTC

After several tests (please see above) I found out, that using the bcm5700 
driver instead of the tg3 driver works fine on my Dell PE 2650, dual Xeon 2.4 
GHz, dual Broadcom 10/100/1000. Even without the kernel "noapic" option the 
system is now up for 17 days.

I will try the tg3 1.2a driver/kernel and report the results.

Comment 44 Jeff Garzik 2003-01-02 19:18:46 UTC

Please try "experiment 1" rpms, as well.

These are for testing potential "tg3 lockup" problems.  tg3 1.2a is a
maintenance release which should improvement performance and fix a PXE issue,
but does not directly address the lockup problems people are seeing.

The driver version string will show up as "tg3 1.2a+exp1" after bootup.

Comment 45 Thomas J. Baker 2003-01-02 19:22:55 UTC

As another data point, I recently upgraded a Dell PowerEdge 2550 (dual P3 933,
2.5GB RAM) to Red Hat 8.0 and had the system hang overnight when using the tg3
gig port and kernel 2.4.18-19.8.0smp. I then switched it to use the eepro100
port and it has been up a week without problems. There was nothing in the logs
about the hang. Unfortunately, I can do much testing as all the machines
experiencing the problem are production systems.

Comment 46 Randy Berdan 2003-01-03 02:50:01 UTC

I also have a Dell 2650 dual 2.4gzh with 3GB of ram and 64GB raid array that is 
locking up on me with a 2 or 3 day frequency.  

I tried the experimental kernel rpms 
http://people.redhat.com/jgarzik/tg3/tg3-1.2a/exp1-rpms/
http://people.redhat.com/jgarzik/tg3/tg3-1.2a/exp2-rpms/
and http://people.redhat.com/jgarzik/tg3/tg3-1.2a/exp3-rpms/

I can't speak yet to the lock up, but I have lost all network capability.  I 
can't ping anything. The interface is up, has link light, and workes when I 
switch to a non-experimental kernel.

Side question: With a dual proc server and 3GB of ram, which kernal do you 
want? bigmem or smp?

Comment 47 Need Real Name 2003-01-20 20:14:57 UTC

kernel-smp-2.4.18-19.7.tg3.126.i686.rpm doesn't work at all for
me. I get no networking (ping etc.). This is with or without
the noapic set. To test things, I unloaded the tg3 and
then loaded the bcm5700 driver. This made networking work again.

Comment 48 Need Real Name 2003-01-20 20:17:18 UTC

Created attachment 89446 [details]
lspci -vvv for 2650 with tg3 problems

Comment 49 Need Real Name 2003-01-20 20:19:51 UTC

Created attachment 89447 [details]
dmesg for  2650 with tg3 problems

Comment 50 Jeff Garzik 2003-01-20 20:56:33 UTC

Ok, some of these reports have actually been fixed in more recently posted rpms.

Just to get everybody on the latest page, please use "aragorn2" test rpms,
posted at http://people.redhat.com/jgarzik/pub/

This is the latest Red Hat errata kernel for 7.x/8.x, with the recent tg3 bug fixes.

Comment 51 Scott Comboni 2003-01-27 16:04:31 UTC

I have a Dell 2650 and have tried all mentioned with no luck.  I recently tried
aragorn2 and noapic which seems to help.  Without noapic it crashes hard. Is
there indeed a fix for this? 

Thanks Scott

Comment 52 Jeff Garzik 2003-01-27 16:13:18 UTC

Ladies and gentlemen,

I have received permission to post the latest release candidate of
Red Hat's errata kernel.  It contains not only fixes for e1000 and tg3 
net drivers, but also system-level fixes which may address the problems 
users on this list were seeing.

This kernel is currently in Red Hat Q/A, and has NOT yet been 
"qualified" as official, nor has it been released.  

Errata kernel 21 release candidate, for Red Hat 8.0:
        http://people.redhat.com/jgarzik/pub/2.4.18-21.8.0/

Errata kernel 21 release candidate, for Red Hat 7.x:
        http://people.redhat.com/jgarzik/pub/2.4.18-21.7.x/

It is requested that people who were seeing crash problems test this 
kernel, as this will be the next official Red Hat errata kernel, after 
it passes Q/A.

Comment 53 Thomas J. Baker 2003-01-28 12:57:44 UTC

Can the noapic option be removed with these latest kernels?

Comment 54 Daniel Grandjean 2003-01-30 10:34:02 UTC

With the tg3 driver, in promiscuous mode, the server hang occurs within 
the next 30 minutes.
The same setup with bcm5700 driver does not hang.

Machine: Dell Poweredge 2650, dual Xeon 2,4 Mhz, dual Broadcom 10/100/1000 
Ethernet adapters (tg3 driver), 1 GB RAM, Onboard SCSI Raid, 
100Mbps Ethernet port.
The port trafic is heavy as it is the span of a busy subnet.

RedHat 7.3 up2date, Kernel 2.4.18-19.7.xsmp
but no hang  with 2.4.18-19.7.xdebug

Thanks Daniel.

Comment 55 Gary Mansell 2003-02-05 14:14:20 UTC

I can confirm that the latest production released Redhat kernel 2.4.18-24.7.xsmp
does not fix the problem. My PE2650 crashed in the usual manner after about 5
hours of normal (minimal) activity.

I am concerned that the bcm5700 modules (the only work around) do not exist in
/lib/modules for this new kernel - it would appear that they have been
deprecated. This is unacceptable to me as my machine has run for two months on
these modules perfectly fine. Hence I cannot run the latest kernel and have had
to revert my machine to the 2.4.18-18.7.xsmp kernel with the bcm5700 kernel module.

I also have a call (ref #222224) logged with Redhat's Patrick Ernzer
(pernzer) who is working with Dell UK on trying resolve this issue
for me for the last 4 months.

I will also submit this report to bug #79997 on bugzilla.redhat.com as I am not
sure which bug I am actually suffering from.

Comment 56 Scott Comboni 2003-02-05 17:23:17 UTC

I upgraded based on this posting
http://people.redhat.com/jgarzik/pub/2.4.18-21.7.x/  my redhat 7.3 installation
to  and I have not suffered a system lockup since.

Scott

Comment 57 Jefferson Ogata 2003-02-06 09:03:37 UTC

Yesterday I installed 2.4.18-24.7.xbigmem on a Dell PE2550 with Red Hat 7.3,
fully patched. The new kernel hung after less than 12 hours. No apparent
difference over 2.4.18-19.

The same system was completely stable with an Intel e1000 card on 2.4.18-18. I
got hopeful when 2.4.18-19 came out and went back to the built-in Broadcom. Woe
is me.

Comment 58 Rick Gaudette 2003-02-10 17:25:27 UTC

I'll second the above results.  On a dual Xeon 2.4.18-24.7.xsmp hung in less
than 12 hours.

On a much more important note where do I find previous kernels to back out this
kernel.  RHN deleted all of the kernel source directories we had, which are
needed to be build external modules for this machine.

Comment 59 Need Real Name 2003-02-11 17:48:23 UTC

I am also reliably seeing this on an IBM x335 single 2.00ghz xeon running the
smp kernel to get the hyperthreading support.  Would running the uniprocessor
kernel likely fix the problem?  It reliably crashes every 10 hours or so when
under network load.

Comment 60 Christopher McCrory 2003-02-11 18:13:29 UTC

FYI,  I ran a test and was able to crash my test box within a hour or so
RH 7.3 , kernel-smp-2.4.18-24.7.x.i686, dell 2550 , tg3 


I tested using ttcp  (test tcp)

something like this

reciever:
while true ; do ttcp -r -s ; done


sender:
while true ; do ttcp -t -s receiver.ip.address ; done


I tested with different ammounts of packets .e.g  -n 100000 


FYI, ifconfig shows total traffic throughput, but loops at, IIRC, 4 gigs

Comment 61 Jeff Garzik 2003-02-11 18:19:22 UTC

Responding to the last message, yes, it would be a useful datapoint to determine
if tg3 still crashes for you, on a uniprocessor kernel.

Also, make sure you are updated to the latest kernel errata, which includes
several tg3 bug fixes.

Comment 62 Need Real Name 2003-02-11 21:44:15 UTC

Oh sorry, I should have noted that we were running the 2.4.18-24.7.xsmp kernel
from out of errata updates.  I am now switching it over to run the uniprocessor
version of this kernel and will report back tomorrow with whether or not it
still locks up.

Comment 63 Need Real Name 2003-02-12 17:43:39 UTC

Running in uniprocessor mode with the 2.4.18-24.7.x, the server has been up for
19:56 without crashing.  I believe if it stays up for one more day this might be
a reasonable workaround for the short term.

Comment 64 John Sopko 2003-02-12 18:03:52 UTC

I confirmed Comment #60. I ran 2 of my 2650 single processor servers one with 
noapic one with apic using the 2.4.18-24.7.xsmp kernel, (I want to use the
hyper-threading).

I downloaded the ttcp software from:

http:http://www.linuxtested.com/linux_tools.html

I set up a link so each system would transmiit and recieve from each
other The server with the noapic stayed up, the other died.

Jeff,

I will test again with  2.4.18-24.7.xsmp kernel and noapic on both systems.

What do you recommend, should I use the "noapic" option and single processor
kernel?

Comment 65 Need Real Name 2003-02-12 18:07:16 UTC

Looks like I was incorrect.  One of my colleagues rebooted these machines due to
a hang earlier today.  It would appear the uptime is incorrect because the
hwclock was off and when the system came up it synced it's clock into the
future. :-(

If I use the noapic option, I will lose the high resolution timer on my smp
system correct?  If so, I cannot afford to do this.  I was going to go back and
try the bcm5700 driver, but it appears to not be included in the new kernels.

Comment 66 Jim Laverty 2003-02-12 18:27:36 UTC

Question on the Uniprocessor test:  

What is the current packet count (e.g. > 2.4 billion packets)?  

If you flood the adapter (e.g. with mgen, ttcp, etc) does the box stay connected/up?

We are currently rebooting our Dell SMP based servers every ten (10) business
days, which is approximately when we are hitting the packet limit.  We have been
staying on 2.4.18-5xsmp and testing with each of the latest kernels to no avail. 

Is anyone using any scripts and/or test cases to accelerate their testing of the
driver/kernel?  We have been using mgen to flood the NIC, it has given us very
fast results vs. waiting for the box to die over time.  I'm going to look at
using ttcp also.

What details beyond "this doesn't work" can we supply to help with the debugging
(e.g. logs, stats, etc)?

I feel for Jeff, as Broadcom is historically not very responsive to their user's
community's needs.  Even leaning on our hardware vendors who supply the Broadcom
NICs in their servers has gone nowhere, their tech support groups have gotten
nowhere with Broadcom.  We need our server vendors to back us and refuse to
resell/use their NICs, unless they open up their code and specs to the *nix
community.

Comment 67 Need Real Name 2003-02-12 18:32:14 UTC

Unfortunately, I'm unaware what the packet count was before it died.  As I
mentioned, one of my colleagues snuck in and rebooted the box unbeknownst to me,
so I was rejoicing over a false test result. :-(

We have a demo that we're giving in a few hrs, then I'll switch back to the old
(kernel-smp-2.4.18-19.7.x) kernel and the bcm5700 driver and try that on one of
the machines.  More later.

Comment 68 Dirk Hufnagel 2003-02-12 19:35:08 UTC

What crashes a maschine instantly for me when using the tg3 driver
is netperf. Tests were run on a Tyan K7X (760MPX) with dual Athlon
MP2000+ cpus and a 3Com 3C996-T Gigabit NIC.

Netperf just blast tcp packets at full speed from one maschine
to another. Run it as follows :

  - start 'netserver' on target maschine.

  - start 'netperf -H <target> -t TCP_STREAM -n 2' on sender to send tcp
    packets for 10s at maximum speeds using two cpus to the target

Kernel 2.4.18-24.7.x with tg3 driver crashes instantly if it's the
target maschine. Strangely enought it works if it's the sender.

BTW, before anybody asks, the second maschine for the rate tests is a
Tyan K7 (760MP) with dual Athlon MP2000+ an an Intel Gigabit Server NIC.

Comment 69 Carl Litt 2003-02-12 19:40:26 UTC

Created attachment 90037 [details]
PE2650 crash screen from kernel-2.4.18-24.7.xsmp.i686.rpm

This is the crash screen from a PE2650/dual onboard BCM95701A10, taken from the
remote access console.	Kernel was kernel-2.4.18-24.7.xsmp, command line was
"ro root=/dev/sda2 nmi_watchdog=1".

Comment 70 Carl Litt 2003-02-12 19:48:26 UTC

Attached the crash screen from my PE2650/dual X1.8/dual BCM95701A10 running 
kernel-2.4.18-24.7.xsmp.  Crash appears to be in the tg3 code.  Kernel command 
line was "ro root=/dev/sda2 nmi_watchdog=1".  I estimate the uptime was just 
over 1 day.  This machine is in development, and was doing nothing but running 
setiathome.  It barely had any Ethernet activity.  Machine must go into service 
next week.

Red Hat: Put the bcm5700 module back in the kernel tree, tg3 is clearly not 
stable.  People need a stable production kernel, we're not here to debug.

Comment 71 Carl Litt 2003-02-13 07:05:06 UTC

Crashed again.  Caught the full output on a serial console:

NMI Watchdog detected LOCKUP on CPU3, eip e0ccbae0, registers:
esm ppp_async ppp_generic slhc racser tg3 ipt_LOG ipt_state ip_conntrack_ftp 
ip_conntrack iptable_filter ip_tables usb-ohci usbcore reiserfs lvm-mod aacraid 
s
CPU:    3
EIP:    0010:[<e0ccbae0>]    Tainted: P 
EFLAGS: 00000086

EIP is at .text.lock.tg3 [tg3] 0xa4 (2.4.18-24.7.xsmp)
eax: c03de300   ebx: d4435d80   ecx: d4435d80   edx: c03de3f4
esi: e0cc80d0   edi: 00000282   ebp: 00000000   esp: d3103f38
ds: 0018   es: 0018   ss: 0018
Process setiathome (pid: 1251, stackpage=d3103000)
Stack: d4435ebc e0cc80d0 00000180 c012635b d4435d80 d3103f54 00000086 d3103f54 
       d3103f54 00000000 00000001 00000180 00000000 c012256b c03ce600 c0122411 
       00000000 00000001 c03a8980 fffffffe 00000003 c012219b c03a8980 00000046 
Call Trace: [<e0cc80d0>] tg3_timer [tg3] 0x0 (0xd3103f3c))
[<c012635b>] timer_bh [kernel] 0x29b (0xd3103f44))
[<c012256b>] bh_action [kernel] 0x4b (0xd3103f6c))
[<c0122411>] tasklet_hi_action [kernel] 0x61 (0xd3103f74))
[<c012219b>] do_softirq [kernel] 0x6b (0xd3103f8c))
[<c010a8b0>] do_IRQ [kernel] 0x100 (0xd3103fa8))
[<c010d078>] call_do_IRQ [kernel] 0x5 (0xd3103fc0))


Code: 80 3b 00 f3 90 7e f9 e9 ee c5 ff ff 80 7b 2c 00 f3 90 7e f8 
console shuts up ...
 [ <c01e8744>] netif_receive_skb [kernel] 0x184 (0xd3105ec0))

Comment 72 John Sopko 2003-02-13 13:40:38 UTC

I have not been able to lock my system up as long as I use the "noapic"
option. From all the comments it is difficult to tell if this is the case.
Has anyone crashed with the 2.4.18-24.7.xsmp kernel and noapic set?

On my Dell 2 dell 2650 servers single Xeon processor, which emaulates 2 logical 
processors, while running 2.4.18-24.7.xsmp with noapic set
I ran ttcp tests for about 19.5 hours setting up each system to both
transmitt and receive between each other without a problem.

I could not get the "netperf" to compile so I could not test as 
comment #68 did.

Here is the packet count from netstat for each system, note the first
system has been up for just over 3 days. The other system crashed
when I did not have "noapic" set but has been up ever since with 
noapic set:


sopko@firebird:8% uname -a
Linux firebird.cs.unc.edu 2.4.18-24.7.xsmp #1 SMP Fri Jan 31 06:10:55 EST 2003 
i686 unknown
sopko@firebird:9% uptime
  8:30am  up 3 days, 9 min,  5 users,  load average: 0.10, 0.08, 0.08
sopko@firebird:10% netstat -s
Ip:
    676708765 total packets received
    0 forwarded
    0 incoming packets discarded
    675390068 incoming packets delivered
    494066595 requests sent out
    1416872 reassemblies required
    366370 packets reassembled ok
    791 fragments created

sopko@rockx:1% uname -a
Linux rockx.cs.unc.edu 2.4.18-24.7.xsmp #1 SMP Fri Jan 31 06:10:55 EST 2003 
i686 unknown
sopko@rockx:2% uptime
  8:30am  up 19:26,  4 users,  load average: 0.01, 0.03, 0.00
sopko@rockx:3% netstat -s
Ip:
    312139470 total packets received
    0 forwarded
    0 incoming packets discarded
    312006240 incoming packets delivered
    314833333 requests sent out
    736 reassemblies required
    203 packets reassembled ok
    349 fragments created

Comment 73 Dirk Hufnagel 2003-02-13 17:43:07 UTC

I remember that compiling netperf wasn't trivial, I had to
do a few changes in the Makefile. If anybody wants to try
my RH 7.2 executables, here is a link. If you want to compile
it yourself, there also is a link to the modified makefile
(for netperf 2.2pl2).

http://www.physics.ohio-state.edu/~hufnagel/netperf
http://www.physics.ohio-state.edu/~hufnagel/netserver
http://www.physics.ohio-state.edu/~hufnagel/makefile

Comment 74 Christopher McCrory 2003-02-16 06:29:58 UTC

Tested kernel kernel-smp-2.4.18-24.7x.legolas2.i686.rpm from
http://people.redhat.com/jgarzik/pub/legolas2-7.x/i686/

I'm running ttcp and 'RX bytes:' has looped four or five times.  Last time, it
crashed after one or two.

Yea!

Comment 75 John Sopko 2003-02-17 12:52:16 UTC

My two Dell 2650 single processor servers running with the multi-processor
kernel and my one dual processor Dell 2650 have been up for almost a week
using the "noapic" option. I have been running the ttcp network test software
on all 3 systems over the weekend. One system has received 256 millon packets
the others 1.8 billion and 2.5 billion:

sopko@firebird:6% uname -a
Linux firebird.cs.unc.edu 2.4.18-24.7.xsmp #1 SMP Fri Jan 31 06:10:55 EST 2003 
i686 unknown
sopko@firebird:7% uptime
  7:35am  up 6 days, 23:15,  5 users,  load average: 0.02, 0.09, 0.11
sopko@firebird:8% netstat -s|head -9
Ip:
    256585885 total packets received
    0 forwarded
    0 incoming packets discarded
    253744336 incoming packets delivered
    74837210 requests sent out
    1417403 reassemblies required
    366526 packets reassembled ok
    802 fragments created

sopko@rockx:3% uname -a
Linux rockx.cs.unc.edu 2.4.18-24.7.xsmp #1 SMP Fri Jan 31 06:10:55 EST 2003 
i686 unknown
sopko@rockx:4% uptime
  7:36am  up 4 days, 18:34,  3 users,  load average: 0.15, 0.09, 0.02
sopko@rockx:5% netstat -s|head -9
Ip:
    1868251366 total packets received
    0 forwarded
    0 incoming packets discarded
    1867471797 incoming packets delivered
    1874499843 requests sent out
    16714 reassemblies required
    5724 packets reassembled ok
    392 fragments created

Linux swan.cs.unc.edu 2.4.18-24.7.xsmp #1 SMP Fri Jan 31 06:10:55 EST 2003 i686 
unknown
sopko@swan:2% uptime
  7:39am  up 6 days, 22:35,  6 users,  load average: 0.11, 0.04, 0.01
sopko@swan:3% netstat -s|head -9
Ip:
    2520303211 total packets received
    0 forwarded
    0 incoming packets discarded
    2513597627 incoming packets delivered
    2682733841 requests sent out
    7729216 reassemblies required
    2040387 packets reassembled ok
    5034907 fragments created

Comment 76 Need Real Name 2003-02-17 17:26:06 UTC

An update:

the bcm5700 driver which came with 2.4.18-19.7.xsmp seems quite stable.

In addition, I did some testing with the 2.4.9-e.10summit that is an errata
update for the AS 2.1 kernel and the tg3 driver which is included there appears
stable.  I've been smacking it down with ttcp for 3.5 days now and no crashes.

Comment 77 Need Real Name 2003-02-17 17:52:13 UTC

I guess I should specify that the 2.4.9-e.10summit is the summit kernel for
running on an IBM x440 machine.

Comment 78 Pete Zaitcev 2003-02-17 17:58:01 UTC

Summit kernels must not be used except on an obscure IBM box for which
they were intended.

The -e kernels feature a slightly different tg3 with simplified locking.
They also lack NAPI support. This removes lockups at the expence of
performance. I think users of normal RHL systems should stick
to the Jeff's tg3. If someone tests -e for me - great, thanks.
Buf if your ksoftirqd eats all CPU on -e, or something else is weird,
please open a new bug. This bug is about a specific problem in the normal tg3.

Comment 79 Dirk Hufnagel 2003-02-17 18:39:40 UTC

Tested kernel-smp-2.4.18-24.7x.legolas2.athlon.rpm with netperf
today. Same behavior as before, sending a TCP stream at full
speed works, receiving one crashes the maschine instantly.

Comment 80 Carl Litt 2003-02-17 20:49:43 UTC

Tried kernel-smp-2.4.18-24.7x.legolas2.i686 on PE2650 with dual BCM5701's, 
completely locked up on boot when initializing eth0.  No kernel messages.  -up 
kernel does it too.

Comment 81 Rodrigo Cunha 2003-02-19 14:04:18 UTC

I'm running 2.4.18-19.7.aragorn2smp without a single problem until now. All
other kernels crashed one way or another under heavy traffic and high load (yes,
load due to many small processes seems to matter).

smp-2.4.18-24.8.0.i686 also crashed, so I went back to aragorn2. I can't really
use this machines to test kernels since they are production systems so I won't
be installing new kernels.

Just my 2 cents.

Comment 82 Jeff Garzik 2003-02-19 17:57:25 UTC

More forward progress.  From a message sent to linux-poweredge mailing list, by me:

As hinted in previous emails, here are the deadlock and hardware bug
fixes in tg3, fixes the crashes in previous versions.

http://people.redhat.com/jgarzik/pub/legolas4-7.x/      (redhat 7.x)
http://people.redhat.com/jgarzik/pub/legolas4-8.0/      (redhat 8.0)
  
As usual, these are based on the latest Red Hat errata kernel,
currently 2.4.18-24.  Also as usual, these kernels are unofficial, not
intended for production, and have not been through the Red Hat Q/A
process.

A user requested that I be less vague on the changes and describe what
has changed.  In my defense, I didn't think people really wanted the
depth of information, other than a simple "there's been progress."
I stand corrected.

Here are the tg3 driver changes in this latest kernel (legolas4),
taken directly from BitKeeper:


# --------------------------------------------
# 03/02/18      jgarzik      1.990
# [netdrvr tg3] disable 5701 h/w bug workaround during core clock reset
# --------------------------------------------

...

# --------------------------------------------
# 03/02/18      jgarzik      1.991
# [netdrvr tg3] fix NAPI deadlock
# * do not hold driver spinlock during RX processing in tg3_poll
#   (this is the deadlock fix... works around a NAPI net stack bug)
# * create netif_poll_{en,dis}able to synchronize against dev->poll()
# * create __netif_rx_complete to avoid a third irq-save in tg3_poll
# * create tg3_netif_{start,stop} as driver-specific helper functions
#   which disable and enable NAPI polling and TX queueing.  Note that
#   the TX queueing enable/disable is purely advisory, and is not
#   intended to prevent any races.
# * remove tg3_halt call from tg3_set_power_state, as all callers
#   have already called tg3_halt, making it redundant.  Removing this
#   function call also eliminates some locking complications.
# * use new helper __netif_rx_complete in tg3_poll
# * create tg3_reset_task, as a function that runs in process context
#   which resets the NIC.  This is needed because tg3_netif_stop()
#   calls schedule() in the process of disabling dev->poll.
# * schedule tg3_reset_task from tg3_tx_timeout
# * schedule tg3_reset_task from tg3_timer
# * wrap several tg3_halt...tg3_init_hw sequences with
#   tg3_netif_stop...tg3_netif_start.  In addition to synchronizing
#   with dev->poll, this additionally fixes bugs where we were not
#   calling netif_wake_queue, when we should have been.
# * move netif_start_queue call to very bottom of tg3_open
# * add missing tg3_netif_{start,stop} to tg3_{suspend,resume},
#   further fixing obvious bugs.
# --------------------------------------------

...

# --------------------------------------------
# 03/02/18      jgarzik      1.992
# [netdrvr tg3] bump version to 1.4c / Feb 18
# --------------------------------------------

...

# --------------------------------------------
# 03/02/18      jgarzik      1.993
# [netdrvr tg3] properly synchronize with TX, in tg3_netif_stop
# --------------------------------------------

...

# --------------------------------------------
# 03/02/18      jgarzik      1.994
# [netdrvr tg3] fix TX race in previous code, and another buglet
# 
# * call netif_tx_disable after netif_poll_disable, fixing TX race,
#   in tg3_netif_stop
# * follow the ordering of the tg3_netif_stop change, and enable
#   poll after waking TX, in tg3_netif_start
# * after doing those two steps in tg3_netif_start, check for work
#   using new helper function tg3_cond_int
# * add helper function tg3_cond_int, which delivers an interrupt
#   if and only if the status block was updated (i.e. if work
#   is likely to be available)
# --------------------------------------------

Comment 83 Dirk Hufnagel 2003-02-19 20:04:39 UTC

My netperf test worked fine with the new legolas4 kernel.
I received and sent appr. 50MB/s for 10minutes on
the maschine with the 3C996-T running the tg3 driver
without the maschine crashing on me.

Comment 84 Jeff Garzik 2003-02-19 22:06:17 UTC

Thanks.  I'm highly confident that "legolas4" kernel solves the existing issues
users were seeing... now we just have to confirm that those issues did not hide
other existing issues.

legolas4 kernel (tg3 version 1.4c) has survived the scenarios which killed
previous drivers in the lab, so now we just need user feedback to verify that
problems in the field are resolved.

Comment 85 Rodrigo Cunha 2003-02-19 23:29:35 UTC

Jeff, does your lab testing include something along the following lines:

About 40 new (small) processes per second, each one doing a few checks via snmp.
Sometimes there might be as much as 500 processes using the network, but each
one only sending and receiving a few packets. The traffic averages at about
200kb/s with very few bursts.

I've noticed a particular server behing a Cisco 2600, used as a firewall but
severely limiting the peak network usage to about 10Mb/s because the 2600
processor is so slow at this, never crashed. Other server doing exactly the same
things (same number of processes, same network usage) but behind a PIX and
connected to an high performance network crashes all the time (except with
aragorn2). I also think the 2600 basically makes the network behave as if
half-duplexed. Might this (half-duplex) have completly masked the problem with
the kernel?

I suggest persons still having problems with tg3 to try half-duplex operation
since that might completly hide the problems. Does this have some scientific
explanation? :-)

(PS: the machines are Dell PE2650 Dual Xeon 2.4GHz)

Comment 86 Jeff Garzik 2003-02-20 17:40:47 UTC

Rodrigo,

No, we have did not test snmp in relation to tg3.  Are you seeing failures with
tg3, and the "legolas4" kernel?  (posted at http://people.redhat.com/pub/)

In any case, if you can contribute snmp test scripts or other descriptions of
how you are testing, we can certainly add that to our network test suite.

WRT half-duplex, that would only be a factor inasmuch as it slows down the
driver enough to hide the recently-solved problems.

Comment 87 Need Real Name 2003-02-20 17:47:22 UTC

I noticed there are a few new enterprise kernels released:

kernel-2.4.9-e.12.src.rpm
kernel-2.4.18-e.25.src.rpm

Do these have any new tg3 and/or bcm5700 related changes?

Comment 88 Michael K. Johnson 2003-02-20 17:59:59 UTC

jeff:
Please redirect questions of that sort through your support
representative -- bugzilla really isn't meant for that kind
of request.  However:
https://rhn.redhat.com/errata/RHBA-2002-319.html
does mention that there's a new tg3 driver.  The bcm5700 driver
is not changed, and the tg3 driver is (obviously) not up to the
latest level being tested here.  There are actually two tg3
drivers in there; the older version for folks for whom it works,
and the newer tg3_12e3 with more recent updates.  But for more
details, do please contact your support representative.  Thanks!

Comment 89 Rodrigo Cunha 2003-02-21 04:01:15 UTC

> No, we have did not test snmp in relation to tg3.  Are you seeing
> failures with tg3, and the "legolas4" kernel?  (posted at
> http://people.redhat.com/pub/)

Failures? Not yet :-) (Damn, I really hope not...!)

I decided to give legolas4 a shot and everything has been stable for ~20 hours.

About my loads: a synthetic way to emulate them would be to launch about 40
processes per second, each taking about 4 or 5 seconds to exit, and each making
a few dozen snmp queries. A few other processes are crunching the incoming data.
As I said the network load is light at about ~200kb/s, and all this is perl so
the processor is the main bottleneck.

I'll make a simple synthetic test along this lines if anyone is interested. All
the kernels I tested until now crashed in 1 or 2 days (except arargorn2 > 20 days).

Comment 90 Jeff Garzik 2003-02-21 15:47:27 UTC

Yes, I am interested in a synthetic test like you describe.

You may add it as an attachment to this bug report, or email it to me directly
at jgarzik.

Thanks for the success report, also!

Comment 91 Rodrigo Cunha 2003-02-22 00:41:11 UTC

ok, legolas4 just crashed...

Now, this time the crash was a bit different: the machine answered pings and
tcps, but the connection would stall after the first packet. No logs existed, as
usual.

The console, on a serial port, was stalled, as usual.

Comment 92 Rodrigo Cunha 2003-02-22 02:32:16 UTC

Jeff,

A friend of mine just had a similar crash with the latest official release: The
machine answers ICMP, refuses tcp (RST). The console is not connected, so I
can't report on that. Anyway... it seems legolas4 suffers from the same problem.
The hardware is the same: Dell PE 2650 Dual Xeon 2.4GHz

BTW, I'm sending the benchmark code right away, with usage instructions.

Comment 93 Rodrigo Cunha 2003-02-22 02:55:46 UTC

Created attachment 90274 [details]
Synthetic load cpu/network test, in perl.

Comment 94 Lance 2003-02-22 14:22:52 UTC

I have an IBM x235 (2x2.4 Xeon & Broadcom Integrated NIC) which I believe
suffers from this issue.  System is ok until it encounters heavy network
traffic.  I did upgrade to the latest kernel through RHN to no avail.  It still
freezes up in a matter of hours.  One thing I noticed, which may be unrelated as
I'm not that familiar with the internals of Linux, is that with the system
monitor up during initial heavy network traffic, the RAM fills to capacity, not
swap space, just RAM and it doesn't seem to be released if network traffic dies
down.  It's jumping from ~350MB being used to the full 2.5GB.  Has anyone
experienced anything similar?

Comment 95 Rodrigo Cunha 2003-02-22 20:15:09 UTC

> One thing I noticed, which may be unrelated as
> I'm not that familiar with the internals of Linux, is that with the
> system monitor up during initial heavy network traffic, the RAM fills
> to capacity, not swap space, just RAM

If your network traffic also means disk reads and writes thatn that's quite
normal, since linux allocates almos all free memory to disk buffers.

The algoritm for physical memory is something like:

1 - Allocate up to 3/4 to text/data/stack memory, but only if needed.
2 - Keep a small pool of free pages, ready to allocate to (1)
3 - Use the rest as read/write buffer space, and "to be written buffers" space.
4 - The remaining least used text/data/stack pages are dumped to swap space to
maintain space for (2) and (3).

Comment 96 John Sopko 2003-02-24 16:02:03 UTC

I ran the test kernel "2.4.18-24.7x.legolas4smp" (without noapic kernel option) 
on 2 Dell 2650 machines, (100MB ethernet connection), and have not had a crash.
The machines have been running for just over 3 days. I ran the ttcp test
program between the 2 machines, they each sent/received just over 
2 billion packets.

Comment 97 Jeff Garzik 2003-02-24 17:12:35 UTC

Rodrigo,

Can you file a new bug with your latest failure?  That does not sound
specifically tg3 related, and in any case would be a different bug from this one.

Second, obtaining output from a freeze can be done by passing
     nmi_watchdog=1
on the kernel command line.  (though if it receives ICMP traffic, NMI watchdog
isn't going to come into play...)

Comment 98 Rodrigo Cunha 2003-02-26 17:09:01 UTC

Well, no crashes in 5 days, so that one on 21/02 might have been spurious...

Comment 99 Chris Haag 2003-02-26 17:50:52 UTC

Sorry, I could not help by testing kernel versions as promised earlier. For the 
records: Our Dell PE 2650, Dual Xeon 2,4, Dual Broadcom 10/100/1000 is now up 
for more than 72 days of (heavy) production using:
- Kernel 2.4.18-18
- smp
- not using "noapic"
- bcm5700 driver instead of tg3 

Anyone succeeded for a longer period of time (> 14 days) with any kernel using 
tg3 and comparable machine so far?

Comment 100 lnelson 2003-02-26 18:30:03 UTC

Up 67 days on 2.4.18-19.7xsmp, NetXtreme BCM5701 Gigabit Ethernet from BROADCOM, 
tg3 driver, dual XEON 2.0 GHz CPUs, 2GB Ram, Intel SE7500WV2 motherboard, 2U rack
mount Open Storage Solutions Server chassis under fairly heavy network loads - loads
that always hung a machine before.  This applies to many identical servers.

They would not stay up more than a couple days on 2.4.18-18.7xsmp - even less (on order
of minutes) when using the "no apic" on same 2.4.18-18.7xsmp kernel.

We have racks of these, all maintained via RHN/up2date, so testing out non-official
kernels that I can't get thru up2date is not really an option

Comment 101 Christopher McCrory 2003-02-26 19:16:49 UTC

>> Anyone succeeded for a longer period of time (> 14 days) with any
>> kernel using tg3 and comparable machine so far?


58 days with tg3 , web server in server farm (only one running tg3)


chrismcc]$ uname -r
2.4.18-19.7.xsmp


chrismcc]$ uptime
 11:11am  up 58 days,  3:12,  1 user,  load average: 1.40, 1.12, 1.11

chrismcc]$ /sbin/lsmod
Module                  Size  Used by    Not tainted
ipt_REJECT              3744   0 (autoclean)
ipt_state               1248   0 (autoclean)
ip_conntrack           22188   1 (autoclean) [ipt_state]
iptable_filter          2464   0 (autoclean)
ip_tables              14304   3 [ipt_REJECT ipt_state iptable_filter]
tg3                    47008   1
ext3                   67392   3
jbd                    51528   3 [ext3]
aic7xxx               129568   4
sd_mod                 12832   8
scsi_mod              108048   2 [aic7xxx sd_mod]



I suspect I just haven't tickled it right.

Comment 102 Matt Domsch 2003-02-26 21:05:03 UTC

Dell has been testing the "legolas4" kernel and drivers in association with 
our testing of the public beta known as Phoebe.  No tests have yet induced a 
failure.  nttcp tests ran Friday through Monday with no problems.  NFS and 
Samba tests were interrupted by a power outage yesterday, restarted today - 
we'll provide an update when we can.

Comment 103 Matt Domsch 2003-02-27 20:30:14 UTC

Dell has performed additional testing using samba and nfs with the "legolas4"  
test kernel.  We have been unable to induce a failure with any of our tests, 
where with previous kernels and tg3 driver versions we have been able to 
induce failure.  This is an excellent sign that the issues so far uncovered 
have been fixed.

Comment 104 Conor Wynne 2003-03-03 18:01:53 UTC

Hi all,

I started testing of Redhat 7.3 kernel-smp-2.4.18-24.80.legolas4.i686 on Friday 
28 Feb 19:30 and at the time of writing Monday 3rd March 17:55 there have been 
no errors.

I have only been testing NFS connections as I have always been able to 
reproduce the TG3 issue doing just that. 

NFS server is Redhat 8 & 2.4.18-24.7.xsmp [FYI]

So far there have been zero errors reported to dmesg and none of the previous 
black screen and crashing problems. 

I shall continue testing this all week. 
If requested I can add more load / services. 

Regards
CW

Comment 105 Christopher McCrory 2003-03-03 18:23:29 UTC

another success here with legolas4 

Dell 2650
web server

chrismcc]$ uptime
 10:18am  up 4 days, 23:20,  1 user,  load average: 1.17, 1.06, 1.06

chrismcc]$ uname -rm
2.4.18-24.7x.legolas4smp i686

5 days with no problems
( google hit us hard over the weekend so it got a good workout )

Comment 106 Jeff Garzik 2003-03-04 20:04:46 UTC

Fixed in 2.4.18-26 errata kernel, just released.

To prevent further "pile-on" of unrelated tg3 issues, please open a new bug
against this errata kernel, if other issues develop.

Note You need to log in before you can comment on or make changes to this bug.

afinkel
amit_bhutani
anne.possoz
bitto
carl
ch
chrismcc
ckato
dale_kaisner
daniel.grandjean
david_j_morse
dombek
eiwanski
gabor.kondorosi
gary.mansell
hufnagel
ian
imcguire
jan.iven
jaroslaw.polok
jbootle
jeff
jefferson.ogata
jgarzik
jim.laverty
jmarquart
jmccann
john_hull
karl.bailey
lnelson
marc.schmitt
matt_domsch
nreilly
olc
pcfe
peterm
randy
scott
sopko
star
sven
tao
tecklee
vkarasik
zaitcev