Bug 69920
Summary: | Kernel Crashes in TG3 Driver | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Retired] Red Hat Linux | Reporter: | Thomas J. Baker <tjb> | ||||||||||||
Component: | kernel | Assignee: | Jeff Garzik <jgarzik> | ||||||||||||
Status: | CLOSED ERRATA | QA Contact: | |||||||||||||
Severity: | high | Docs Contact: | |||||||||||||
Priority: | medium | ||||||||||||||
Version: | 7.3 | CC: | afinkel, amit_bhutani, anne.possoz, bitto, carl, ch, chrismcc, ckato, dale_kaisner, daniel.grandjean, david_j_morse, dombek, eiwanski, gabor.kondorosi, gary.mansell, hufnagel, ian, imcguire, jan.iven, jaroslaw.polok, jbootle, jeff, jefferson.ogata, jgarzik, jim.laverty, jmarquart, jmccann, john_hull, karl.bailey, lnelson, marc.schmitt, matt_domsch, nreilly, olc, pcfe, peterm, randy, scott, sopko, star, sven, tao, tecklee, vkarasik, zaitcev | ||||||||||||
Target Milestone: | --- | ||||||||||||||
Target Release: | --- | ||||||||||||||
Hardware: | i686 | ||||||||||||||
OS: | Linux | ||||||||||||||
Whiteboard: | |||||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||
Clone Of: | Environment: | ||||||||||||||
Last Closed: | 2003-03-04 20:04:46 UTC | Type: | --- | ||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||
Documentation: | --- | CRM: | |||||||||||||
Verified Versions: | Category: | --- | |||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
Embargoed: | |||||||||||||||
Bug Depends On: | |||||||||||||||
Bug Blocks: | 79997 | ||||||||||||||
Attachments: |
|
Description
Thomas J. Baker
2002-07-26 13:14:26 UTC
Here's another crash. fitzcarraldo.sr.unh.edu login: Red Hat Linux release 7.3 (Valhalla) Kernel 2.4.18-5smp on an i686 fitzcarraldo.sr.unh.edu login: Unable to handle kernel NULL pointer dereference at virtual address 00000060 printing eip: f891e0e2 *pde = 00000000 Oops: 0000 nfs nfsd lockd sunrpc autofs tg3 eepro100 ext3 jbd megaraid aic7xxx sd_mod scsCPU: 0 EIP: 0010:[<f891e0e2>] Not tainted EFLAGS: 00010206 EIP is at tg3_rx [tg3] 0x112 (2.4.18-5smp) eax: 00000642 ebx: 00010000 ecx: c8f99760 edx: 00000000 esi: 00000000 edi: 0000005e ebp: 000005ea esp: c030bef8 ds: 0018 es: 0018 ss: 0018 Process swapper (pid: 0, stackpage=c030b000) Stack: 000005ee f4a7e170 c034cbc0 0000025f 8000025e 00010000 f58765e0 f4dfea80 db94d4e0 c03de6c0 c035a000 f4a7e160 04000001 00000011 f891e52e f4a7e160 f4a7e160 c035a000 f891e5c4 f4a7e160 f50e4b20 00000000 c010a53e 00000011 Call Trace: [<f891e52e>] tg3_interrupt_main_work [tg3] 0x3e [<f891e5c4>] tg3_interrupt [tg3] 0x44 [<c010a53e>] handle_IRQ_event [kernel] 0x5e [<c010a755>] do_IRQ [kernel] 0xa5 [<c0106e70>] default_idle [kernel] 0x0 [<c0105000>] stext [kernel] 0x0 [<c0106e70>] default_idle [kernel] 0x0 [<c0105000>] stext [kernel] 0x0 [<c0106e9c>] default_idle [kernel] 0x2c [<c0106ef4>] cpu_idle [kernel] 0x24 Code: 8b 46 60 85 c0 74 13 68 15 03 00 00 68 a0 4a 92 f8 e8 78 92 <0>Kernel panic: Aiee, killing interrupt handler! In interrupt handler - not syncing End Data Have you attempted to use the 2.4.18-10 kernel? If so, what were the results? I'm running the new kernel on two systems with TG3 ethernet and haven't had any problems yet. It's only been a few days though. I'm seeing this problem also on Dell 2550s and on 6450s. This is with the latest Red Hat kernel 2.4.18-17.7.x on Red Hat 7.3, all patches applied. smp and bigmem kernels both appear to be affected. The problem is IMHO unambiguously the tg3 driver. I had three different hosts all exhibiting the same problem -- run for a few hours then hard hang. I disabled the built-in Broadcom adapters and installed Intel Gb adapters and have been running for over a week with no problem. An errata has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2002-262.html experienced periodic (few hours at times) system hang (usually without anything in message log. Nov 18 06:08:44 vfdb kernel: Nov 18 06:08:44 vfdb kernel: wait_on_irq, CPU 0: Nov 18 06:08:44 vfdb kernel: irq: 1 [ 0 0 1 0 ] Nov 18 06:08:44 vfdb kernel: bh: 0 [ 0 0 1 0 ] Nov 18 06:08:44 vfdb kernel: Stack dumps: Nov 18 06:08:44 vfdb kernel: CPU 1:00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 Nov 18 06:08:44 vfdb kernel: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 Nov 18 06:08:44 vfdb kernel: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 Nov 18 06:08:44 vfdb kernel: Call Trace: [<f89ca28b>] tg3_start_xmit [tg3] 0x12b (0xc36b99c4)) Nov 18 06:08:44 vfdb kernel: [<f897f117>] ipfw_output_check [ipchains] 0x77 (0xc36b99f4)) Nov 18 06:08:44 vfdb kernel: [<f89ca28b>] tg3_start_xmit [tg3] 0x12b (0xc36b9a38)) Nov 18 06:08:44 vfdb kernel: [<c01e7e5e>] dev_queue_xmit [kernel] 0x14e (0xc36b9a54)) Nov 18 06:08:44 vfdb kernel: [<f897f117>] ipfw_output_check [ipchains] 0x77 (0xc36b9a68)) Nov 18 06:08:44 vfdb kernel: [<f8978b63>] check_for_unredirect [ipchains] 0x63 (0xc36b9a88)) Nov 18 06:08:44 vfdb kernel: [<c01f1c94>] qdisc_restart [kernel] 0x14 (0xc36b9aa4)) Nov 18 06:08:44 vfdb kernel: [<c01e7e5e>] dev_queue_xmit [kernel] 0x14e (0xc36b9ac8)) Nov 18 06:08:44 vfdb kernel: [<c01eee3e>] nf_iterate [kernel] 0x2e (0xc36b9ad0)) Nov 18 06:08:44 vfdb kernel: [<c02010bf>] ip_finish_output2 [kernel] 0xaf (0xc36b9aec)) Nov 18 06:08:44 vfdb kernel: [<c0201010>] ip_finish_output2 [kernel] 0x0 (0xc36b9af4)) Nov 18 06:08:44 vfdb kernel: [<c01ef173>] nf_hook_slow [kernel] 0xd3 (0xc36b9af8)) Nov 18 06:08:44 vfdb kernel: [<c01ef1aa>] nf_hook_slow [kernel] 0x10a (0xc36b9b10)) Nov 18 06:08:44 vfdb kernel: [<c01ffb02>] ip_output [kernel] 0x162 (0xc36b9b40)) Nov 18 06:08:44 vfdb kernel: [<c0201010>] ip_finish_output2 [kernel] 0x0 (0xc36b9b58)) Nov 18 06:08:44 vfdb kernel: [<c01ffea0>] ip_queue_xmit [kernel] 0x390 (0xc36b9b88)) Nov 18 06:08:44 vfdb kernel: [<c01e42b8>] skb_clone [kernel] 0x78 (0xc36b9ba4)) Nov 18 06:08:44 vfdb kernel: [<c0214a1e>] tcp_v4_send_check [kernel] 0x6e (0xc36b9bc8)) Nov 18 06:08:44 vfdb kernel: [<c020f6c5>] tcp_transmit_skb [kernel] 0x565 (0xc36b9bf0)) Nov 18 06:08:44 vfdb kernel: [<c02101df>] tcp_write_xmit [kernel] 0x1df (0xc36b9c34)) Nov 18 06:08:44 vfdb kernel: [<c01e51b4>] skb_checksum [kernel] 0x54 (0xc36b9c50)) Nov 18 06:08:44 vfdb kernel: [<c020d592>] __tcp_data_snd_check [kernel] 0x52 (0xc36b9c68)) Nov 18 06:08:44 vfdb kernel: [<c020d53c>] tcp_new_space [kernel] 0x7c (0xc36b9c84)) Nov 18 06:08:44 vfdb kernel: [<c01e51b4>] skb_checksum [kernel] 0x54 (0xc36b9c94)) Nov 18 06:08:44 vfdb kernel: [<c011a02b>] __wake_up [kernel] 0x4b (0xc36b9cd4)) Nov 18 06:08:44 vfdb kernel: [<c0215e6c>] tcp_v4_rcv [kernel] 0x3cc (0xc36b9cf0)) Nov 18 06:08:44 vfdb kernel: [<c01eee3e>] nf_iterate [kernel] 0x2e (0xc36b9d2c)) Nov 18 06:08:44 vfdb kernel: [<c01fd337>] ip_local_deliver_finish [kernel] 0xb7 (0xc36b9d44)) Nov 18 06:08:44 vfdb kernel: [<c01fd280>] ip_local_deliver_finish [kernel] 0x0 (0xc36b9d50)) Nov 18 06:08:44 vfdb kernel: [<c01ef173>] nf_hook_slow [kernel] 0xd3 (0xc36b9d54)) Nov 18 06:08:44 vfdb kernel: [<c01fd280>] ip_local_deliver_finish [kernel] 0x0 (0xc36b9d68)) Nov 18 06:08:44 vfdb kernel: [<c01ef1aa>] nf_hook_slow [kernel] 0x10a (0xc36b9d6c)) Nov 18 06:08:44 vfdb kernel: [<c0126115>] update_process_times [kernel] 0x25 (0xc36b9da8)) Nov 18 06:08:44 vfdb kernel: [<c0116999>] smp_apic_timer_interrupt [kernel] 0xa9 (0xc36b9dcc)) Nov 18 06:08:44 vfdb kernel: [<c01a0cc4>] account_io_start [kernel] 0x44 (0xc36b9dd8)) Nov 18 06:08:44 vfdb kernel: [<c01a0c07>] locate_hd_struct [kernel] 0x27 (0xc36b9de0)) Nov 18 06:08:44 vfdb kernel: [<c01a0d69>] req_new_io [kernel] 0x49 (0xc36b9df4)) Nov 18 06:08:44 vfdb kernel: [<f8814f0c>] scsi_queue_next_request [scsi_mod] 0x5c (0xc36b9e50)) Nov 18 06:08:44 vfdb kernel: [<f8815139>] __scsi_end_request [scsi_mod] 0x139 (0xc36b9e68)) Nov 18 06:08:44 vfdb kernel: [<c0126115>] update_process_times [kernel] 0x25 (0xc36b9e84)) Nov 18 06:08:44 vfdb kernel: [<c0126115>] update_process_times [kernel] 0x25 (0xc36b9ea0)) : Nov 18 06:08:44 vfdb kernel: [<c0116999>] smp_apic_timer_interrupt [kernel] 0xa9 (0xc36b9ef8)) Nov 18 06:08:44 vfdb kernel: [<c01266cc>] schedule_timeout [kernel] 0x7c (0xc36b9f80)) Nov 18 06:08:44 vfdb kernel: [<c0126640>] process_timeout [kernel] 0x0 (0xc36b9f98)) Nov 18 06:08:44 vfdb kernel: [<c013a76e>] wakeup_memwaiters [kernel] 0xde (0xc36b9fb0)) Nov 18 06:08:44 vfdb kernel: [<c013a541>] kswapd [kernel] 0x381 (0xc36b9fd8)) Nov 18 06:08:44 vfdb kernel: [<c0105000>] stext [kernel] 0x0 (0xc36b9fe8)) Nov 18 06:08:44 vfdb kernel: [<c0107286>] kernel_thread [kernel] 0x26 (0xc36b9ff0)) Nov 18 06:08:44 vfdb kernel: [<c013a1c0>] kswapd [kernel] 0x0 (0xc36b9ff8)) Nov 18 06:08:44 vfdb kernel: Nov 18 06:08:44 vfdb kernel: Nov 18 06:08:44 vfdb kernel: CPU 2:00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 Nov 18 06:08:44 vfdb kernel: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 Nov 18 06:08:44 vfdb kernel: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 Nov 18 06:08:44 vfdb kernel: Call Trace: Nov 18 06:08:44 vfdb kernel: Nov 18 06:08:44 vfdb kernel: CPU 3:55514246 55514246 55514246 55514246 55514246 55514246 55514246 55514246 Nov 18 06:08:44 vfdb kernel: 55514246 55514246 55514246 55514246 55514246 55514246 51514246 55514b30 Nov 18 06:08:44 vfdb kernel: 55514246 55514246 55514246 55514246 55514246 55514246 55514246 55514246 Nov 18 06:08:44 vfdb kernel: Call Trace: Nov 18 06:08:44 vfdb kernel: Nov 18 06:08:44 vfdb kernel: CPU 0:f7fb3f28 c0246f0e 00000000 00000001 ffffffff 00000000 c010a452 c0246f23 Nov 18 06:08:44 vfdb kernel: 00000001 f66de000 00000001 c017e1bf f66de368 c030a284 f7fb3f74 00000000 Nov 18 06:08:44 vfdb kernel: f7fb2000 c01226de f66de000 f66de130 c030a284 c0304f00 00000000 c012b3e5 Nov 18 06:08:44 vfdb kernel: Call Trace: [<c010a452>] __global_cli [kernel] 0xe2 (0xf7fb3f40)) Nov 18 06:08:44 vfdb kernel: [<c017e1bf>] flush_to_ldisc [kernel] 0x9f (0xf7fb3f54)) Nov 18 06:08:44 vfdb kernel: [<c01226de>] __run_task_queue [kernel] 0x5e (0xf7fb3f6c)) Nov 18 06:08:44 vfdb kernel: [<c012b3e5>] context_thread [kernel] 0x155 (0xf7fb3f84)) Nov 18 06:08:44 vfdb kernel: [<c012b290>] context_thread [kernel] 0x0 (0xf7fb3fc8)) Nov 18 06:08:44 vfdb kernel: [<c0105000>] stext [kernel] 0x0 (0xf7fb3fe8)) Nov 18 06:08:44 vfdb kernel: [<c0107286>] kernel_thread [kernel] 0x26 (0xf7fb3ff0)) Nov 18 06:08:44 vfdb kernel: [<c012b290>] context_thread [kernel] 0x0 (0xf7fb3ff8)) Nov 18 06:08:44 vfdb kernel: The errata (2.4.18-18.8.0smp) did not solve the problem for me. Approximately 10 minutes after booting the kernel the system hung as before (with no audit trail). I am still running the new kernel but I operate off my Fast Ethernet interface and manually unload the tg3 driver. This is the same workaround I was using for the 2.4.18-17.8.0smp kernel. RH8.0 on a Dell 4600. Just to add: my problems described above at 2002-11-18 01:21:19 are for 2002-11-18 01:21:19. so the problem is not fixed. Also, I am running dual proc 2G Xeon CPU on two DELL 2650's which experienced the same problem, so it is definitely the kernel. Please re-open this bug. The tg3 driver is still broken in the currently available (2.4.18-18) kernel for RH 7.3 as of Tuesday 2002-11-19. Admittedly I get a crash now that is unequivocally caused by tg3, where before I had a "mystery lockup" with no errors in ESM, no errors in syslog, no messages on screen, and no response to terminal or network I/O. Systems are two identical Dell 2650's running drbd, heartbeat, nfs in a highly redundant configuration with a crossover 1000bt cable. To add my wieght this is a problem still. I have two dell 2650 2*2ghz xeon processors. These boxes are meant to be replacing our old groupwaise mail systems with a spanking redhat mail system. The boxes have shown this fault on RH7.1 through to RH8.0. Current config is RH7.3 installed using dell install disk with ALL errata applied and using the kernel-bigmem-2.4.18- 18.7.x.i686.rpm 'fix'. The system runs sendmail, 180 * 1MByte emails an hour using virus scanning & spam stomping. This seems fairly stable, approx 2 days uptime before locking, but if I run httpd with php scripts aswell then the crash occurs with 15 minutes. The httpd is not underload. A fix to this problem is sorely needed as I'm getting mud on my face at the moment as our MS exchange server is more stable than the redhat server.... Not good. To add my wieght this is a problem still. I have two dell 2650 2*2ghz xeon processors. These boxes are meant to be replacing our old groupwaise mail systems with a spanking redhat mail system. The boxes have shown this fault on RH7.1 through to RH8.0. Current config is RH7.3 installed using dell install disk with ALL errata applied and using the kernel-bigmem-2.4.18- 18.7.x.i686.rpm 'fix'. The system runs sendmail, 180 * 1MByte emails an hour using virus scanning & spam stomping. This seems fairly stable, approx 2 days uptime before locking, but if I run httpd with php scripts aswell then the crash occurs with 15 minutes. The httpd is not underload. A fix to this problem is sorely needed as I'm getting mud on my face at the moment as our MS exchange server is more stable than the redhat server.... Not good. To add my wieght this is a problem still. I have two dell 2650 2*2ghz xeon processors. These boxes are meant to be replacing our old groupwaise mail systems with a spanking redhat mail system. The boxes have shown this fault on RH7.1 through to RH8.0. Current config is RH7.3 installed using dell install disk with ALL errata applied and using the kernel-bigmem-2.4.18- 18.7.x.i686.rpm 'fix'. The system runs sendmail, 180 * 1MByte emails an hour using virus scanning & spam stomping. This seems fairly stable, approx 2 days uptime before locking, but if I run httpd with php scripts aswell then the crash occurs with 15 minutes. The httpd is not underload. A fix to this problem is sorely needed as I'm getting mud on my face at the moment as our MS exchange server is more stable than the redhat server.... Not good. I've just used the "noapic" kernel boot option & have found this to make my system allot more stable than it ever has been. I'm compressing 6 Gbytes of data aswell as carrying out the functions that the server should be doing, & everything is running sweetly, (& very fast), prior to adding the "noapic" I would have expected the machine to have locked up by now even without the large compression test. I'm not saying this is a fix as it hasn't been running long enough... but it certainly seems to point to where the problem may lie. Any thoughts? The latest Red Hat errata kernel 2.4.18-18.8.0 states that it addresses the "Kernel Crashes in TG3 Driver" issue (Bugzilla ID:69920). After installing the kernel-source for the errata rpm and performing a diff between the errata kernel (2.4.18-18.8.0) and the RH 8.0 stock kernel (2.4.18-14), it was evident that the tg3 patch was "not" included in the errata kernel. Refer below for the actual patch (originally posted on Linux Kernel Mailing List) ChangeSet 1.790, 2002/11/14 14:43:47-05:00, davem Fix tg3 net driver to properly disable interrupts during some TX operations # This patch includes the following deltas: # ChangeSet 1.789 -> 1.790 # drivers/net/tg3.c 1.37 -> 1.38 # tg3.c | 46 ++++++++++++++++++++++++++++++++++++++-------- 1 files changed, 38 insertions(+), 8 deletions(-) diff -Nru a/drivers/net/tg3.c b/drivers/net/tg3.c --- a/drivers/net/tg3.c Fri Nov 15 09:08:21 2002 +++ b/drivers/net/tg3.c Fri Nov 15 09:08:21 2002 @@ -59,8 +59,8 @@ #define DRV_MODULE_NAME "tg3" #define PFX DRV_MODULE_NAME ": " -#define DRV_MODULE_VERSION "1.1" -#define DRV_MODULE_RELDATE "Aug 30, 2002" +#define DRV_MODULE_VERSION "1.2" +#define DRV_MODULE_RELDATE "Nov 14, 2002" #define TG3_DEF_MAC_MODE 0 #define TG3_DEF_RX_MODE 0 @@ -2373,13 +2373,28 @@ /* No BH disabling for tx_lock here. We are running in BH disabled * context and TX reclaim runs via tp->poll inside of a software * interrupt. Rejoice! + * + * Actually, things are not so simple. If we are to take a hw + * IRQ here, we can deadlock, consider: + * + * CPU1 CPU2 + * tg3_start_xmit + * take tp->tx_lock + * tg3_timer + * take tp->lock + * tg3_interrupt + * spin on tp->lock + * spin on tp->tx_lock + * + * So we really do need to disable interrupts when taking + * tx_lock here. */ - spin_lock(&tp->tx_lock); + spin_lock_irq(&tp->tx_lock); /* This is a hard error, log it. */ if (unlikely(TX_BUFFS_AVAIL(tp) <= (skb_shinfo(skb)->nr_frags + 1))) { netif_stop_queue(dev); - spin_unlock(&tp->tx_lock); + spin_unlock_irq(&tp->tx_lock); printk(KERN_ERR PFX "%s: BUG! Tx Ring full when queue awake!\n", dev->name); return 1; @@ -2520,7 +2535,7 @@ netif_stop_queue(dev); out_unlock: - spin_unlock(&tp->tx_lock); + spin_unlock_irq(&tp->tx_lock); dev->trans_start = jiffies; @@ -2538,13 +2553,28 @@ /* No BH disabling for tx_lock here. We are running in BH disabled * context and TX reclaim runs via tp->poll inside of a software * interrupt. Rejoice! + * + * Actually, things are not so simple. If we are to take a hw + * IRQ here, we can deadlock, consider: + * + * CPU1 CPU2 + * tg3_start_xmit + * take tp->tx_lock + * tg3_timer + * take tp->lock + * tg3_interrupt + * spin on tp->lock + * spin on tp->tx_lock + * + * So we really do need to disable interrupts when taking + * tx_lock here. */ - spin_lock(&tp->tx_lock); + spin_lock_irq(&tp->tx_lock); /* This is a hard error, log it. */ if (unlikely(TX_BUFFS_AVAIL(tp) <= (skb_shinfo(skb)->nr_frags + 1))) { netif_stop_queue(dev); - spin_unlock(&tp->tx_lock); + spin_unlock_irq(&tp->tx_lock); printk(KERN_ERR PFX "%s: BUG! Tx Ring full when queue awake!\n", dev->name); return 1; @@ -2635,7 +2665,7 @@ if (TX_BUFFS_AVAIL(tp) <= (MAX_SKB_FRAGS + 1)) netif_stop_queue(dev); - spin_unlock(&tp->tx_lock); + spin_unlock_irq(&tp->tx_lock); dev->trans_start = jiffies; FWIW, I diffed the 2.4.18-17.7.x and 2.4.18-18.7.x tg3 sources and there are definitely some differences. I seem to recall the spin_lock and spin_lock_irq calls differ, but I don't think the stuff about TX_BUFFS_AVAIL differed. This bug is still marked CLOSED. Hey Red Hat, please REOPEN. Created attachment 85747 [details]
Possible fix
Everyone: please try the patch I just attached to this bug report, and see if it fixes the problem. has anyone a compiled kernel with this patch for dell 2650 FWIW I've had success with this patch on a Dual 2Ghz Xeon IBM 335 xserver with 2 tg3 NICs. Previously a large rdist to this box would cause it to hang within about an hour under both 2.4.18-17.7.xsmp and 2.4.18-18.7.xsmp kernels. With the patch applied to 2.4.18-18.7.xsmp it has been going without fault for the last 12 hours. Regarding 2.4.18-18.7.x (not using the patch): this is much more unstable than 2.4.18-17.7.x was. I have a 2650 that hadn't crashed at all -- since installing the new kernel it won't stay up more than 24 hours. I'm building a patched kernel to test on that machine. I have experienced a kernel crash with 2.4.18-18.7.xsmp on HP ProLiant DL580 G2 (4*Xeon 1.6 GHz, 2 GB RAM) just few hours after I rebooted from 2.4.18-10smp. This is what 2.4.18-10smp says: tg3.c:v0.99 (Jun 11, 2002) eth0: Tigon3 [partno(284685-001) rev 0105 PHY(5701)] (PCIX:100MHz:64-bit) 10/100/1000BaseT Ethernet xx:xx:xx:xx:xx:xx eth0: Link is up at 100 Mbps, full duplex. eth0: Flow control is off for TX and off for RX. The NIC has always been connected to a 100 Mbps port at the Cisco Catalyst switch. It has never been running at 1000 Mbps. 2.4.18-10smp is running just fine (if you ignore its security problems). *** Bug 78166 has been marked as a duplicate of this bug. *** With a stock RH 8.0 kernel (2.4.18-14smp), tg3 v1.0 driver, a PowerEdge 1655MC (2x1.26GHz PIIIs, dual onboard BCM95703A31) would lock hard after 5-10 minutes of heavy network traffic (primarily transmits). Tried jgarzik's 2.4.18-19.7.tg3.120smp kernel (with tg3 v1.2) on the same box, and running the same heavy TX load (both interfaces sending ~47MB/sec), it's been up for 8 hours straight. Thus, at least in my case, it has fixed the lockup problem. Thanks Jeff! Thanks for the feedback so far. To make things easier to access and test, I have made available a drop-in tg3.c and tg3.c which should fix the tg3 crashes, and have also created rpms (including source rpm) to make things easier to test. Drop-in source code: http://people.redhat.com/jgarzik/tg3/tg3-1.2/ Unofficial test rpms: http://people.redhat.com/jgarzik/tg3/tg3-1.2/rpms/ (disclaimer/warning: these rpms are unofficial, and should not be used in production. they have not gone through a full battery of Red Hat Q/A tests. if they damage your computer hardware, software, or scare your cat, it's not my fault.) Hi - Just wanted to chime in on this bug. We upgraded to 2.4.18-18.7.x on a machine with a 3com 3c996T ethernet card, and had *severe* stability problems: after an indeterminate period ranging between five minutes and three hours, the system would hang. No OOPSes, logged errors to syslog, or other diagnostics, just locked up solid and requiring a hard reset. This system had been rock-steady under the stock 7.3 kernel and the 2.4.18-5 errata kernel. Retrograding to the 2.4.18-10bigmem kernel appears to have solved the problem for now, but I'll definitely wait for a resolution on this bugzilla entry before installing a newer kernel... Hardware: dual 933MHz Piii system, SuperMicro 370DE6 mainboard, 2G RAM, 3com 3c996-T gigabit ethernet NIC, U180 scsi system disk and 6-disk RAID array (off of a Mylex A352.) more-or-less stock RH7.3 with all relevant errata updates (except the kernel, as detailed above.) Non-stock SW includes local Apache+mod_ssl and php, and Sendmail 8.12.6/mimedefang/spamassassin. Primary functions: mail server (averaging ~200 emails/hour), webmail server, NIS master, and NFS server (with about 50 client machines.) I have solved the problem on our machine using the "noapic" option for the kernel line in grub.conf. Machine: Dell Poweredge 2650, dual Xeon 2,4 Mhz, dual Broadcom 10/100/1000 Ethernet adapters (tg3 driver!), 1 GB RAM, Onboard SCSI Raid Symptom: - freezing, after 30 minutes or up to 10 hours - sometimes displaying a message on console, sometimes not - no response to ping - fans running faster - hard reset required RedHat: Kernel 2.4.18-17.7.x and 2.4.18-18.7.x (SMP versions) Solution: adding "noapic" in grub.conf, machine is now up for 4 1/2 days, no more trouble Credits: Thanks to Guiseppe Raimondi from RedHat/UK Next Steps: Maybe I will try jgarzik's new kernel with corrected tg3 drivers (hmm it's a production machine, see if I wait for the next official kernel...) Idea: As far as I understand, apic distributes the interrupts to the four logical processors. Is it possible, that the tg3 driver is faulty in that area? When not using apic it works fine. Still strange: I updates from the -17 kernel to the -18 kernel. Using the -17 kernel everything was running properly. After doing the update, booting with the -18 *AND* the -17 kernel produced the same problem of freezing the machine. Something's wrong with the kernel update procedure, or did I misunderstand something? I thought I could go back to a previous kernel? Jeff, I left two systems w/ tg3 v1.2 running heavy network traffic continuously over the holidays. It was still running fine until I stopped it this afternoon (5 days straight!). Tried tg3 v1.2txlock from http://people.redhat.com/jgarzik/tg3/tg3-1.2txlock/ Here's some performance data from running netperf between 2 PE1655MC blades, each with 2 integrated BCM5703 NICs: - Both ends using bcm5700 2.2.26: ~50.3MB/sec per NIC - Both ends using v1.2 (no txlock): ~71MB/sec per NIC - Both ends using v1.2txlock: ~71MB/sec per NIC 1.2txlock seems to be just as stable and performance seems to be equivalent (at least in this test). My original problem was with two Dell PE2550s and the problem was fixed with the 2.4.18-17.7.x or maybe the -10 one. But I've got a new 2650 with dual 2.8GHz P4 Xeons and 6GBs of memory and it hasn't gone overnight without hanging. There is no debugging information at all, just a consistent hang. I've tried loading the network and it seems fine but by the next morning, it's hung again. The test kernel with the TG3 1.2 driver didn't make a difference. Admittedly, it could be because of something else but it appears others are having trouble with the 2650s too. I had this problem (see previous posts to this problem), with daul xeon dell 2650 with 1Gbyte mem, I've found that the xbigmem kernel on RH7.3 & using the drivers supplied from the dell site for the broardcom network card for RH7.2 have given me a stable platform, I used to get a lock up (no debug) after 4 hours or so, uptime so far is 12 days with this combination & without adding the noapic stuff to the grub config. Basically I've two servers that were displaying the problem under light load & now don't even under heavey loading. Hope this is of use. Does the near furure hold an errata kernel? I have this problem with a Dell 2650 2x2.4Ghz CPU/2GB RAM dual BCM5701 gbit ethernet running RH 8.0 with Kernel 2.4.18-18.8.0smp. Reproducable hang while doing a recursive scp of a single directory from the Dell 2650 to another machine. Directory contains about 20+ files and totals about 1MB data. Approx halfway through the directory system hangs. Problem occurs with hyper-threading enabled or disabled. The noapic kernel parameter seems to keep the system from hanging in my case. I don't know if that helps in the debugging or not. Based on feedback, I will confirm that tg3 driver version 1.2 definitely fixes these problems. Users can get unofficial rpms containing these fixes from: http://people.redhat.com/jgarzik/tg3/tg3-1.2/rpms/ or simply download the tg3 1.2 source code and drop it into your current kernel build, from http://people.redhat.com/jgarzik/tg3/tg3-1.2/ or simply download the latest stock kernel, 2.4.20, or download the latest Red Hat rawhide kernel, or wait for the next Red Hat release. If you continue to see crashes with tg3 1.2, please open a new bug. *** Bug 78822 has been marked as a duplicate of this bug. *** *** Bug 78427 has been marked as a duplicate of this bug. *** My test results so far: 1) Kernel 2.4.18-17.7.x -> OK 2) Update to Kernel 2.4.18-18.7.x -> crashes after 30 min. up to 12 hours (6 times) 3) Running 2.4.18-18.7.x with noapic option -> crashes after approx. 7 days (2 times) 4) Jeff Garzik's tg31.2 (120) kernel -> crash after 5 hours (once) What are the experiences using Jeff Garzik's txlock (121) kernel? Anyone succeeded on a Dell Poweredge (mine is a 2650 dual xeon, dual broadcom)? Anyone out there still having troubles with a Dell Poweredge? (My so called solution using noapic option was in fact wrong - I wrote it too early) (For more The bug still exists as far as I am concerned, the new tg3 module does not fix the problem that I am seeing nor does the noapic option to the kernel. I have had to go back to using the bcm5700 module instead of the tg3. I have been up for 48hours now, and counting... I have a dell PE2650 2x2.4Ghz Xeons, 2Gb RAM 2x HW mirrored sysdisks, 500Gb RAID 5 array attached via 2xQLA2310F cards. and an autoloader attached via SCSI card. The machine has twin onboard Broadcom network cards As it seems, that this bug is not solved, could you open this thread again, Jeff? I still have a 8 k$ machine here, that is not reliable. And I think I am not the only one. It does not make sense to open a new bug, as the history could be useful. People are still seeing problems, re-opening bug. I have a Dell PowerEdge 2650 dual Xeon 2.8 GHz, dual on-board Broadcom Gigabit NICS, with 6 gigabytes of memory. I have also been experiencing the system crashes on Redhat 8.0 - most likely due to the tg3 driver. I am running the bigmem version of the kernel. By running PostgreSQL heavily, I am able to cause the system to crash regularly. With kernel-bigmem-2.4.18-19.8.0, the system crashed after eight days. That is actually an improvement over kernel-bigmem-2.4.18-18.8.0, which I was able to crash consistently after less than one day. So it looks like an improvement was made, but there are still bugs in the driver. I don't know if this is related, but I am seeing the following messages in /var/log/messages (once or twice a day): kernel: ENOMEM in do_get_write_access, retrying. kernel: ENOMEM in journal_alloc_journal_head, retrying. To all still experiencing problems, 1) please boot with "noapic" on the kernel command line. You can run "cat /proc/cmdline" to check for sure. 2) I have posted some new rpms for testing, based on the latest errata: latest production tg3 release, 1.2a, built into unofficial rpms: http://people.redhat.com/jgarzik/tg3/tg3-1.2a/rpms/ but I would like people to test my experiment which should provide additional stability: http://people.redhat.com/jgarzik/tg3/tg3-1.2a/exp1-rpms/ ...and if that doesn't work for people, fall back to experiment 2: http://people.redhat.com/jgarzik/tg3/tg3-1.2a/exp2-rpms/ Feedback requested! On several systems, there is evidence that the lock-ups are not directly related to driver but more to system board. So please make sure to attach 'dmesg' and 'lspci -vvv' output in future bug reports. I don't know if this helps; but, I have two Dell PE 2650's (dual 2.4 GHz P4 Xeon) connected to a 100 Mbps full duplex switch and running the 2.4.18-10smp kernel that are very stable (90+ days under moderate to heavy load). I have a third PE 2650 with the same hardware configuration running the 2.4.18-19smp kernel that won't run for more than 2 days without locking up under virtually no load. I will try the tg3 release 1.2a kernel. After several tests (please see above) I found out, that using the bcm5700 driver instead of the tg3 driver works fine on my Dell PE 2650, dual Xeon 2.4 GHz, dual Broadcom 10/100/1000. Even without the kernel "noapic" option the system is now up for 17 days. I will try the tg3 1.2a driver/kernel and report the results. Please try "experiment 1" rpms, as well. These are for testing potential "tg3 lockup" problems. tg3 1.2a is a maintenance release which should improvement performance and fix a PXE issue, but does not directly address the lockup problems people are seeing. The driver version string will show up as "tg3 1.2a+exp1" after bootup. As another data point, I recently upgraded a Dell PowerEdge 2550 (dual P3 933, 2.5GB RAM) to Red Hat 8.0 and had the system hang overnight when using the tg3 gig port and kernel 2.4.18-19.8.0smp. I then switched it to use the eepro100 port and it has been up a week without problems. There was nothing in the logs about the hang. Unfortunately, I can do much testing as all the machines experiencing the problem are production systems. I also have a Dell 2650 dual 2.4gzh with 3GB of ram and 64GB raid array that is locking up on me with a 2 or 3 day frequency. I tried the experimental kernel rpms http://people.redhat.com/jgarzik/tg3/tg3-1.2a/exp1-rpms/ http://people.redhat.com/jgarzik/tg3/tg3-1.2a/exp2-rpms/ and http://people.redhat.com/jgarzik/tg3/tg3-1.2a/exp3-rpms/ I can't speak yet to the lock up, but I have lost all network capability. I can't ping anything. The interface is up, has link light, and workes when I switch to a non-experimental kernel. Side question: With a dual proc server and 3GB of ram, which kernal do you want? bigmem or smp? kernel-smp-2.4.18-19.7.tg3.126.i686.rpm doesn't work at all for me. I get no networking (ping etc.). This is with or without the noapic set. To test things, I unloaded the tg3 and then loaded the bcm5700 driver. This made networking work again. Created attachment 89446 [details]
lspci -vvv for 2650 with tg3 problems
Created attachment 89447 [details]
dmesg for 2650 with tg3 problems
Ok, some of these reports have actually been fixed in more recently posted rpms. Just to get everybody on the latest page, please use "aragorn2" test rpms, posted at http://people.redhat.com/jgarzik/pub/ This is the latest Red Hat errata kernel for 7.x/8.x, with the recent tg3 bug fixes. I have a Dell 2650 and have tried all mentioned with no luck. I recently tried aragorn2 and noapic which seems to help. Without noapic it crashes hard. Is there indeed a fix for this? Thanks Scott Ladies and gentlemen, I have received permission to post the latest release candidate of Red Hat's errata kernel. It contains not only fixes for e1000 and tg3 net drivers, but also system-level fixes which may address the problems users on this list were seeing. This kernel is currently in Red Hat Q/A, and has NOT yet been "qualified" as official, nor has it been released. Errata kernel 21 release candidate, for Red Hat 8.0: http://people.redhat.com/jgarzik/pub/2.4.18-21.8.0/ Errata kernel 21 release candidate, for Red Hat 7.x: http://people.redhat.com/jgarzik/pub/2.4.18-21.7.x/ It is requested that people who were seeing crash problems test this kernel, as this will be the next official Red Hat errata kernel, after it passes Q/A. Can the noapic option be removed with these latest kernels? With the tg3 driver, in promiscuous mode, the server hang occurs within the next 30 minutes. The same setup with bcm5700 driver does not hang. Machine: Dell Poweredge 2650, dual Xeon 2,4 Mhz, dual Broadcom 10/100/1000 Ethernet adapters (tg3 driver), 1 GB RAM, Onboard SCSI Raid, 100Mbps Ethernet port. The port trafic is heavy as it is the span of a busy subnet. RedHat 7.3 up2date, Kernel 2.4.18-19.7.xsmp but no hang with 2.4.18-19.7.xdebug Thanks Daniel. I can confirm that the latest production released Redhat kernel 2.4.18-24.7.xsmp does not fix the problem. My PE2650 crashed in the usual manner after about 5 hours of normal (minimal) activity. I am concerned that the bcm5700 modules (the only work around) do not exist in /lib/modules for this new kernel - it would appear that they have been deprecated. This is unacceptable to me as my machine has run for two months on these modules perfectly fine. Hence I cannot run the latest kernel and have had to revert my machine to the 2.4.18-18.7.xsmp kernel with the bcm5700 kernel module. I also have a call (ref #222224) logged with Redhat's Patrick Ernzer (pernzer) who is working with Dell UK on trying resolve this issue for me for the last 4 months. I will also submit this report to bug #79997 on bugzilla.redhat.com as I am not sure which bug I am actually suffering from. I upgraded based on this posting http://people.redhat.com/jgarzik/pub/2.4.18-21.7.x/ my redhat 7.3 installation to and I have not suffered a system lockup since. Scott Yesterday I installed 2.4.18-24.7.xbigmem on a Dell PE2550 with Red Hat 7.3, fully patched. The new kernel hung after less than 12 hours. No apparent difference over 2.4.18-19. The same system was completely stable with an Intel e1000 card on 2.4.18-18. I got hopeful when 2.4.18-19 came out and went back to the built-in Broadcom. Woe is me. I'll second the above results. On a dual Xeon 2.4.18-24.7.xsmp hung in less than 12 hours. On a much more important note where do I find previous kernels to back out this kernel. RHN deleted all of the kernel source directories we had, which are needed to be build external modules for this machine. I am also reliably seeing this on an IBM x335 single 2.00ghz xeon running the smp kernel to get the hyperthreading support. Would running the uniprocessor kernel likely fix the problem? It reliably crashes every 10 hours or so when under network load. FYI, I ran a test and was able to crash my test box within a hour or so RH 7.3 , kernel-smp-2.4.18-24.7.x.i686, dell 2550 , tg3 I tested using ttcp (test tcp) something like this reciever: while true ; do ttcp -r -s ; done sender: while true ; do ttcp -t -s receiver.ip.address ; done I tested with different ammounts of packets .e.g -n 100000 FYI, ifconfig shows total traffic throughput, but loops at, IIRC, 4 gigs Responding to the last message, yes, it would be a useful datapoint to determine if tg3 still crashes for you, on a uniprocessor kernel. Also, make sure you are updated to the latest kernel errata, which includes several tg3 bug fixes. Oh sorry, I should have noted that we were running the 2.4.18-24.7.xsmp kernel from out of errata updates. I am now switching it over to run the uniprocessor version of this kernel and will report back tomorrow with whether or not it still locks up. Running in uniprocessor mode with the 2.4.18-24.7.x, the server has been up for 19:56 without crashing. I believe if it stays up for one more day this might be a reasonable workaround for the short term. I confirmed Comment #60. I ran 2 of my 2650 single processor servers one with noapic one with apic using the 2.4.18-24.7.xsmp kernel, (I want to use the hyper-threading). I downloaded the ttcp software from: http:http://www.linuxtested.com/linux_tools.html I set up a link so each system would transmiit and recieve from each other The server with the noapic stayed up, the other died. Jeff, I will test again with 2.4.18-24.7.xsmp kernel and noapic on both systems. What do you recommend, should I use the "noapic" option and single processor kernel? Looks like I was incorrect. One of my colleagues rebooted these machines due to a hang earlier today. It would appear the uptime is incorrect because the hwclock was off and when the system came up it synced it's clock into the future. :-( If I use the noapic option, I will lose the high resolution timer on my smp system correct? If so, I cannot afford to do this. I was going to go back and try the bcm5700 driver, but it appears to not be included in the new kernels. Question on the Uniprocessor test: What is the current packet count (e.g. > 2.4 billion packets)? If you flood the adapter (e.g. with mgen, ttcp, etc) does the box stay connected/up? We are currently rebooting our Dell SMP based servers every ten (10) business days, which is approximately when we are hitting the packet limit. We have been staying on 2.4.18-5xsmp and testing with each of the latest kernels to no avail. Is anyone using any scripts and/or test cases to accelerate their testing of the driver/kernel? We have been using mgen to flood the NIC, it has given us very fast results vs. waiting for the box to die over time. I'm going to look at using ttcp also. What details beyond "this doesn't work" can we supply to help with the debugging (e.g. logs, stats, etc)? I feel for Jeff, as Broadcom is historically not very responsive to their user's community's needs. Even leaning on our hardware vendors who supply the Broadcom NICs in their servers has gone nowhere, their tech support groups have gotten nowhere with Broadcom. We need our server vendors to back us and refuse to resell/use their NICs, unless they open up their code and specs to the *nix community. Unfortunately, I'm unaware what the packet count was before it died. As I mentioned, one of my colleagues snuck in and rebooted the box unbeknownst to me, so I was rejoicing over a false test result. :-( We have a demo that we're giving in a few hrs, then I'll switch back to the old (kernel-smp-2.4.18-19.7.x) kernel and the bcm5700 driver and try that on one of the machines. More later. What crashes a maschine instantly for me when using the tg3 driver is netperf. Tests were run on a Tyan K7X (760MPX) with dual Athlon MP2000+ cpus and a 3Com 3C996-T Gigabit NIC. Netperf just blast tcp packets at full speed from one maschine to another. Run it as follows : - start 'netserver' on target maschine. - start 'netperf -H <target> -t TCP_STREAM -n 2' on sender to send tcp packets for 10s at maximum speeds using two cpus to the target Kernel 2.4.18-24.7.x with tg3 driver crashes instantly if it's the target maschine. Strangely enought it works if it's the sender. BTW, before anybody asks, the second maschine for the rate tests is a Tyan K7 (760MP) with dual Athlon MP2000+ an an Intel Gigabit Server NIC. Created attachment 90037 [details]
PE2650 crash screen from kernel-2.4.18-24.7.xsmp.i686.rpm
This is the crash screen from a PE2650/dual onboard BCM95701A10, taken from the
remote access console. Kernel was kernel-2.4.18-24.7.xsmp, command line was
"ro root=/dev/sda2 nmi_watchdog=1".
Attached the crash screen from my PE2650/dual X1.8/dual BCM95701A10 running kernel-2.4.18-24.7.xsmp. Crash appears to be in the tg3 code. Kernel command line was "ro root=/dev/sda2 nmi_watchdog=1". I estimate the uptime was just over 1 day. This machine is in development, and was doing nothing but running setiathome. It barely had any Ethernet activity. Machine must go into service next week. Red Hat: Put the bcm5700 module back in the kernel tree, tg3 is clearly not stable. People need a stable production kernel, we're not here to debug. Crashed again. Caught the full output on a serial console: NMI Watchdog detected LOCKUP on CPU3, eip e0ccbae0, registers: esm ppp_async ppp_generic slhc racser tg3 ipt_LOG ipt_state ip_conntrack_ftp ip_conntrack iptable_filter ip_tables usb-ohci usbcore reiserfs lvm-mod aacraid s CPU: 3 EIP: 0010:[<e0ccbae0>] Tainted: P EFLAGS: 00000086 EIP is at .text.lock.tg3 [tg3] 0xa4 (2.4.18-24.7.xsmp) eax: c03de300 ebx: d4435d80 ecx: d4435d80 edx: c03de3f4 esi: e0cc80d0 edi: 00000282 ebp: 00000000 esp: d3103f38 ds: 0018 es: 0018 ss: 0018 Process setiathome (pid: 1251, stackpage=d3103000) Stack: d4435ebc e0cc80d0 00000180 c012635b d4435d80 d3103f54 00000086 d3103f54 d3103f54 00000000 00000001 00000180 00000000 c012256b c03ce600 c0122411 00000000 00000001 c03a8980 fffffffe 00000003 c012219b c03a8980 00000046 Call Trace: [<e0cc80d0>] tg3_timer [tg3] 0x0 (0xd3103f3c)) [<c012635b>] timer_bh [kernel] 0x29b (0xd3103f44)) [<c012256b>] bh_action [kernel] 0x4b (0xd3103f6c)) [<c0122411>] tasklet_hi_action [kernel] 0x61 (0xd3103f74)) [<c012219b>] do_softirq [kernel] 0x6b (0xd3103f8c)) [<c010a8b0>] do_IRQ [kernel] 0x100 (0xd3103fa8)) [<c010d078>] call_do_IRQ [kernel] 0x5 (0xd3103fc0)) Code: 80 3b 00 f3 90 7e f9 e9 ee c5 ff ff 80 7b 2c 00 f3 90 7e f8 console shuts up ... [ <c01e8744>] netif_receive_skb [kernel] 0x184 (0xd3105ec0)) I have not been able to lock my system up as long as I use the "noapic" option. From all the comments it is difficult to tell if this is the case. Has anyone crashed with the 2.4.18-24.7.xsmp kernel and noapic set? On my Dell 2 dell 2650 servers single Xeon processor, which emaulates 2 logical processors, while running 2.4.18-24.7.xsmp with noapic set I ran ttcp tests for about 19.5 hours setting up each system to both transmitt and receive between each other without a problem. I could not get the "netperf" to compile so I could not test as comment #68 did. Here is the packet count from netstat for each system, note the first system has been up for just over 3 days. The other system crashed when I did not have "noapic" set but has been up ever since with noapic set: sopko@firebird:8% uname -a Linux firebird.cs.unc.edu 2.4.18-24.7.xsmp #1 SMP Fri Jan 31 06:10:55 EST 2003 i686 unknown sopko@firebird:9% uptime 8:30am up 3 days, 9 min, 5 users, load average: 0.10, 0.08, 0.08 sopko@firebird:10% netstat -s Ip: 676708765 total packets received 0 forwarded 0 incoming packets discarded 675390068 incoming packets delivered 494066595 requests sent out 1416872 reassemblies required 366370 packets reassembled ok 791 fragments created sopko@rockx:1% uname -a Linux rockx.cs.unc.edu 2.4.18-24.7.xsmp #1 SMP Fri Jan 31 06:10:55 EST 2003 i686 unknown sopko@rockx:2% uptime 8:30am up 19:26, 4 users, load average: 0.01, 0.03, 0.00 sopko@rockx:3% netstat -s Ip: 312139470 total packets received 0 forwarded 0 incoming packets discarded 312006240 incoming packets delivered 314833333 requests sent out 736 reassemblies required 203 packets reassembled ok 349 fragments created I remember that compiling netperf wasn't trivial, I had to do a few changes in the Makefile. If anybody wants to try my RH 7.2 executables, here is a link. If you want to compile it yourself, there also is a link to the modified makefile (for netperf 2.2pl2). http://www.physics.ohio-state.edu/~hufnagel/netperf http://www.physics.ohio-state.edu/~hufnagel/netserver http://www.physics.ohio-state.edu/~hufnagel/makefile Tested kernel kernel-smp-2.4.18-24.7x.legolas2.i686.rpm from http://people.redhat.com/jgarzik/pub/legolas2-7.x/i686/ I'm running ttcp and 'RX bytes:' has looped four or five times. Last time, it crashed after one or two. Yea! My two Dell 2650 single processor servers running with the multi-processor kernel and my one dual processor Dell 2650 have been up for almost a week using the "noapic" option. I have been running the ttcp network test software on all 3 systems over the weekend. One system has received 256 millon packets the others 1.8 billion and 2.5 billion: sopko@firebird:6% uname -a Linux firebird.cs.unc.edu 2.4.18-24.7.xsmp #1 SMP Fri Jan 31 06:10:55 EST 2003 i686 unknown sopko@firebird:7% uptime 7:35am up 6 days, 23:15, 5 users, load average: 0.02, 0.09, 0.11 sopko@firebird:8% netstat -s|head -9 Ip: 256585885 total packets received 0 forwarded 0 incoming packets discarded 253744336 incoming packets delivered 74837210 requests sent out 1417403 reassemblies required 366526 packets reassembled ok 802 fragments created sopko@rockx:3% uname -a Linux rockx.cs.unc.edu 2.4.18-24.7.xsmp #1 SMP Fri Jan 31 06:10:55 EST 2003 i686 unknown sopko@rockx:4% uptime 7:36am up 4 days, 18:34, 3 users, load average: 0.15, 0.09, 0.02 sopko@rockx:5% netstat -s|head -9 Ip: 1868251366 total packets received 0 forwarded 0 incoming packets discarded 1867471797 incoming packets delivered 1874499843 requests sent out 16714 reassemblies required 5724 packets reassembled ok 392 fragments created Linux swan.cs.unc.edu 2.4.18-24.7.xsmp #1 SMP Fri Jan 31 06:10:55 EST 2003 i686 unknown sopko@swan:2% uptime 7:39am up 6 days, 22:35, 6 users, load average: 0.11, 0.04, 0.01 sopko@swan:3% netstat -s|head -9 Ip: 2520303211 total packets received 0 forwarded 0 incoming packets discarded 2513597627 incoming packets delivered 2682733841 requests sent out 7729216 reassemblies required 2040387 packets reassembled ok 5034907 fragments created An update: the bcm5700 driver which came with 2.4.18-19.7.xsmp seems quite stable. In addition, I did some testing with the 2.4.9-e.10summit that is an errata update for the AS 2.1 kernel and the tg3 driver which is included there appears stable. I've been smacking it down with ttcp for 3.5 days now and no crashes. I guess I should specify that the 2.4.9-e.10summit is the summit kernel for running on an IBM x440 machine. Summit kernels must not be used except on an obscure IBM box for which they were intended. The -e kernels feature a slightly different tg3 with simplified locking. They also lack NAPI support. This removes lockups at the expence of performance. I think users of normal RHL systems should stick to the Jeff's tg3. If someone tests -e for me - great, thanks. Buf if your ksoftirqd eats all CPU on -e, or something else is weird, please open a new bug. This bug is about a specific problem in the normal tg3. Tested kernel-smp-2.4.18-24.7x.legolas2.athlon.rpm with netperf today. Same behavior as before, sending a TCP stream at full speed works, receiving one crashes the maschine instantly. Tried kernel-smp-2.4.18-24.7x.legolas2.i686 on PE2650 with dual BCM5701's, completely locked up on boot when initializing eth0. No kernel messages. -up kernel does it too. I'm running 2.4.18-19.7.aragorn2smp without a single problem until now. All other kernels crashed one way or another under heavy traffic and high load (yes, load due to many small processes seems to matter). smp-2.4.18-24.8.0.i686 also crashed, so I went back to aragorn2. I can't really use this machines to test kernels since they are production systems so I won't be installing new kernels. Just my 2 cents. More forward progress. From a message sent to linux-poweredge mailing list, by me: As hinted in previous emails, here are the deadlock and hardware bug fixes in tg3, fixes the crashes in previous versions. http://people.redhat.com/jgarzik/pub/legolas4-7.x/ (redhat 7.x) http://people.redhat.com/jgarzik/pub/legolas4-8.0/ (redhat 8.0) As usual, these are based on the latest Red Hat errata kernel, currently 2.4.18-24. Also as usual, these kernels are unofficial, not intended for production, and have not been through the Red Hat Q/A process. A user requested that I be less vague on the changes and describe what has changed. In my defense, I didn't think people really wanted the depth of information, other than a simple "there's been progress." I stand corrected. Here are the tg3 driver changes in this latest kernel (legolas4), taken directly from BitKeeper: # -------------------------------------------- # 03/02/18 jgarzik 1.990 # [netdrvr tg3] disable 5701 h/w bug workaround during core clock reset # -------------------------------------------- ... # -------------------------------------------- # 03/02/18 jgarzik 1.991 # [netdrvr tg3] fix NAPI deadlock # * do not hold driver spinlock during RX processing in tg3_poll # (this is the deadlock fix... works around a NAPI net stack bug) # * create netif_poll_{en,dis}able to synchronize against dev->poll() # * create __netif_rx_complete to avoid a third irq-save in tg3_poll # * create tg3_netif_{start,stop} as driver-specific helper functions # which disable and enable NAPI polling and TX queueing. Note that # the TX queueing enable/disable is purely advisory, and is not # intended to prevent any races. # * remove tg3_halt call from tg3_set_power_state, as all callers # have already called tg3_halt, making it redundant. Removing this # function call also eliminates some locking complications. # * use new helper __netif_rx_complete in tg3_poll # * create tg3_reset_task, as a function that runs in process context # which resets the NIC. This is needed because tg3_netif_stop() # calls schedule() in the process of disabling dev->poll. # * schedule tg3_reset_task from tg3_tx_timeout # * schedule tg3_reset_task from tg3_timer # * wrap several tg3_halt...tg3_init_hw sequences with # tg3_netif_stop...tg3_netif_start. In addition to synchronizing # with dev->poll, this additionally fixes bugs where we were not # calling netif_wake_queue, when we should have been. # * move netif_start_queue call to very bottom of tg3_open # * add missing tg3_netif_{start,stop} to tg3_{suspend,resume}, # further fixing obvious bugs. # -------------------------------------------- ... # -------------------------------------------- # 03/02/18 jgarzik 1.992 # [netdrvr tg3] bump version to 1.4c / Feb 18 # -------------------------------------------- ... # -------------------------------------------- # 03/02/18 jgarzik 1.993 # [netdrvr tg3] properly synchronize with TX, in tg3_netif_stop # -------------------------------------------- ... # -------------------------------------------- # 03/02/18 jgarzik 1.994 # [netdrvr tg3] fix TX race in previous code, and another buglet # # * call netif_tx_disable after netif_poll_disable, fixing TX race, # in tg3_netif_stop # * follow the ordering of the tg3_netif_stop change, and enable # poll after waking TX, in tg3_netif_start # * after doing those two steps in tg3_netif_start, check for work # using new helper function tg3_cond_int # * add helper function tg3_cond_int, which delivers an interrupt # if and only if the status block was updated (i.e. if work # is likely to be available) # -------------------------------------------- My netperf test worked fine with the new legolas4 kernel. I received and sent appr. 50MB/s for 10minutes on the maschine with the 3C996-T running the tg3 driver without the maschine crashing on me. Thanks. I'm highly confident that "legolas4" kernel solves the existing issues users were seeing... now we just have to confirm that those issues did not hide other existing issues. legolas4 kernel (tg3 version 1.4c) has survived the scenarios which killed previous drivers in the lab, so now we just need user feedback to verify that problems in the field are resolved. Jeff, does your lab testing include something along the following lines: About 40 new (small) processes per second, each one doing a few checks via snmp. Sometimes there might be as much as 500 processes using the network, but each one only sending and receiving a few packets. The traffic averages at about 200kb/s with very few bursts. I've noticed a particular server behing a Cisco 2600, used as a firewall but severely limiting the peak network usage to about 10Mb/s because the 2600 processor is so slow at this, never crashed. Other server doing exactly the same things (same number of processes, same network usage) but behind a PIX and connected to an high performance network crashes all the time (except with aragorn2). I also think the 2600 basically makes the network behave as if half-duplexed. Might this (half-duplex) have completly masked the problem with the kernel? I suggest persons still having problems with tg3 to try half-duplex operation since that might completly hide the problems. Does this have some scientific explanation? :-) (PS: the machines are Dell PE2650 Dual Xeon 2.4GHz) Rodrigo, No, we have did not test snmp in relation to tg3. Are you seeing failures with tg3, and the "legolas4" kernel? (posted at http://people.redhat.com/pub/) In any case, if you can contribute snmp test scripts or other descriptions of how you are testing, we can certainly add that to our network test suite. WRT half-duplex, that would only be a factor inasmuch as it slows down the driver enough to hide the recently-solved problems. I noticed there are a few new enterprise kernels released: kernel-2.4.9-e.12.src.rpm kernel-2.4.18-e.25.src.rpm Do these have any new tg3 and/or bcm5700 related changes? jeff: Please redirect questions of that sort through your support representative -- bugzilla really isn't meant for that kind of request. However: https://rhn.redhat.com/errata/RHBA-2002-319.html does mention that there's a new tg3 driver. The bcm5700 driver is not changed, and the tg3 driver is (obviously) not up to the latest level being tested here. There are actually two tg3 drivers in there; the older version for folks for whom it works, and the newer tg3_12e3 with more recent updates. But for more details, do please contact your support representative. Thanks! > No, we have did not test snmp in relation to tg3. Are you seeing
> failures with tg3, and the "legolas4" kernel? (posted at
> http://people.redhat.com/pub/)
Failures? Not yet :-) (Damn, I really hope not...!)
I decided to give legolas4 a shot and everything has been stable for ~20 hours.
About my loads: a synthetic way to emulate them would be to launch about 40
processes per second, each taking about 4 or 5 seconds to exit, and each making
a few dozen snmp queries. A few other processes are crunching the incoming data.
As I said the network load is light at about ~200kb/s, and all this is perl so
the processor is the main bottleneck.
I'll make a simple synthetic test along this lines if anyone is interested. All
the kernels I tested until now crashed in 1 or 2 days (except arargorn2 > 20 days).
Yes, I am interested in a synthetic test like you describe. You may add it as an attachment to this bug report, or email it to me directly at jgarzik. Thanks for the success report, also! ok, legolas4 just crashed... Now, this time the crash was a bit different: the machine answered pings and tcps, but the connection would stall after the first packet. No logs existed, as usual. The console, on a serial port, was stalled, as usual. Jeff, A friend of mine just had a similar crash with the latest official release: The machine answers ICMP, refuses tcp (RST). The console is not connected, so I can't report on that. Anyway... it seems legolas4 suffers from the same problem. The hardware is the same: Dell PE 2650 Dual Xeon 2.4GHz BTW, I'm sending the benchmark code right away, with usage instructions. Created attachment 90274 [details]
Synthetic load cpu/network test, in perl.
I have an IBM x235 (2x2.4 Xeon & Broadcom Integrated NIC) which I believe suffers from this issue. System is ok until it encounters heavy network traffic. I did upgrade to the latest kernel through RHN to no avail. It still freezes up in a matter of hours. One thing I noticed, which may be unrelated as I'm not that familiar with the internals of Linux, is that with the system monitor up during initial heavy network traffic, the RAM fills to capacity, not swap space, just RAM and it doesn't seem to be released if network traffic dies down. It's jumping from ~350MB being used to the full 2.5GB. Has anyone experienced anything similar? > One thing I noticed, which may be unrelated as
> I'm not that familiar with the internals of Linux, is that with the
> system monitor up during initial heavy network traffic, the RAM fills
> to capacity, not swap space, just RAM
If your network traffic also means disk reads and writes thatn that's quite
normal, since linux allocates almos all free memory to disk buffers.
The algoritm for physical memory is something like:
1 - Allocate up to 3/4 to text/data/stack memory, but only if needed.
2 - Keep a small pool of free pages, ready to allocate to (1)
3 - Use the rest as read/write buffer space, and "to be written buffers" space.
4 - The remaining least used text/data/stack pages are dumped to swap space to
maintain space for (2) and (3).
I ran the test kernel "2.4.18-24.7x.legolas4smp" (without noapic kernel option) on 2 Dell 2650 machines, (100MB ethernet connection), and have not had a crash. The machines have been running for just over 3 days. I ran the ttcp test program between the 2 machines, they each sent/received just over 2 billion packets. Rodrigo, Can you file a new bug with your latest failure? That does not sound specifically tg3 related, and in any case would be a different bug from this one. Second, obtaining output from a freeze can be done by passing nmi_watchdog=1 on the kernel command line. (though if it receives ICMP traffic, NMI watchdog isn't going to come into play...) Well, no crashes in 5 days, so that one on 21/02 might have been spurious... Sorry, I could not help by testing kernel versions as promised earlier. For the records: Our Dell PE 2650, Dual Xeon 2,4, Dual Broadcom 10/100/1000 is now up for more than 72 days of (heavy) production using: - Kernel 2.4.18-18 - smp - not using "noapic" - bcm5700 driver instead of tg3 Anyone succeeded for a longer period of time (> 14 days) with any kernel using tg3 and comparable machine so far? Up 67 days on 2.4.18-19.7xsmp, NetXtreme BCM5701 Gigabit Ethernet from BROADCOM, tg3 driver, dual XEON 2.0 GHz CPUs, 2GB Ram, Intel SE7500WV2 motherboard, 2U rack mount Open Storage Solutions Server chassis under fairly heavy network loads - loads that always hung a machine before. This applies to many identical servers. They would not stay up more than a couple days on 2.4.18-18.7xsmp - even less (on order of minutes) when using the "no apic" on same 2.4.18-18.7xsmp kernel. We have racks of these, all maintained via RHN/up2date, so testing out non-official kernels that I can't get thru up2date is not really an option >> Anyone succeeded for a longer period of time (> 14 days) with any
>> kernel using tg3 and comparable machine so far?
58 days with tg3 , web server in server farm (only one running tg3)
chrismcc]$ uname -r
2.4.18-19.7.xsmp
chrismcc]$ uptime
11:11am up 58 days, 3:12, 1 user, load average: 1.40, 1.12, 1.11
chrismcc]$ /sbin/lsmod
Module Size Used by Not tainted
ipt_REJECT 3744 0 (autoclean)
ipt_state 1248 0 (autoclean)
ip_conntrack 22188 1 (autoclean) [ipt_state]
iptable_filter 2464 0 (autoclean)
ip_tables 14304 3 [ipt_REJECT ipt_state iptable_filter]
tg3 47008 1
ext3 67392 3
jbd 51528 3 [ext3]
aic7xxx 129568 4
sd_mod 12832 8
scsi_mod 108048 2 [aic7xxx sd_mod]
I suspect I just haven't tickled it right.
Dell has been testing the "legolas4" kernel and drivers in association with our testing of the public beta known as Phoebe. No tests have yet induced a failure. nttcp tests ran Friday through Monday with no problems. NFS and Samba tests were interrupted by a power outage yesterday, restarted today - we'll provide an update when we can. Dell has performed additional testing using samba and nfs with the "legolas4" test kernel. We have been unable to induce a failure with any of our tests, where with previous kernels and tg3 driver versions we have been able to induce failure. This is an excellent sign that the issues so far uncovered have been fixed. Hi all, I started testing of Redhat 7.3 kernel-smp-2.4.18-24.80.legolas4.i686 on Friday 28 Feb 19:30 and at the time of writing Monday 3rd March 17:55 there have been no errors. I have only been testing NFS connections as I have always been able to reproduce the TG3 issue doing just that. NFS server is Redhat 8 & 2.4.18-24.7.xsmp [FYI] So far there have been zero errors reported to dmesg and none of the previous black screen and crashing problems. I shall continue testing this all week. If requested I can add more load / services. Regards CW another success here with legolas4 Dell 2650 web server chrismcc]$ uptime 10:18am up 4 days, 23:20, 1 user, load average: 1.17, 1.06, 1.06 chrismcc]$ uname -rm 2.4.18-24.7x.legolas4smp i686 5 days with no problems ( google hit us hard over the weekend so it got a good workout ) Fixed in 2.4.18-26 errata kernel, just released. To prevent further "pile-on" of unrelated tg3 issues, please open a new bug against this errata kernel, if other issues develop. |