Bug 79997

Summary:

Broadcom tg3 driver hangs dual 2.4GHz xeon server

Product:

[Retired] Red Hat Linux

Reporter:

Rich Holley <rdh>

Component:

kernel

Assignee:

Jeff Garzik <jgarzik>

Status:

CLOSED ERRATA

QA Contact:

Brian Brock <bbrock>

Severity:

high

Docs Contact:

Priority:

medium

Version:

8.0

CC:

amit_bhutani, gabor.kondorosi, gary.mansell, huiz, jefferson.ogata, kevin.kling, michael_brock, nreilly, pcfe, peterm, pizzof, signal, sopko, vkarasik

Target Milestone:

---

Target Release:

---

Hardware:

i686

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2003-03-04 20:15:04 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

69920

Bug Blocks:

Attachments:

Description	Flags
kernel messages in syslog on a PE2650 before it froze	none

Description Rich Holley 2002-12-18 16:31:14 UTC

From Bugzilla Helper:
User-Agent: Mozilla/4.79 [en] (X11; U; Linux 2.4.18-4smp i686)

Description of problem:
After installing both Red Hat 7.3 and 8.0 with all patches
and running the 2.4.18-18.7.x / 2.4.18-18.8.0 kernels, the
servers (SuperMicro P4DL6 based) lock up when using the 
Broadcom GBE ports. The motherboards have a Broadcom 10/100/1000
copper port and an Intel 10/100 port. If only the Intel 10/100 port is used (and
the tg3 module NOT loaded), the servers run continuously
for days. Typical lock-ups freeze the machines within 0-10 hours
of operation. Network traffic is low. 

I have tried using the "noapic" kernel option as suggested in
BUG 69920 and have tried the new "beta" tg3 driver with no
success. There is definitely some bug in the tg3 driver. I have
reproduced this bug on 8 servers now.

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1. Boot with RedHat 2.4.18-18.7.x / 2.4.18-18.8.0 on a P4DL6 motherboard
2. Wait for system to freeze
3.
    

Actual Results:  All servers eventually lock up

Additional info:

Comment 1 Rich Holley 2002-12-18 21:12:27 UTC

At the suggestion of some folks familiar with this problem I changed
the default /etc/modules.conf support from "tg3" to "bcm5700". Using 
a simple tcp socket test I was able to lock up all 8 servers (at different
times) in less than 1 hour. So apparently the problem resides either in
BOTH the "tg3" and "bcm5700" drivers, or perhaps there is some fundamental
hardware flaw in the BCM5701.

Using "/sbin/lspci -v" reports:
	Broadcom Corporation NetXTreme BCM5701 Gigabit Ethernet (rev 15)
for the PCI device.

Comment 2 Need Real Name 2002-12-19 03:30:58 UTC

Basically the same story, with some additional notes:

Two Dell PE2650 systems, Broadcom on-board copper gigabit cards (bcm5701 
chipset), 2.4.18-18.7.xsmp kernels. Servers freeze within couple of min ~ 
couple of hour, usually with no trace of any kernel error message in syslog.
There has been only one occasion when kernel messages preceded the freeze. The 
corresponding log entries are attached.

 In contrast to Rich's experience, switching to the bcm5700 driver did help. I 
have been using the Broadcom driver (ver. 2.2.26, included in 2.4.18.18.7.xsmp) 
for almost two weeks without a single incident.
 By the way, because of some reasons unknown to me, Dell recommends the bcm5700 
driver (see the Dell support download site for the Dell PE2650s). The driver 
that can be downloaded from there is v2.2.22, apparently a bit obsolete.
 The source of these drivers is a mistery for me. I was not able to find them 
on the Broadcom website. They are not included in the stock 2.4.18,19,20 ... 
kernels. Some Debian Linux sites have v2.2.30.
http://packages.qa.debian.org/b/bcm5700.html
http://packages.debian.org/unstable/misc/bcm5700-source.html

 It seems other people have the same problem not being able to find the 
Broadcom driver source:
http://www.uwsg.iu.edu/hypermail/linux/kernel/0210.3/0107.html
Interestingly enough, in this, Jeff responds and claims that the bcm5700 driver 
is buggy and "tg3 is the way to go".

 For some notes on the comparision of the tg3 and bcm5700 drivers, you may want 
to see:
http://lists.us.dell.com/pipermail/linux-poweredge/2002-November/004995.html

 Regarding the tg3 driver patch by Jeff: it seems there are two different 
versions. The one Jeff regularly refers to is v1.2 (v1.20).
http://people.redhat.com/jgarzik/tg3/tg3-1.2/
This is the one that went into the stock 2.4.20 kernel as well.
 Nevertheless, there is another tree:
http://people.redhat.com/jgarzik/tg3/tg3-1.2txlock/ (a.k.a v 1.21)
and David Morse from Dell claims on the Bug#69920 page the he tested THIS 
successfully on a Dell PE2650.

What is the difference between them and which one is supposed to fix the 
problem?

Also, while running 'xosview', I noticed that IRQs corresponding to the 
Broadcom cards are constantly on, regardless of the volume of the traffic or 
whether there is any traffic at all. This behaviour is the same regardless of 
the driver used (tg3 - bcm5700). All 10/100 cards I have ever seen and the 
Intel copper gigabit cards (with the e1000 driver) interrupt the CPU only 
intermittently, when they actually need to. Is this bahviour for the Broadcom 
cards/drivers normal?

Gabor

Comment 3 Need Real Name 2002-12-19 03:36:02 UTC

Created attachment 88809 [details]
kernel messages in syslog on a PE2650 before it froze

Comment 4 Rich Holley 2002-12-30 21:07:21 UTC

I'm not convinced the problem is exclusive to the tg3 driver. With the
SuperMicro P4DL6, I discovered the same lock-up using only the Intel 
10/100 port and the eepro100 driver after several days of intense network
loading. For this round of tests, I disabled the Broadcom GBE ports and
did not even load the "tg3" module. Perhaps the faster GBE Broadcom ports 
simply get to the problem quicker. 

Symptoms are the same as before. I'd send the kernel oops, but I
never seem to get one. The machines just lock hard. Even the motherboard
"reset" buttons don't work (poweroff/on is the only solution).
It is interesting that the Serverworks chipset on the P4DL6 is identical 
to the Dell 2650 chipset. I have verified the memory, CPU, and power
supplies are all top-notch. It seems unlikely that 8 servers built at
different times would all have bad hardware. Just for completness I
updated the motherboard bios on all servers to the latest release before
starting the last round of tests.

Current setup is: Dual Xeon 2.4GHz CPU's on SuperMicro P4DL6 based server
                  RedHat 8.0 with 2.4.18-19.8.0smp kernel

I also downloaded the latest 2.4.20 and 2.4.20-ac2 kernels and tried these
on a pair of the servers connecting the GBE ports with a crossover cable.
Both machines eventually lock up, but only 1 at a time (in other words once
the network traffic between the two servers stop, the remaining "live" 
server seems o.k.). Is there an interrupt/threadsafe issue relating to
the 2.4.20 kernel/drivers or GCC3.2 compiler? I will attempt to recompile
the kernel.org kernels with GCC2.96 and see if there's a difference.

Comment 5 Rich Holley 2002-12-31 18:52:44 UTC

Just an update. I have now been running 3 servers for over 24 hours
without a lockup using the 2.4.18-19.8.0smp kernel with the "noapic"
option. If I remove the "noapic" kernel option, at least one of the
three servers will lock-up under heavy network traffic in under 1 hour.

I noticed a comment in the "dmesg" output that the "APIC table appears 
buggy...". This message appears on every one of the servers. So I decided
to try Karl's "noapic" solution. It is interesting that this kernel 
option does nothing to prevent the lockup on kernels prior to 
2.4.18-19.8.0smp. The only difference I see is that the interrupt 
assignments for eth0 and eth1 are now shared with other devices on
the motherboard within the (0-15) range. Without the "noapic" option,
each ethernet device is assigned a unique interrupt.

Does anyone have information on why the "APIC" tables on the SuperMicro
P4DL6 / Serverworks chipset might be buggy?

Comment 6 Need Real Name 2003-01-01 14:46:01 UTC

just my 2 cents:

I have the same problem with Proliant ML370G3 [2 x Xeon 2.4Ghz] with RH7.3 
2.4.18-19.7.xsmp kernel from RH updates crashes every 5-10 hours.
I see nothing in syslog.

Comment 7 Rich Holley 2003-01-01 22:03:33 UTC

Another update. Two of the three servers are now hung again. The
remaining server appears to be functioning fine. This makes sense
because when two of the three servers die, the third server is no
longer receiving or responding to any other network traffic...

Looks like the "noapic" solution buys me a couple more days of
stability but the end result is still the same. No kernel oops to
send - just a hard lock. It is interesting that both machines
simply froze this time. Without the "noapic" option, the servers
typically freeze up and trigger the server alarm. With the "noapic"
switch they just die quietly.

It is very interesting that the Proliant ML370G3 also experiences
this problem. Now we have Dell 2650's, Proliant ML370G3's, and
SuperMicro P4DL6 servers locking up after some random period of
network traffic. While the Dell 2650 and SuperMicro P4DL6 have 
BroadCom ethernet, the ML370G3 has an NC7781 ethernet. (As previously
mentioned, I also get the lockup using the Intel ethernet port).
However, all the servers use the same ServerWorks GC-HE/LE chipset
with dual 2.X GHz Xeons.

Does anyone else get a message indicating a "buggy APIC" in the
/var/log/dmesg file after booting with the 2.4.18-19.X.Xsmp kernel?
It sure seems like the lockup is somehow related to bad interrupt
handling or missed interrupts.

Is there any information I can provide that might assist someone
in making sense of this? (Other than the kernel oops message which
doesn't exist...)

Comment 8 Jeff Garzik 2003-01-02 16:57:11 UTC

To all still experiencing problems,

1) please boot with "noapic" on the kernel command line.  You can run "cat
/proc/cmdline" to check for sure.

2) I have posted some new rpms for testing, based on the latest errata:

latest production tg3 release, 1.2a, built into unofficial rpms:
http://people.redhat.com/jgarzik/tg3/tg3-1.2a/rpms/

but I would like people to test my experiment which should provide additional
stability:
http://people.redhat.com/jgarzik/tg3/tg3-1.2a/exp1-rpms/

...and if that doesn't work for people, fall back to experiment 2:
http://people.redhat.com/jgarzik/tg3/tg3-1.2a/exp2-rpms/

Feedback requested!  On several systems, there is evidence that the lock-ups are
not directly related to driver but more to system board.  So please make sure to
attach 'dmesg' and 'lspci -vvv' output in future bug reports.

Comment 9 Jeff Garzik 2003-01-20 21:05:23 UTC

Ok, some of these reports have actually been fixed in more recently posted rpms.

Just to get everybody on the latest page, please use "aragorn2" test rpms,
posted at http://people.redhat.com/jgarzik/pub/

This is the latest Red Hat errata kernel for 7.x/8.x, with the recent tg3 bug fixes.

Comment 10 Jeff Garzik 2003-01-27 16:14:56 UTC

Ladies and gentlemen,

I have received permission to post the latest release candidate of
Red Hat's errata kernel.  It contains not only fixes for e1000 and tg3 
net drivers, but also system-level fixes which may address the problems 
users on this list were seeing.

This kernel is currently in Red Hat Q/A, and has NOT yet been 
"qualified" as official, nor has it been released.  

Errata kernel 21 release candidate, for Red Hat 8.0:
        http://people.redhat.com/jgarzik/pub/2.4.18-21.8.0/

Errata kernel 21 release candidate, for Red Hat 7.x:
        http://people.redhat.com/jgarzik/pub/2.4.18-21.7.x/

It is requested that people who were seeing crash problems test this 
kernel, as this will be the next official Red Hat errata kernel, after 
it passes Q/A.

Comment 11 Brad Erickson 2003-01-30 19:24:04 UTC

I installed this latest kernel (2.4.18-21.8.0) on a 4-way Dell 6600 and the 
system still freezes.

Comment 12 Rich Holley 2003-01-30 22:09:29 UTC

I've been out of the loop for a few days. I loaded the 2.4.18-21.8.0 kernel
on two of the SuperMicro P4DL6 servers yesterday and started some 
torture tests. Both machines are still functioning fine, but I'll wait
a few more days before declaring victory...

Comment 13 Kevin Kling 2003-01-31 13:30:06 UTC

I've been following this since late December, I have had the same problem on a
IBM Xseries 235 Dual Xeon 2.2 with both the onboard broadcom NIC and a 3COM 1GB
NIC which also uses the broadcom chip.

The last fix posted for the 7.3 (Errata kernel 21 release candidate, for Red Hat
7.x:) seemed to be a big improvement, but I had a hang after about 3 days of
use, prior to that it would happen every few hours.

I can't confirm that it was the source of the hang, since there is no message
and no log entry for it, but that follows the symptom I have been having.

Will run for more than a week on a 3com 10/100 card without any problems, I'm
still in the testing phase so it's not a big deal, but it would be nice to have
this resolved.

Comment 14 Rich Holley 2003-02-04 20:58:45 UTC

Just an update. The servers have now been up 6 days under 100% loading
of the network ports, disks, and CPU's with hyper-threading enabled and
without using the "noapic" option. Things seem VERY stable and if this
goes for another couple of days I'll probably suggest closing the bug.
I'm using the 2.4.18-21 kernel and have disabled ACPI, APM, and PnP
in the bios. Also, I had a chance to smoke test a E7505 chipset server 
for 7 days.It had dual Intel E1000 ports and never locked up under the 
2.4.18-21 kernel. I think we may have a winner...

Comment 15 Gary Mansell 2003-02-05 14:15:37 UTC

I can confirm that the latest production released Redhat kernel 2.4.18-24.7.xsmp
does not fix the problem. My PE2650 crashed in the usual manner after about 5
hours of normal (minimal) activity.

I am concerned that the bcm5700 modules (the only work around) do not exist in
/lib/modules for this new kernel - it would appear that they have been
deprecated. This is unacceptable to me as my machine has run for two months on
these modules perfectly fine. Hence I cannot run the latest kernel and have had
to revert my machine to the 2.4.18-18.7.xsmp kernel with the bcm5700 kernel module.

I also have a call (ref #222224) logged with Redhat's Patrick Ernzer
(pernzer) who is working with Dell UK on trying resolve this issue
for me for the last 4 months.

Comment 16 Brian Feeny 2003-02-07 18:35:25 UTC

I have been reading bugs 75680, 78059 as well as this bug.  I hope I am adding
to the correct bug id, if not, someone please let me know.

We are running a Redhat 7.3 system.

Our system is a Supermicro X5DAE dual Xeon 2.4Ghz (intel E7505 chipset).  It has
the E1000 intel network card built in, and a PCI 3C59x card installed.  The box
uses 3ware Escalade 7500 series IDE RAID controllers, which I have had great
success out of on other Redhat boxes we run.  Prior to this
motherboard/cpu/network card, this box was running an ASUS A7M266-D with dual
1.8Ghz Athlons, and 2 3C59x cards, the box was stable at that point.

The box currently has Hyper Threading disabled, and is NOT running with
"noapic", but rather running with no special kernel options passed to it other
than setting the ramdisk size to 512000 (the system has 2GB of memory).  The
problem is there with or without the ramdisk, and with or without Hyper
Threading.  I have not tried "noapic", I am hoping to share the same success
Rich Holley has had.

The box is a mail server we are trying to bring into production to replace our
current server.  It has relativley light loading (just beta testers) compared to
what its in store for.  The box will only stay up about 12-20 hours average with
48 hours being the record I believe.

We tried Jeff Garzik's recommendation in earlier threads to try the
linux-2.4.18-18.7.x which he had in his webspace at redhat.  That did not seem
to do it for us.  We are now on linux-2.4.18-24.7.x which contains the version
4.4.x e1000 driver and have been up since last night............only time will
tell.  In the meantime, if any information is needed from me please let me know
and I will get that to you.

Comment 17 Rich Holley 2003-02-10 15:49:26 UTC

Under 2.4.18-21-8.0 all servers have now been up continuously under
very heavy loading for over 10 days. I'd like to close this bug, but
I see other people are still having similar problems with the
officially released 2.4.18-24 kernel. Are the lock-up problems 
really fixed or have I simply found a magic combination of kernel/
bios settings/hardware that is stable?

I plan on updating all servers to the 2.4.18-24 kernels and smoke testing
for a few more days to make sure the good behavior sticks, then I'll 
close this bug (unless someone else with a similar configuration is still
having problems).

Comment 18 John Sopko 2003-02-12 15:40:42 UTC

I have 3, Dell 2650 systems 2 single-processor, 1 multi-processor. All 3 hung
in a short time with the 2.4.18-19 kernel.

I upgraded to the -24 kernel and the systems have not hung for several
days. I am running with the -24.7xmp and have hyper-threading enabled in
the bios. I set the "noapic" option on 2 of the systems.

BTW it is difficult to find much info on the apic/noapic option. I did find
it stands for "Advanced Programmable Interrupt Controller".  It is far from
clear what impact this option has on a system.

2 of these servers are to replace older web servers. I can not release these 
until this gets straightened out.

Comment 19 Brian Feeny 2003-02-12 22:28:16 UTC

Machine crashed within 20 hours of running 2.4.18-24.  I have tried -24 with and
without my E1000 loaded however (I used a Intel Pro/100 card instead, and/or
3Com 3C59x).  This is the SuperMicro X5DAE.  Finally I tried:

SMP
noapic

and I have been running for a record 2 1/2 days on this setup.  My
/proc/interrupts shows as follows:

           CPU0       CPU1
  0:   87463017          0          XT-PIC  timer
  1:          7          0          XT-PIC  keyboard
  2:          0          0          XT-PIC  cascade
  8:          3          0          XT-PIC  rtc
 11:  164225889          0          XT-PIC  eth0, eth1
 12:   64568972          0          XT-PIC  3ware Storage Controller, 3ware
Storage Controller
 15:          0          0          XT-PIC  ide1
NMI:          0          0
LOC:   87470320   87470319
ERR:          0
MIS:          0


Is it normal to only show interrupts on the first CPU?  Am I really using both
CPU's then with noapic?  I ask because "top" still shows the cpu working on some
of the load.

I have plans to possibly replace this board with a Intel SE7501BR2, which is a
redhat certified board.  I would like to try anything someone suggests.  My plan
next is to swap out the memory or throw that new motherboard/cpu/mem combo online.

Comment 20 Jeff Garzik 2003-02-21 15:55:43 UTC

This is resolved, please test:

http://people.redhat.com/jgarzik/pub/legolas4-7.x/
(red hat 7.x)

http://people.redhat.com/jgarzik/pub/legolas4-8.0/
(red hat 8.0)

Comment 21 hui zhang 2003-04-14 02:35:54 UTC

should we try this test kernel (http://people.redhat.com/jgarzik/pub/legolas4-8.0/)
or the official errata release?