Bug 79997
Summary: | Broadcom tg3 driver hangs dual 2.4GHz xeon server | ||||||
---|---|---|---|---|---|---|---|
Product: | [Retired] Red Hat Linux | Reporter: | Rich Holley <rdh> | ||||
Component: | kernel | Assignee: | Jeff Garzik <jgarzik> | ||||
Status: | CLOSED ERRATA | QA Contact: | Brian Brock <bbrock> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 8.0 | CC: | amit_bhutani, gabor.kondorosi, gary.mansell, huiz, jefferson.ogata, kevin.kling, michael_brock, nreilly, pcfe, peterm, pizzof, signal, sopko, vkarasik | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | i686 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2003-03-04 20:15:04 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | 69920 | ||||||
Bug Blocks: | |||||||
Attachments: |
|
Description
Rich Holley
2002-12-18 16:31:14 UTC
At the suggestion of some folks familiar with this problem I changed the default /etc/modules.conf support from "tg3" to "bcm5700". Using a simple tcp socket test I was able to lock up all 8 servers (at different times) in less than 1 hour. So apparently the problem resides either in BOTH the "tg3" and "bcm5700" drivers, or perhaps there is some fundamental hardware flaw in the BCM5701. Using "/sbin/lspci -v" reports: Broadcom Corporation NetXTreme BCM5701 Gigabit Ethernet (rev 15) for the PCI device. Basically the same story, with some additional notes: Two Dell PE2650 systems, Broadcom on-board copper gigabit cards (bcm5701 chipset), 2.4.18-18.7.xsmp kernels. Servers freeze within couple of min ~ couple of hour, usually with no trace of any kernel error message in syslog. There has been only one occasion when kernel messages preceded the freeze. The corresponding log entries are attached. In contrast to Rich's experience, switching to the bcm5700 driver did help. I have been using the Broadcom driver (ver. 2.2.26, included in 2.4.18.18.7.xsmp) for almost two weeks without a single incident. By the way, because of some reasons unknown to me, Dell recommends the bcm5700 driver (see the Dell support download site for the Dell PE2650s). The driver that can be downloaded from there is v2.2.22, apparently a bit obsolete. The source of these drivers is a mistery for me. I was not able to find them on the Broadcom website. They are not included in the stock 2.4.18,19,20 ... kernels. Some Debian Linux sites have v2.2.30. http://packages.qa.debian.org/b/bcm5700.html http://packages.debian.org/unstable/misc/bcm5700-source.html It seems other people have the same problem not being able to find the Broadcom driver source: http://www.uwsg.iu.edu/hypermail/linux/kernel/0210.3/0107.html Interestingly enough, in this, Jeff responds and claims that the bcm5700 driver is buggy and "tg3 is the way to go". For some notes on the comparision of the tg3 and bcm5700 drivers, you may want to see: http://lists.us.dell.com/pipermail/linux-poweredge/2002-November/004995.html Regarding the tg3 driver patch by Jeff: it seems there are two different versions. The one Jeff regularly refers to is v1.2 (v1.20). http://people.redhat.com/jgarzik/tg3/tg3-1.2/ This is the one that went into the stock 2.4.20 kernel as well. Nevertheless, there is another tree: http://people.redhat.com/jgarzik/tg3/tg3-1.2txlock/ (a.k.a v 1.21) and David Morse from Dell claims on the Bug#69920 page the he tested THIS successfully on a Dell PE2650. What is the difference between them and which one is supposed to fix the problem? Also, while running 'xosview', I noticed that IRQs corresponding to the Broadcom cards are constantly on, regardless of the volume of the traffic or whether there is any traffic at all. This behaviour is the same regardless of the driver used (tg3 - bcm5700). All 10/100 cards I have ever seen and the Intel copper gigabit cards (with the e1000 driver) interrupt the CPU only intermittently, when they actually need to. Is this bahviour for the Broadcom cards/drivers normal? Gabor Created attachment 88809 [details]
kernel messages in syslog on a PE2650 before it froze
I'm not convinced the problem is exclusive to the tg3 driver. With the SuperMicro P4DL6, I discovered the same lock-up using only the Intel 10/100 port and the eepro100 driver after several days of intense network loading. For this round of tests, I disabled the Broadcom GBE ports and did not even load the "tg3" module. Perhaps the faster GBE Broadcom ports simply get to the problem quicker. Symptoms are the same as before. I'd send the kernel oops, but I never seem to get one. The machines just lock hard. Even the motherboard "reset" buttons don't work (poweroff/on is the only solution). It is interesting that the Serverworks chipset on the P4DL6 is identical to the Dell 2650 chipset. I have verified the memory, CPU, and power supplies are all top-notch. It seems unlikely that 8 servers built at different times would all have bad hardware. Just for completness I updated the motherboard bios on all servers to the latest release before starting the last round of tests. Current setup is: Dual Xeon 2.4GHz CPU's on SuperMicro P4DL6 based server RedHat 8.0 with 2.4.18-19.8.0smp kernel I also downloaded the latest 2.4.20 and 2.4.20-ac2 kernels and tried these on a pair of the servers connecting the GBE ports with a crossover cable. Both machines eventually lock up, but only 1 at a time (in other words once the network traffic between the two servers stop, the remaining "live" server seems o.k.). Is there an interrupt/threadsafe issue relating to the 2.4.20 kernel/drivers or GCC3.2 compiler? I will attempt to recompile the kernel.org kernels with GCC2.96 and see if there's a difference. Just an update. I have now been running 3 servers for over 24 hours without a lockup using the 2.4.18-19.8.0smp kernel with the "noapic" option. If I remove the "noapic" kernel option, at least one of the three servers will lock-up under heavy network traffic in under 1 hour. I noticed a comment in the "dmesg" output that the "APIC table appears buggy...". This message appears on every one of the servers. So I decided to try Karl's "noapic" solution. It is interesting that this kernel option does nothing to prevent the lockup on kernels prior to 2.4.18-19.8.0smp. The only difference I see is that the interrupt assignments for eth0 and eth1 are now shared with other devices on the motherboard within the (0-15) range. Without the "noapic" option, each ethernet device is assigned a unique interrupt. Does anyone have information on why the "APIC" tables on the SuperMicro P4DL6 / Serverworks chipset might be buggy? just my 2 cents: I have the same problem with Proliant ML370G3 [2 x Xeon 2.4Ghz] with RH7.3 2.4.18-19.7.xsmp kernel from RH updates crashes every 5-10 hours. I see nothing in syslog. Another update. Two of the three servers are now hung again. The remaining server appears to be functioning fine. This makes sense because when two of the three servers die, the third server is no longer receiving or responding to any other network traffic... Looks like the "noapic" solution buys me a couple more days of stability but the end result is still the same. No kernel oops to send - just a hard lock. It is interesting that both machines simply froze this time. Without the "noapic" option, the servers typically freeze up and trigger the server alarm. With the "noapic" switch they just die quietly. It is very interesting that the Proliant ML370G3 also experiences this problem. Now we have Dell 2650's, Proliant ML370G3's, and SuperMicro P4DL6 servers locking up after some random period of network traffic. While the Dell 2650 and SuperMicro P4DL6 have BroadCom ethernet, the ML370G3 has an NC7781 ethernet. (As previously mentioned, I also get the lockup using the Intel ethernet port). However, all the servers use the same ServerWorks GC-HE/LE chipset with dual 2.X GHz Xeons. Does anyone else get a message indicating a "buggy APIC" in the /var/log/dmesg file after booting with the 2.4.18-19.X.Xsmp kernel? It sure seems like the lockup is somehow related to bad interrupt handling or missed interrupts. Is there any information I can provide that might assist someone in making sense of this? (Other than the kernel oops message which doesn't exist...) To all still experiencing problems, 1) please boot with "noapic" on the kernel command line. You can run "cat /proc/cmdline" to check for sure. 2) I have posted some new rpms for testing, based on the latest errata: latest production tg3 release, 1.2a, built into unofficial rpms: http://people.redhat.com/jgarzik/tg3/tg3-1.2a/rpms/ but I would like people to test my experiment which should provide additional stability: http://people.redhat.com/jgarzik/tg3/tg3-1.2a/exp1-rpms/ ...and if that doesn't work for people, fall back to experiment 2: http://people.redhat.com/jgarzik/tg3/tg3-1.2a/exp2-rpms/ Feedback requested! On several systems, there is evidence that the lock-ups are not directly related to driver but more to system board. So please make sure to attach 'dmesg' and 'lspci -vvv' output in future bug reports. Ok, some of these reports have actually been fixed in more recently posted rpms. Just to get everybody on the latest page, please use "aragorn2" test rpms, posted at http://people.redhat.com/jgarzik/pub/ This is the latest Red Hat errata kernel for 7.x/8.x, with the recent tg3 bug fixes. Ladies and gentlemen, I have received permission to post the latest release candidate of Red Hat's errata kernel. It contains not only fixes for e1000 and tg3 net drivers, but also system-level fixes which may address the problems users on this list were seeing. This kernel is currently in Red Hat Q/A, and has NOT yet been "qualified" as official, nor has it been released. Errata kernel 21 release candidate, for Red Hat 8.0: http://people.redhat.com/jgarzik/pub/2.4.18-21.8.0/ Errata kernel 21 release candidate, for Red Hat 7.x: http://people.redhat.com/jgarzik/pub/2.4.18-21.7.x/ It is requested that people who were seeing crash problems test this kernel, as this will be the next official Red Hat errata kernel, after it passes Q/A. I installed this latest kernel (2.4.18-21.8.0) on a 4-way Dell 6600 and the system still freezes. I've been out of the loop for a few days. I loaded the 2.4.18-21.8.0 kernel on two of the SuperMicro P4DL6 servers yesterday and started some torture tests. Both machines are still functioning fine, but I'll wait a few more days before declaring victory... I've been following this since late December, I have had the same problem on a IBM Xseries 235 Dual Xeon 2.2 with both the onboard broadcom NIC and a 3COM 1GB NIC which also uses the broadcom chip. The last fix posted for the 7.3 (Errata kernel 21 release candidate, for Red Hat 7.x:) seemed to be a big improvement, but I had a hang after about 3 days of use, prior to that it would happen every few hours. I can't confirm that it was the source of the hang, since there is no message and no log entry for it, but that follows the symptom I have been having. Will run for more than a week on a 3com 10/100 card without any problems, I'm still in the testing phase so it's not a big deal, but it would be nice to have this resolved. Just an update. The servers have now been up 6 days under 100% loading of the network ports, disks, and CPU's with hyper-threading enabled and without using the "noapic" option. Things seem VERY stable and if this goes for another couple of days I'll probably suggest closing the bug. I'm using the 2.4.18-21 kernel and have disabled ACPI, APM, and PnP in the bios. Also, I had a chance to smoke test a E7505 chipset server for 7 days.It had dual Intel E1000 ports and never locked up under the 2.4.18-21 kernel. I think we may have a winner... I can confirm that the latest production released Redhat kernel 2.4.18-24.7.xsmp does not fix the problem. My PE2650 crashed in the usual manner after about 5 hours of normal (minimal) activity. I am concerned that the bcm5700 modules (the only work around) do not exist in /lib/modules for this new kernel - it would appear that they have been deprecated. This is unacceptable to me as my machine has run for two months on these modules perfectly fine. Hence I cannot run the latest kernel and have had to revert my machine to the 2.4.18-18.7.xsmp kernel with the bcm5700 kernel module. I also have a call (ref #222224) logged with Redhat's Patrick Ernzer (pernzer) who is working with Dell UK on trying resolve this issue for me for the last 4 months. I have been reading bugs 75680, 78059 as well as this bug. I hope I am adding to the correct bug id, if not, someone please let me know. We are running a Redhat 7.3 system. Our system is a Supermicro X5DAE dual Xeon 2.4Ghz (intel E7505 chipset). It has the E1000 intel network card built in, and a PCI 3C59x card installed. The box uses 3ware Escalade 7500 series IDE RAID controllers, which I have had great success out of on other Redhat boxes we run. Prior to this motherboard/cpu/network card, this box was running an ASUS A7M266-D with dual 1.8Ghz Athlons, and 2 3C59x cards, the box was stable at that point. The box currently has Hyper Threading disabled, and is NOT running with "noapic", but rather running with no special kernel options passed to it other than setting the ramdisk size to 512000 (the system has 2GB of memory). The problem is there with or without the ramdisk, and with or without Hyper Threading. I have not tried "noapic", I am hoping to share the same success Rich Holley has had. The box is a mail server we are trying to bring into production to replace our current server. It has relativley light loading (just beta testers) compared to what its in store for. The box will only stay up about 12-20 hours average with 48 hours being the record I believe. We tried Jeff Garzik's recommendation in earlier threads to try the linux-2.4.18-18.7.x which he had in his webspace at redhat. That did not seem to do it for us. We are now on linux-2.4.18-24.7.x which contains the version 4.4.x e1000 driver and have been up since last night............only time will tell. In the meantime, if any information is needed from me please let me know and I will get that to you. Under 2.4.18-21-8.0 all servers have now been up continuously under very heavy loading for over 10 days. I'd like to close this bug, but I see other people are still having similar problems with the officially released 2.4.18-24 kernel. Are the lock-up problems really fixed or have I simply found a magic combination of kernel/ bios settings/hardware that is stable? I plan on updating all servers to the 2.4.18-24 kernels and smoke testing for a few more days to make sure the good behavior sticks, then I'll close this bug (unless someone else with a similar configuration is still having problems). I have 3, Dell 2650 systems 2 single-processor, 1 multi-processor. All 3 hung in a short time with the 2.4.18-19 kernel. I upgraded to the -24 kernel and the systems have not hung for several days. I am running with the -24.7xmp and have hyper-threading enabled in the bios. I set the "noapic" option on 2 of the systems. BTW it is difficult to find much info on the apic/noapic option. I did find it stands for "Advanced Programmable Interrupt Controller". It is far from clear what impact this option has on a system. 2 of these servers are to replace older web servers. I can not release these until this gets straightened out. Machine crashed within 20 hours of running 2.4.18-24. I have tried -24 with and without my E1000 loaded however (I used a Intel Pro/100 card instead, and/or 3Com 3C59x). This is the SuperMicro X5DAE. Finally I tried: SMP noapic and I have been running for a record 2 1/2 days on this setup. My /proc/interrupts shows as follows: CPU0 CPU1 0: 87463017 0 XT-PIC timer 1: 7 0 XT-PIC keyboard 2: 0 0 XT-PIC cascade 8: 3 0 XT-PIC rtc 11: 164225889 0 XT-PIC eth0, eth1 12: 64568972 0 XT-PIC 3ware Storage Controller, 3ware Storage Controller 15: 0 0 XT-PIC ide1 NMI: 0 0 LOC: 87470320 87470319 ERR: 0 MIS: 0 Is it normal to only show interrupts on the first CPU? Am I really using both CPU's then with noapic? I ask because "top" still shows the cpu working on some of the load. I have plans to possibly replace this board with a Intel SE7501BR2, which is a redhat certified board. I would like to try anything someone suggests. My plan next is to swap out the memory or throw that new motherboard/cpu/mem combo online. This is resolved, please test: http://people.redhat.com/jgarzik/pub/legolas4-7.x/ (red hat 7.x) http://people.redhat.com/jgarzik/pub/legolas4-8.0/ (red hat 8.0) should we try this test kernel (http://people.redhat.com/jgarzik/pub/legolas4-8.0/) or the official errata release? |