From Bugzilla Helper: User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0) Description of problem: The Kernel Shipped with Red Hat 8 (2.4.18-14) has problems on my Dell Poweredge 4600 Server - a Dual Xeon (2.2 GHz) with 2 Gigs of RAM. After about 3 hours, the machine will hard lock. No keyboard, no mouse, no response to a network ping. A power cycle is required. Nothing in the kernel log to explain why this happened. I compiled a 2.4.19 kernel from kernel.org source, and the machine seems to run fine (been running for 4 days now). So, it seems to be a problem with the stock RH 8 kernel. Version-Release number of selected component (if applicable): How reproducible: Sometimes Steps to Reproduce: Install RedHat 8 on a machine like this, and wait. Once it crashed when I wasn't even typing anything - I just wheeled my chair over to use it and it was locked up. Additional info:
I can confirm this happening also with brand new IBM xSeries 335 / Intel Xeon 2.0 GHz / 1 GB RAM, running RH 8.0, kernel 2.4.18-18.8.0. I installed the server about two weeks ago. With uni-processor kernel it was running flawlessly 10 days without a crash. This monday I rebooted to the SMP-kernel and was able to bring the server down with a stupid trick like this: ab -k -n 1000000 -c 100000 http://127.0.0.1:80/ Like expected, after a while server reported "Too many open files", but couple of seconds later kernel oopsed, telling something about cpu0. I was able to reproduce this three times in a row and I think I could reproduce it anytime again. With the uni-processor kernel and the same test only the "Too many open files" message appears but the kernel itself survives. Well, since the above test is stupid anyway, I thought that maybe SMP-kernel would run just fine in our normal use. Nope. Best uptime the server has had with SMP-enabled kernel is around 30 hours, it can totally freeze even when totally idle. The server is not in production use, so I can run any tests you may need. I will also provide kernel oops message or whatever logs here if needed.
what network cards are in use ?
Broadcom Corporation NetXtreme BCM5703X Gigabit Ethernet (rev 02), using tg3 kernel module.
Whoops, forgot to tell you that the server itself has only one physical CPU, but since Xeon is Hyper Threading -capable I decided to give it a try.
My Dell Server also uses a Broadcom Netextreme BCM5700 network card (10/100/1000) along with an Intel 10/100 (82559) Mark
Further investigation revealed that our server's SCSI controller (LSI Logic / Symbios Logic 53c1030) is using IRQ 9. Our another server has IBM ServerRAID controller and its manual states that Linux has some issues with SMP + IRQ 9 devices. I don't know if that's ServerRAID specific or not, though...
Can you confirm that the latest rawhide kernel fixes this bug? Or confirm that the following unofficial rpms, based on the latest errata kernel, fix the problem? http://people.redhat.com/jgarzik/tg3/tg3-1.2/rpms/
I just downloaded and installed kernel-smp-2.4.18-19.7.tg3.120.i686.rpm from people.redhat.com. --- Linux xxx 2.4.18-19.7.tg3.120smp #1 SMP Mon Nov 25 15:33:06 EST 2002 i686 i686 i386 GNU/Linux --- At least something has changed, since /proc/cpuinfo is now a bit different than it used to be. Previous kernels showed up something like --- processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 2 model name : Intel(R) XEON(TM) CPU 2.00GHz stepping : 4 cpu MHz : 1993.759 cache size : 512 KB Physical processor ID : 51941323214 Number of siblings : 2 <cut> bogomips : 3953.14 processor : 1 vendor_id : GenuineIntel cpu family : 15 model : 2 model name : Intel(R) XEON(TM) CPU 2.00GHz stepping : 4 cpu MHz : 1993.759 cache size : 512 KB Physical processor ID : 51941323214 Number of siblings : 2 <cut> bogomips : 3986.55 --- But this new one tells me ---processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 2 model name : Intel(R) XEON(TM) CPU 2.00GHz stepping : 4 cpu MHz : 1993.759 cache size : 512 KB Physical processor ID : 0 Number of siblings : 2 <cut> bogomips : 3953.14 processor : 1 vendor_id : GenuineIntel cpu family : 15 model : 2 model name : Intel(R) XEON(TM) CPU 2.00GHz stepping : 4 cpu MHz : 1993.759 cache size : 512 KB Physical processor ID : 0 Number of siblings : 2 <cut> bogomips : 3986.55 --- The difference being the Physical processor ID. In the previous kernels that number was always insanely high and I think it even could change between my checks. Now the "0" seems much more right. I'll let you know whatever happens with this kernel. If it doesn't crash during the next two days, I'll be back here on thursday, if it does crash, I'll be back sooner... thanks for your help and keep up the good work. :-)
So I did come back earlier, but only to report this one: --- CPU0 CPU1 0: 178771 178805 IO-APIC-edge timer 1: 1 2 IO-APIC-edge keyboard 2: 0 0 XT-PIC cascade 8: 1 0 IO-APIC-edge rtc 11: 0 0 IO-APIC-level usb-ohci 15: 1 1 IO-APIC-edge ide1 22: 4295 3299 IO-APIC-level ioc0 24: 3907 3865 IO-APIC-level eth0 NMI: 0 0 LOC: 357430 357442 ERR: 0 MIS: 0 --- No more IRQ 9. :-)
It seems that kernel 2.4.18-19.7.tg3.120smp fixed the problem for me: --- 2:46pm up 2 days, 5:02, 1 user, load average: 0.03, 0.07, 0.03 --- Also my previously mentioned ab-test does not crash kernel anymore. jgarzik: Was this all about IRQ 9 or did you fix something else?
Looks good! Last night I started stress-test (http://weather.ou.edu/~apw/projects/stress/) at the server and left it run all night. nice -n 10 stress -c 750 -i 4 --verbose Also bonnie++ was torturing disks all night. The result? This morning server was still running and stress-test was still running ok. 9:54am up 3 days, 10 min, 1 user, load average: 754.44, 754.68, 754.68 I believe this case is over for me. Mark, how's your server?
I haven't tried the patched kernel yet - I've been running a vanilla 2.4.20 kernel, and the machine has been up for over two weeks... Mark
To all still experiencing problems, 1) please boot with "noapic" on the kernel command line. You can run "cat /proc/cmdline" to check for sure. 2) I have posted some new rpms for testing, based on the latest errata: latest production tg3 release, 1.2a, built into unofficial rpms: http://people.redhat.com/jgarzik/tg3/tg3-1.2a/rpms/ but I would like people to test my experiment which should provide additional stability: http://people.redhat.com/jgarzik/tg3/tg3-1.2a/exp1-rpms/ ...and if that doesn't work for people, fall back to experiment 2: http://people.redhat.com/jgarzik/tg3/tg3-1.2a/exp2-rpms/ Feedback requested! On several systems, there is evidence that the lock-ups are not directly related to driver but more to system board. So please make sure to attach 'dmesg' and 'lspci -vvv' output in future bug reports.
Ok, some of these reports have actually been fixed in more recently posted rpms. Just to get everybody on the latest page, please use "aragorn2" test rpms, posted at http://people.redhat.com/jgarzik/pub/ This is the latest Red Hat errata kernel for 7.x/8.x, with the recent tg3 bug fixes.
Ladies and gentlemen, I have received permission to post the latest release candidate of Red Hat's errata kernel. It contains not only fixes for e1000 and tg3 net drivers, but also system-level fixes which may address the problems users on this list were seeing. This kernel is currently in Red Hat Q/A, and has NOT yet been "qualified" as official, nor has it been released. Errata kernel 21 release candidate, for Red Hat 8.0: http://people.redhat.com/jgarzik/pub/2.4.18-21.8.0/ Errata kernel 21 release candidate, for Red Hat 7.x: http://people.redhat.com/jgarzik/pub/2.4.18-21.7.x/ It is requested that people who were seeing crash problems test this kernel, as this will be the next official Red Hat errata kernel, after it passes Q/A.