78059 – [tg3?] Kernel Hangs on Xeon SMP

Bug 78059 - [tg3?] Kernel Hangs on Xeon SMP

Summary: [tg3?] Kernel Hangs on Xeon SMP

Keywords:
Status:	CLOSED RAWHIDE
Alias:	None
Product:	Red Hat Linux
Classification:	Retired
Component:	kernel
Sub Component:
Version:	8.0
Hardware:	i386
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Jeff Garzik
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2002-11-18 16:19 UTC by Mark Cuss
Modified:	2013-07-03 02:07 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2003-02-21 16:21:04 UTC
Embargoed:

Attachments	(Terms of Use)

Description Mark Cuss 2002-11-18 16:19:24 UTC

From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)

Description of problem:
The Kernel Shipped with Red Hat 8 (2.4.18-14) has problems on my Dell Poweredge 
4600 Server - a Dual Xeon (2.2 GHz) with 2 Gigs of RAM.  After about 3 hours, 
the machine will hard lock.  No keyboard, no mouse, no response to a network 
ping.  A power cycle is required.  Nothing in the kernel log to explain why 
this happened. 

I compiled a 2.4.19 kernel from kernel.org source, and the machine seems to run 
fine (been running for 4 days now). So, it seems to be a problem with the stock 
RH 8 kernel. 

Version-Release number of selected component (if applicable):


How reproducible:
Sometimes

Steps to Reproduce:
Install RedHat 8 on a machine like this, and wait.  Once it crashed when I 
wasn't even typing anything - I just wheeled my chair over to use it and it was 
locked up. 

Additional info:

Comment 1 Janne Pikkarainen 2002-12-05 07:45:30 UTC

I can confirm this happening also with brand new IBM xSeries 335 / Intel Xeon
2.0 GHz / 1 GB RAM, running RH 8.0, kernel 2.4.18-18.8.0. 

I installed the server about two weeks ago. With uni-processor kernel it was
running flawlessly 10 days without a crash. This monday I rebooted to the
SMP-kernel and was able to bring the server down with a stupid trick like this:

ab -k -n 1000000 -c 100000 http://127.0.0.1:80/

Like expected, after a while server reported "Too many open files", but couple
of seconds later kernel oopsed, telling something about cpu0. I was able to
reproduce this three times in a row and I think I could reproduce it anytime again.

With the uni-processor kernel and the same test only the "Too many open files"
message appears but the kernel itself survives.

Well, since the above test is stupid anyway, I thought that maybe SMP-kernel
would run just fine in our normal use. Nope. Best uptime the server has had with
SMP-enabled kernel is around 30 hours, it can totally freeze even when totally
idle. The server is not in production use, so I can run any tests you may need.
I will also provide kernel oops message or whatever logs here if needed.

Comment 2 Arjan van de Ven 2002-12-05 09:49:42 UTC

what network cards are in use ?

Comment 3 Janne Pikkarainen 2002-12-05 10:49:48 UTC

Broadcom Corporation NetXtreme BCM5703X Gigabit Ethernet (rev 02), using tg3
kernel module.

Comment 4 Janne Pikkarainen 2002-12-05 10:53:30 UTC

Whoops, forgot to tell you that the server itself has only one physical CPU, but
since Xeon is Hyper Threading -capable I decided to give it a try.

Comment 5 Mark Cuss 2002-12-05 15:31:01 UTC

My Dell Server also uses a Broadcom Netextreme BCM5700 network card 
(10/100/1000) along with an Intel 10/100 (82559)

Mark

Comment 6 Janne Pikkarainen 2002-12-09 07:29:08 UTC

Further investigation revealed that our server's SCSI controller (LSI Logic /
Symbios Logic 53c1030) is using IRQ 9. 

Our another server has IBM ServerRAID controller and its manual states that
Linux has some issues with SMP + IRQ 9 devices. I don't know if that's
ServerRAID specific or not, though...

Comment 7 Jeff Garzik 2002-12-09 19:09:14 UTC

Can you confirm that the latest rawhide kernel fixes this bug?

Or confirm that the following unofficial rpms, based on the latest errata kernel,
fix the problem?

http://people.redhat.com/jgarzik/tg3/tg3-1.2/rpms/

Comment 8 Janne Pikkarainen 2002-12-10 07:54:39 UTC

I just downloaded and installed kernel-smp-2.4.18-19.7.tg3.120.i686.rpm from
people.redhat.com.

---
Linux xxx 2.4.18-19.7.tg3.120smp #1 SMP Mon Nov 25 15:33:06 EST 2002 i686 i686
i386 GNU/Linux
---

At least something has changed, since /proc/cpuinfo is now a bit different than
it used to be. Previous kernels showed up something like

---
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 15
model           : 2
model name      : Intel(R) XEON(TM) CPU 2.00GHz
stepping        : 4
cpu MHz         : 1993.759
cache size      : 512 KB
Physical processor ID   : 51941323214
Number of siblings      : 2
<cut>
bogomips        : 3953.14

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 15
model           : 2
model name      : Intel(R) XEON(TM) CPU 2.00GHz
stepping        : 4
cpu MHz         : 1993.759
cache size      : 512 KB
Physical processor ID   : 51941323214
Number of siblings      : 2
<cut>
bogomips        : 3986.55
---

But this new one tells me

---processor       : 0
vendor_id       : GenuineIntel
cpu family      : 15
model           : 2
model name      : Intel(R) XEON(TM) CPU 2.00GHz
stepping        : 4
cpu MHz         : 1993.759
cache size      : 512 KB
Physical processor ID   : 0
Number of siblings      : 2
<cut>
bogomips        : 3953.14

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 15
model           : 2
model name      : Intel(R) XEON(TM) CPU 2.00GHz
stepping        : 4
cpu MHz         : 1993.759
cache size      : 512 KB
Physical processor ID   : 0
Number of siblings      : 2
<cut>
bogomips        : 3986.55
---

The difference being the Physical processor ID. In the previous kernels that
number was always insanely high and I think it even could change between my
checks. Now the "0" seems much more right. 

I'll let you know whatever happens with this kernel. If it doesn't crash during
the next two days, I'll be back here on thursday, if it does crash, I'll be back
sooner... thanks for your help and keep up the good work. :-)

Comment 9 Janne Pikkarainen 2002-12-10 07:56:49 UTC

So I did come back earlier, but only to report this one:

---
           CPU0       CPU1       
  0:     178771     178805    IO-APIC-edge  timer
  1:          1          2    IO-APIC-edge  keyboard
  2:          0          0          XT-PIC  cascade
  8:          1          0    IO-APIC-edge  rtc
 11:          0          0   IO-APIC-level  usb-ohci
 15:          1          1    IO-APIC-edge  ide1
 22:       4295       3299   IO-APIC-level  ioc0
 24:       3907       3865   IO-APIC-level  eth0
NMI:          0          0 
LOC:     357430     357442 
ERR:          0
MIS:          0
---

No more IRQ 9. :-)

Comment 10 Janne Pikkarainen 2002-12-12 12:49:47 UTC

It seems that kernel 2.4.18-19.7.tg3.120smp fixed the problem for me:

---
  2:46pm  up 2 days,  5:02,  1 user,  load average: 0.03, 0.07, 0.03
---

Also my previously mentioned ab-test does not crash kernel anymore. 

jgarzik: Was this all about IRQ 9 or did you fix something else?

Comment 11 Janne Pikkarainen 2002-12-13 08:07:52 UTC

Looks good! Last night I started stress-test
(http://weather.ou.edu/~apw/projects/stress/) at the server and left it run all
night.

nice -n 10 stress -c 750 -i 4 --verbose

Also bonnie++ was torturing disks all night.

The result? This morning server was still running and stress-test was still
running ok. 

 9:54am  up 3 days, 10 min,  1 user,  load average: 754.44, 754.68, 754.68

I believe this case is over for me. Mark, how's your server?

Comment 12 Mark Cuss 2002-12-13 15:48:08 UTC

I haven't tried the patched kernel yet - I've been running a vanilla 2.4.20 
kernel, and the machine has been up for over two weeks...

Mark

Comment 13 Jeff Garzik 2002-12-31 22:37:02 UTC

To all still experiencing problems,

1) please boot with "noapic" on the kernel command line.  You can run "cat
/proc/cmdline" to check for sure.

2) I have posted some new rpms for testing, based on the latest errata:

latest production tg3 release, 1.2a, built into unofficial rpms:
http://people.redhat.com/jgarzik/tg3/tg3-1.2a/rpms/

but I would like people to test my experiment which should provide additional
stability:
http://people.redhat.com/jgarzik/tg3/tg3-1.2a/exp1-rpms/

...and if that doesn't work for people, fall back to experiment 2:
http://people.redhat.com/jgarzik/tg3/tg3-1.2a/exp2-rpms/

Feedback requested!  On several systems, there is evidence that the lock-ups are
not directly related to driver but more to system board.  So please make sure to
attach 'dmesg' and 'lspci -vvv' output in future bug reports.

Comment 14 Jeff Garzik 2003-01-20 20:57:50 UTC

Ok, some of these reports have actually been fixed in more recently posted rpms.

Just to get everybody on the latest page, please use "aragorn2" test rpms,
posted at http://people.redhat.com/jgarzik/pub/

This is the latest Red Hat errata kernel for 7.x/8.x, with the recent tg3 bug fixes.

Comment 15 Jeff Garzik 2003-01-27 16:14:14 UTC

Ladies and gentlemen,

I have received permission to post the latest release candidate of
Red Hat's errata kernel.  It contains not only fixes for e1000 and tg3 
net drivers, but also system-level fixes which may address the problems 
users on this list were seeing.

This kernel is currently in Red Hat Q/A, and has NOT yet been 
"qualified" as official, nor has it been released.  

Errata kernel 21 release candidate, for Red Hat 8.0:
        http://people.redhat.com/jgarzik/pub/2.4.18-21.8.0/

Errata kernel 21 release candidate, for Red Hat 7.x:
        http://people.redhat.com/jgarzik/pub/2.4.18-21.7.x/

It is requested that people who were seeing crash problems test this 
kernel, as this will be the next official Red Hat errata kernel, after 
it passes Q/A.

Note You need to log in before you can comment on or make changes to this bug.