126789 – kernel crash in tg3 driver on dual opteron

Bug 126789 - kernel crash in tg3 driver on dual opteron

Summary: kernel crash in tg3 driver on dual opteron

Keywords:
Status:	CLOSED NEXTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	2
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Dave Jones
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2004-06-26 20:35 UTC by Mike Hardy
Modified:	2015-01-04 22:07 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2005-04-16 04:30:34 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
lspci -v output from the machine (3.38 KB, text/plain) 2004-06-26 20:37 UTC, Mike Hardy	no flags	Details
the /var/log/dmesg file from the machine (11.87 KB, text/plain) 2004-06-26 20:38 UTC, Mike Hardy	no flags	Details
dmesg output after boot from kernel 2.6.6-1.435.2.3smp (12.21 KB, text/plain) 2004-07-09 06:50 UTC, Mike Hardy	no flags	Details
Show Obsolete (1) View All

Description Mike Hardy 2004-06-26 20:35:13 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6) Gecko/20040113

Description of problem:
I've got a dual Opteron machine on a MSI K8T Master2-FAR motherboard,
and I'm stress testing to make sure I can't crash it before I put it
in production. Unfortunately, I can make it crash pretty easily by
stressing out the network subsystem, using the onboard BCM5705 chip
and either the tg3 driver or the bcm5700 driver.

When it crashes the stack trace looks like the one in this bug:
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=53849

Version-Release number of selected component (if applicable):
kernel-smp-2.6.6-1.435 bcm5700-7.1.22

How reproducible:
Always

Steps to Reproduce:
1. start a network stress test (ttcp or wget a huge file or similar)
2. wait for a little while
3. collect oops
    

Actual Results:  kernel crashes

Expected Results:  kernel shouldn't crash

Additional info:

For a little background, I've had problems with the tg3 driver on
other machines before, so when I was able to make the machine hang
while doing network tests I immediately installed the bcm5700 driver
in its place. I did not put much effort into testing the tg3 driver,
but would be happy to if I was able to stabilize the machine with it. 

The kernel command line I'm using is "ro root=/dev/md1 rhgb quiet
noapic console=tty0", and the kernel is the default x86-64 smp kernel
from FC2, after updating fully.

The "noapic" kernel option was added after I was able to make the
machine hang while stressing the I/O subsystem, and it fixed that problem.

I'll attach the lspci -v output and the dmesg output from boot, and
I'd be happy to find and post any other information necessary. 

I'd also be happy to test patches (maybe from x86-64.org?) or run
other tests if that would help isolate this and fix it.

Comment 1 Mike Hardy 2004-06-26 20:37:42 UTC

Created attachment 101446 [details]
lspci -v output from the machine

Comment 2 Mike Hardy 2004-06-26 20:38:20 UTC

Created attachment 101447 [details]
the /var/log/dmesg file from the machine

Comment 3 Mike Hardy 2004-06-30 04:12:53 UTC

Is there any other information that would help solve this? Would there
be more interest if it was the in-kernel tg3 driver? I had problems
with that too and I'd be happy to switch and work on that driver if I
knew someone who could get changes into mainline was willing to work
on it.

Comment 4 Arjan van de Ven 2004-06-30 06:54:14 UTC

Well how do you expect us to fix problems in drivers we don't ship nor
want to ship ? ;)

If you have problems with tg3... well please file a (separate) bug so
that the tg3 maintainers can investigate...

Comment 5 Mike Hardy 2004-06-30 18:12:04 UTC

Arjan - that certainly makes sense. I did see other bcm5700 bugs in
here, which is why I went ahead and filed it. 

If I isolated an oops or a test case for tg3, shouldn't I just change
the summary line on this bug? Its already got good information and I'd
just add a new comment with the oops. 

If you still want a new bug, and you guys have no interest in bcm5700
issues, then you might as well put this one to RESOLVED -> WONTFIX

Comment 6 Mike Hardy 2004-07-09 06:46:04 UTC

Ok, I finally got physical access to the machine again after the July
4th holiday, and I brought it up with the tg3 driver then triggered
the oops using wget from another machine to transfer a big file across
the tg3 interface and dump it to /dev/null.

The stack in the oops was IRQ, tg3_poll+108, net_rx_action+128,
__do_softirq+76, __do_softirq+49, do_IRQ+321, default_idle+0,
ret_from_intr+0, etc (I can post more if you want)

There was a code that read "0f 0b 61 05 06 90 ff ff ff ff 68 08 89 e8
48 6b d8 18 49 03

At the very bottom there was a "RIP" line with tg3_tx+139 in it, then
"Aiee, killing interrupt handler".

I'm the first to admit I'm not giving a perfect oops report here (I
read oops-tracing.txt) but I think this has the important info.

I can run a higher resolution console maybe and get the full oops if
necessary.

I'm also attaching an updated dmesg since I have updated to the newest
FC2 kernel. That will follow in a moment

Comment 7 Mike Hardy 2004-07-09 06:50:45 UTC

Created attachment 101742 [details]
dmesg output after boot from kernel 2.6.6-1.435.2.3smp

This attachment holds the output of the dmesg command from the machine running
the newest Fedora Core 2 kernel. This kernel produced the oops message in the
comment from 20040708

Comment 8 David Miller 2004-07-09 20:10:00 UTC

We do need the full OOPS log so we can see register
values at the time of the crash, etc.

Set up a serial console to capture it if you need to.
Thanks.

Comment 9 Mike Hardy 2004-07-09 21:35:56 UTC

Ok - I've done tons of debugging but never a full Oops report or
serial console. So, I will certainly do this but it may take a bit of
time to get perfect so there may be a lull here before I report back.
Thanks in advance for your patience and I'll post again as soon as
I've got it

Comment 10 Mike Hardy 2004-07-16 03:55:33 UTC

Alright, I get physical access to the machine on Thursday afternoons,
so here's this week's progress. I got serial console to work (nifty
trick, that - extremely useful), and this is what I get on the oops:

Kernel BUG at tg3:2232
invalid operand: 0000 [1] SMP 
CPU 0 
Modules linked in: ipv6 autofs4 nfs lockd sunrpc tg3 dm_mod button
battery asus_acpi ac ext3 jbd raid1
Pid: 0, comm: swapper Not tainted 2.6.6-1.435.2.3smp
RIP: 0010:[<ffffffffa00576ec>] <ffffffffa00576ec>{:tg3:tg3_tx+139}
RSP: 0018:ffffffff80436518  EFLAGS: 00010046
RAX: 000001007f209900 RBX: 000001007b325f70 RCX: 0000000000000000
RDX: 0000000000000118 RSI: 000000005553dee8 RDI: 000001007fe14038
RBP: 00000000000001fb R08: 0000000000000001 R09: ffffffff8043fea0
R10: 0000000000000202 R11: 0000000000000003 R12: 000001007b2f1180
R13: 000001007b28b380 R14: 0000000000000001 R15: 00000000000001fb
FS:  0000002a9557e320(0000) GS:ffffffff80496f00(0000)
knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000002a95558000 CR3: 0000000000101000 CR4: 00000000000006e0
Process swapper (pid: 0, threadinfo ffffffff8049a000, task
ffffffff803bb0a0)
Stack: 0000000000000000 0004659101910000 0000000000000002
000001007b371000 
       000001007b28b380 000001007b28b000 ffffffff804365ac
ffffffff8049bf28 
       0000000000000000 ffffffffa0057fcc 
Call Trace:<IRQ> <ffffffffa0057fcc>{:tg3:tg3_poll+108}
<ffffffff80283b40>{net_rx_action+128} 
       <ffffffff80139964>{__do_softirq+76}
<ffffffff801399f1>{do_softirq+49} 
       <ffffffff80113f29>{do_IRQ+321} <ffffffff8010f710>{default_idle+0} 
       <ffffffff8011186b>{ret_from_intr+0}  <EOI>
<ffffffff8010f710>{default_idle+0} 
       <ffffffff8010f734>{default_idle+36}
<ffffffff8010f7a7>{cpu_idle+24} 
       <ffffffff8049d817>{start_kernel+451} 

Code: 0f 0b 61 05 06 a0 ff ff ff ff b8 08 89 e8 48 6b d8 18 49 03 
RIP <ffffffffa00576ec>{:tg3:tg3_tx+139} RSP <ffffffff80436518>
 <0>Kernel panic: Aiee, killing interrupt handler!
In interrupt handler - not syncing

Comment 11 Mike Hardy 2004-07-31 16:53:04 UTC

Is there anything else I can do here, to assist? I am a programmer
(though its Java normally these days) and I'll help if possible, I
just need direction when it comes to kernel internals etc.

Comment 12 Dave Jones 2005-04-16 04:30:34 UTC

Fedora Core 2 has now reached end of life, and no further updates will be
provided by Red Hat.  The Fedora legacy project will be producing further kernel
updates for security problems only.

If this bug has not been fixed in the latest Fedora Core 2 update kernel, please
try to reproduce it under Fedora Core 3, and reopen if necessary, changing the
product version accordingly.

Thank you.

Note You need to log in before you can comment on or make changes to this bug.