From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6) Gecko/20040113 Description of problem: I've got a dual Opteron machine on a MSI K8T Master2-FAR motherboard, and I'm stress testing to make sure I can't crash it before I put it in production. Unfortunately, I can make it crash pretty easily by stressing out the network subsystem, using the onboard BCM5705 chip and either the tg3 driver or the bcm5700 driver. When it crashes the stack trace looks like the one in this bug: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=53849 Version-Release number of selected component (if applicable): kernel-smp-2.6.6-1.435 bcm5700-7.1.22 How reproducible: Always Steps to Reproduce: 1. start a network stress test (ttcp or wget a huge file or similar) 2. wait for a little while 3. collect oops Actual Results: kernel crashes Expected Results: kernel shouldn't crash Additional info: For a little background, I've had problems with the tg3 driver on other machines before, so when I was able to make the machine hang while doing network tests I immediately installed the bcm5700 driver in its place. I did not put much effort into testing the tg3 driver, but would be happy to if I was able to stabilize the machine with it. The kernel command line I'm using is "ro root=/dev/md1 rhgb quiet noapic console=tty0", and the kernel is the default x86-64 smp kernel from FC2, after updating fully. The "noapic" kernel option was added after I was able to make the machine hang while stressing the I/O subsystem, and it fixed that problem. I'll attach the lspci -v output and the dmesg output from boot, and I'd be happy to find and post any other information necessary. I'd also be happy to test patches (maybe from x86-64.org?) or run other tests if that would help isolate this and fix it.
Created attachment 101446 [details] lspci -v output from the machine
Created attachment 101447 [details] the /var/log/dmesg file from the machine
Is there any other information that would help solve this? Would there be more interest if it was the in-kernel tg3 driver? I had problems with that too and I'd be happy to switch and work on that driver if I knew someone who could get changes into mainline was willing to work on it.
Well how do you expect us to fix problems in drivers we don't ship nor want to ship ? ;) If you have problems with tg3... well please file a (separate) bug so that the tg3 maintainers can investigate...
Arjan - that certainly makes sense. I did see other bcm5700 bugs in here, which is why I went ahead and filed it. If I isolated an oops or a test case for tg3, shouldn't I just change the summary line on this bug? Its already got good information and I'd just add a new comment with the oops. If you still want a new bug, and you guys have no interest in bcm5700 issues, then you might as well put this one to RESOLVED -> WONTFIX
Ok, I finally got physical access to the machine again after the July 4th holiday, and I brought it up with the tg3 driver then triggered the oops using wget from another machine to transfer a big file across the tg3 interface and dump it to /dev/null. The stack in the oops was IRQ, tg3_poll+108, net_rx_action+128, __do_softirq+76, __do_softirq+49, do_IRQ+321, default_idle+0, ret_from_intr+0, etc (I can post more if you want) There was a code that read "0f 0b 61 05 06 90 ff ff ff ff 68 08 89 e8 48 6b d8 18 49 03 At the very bottom there was a "RIP" line with tg3_tx+139 in it, then "Aiee, killing interrupt handler". I'm the first to admit I'm not giving a perfect oops report here (I read oops-tracing.txt) but I think this has the important info. I can run a higher resolution console maybe and get the full oops if necessary. I'm also attaching an updated dmesg since I have updated to the newest FC2 kernel. That will follow in a moment
Created attachment 101742 [details] dmesg output after boot from kernel 2.6.6-1.435.2.3smp This attachment holds the output of the dmesg command from the machine running the newest Fedora Core 2 kernel. This kernel produced the oops message in the comment from 20040708
We do need the full OOPS log so we can see register values at the time of the crash, etc. Set up a serial console to capture it if you need to. Thanks.
Ok - I've done tons of debugging but never a full Oops report or serial console. So, I will certainly do this but it may take a bit of time to get perfect so there may be a lull here before I report back. Thanks in advance for your patience and I'll post again as soon as I've got it
Alright, I get physical access to the machine on Thursday afternoons, so here's this week's progress. I got serial console to work (nifty trick, that - extremely useful), and this is what I get on the oops: Kernel BUG at tg3:2232 invalid operand: 0000 [1] SMP CPU 0 Modules linked in: ipv6 autofs4 nfs lockd sunrpc tg3 dm_mod button battery asus_acpi ac ext3 jbd raid1 Pid: 0, comm: swapper Not tainted 2.6.6-1.435.2.3smp RIP: 0010:[<ffffffffa00576ec>] <ffffffffa00576ec>{:tg3:tg3_tx+139} RSP: 0018:ffffffff80436518 EFLAGS: 00010046 RAX: 000001007f209900 RBX: 000001007b325f70 RCX: 0000000000000000 RDX: 0000000000000118 RSI: 000000005553dee8 RDI: 000001007fe14038 RBP: 00000000000001fb R08: 0000000000000001 R09: ffffffff8043fea0 R10: 0000000000000202 R11: 0000000000000003 R12: 000001007b2f1180 R13: 000001007b28b380 R14: 0000000000000001 R15: 00000000000001fb FS: 0000002a9557e320(0000) GS:ffffffff80496f00(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000002a95558000 CR3: 0000000000101000 CR4: 00000000000006e0 Process swapper (pid: 0, threadinfo ffffffff8049a000, task ffffffff803bb0a0) Stack: 0000000000000000 0004659101910000 0000000000000002 000001007b371000 000001007b28b380 000001007b28b000 ffffffff804365ac ffffffff8049bf28 0000000000000000 ffffffffa0057fcc Call Trace:<IRQ> <ffffffffa0057fcc>{:tg3:tg3_poll+108} <ffffffff80283b40>{net_rx_action+128} <ffffffff80139964>{__do_softirq+76} <ffffffff801399f1>{do_softirq+49} <ffffffff80113f29>{do_IRQ+321} <ffffffff8010f710>{default_idle+0} <ffffffff8011186b>{ret_from_intr+0} <EOI> <ffffffff8010f710>{default_idle+0} <ffffffff8010f734>{default_idle+36} <ffffffff8010f7a7>{cpu_idle+24} <ffffffff8049d817>{start_kernel+451} Code: 0f 0b 61 05 06 a0 ff ff ff ff b8 08 89 e8 48 6b d8 18 49 03 RIP <ffffffffa00576ec>{:tg3:tg3_tx+139} RSP <ffffffff80436518> <0>Kernel panic: Aiee, killing interrupt handler! In interrupt handler - not syncing
Is there anything else I can do here, to assist? I am a programmer (though its Java normally these days) and I'll help if possible, I just need direction when it comes to kernel internals etc.
Fedora Core 2 has now reached end of life, and no further updates will be provided by Red Hat. The Fedora legacy project will be producing further kernel updates for security problems only. If this bug has not been fixed in the latest Fedora Core 2 update kernel, please try to reproduce it under Fedora Core 3, and reopen if necessary, changing the product version accordingly. Thank you.