Description of problem: We're seeing occasional kernel panics in the bnx2 module's bnx_poll function. I have a partial backtrace for the panic: ------------------------------------------------------------------ <3>BUG: sleeping function called from invalid context at kernel/rwsem.c:20 in_atomic():1, irqs_disabled():0 Call Trace: [<ffffffff80269387>] show_trace+0x34/0x47 [<ffffffff802693ac>] dump_stack+0x12/0x17 [<ffffffff8029dcd2>] down_read+0x15/0x23 [<ffffffff802962c0>] blocking_notifier_call_chain+0x13/0x36 [<ffffffff80214e75>] do_exit+0x1f/0x8c3 [<ffffffff80264a70>] do_page_fault+0x79a/0x815 [<ffffffff8025ce9d>] error_exit+0x0/0x84 DWARF2 unwinder stuck at error_exit+0x0/0x84 Leftover inexact backtrace: <IRQ> [<ffffffff88175e8c>] :bnx2:bnx2_poll+0xf9/0xb7b [<ffffffff8020c4bf>] net_rx_action+0xa4/0x1a6 [<ffffffff80211d0f>] __do_softirq+0x5e/0xd5 [<ffffffff8034b037>] end_msi_irq_wo_maskbit+0x9/0x16 [<ffffffff8025d3b0>] call_softirq+0x1c/0x28 [<ffffffff8026a541>] do_softirq+0x1c/0x28 [<ffffffff8026a3cf>] do_IRQ+0xec/0xf5 [<ffffffff8025c6c9>] ret_from_intr+0x0/0xa <EOI> Kernel panic - not syncing: Aiee, killing interrupt handler! ------------------------------------------------------------------ The machine is a Dell PowerEdge 1950 with dual 2.66GHz Woodcrest Xeon CPUs and 16GB of RAM. The kernel only has one non-Red Hat patch applied to it, which backs out the following change in order to fix automount /net trouble: http://kernel.org/git/gitweb.cgi?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=a634904a7de0d3a0bc606f608007a34e8c05bfee;hp=ddeff520f02b92128132c282c350fa72afffb84a Does the backtrace ring any bells? I tried to trace it down myself, but don't know how to get gdb to read debuginfo symbols for kernel modules in the kernel-debuginfo package. Any pointers to docs on that would be greatly appreciated. thanks, lars Version-Release number of selected component (if applicable): kernel-2.6.18-1.2200.fc5 How reproducible: Intermittent; one or two a day in a farm of about 200 machines. Steps to Reproduce: 1. Boot machine 2. Let it chew through various rendering tasks 3. Read the stack trace when it eventually panics 4. Not a very good list of steps here. sorry. Actual results: a kernel panic Expected results: no panics Additional info:
I've been seeing this recently on some bnx2 hardware. Can you please attach `lspci -vvv` output so I can understand which bnx2 hardware is on the system?
Created attachment 141795 [details] lspci -vv output for the machine suffering from bnx2 segfaults Here you go. Thankfully we haven't seen one of these panics since submitting the bug report, but we haven't changed anything that would have fixed them. I'd still like to find a cause if we can. thanks, lars
Thanks for sending that output. I've been investigating panics like these on other kernels and will let you know when we come up with a solution there since it should apply here as well. Please let me know if you continue to see this panic or if you come up with a reliable way to reproduce it.
No problem. I was incorrect about not having seen it in since reporting the bug--we actually catch seven or eight of them a day. The admins responsible for the farm have just been rebooting the machines and not telling me about it. :) So, if there's any other information I can provide, please let me know! So far we've found no pattern to the panics. thanks, lars
Created attachment 142215 [details] bnx2-txdebug2.diff Currently we are still collecting data for the bnx2 crash and using the attached patch. Do you need me to roll a test kernel with this patch or would you be willing to build one yourself?
I'm happy to build it myself. Thanks, though! It'll probably be a couple of days before we can install it on a significant number of machines, but I'll get the process going.
Hi Andy, We finally had a panic on a machine with this patch installed. I don't see any output from the patch in the messages file from before the crash; would it have been logged to disk anywhere else before the machine froze up? I'm hoping a serial console wouldn't have been required to catch the message; we have hundreds of these machines, and attaching serial consoles to a number of them large enough to catch a panic soon would be pretty difficult. thanks, lars
Lars, The output probably did go to the serial port, but that's OK. I've been working this issue with some others on a different release and arch and the following patch has produced good results: http://people.redhat.com/agospoda/rhel4/gtest/bnx2-poll-fix2.patch This came as a suggestion from the upstream maintainer based on the output from the patch in Comment #7. Based on the other feedback I've gotten it seems this should probably resolve your issue. I realize that installing yet another kernel on that many machines is non-trivial, but based on the results from others it seems like a good candidate to resolve the panics. Please let me know if this resolves your issue. -andy
This patch looks like the final one that will resolve your issue: http://people.redhat.com/agospoda/rhel4/gtest/bnx2-txdesc-error.patch
Lars, Any chance you were able to verify the patch in comment #11? Thanks!
Hello Andy, We have the patch active on a test group of render machines, and so far things are looking good. We're going to increase the number of machines using it soon, so I should be able to have a more definitive answer soon. Thanks for the checking in! I'll update again when I have more info. -lars
Sounds good, Lars. The patch for this will appear in 2.6.20.