Bug 214526

Summary: sporadic panic in bnx2 module
Product: [Fedora] Fedora Reporter: Lars Damerow <lars>
Component: kernelAssignee: Andy Gospodarek <agospoda>
Status: CLOSED UPSTREAM QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 5CC: davej, linville, peterm, wtogami
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-02-02 20:13:48 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
lspci -vv output for the machine suffering from bnx2 segfaults
none
bnx2-txdebug2.diff none

Description Lars Damerow 2006-11-07 23:22:55 UTC
Description of problem:
We're seeing occasional kernel panics in the bnx2 module's bnx_poll function.
I have a partial backtrace for the panic:
------------------------------------------------------------------
  <3>BUG: sleeping function called from invalid context at kernel/rwsem.c:20
in_atomic():1, irqs_disabled():0

  Call Trace:
   [<ffffffff80269387>] show_trace+0x34/0x47
   [<ffffffff802693ac>] dump_stack+0x12/0x17
   [<ffffffff8029dcd2>] down_read+0x15/0x23
   [<ffffffff802962c0>] blocking_notifier_call_chain+0x13/0x36
   [<ffffffff80214e75>] do_exit+0x1f/0x8c3
   [<ffffffff80264a70>] do_page_fault+0x79a/0x815
   [<ffffffff8025ce9d>] error_exit+0x0/0x84
  DWARF2 unwinder stuck at error_exit+0x0/0x84
  Leftover inexact backtrace:
   <IRQ>  [<ffffffff88175e8c>] :bnx2:bnx2_poll+0xf9/0xb7b
   [<ffffffff8020c4bf>] net_rx_action+0xa4/0x1a6
   [<ffffffff80211d0f>] __do_softirq+0x5e/0xd5
   [<ffffffff8034b037>] end_msi_irq_wo_maskbit+0x9/0x16
   [<ffffffff8025d3b0>] call_softirq+0x1c/0x28
   [<ffffffff8026a541>] do_softirq+0x1c/0x28
   [<ffffffff8026a3cf>] do_IRQ+0xec/0xf5
   [<ffffffff8025c6c9>] ret_from_intr+0x0/0xa
   <EOI>
  Kernel panic - not syncing: Aiee, killing interrupt handler!
------------------------------------------------------------------

The machine is a Dell PowerEdge 1950 with dual 2.66GHz Woodcrest Xeon CPUs and
16GB of RAM. The kernel only has one non-Red Hat patch applied to it, which
backs out the following change in order to fix automount /net trouble:

http://kernel.org/git/gitweb.cgi?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=a634904a7de0d3a0bc606f608007a34e8c05bfee;hp=ddeff520f02b92128132c282c350fa72afffb84a

Does the backtrace ring any bells? I tried to trace it down myself, but don't
know how to get gdb to read debuginfo symbols for kernel modules in the
kernel-debuginfo package. Any pointers to docs on that would be greatly appreciated.

thanks,
lars

Version-Release number of selected component (if applicable):
kernel-2.6.18-1.2200.fc5

How reproducible:
Intermittent; one or two a day in a farm of about 200 machines.

Steps to Reproduce:
1. Boot machine
2. Let it chew through various rendering tasks
3. Read the stack trace when it eventually panics
4. Not a very good list of steps here. sorry.
  
Actual results:
a kernel panic

Expected results:
no panics

Additional info:

Comment 3 Andy Gospodarek 2006-11-21 16:24:48 UTC
I've been seeing this recently on some bnx2 hardware.  Can you please attach
`lspci -vvv` output so I can understand which bnx2 hardware is on the system?  

Comment 4 Lars Damerow 2006-11-21 17:14:25 UTC
Created attachment 141795 [details]
lspci -vv output for the machine suffering from bnx2 segfaults

Here you go. Thankfully we haven't seen one of these panics since submitting
the bug report, but we haven't changed anything that would have fixed them. I'd
still like to find a cause if we can.

thanks,
lars

Comment 5 Andy Gospodarek 2006-11-27 15:14:57 UTC
Thanks for sending that output.  I've been investigating panics like these on
other kernels and will let you know when we come up with a solution there since
it should apply here as well.  

Please let me know if you continue to see this panic or if you come up with a
reliable way to reproduce it.

Comment 6 Lars Damerow 2006-11-27 15:58:36 UTC
No problem. I was incorrect about not having seen it in since reporting the
bug--we actually catch seven or eight of them a day. The admins responsible for
the farm have just been rebooting the machines and not telling me about it. :)
So, if there's any other information I can provide, please let me know!

So far we've found no pattern to the panics.

thanks,
lars

Comment 7 Andy Gospodarek 2006-11-27 19:55:25 UTC
Created attachment 142215 [details]
bnx2-txdebug2.diff

Currently we are still collecting data for the bnx2 crash and using the
attached patch.  

Do you need me to roll a test kernel with this patch or would you be willing to
build one yourself?

Comment 8 Lars Damerow 2006-11-27 22:22:58 UTC
I'm happy to build it myself. Thanks, though! It'll probably be a couple of days
before we can install it on a significant number of machines, but I'll get the
process going.

Comment 9 Lars Damerow 2006-12-08 22:05:36 UTC
Hi Andy,

We finally had a panic on a machine with this patch installed. I don't see any
output from the patch in the messages file from before the crash; would it have
been logged to disk anywhere else before the machine froze up?

I'm hoping a serial console wouldn't have been required to catch the message; we
have hundreds of these machines, and attaching serial consoles to a number of
them large enough to catch a panic soon would be pretty difficult.

thanks,
lars

Comment 10 Andy Gospodarek 2006-12-08 22:27:42 UTC
Lars,

The output probably did go to the serial port, but that's OK.  I've been working
this issue with some others on a different release and arch and the following
patch has produced good results:

http://people.redhat.com/agospoda/rhel4/gtest/bnx2-poll-fix2.patch

This came as a suggestion from the upstream maintainer based on the output from
the patch in Comment #7.

Based on the other feedback I've gotten it seems this should probably resolve
your issue.  I realize that installing yet another kernel on that many machines
is non-trivial, but based on the results from others it seems like a good
candidate to resolve the panics.  Please let me know if this resolves your issue.

-andy



Comment 11 Andy Gospodarek 2006-12-15 19:49:58 UTC
This patch looks like the final one that will resolve your issue:

http://people.redhat.com/agospoda/rhel4/gtest/bnx2-txdesc-error.patch



Comment 12 Andy Gospodarek 2007-01-09 15:12:20 UTC
Lars,

Any chance you were able to verify the patch in comment #11?

Thanks!

Comment 13 Lars Damerow 2007-01-11 18:01:58 UTC
Hello Andy,

We have the patch active on a test group of render machines, and so far things
are looking good. We're going to increase the number of machines using it soon,
so I should be able to have a more definitive answer soon.

Thanks for the checking in! I'll update again when I have more info.

-lars

Comment 14 Andy Gospodarek 2007-01-11 19:11:01 UTC
Sounds good, Lars.  The patch for this will appear in 2.6.20.