477945 – Kernel Panic with Bnx2 - Badness in local_bh_enable at kernel/softirq.c:141

Bug 477945 - Kernel Panic with Bnx2 - Badness in local_bh_enable at kernel/softirq.c:141

Summary: Kernel Panic with Bnx2 - Badness in local_bh_enable at kernel/softirq.c:141

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	4.7
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Neil Horman
QA Contact:	Martin Jenner
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2008-12-26 04:47 UTC by Qian Cai
Modified:	2009-05-18 19:22 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2009-05-18 19:22:29 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
poll to keep bh's disabled locally while poll_lock is held (579 bytes, patch) 2009-01-06 14:41 UTC, Neil Horman	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2009:1024	0	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Linux 4.8 kernel security and bug fix update	2009-05-18 14:57:26 UTC

Description Qian Cai 2008-12-26 04:47:44 UTC

Description of problem:
After setting up netdump, I can reproduce a kernel panic or hanging consistently by doing the following at the same time,

- "echo t >/proc/sysrq-trigger" via serial console.
- SSH to the machine from a remote host.

In addition, netdump fails to save the VMCore after the panic.

# echo t >/proc/sysrq-trigger
SysRq : Show State


                                               sibling

  task             PC      pid father child younger older

init          S C037DF14   936     1      0     2               (NOTLB)

f7e0dea8 00000082 c037df14 c037df14 f1be1870 00000000 e0588c00 000f66ab 

       f7e41770 f7e418fc 02632038 f7e0deb8 00000000 f7e0df74 c03262a5 f7e41770 

       00000000 f74040c4 c037c4a0 02632038 1d244b3c 00000000 00000005 c0335879 

Call Trace:

 [<c03262a5>] schedule_timeout+0x158/0x17c

 [<c012efd9>] process_timeout+0x0/0x13

 [<c01836f1>] do_select+0x347/0x378

 [<c0183271>] __pollwait+0x0/0x94

 [<c0183a15>] sys_select+0x2e0/0x43a

 [<c0327947>] syscall_call+0x7/0xb

 [<c032007b>] unix_stream_sendmsg+0x280/0x318

ksoftirqd/0   S F6A7B7B0  3776     2      1             3       (L-TLB)

f7e10fc4 00000046 f7e10000 f6a7b7b0 f1be1870 00000000 5deed0c0 000f66ac 

       f7e411a0 f7e4132c f7e10000 f7e0df64 00000000 c012a593 c012a5c8 f7e10000 

       c013d79d fffffffc ffffffff ffffffff c013d734 00000000 00000000 00000000 

Call Trace:

 [<c012a593>] ksoftirqd+0x0/0x95

 [<c012a5c8>] ksoftirqd+0x35/0x95

 [<c013d79d>] kthread+0x69/0x91

 [<c013d734>] kthread+0x0/0x91

 [<c01041dd>] kernel_thread_helper+0x5/0xb

events/0      S F7E82E88  2824     3      1     4      58     2 (L-TLB)

f7feff6c 00000046 00000003 f7e82e88 f1be1870 00000000 120e6080 000f66ac 

       f7e40bd0 f7e40d5c c0447200 00000246 00000000 f7e82e40 c01387d3 c0155cb4 

       ffffffff ffffffff 00000001 00000000 c012007b 00010000 00000000 c0324ff5 

Call Trace:

 [<c01387d3>] worker_thread+0xca/0x2f0

 [<c0155cb4>] cache_reap+0x0/0x273

 [<c012007b>] default_wake_function+0x0/0xc

 [<c0324ff5>] schedule+0x44d/0x606

 [<c012007b>] default_wake_function+0x0/0xc

 [<c0138709>] worker_thread+0x0/0x2f0

 [<c013d79d>] kthread+0x69/0x91

 [<c013d734>] kthread+0x0/0x91

 [<c01041dd>] kernel_thread_helper+0x5/0xb

khelper       S F7E82D88  3268     4      3             5       (L-TLB)

f7feef6c 00000046 00000003 f7e82d88 cfdf6700 00000000 fc975a80 000f4cbc 

       f7e40600 f7e4078c e0db6e00 00000202 e0db6e50 f7e82d40 c01387d3 c013830d 

       ffffffff ffffffff 00000001 00000000 c012007b 00010000 00000000 c0324ff5 

Call Trace:

 [<c01387d3>] worker_thread+0xca/0x2f0

 [<c013830d>] __call_usermodehelper+0x0/0x43

 [<c012007b>] default_wake_function+0x0/0xc

 [<c0324ff5>] schedule+0x44d/0x606

 [<c012007b>] default_wake_function+0x0/0xc

 [<c0138709>] worker_thread+0x0/0x2f0Badness in local_bh_enable at kernel/softirq.c:141

 [<c012a30a>] local_bh_enable+0x3d/0x5f

 [<f897a11f>] bnx2_reg_rd_ind+0x11f/0x123 [bnx2]

 [<f897c133>] bnx2_set_remote_link+0x14/0x2b [bnx2]

 [<f897dc3e>] bnx2_poll+0x2c/0x16b [bnx2]

 [<c02d66d4>] poll_napi+0xc8/0x122

 [<c02d6792>] netpoll_poll_dev+0x48/0x59

 [<c02d6dbb>] netpoll_send_skb+0x29e/0x2ba

 [<f89e6156>] write_msg+0x156/0x16d [netconsole]

 [<f89e6000>] write_msg+0x0/0x16d [netconsole]

 [<c0124cc2>] __call_console_drivers+0x36/0x40

 [<c0125311>] release_console_sem+0xec/0x1b6

 [<c0125157>] vprintk+0x22d/0x29d

 [<c0138709>] worker_thread+0x0/0x2f0

 [<c0124f27>] printk+0xe/0x11

 [<c0143a6b>] __print_symbol+0x8a/0x93

 [<c0150a00>] __alloc_pages+0xb4/0x2b1

 [<c02c45b9>] alloc_skb+0x33/0xc5

 [<c02d6d26>] netpoll_send_skb+0x209/0x2ba

 [<c0138709>] worker_thread+0x0/0x2f0

 [<c01064c2>] show_trace+0x3a/0x6b

 [<c0106566>] show_stack+0x73/0x79

 [<c012107d>] show_state+0x3a/0x5f

 [<c0241fe1>] __handle_sysrq+0xd9/0x1b0

 [<c01b19ad>] write_sysrq_trigger+0x23/0x29

 [<c016e1bc>] vfs_write+0xb6/0xe2

 [<c016e286>] sys_write+0x3c/0x62

 [<c0327947>] syscall_call+0x7/0xb

Kernel panic - not syncing: include/linux/netpoll.h:94: spin_lock(net/core/netpoll.c:c19e1880) already locked by net/core/netpoll.c/99


net/core/netpoll.c:99: spin_trylock(net/core/netpoll.c:c19e1880) already locked by net/core/netpoll.c/99

net/core/netpoll.c:99: spin_trylock(net/core/netpoll.c:c19e1880) already locked by net/core/netpoll.c/99

net/core/netpoll.c:99: spin_trylock(net/core/netpoll.c:c19e1880) already locked by net/core/netpoll.c/99

net/core/netpoll.c:99: spin_trylock(net/core/netpoll.c:c19e1880) already locked by net/core/netpoll.c/99

------------[ cut here ]------------

kernel BUG at kernel/panic.c:75!

invalid operand: 0000 [#1]

Modules linked in: lp(U) ohci_hcd nfsd exportfs nfs lockd nfs_acl netconsole netdump md5 ipv6 parport_pc parport autofs4 sunrpc cpufreq_powersave loop button battery ac i5000_edac edac_mc hw_random bnx2 sr_mod dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod usb_storage uhci_hcd ehci_hcd ata_piix libata mptscsih mptsas mptspi mptscsi mptbase sd_mod scsi_mod

CPU:    0

EIP:    0060:[<c01243ca>]    Not tainted VLI

EFLAGS: 00010282   (2.6.9-78.0.12.EL) 

EIP is at panic+0x47/0x142

eax: 0000008b   ebx: 00000001   ecx: c0339206   edx: c0417fb0

esi: f7404000   edi: 026316eb   ebp: f20f5d00   esp: c0417fb8

ds: 007b   es: 007b   ss: 0068

Process rhts-system-inf (pid: 10506, threadinfo=c0417000 task=ed506170)

Stack: 00000001 c02cad7b c0330f41 c035cf0c 0000005e c035dc8c c19e1880 c035dc8c 

       00000063 0000012c 00000001 c0442a38 0000000a c012a289 f20f5cd0 00000046 

       f7404000 c010944e 

Call Trace:

 [<c02cad7b>] net_rx_action+0xbe/0x200

 [<c012a289>] __do_softirq+0x35/0x79

 [<c010944e>] do_softirq+0x46/0x4d

 =======================

 [<f897a11f>] bnx2_reg_rd_ind+0x11f/0x123 [bnx2]

 [<f897c133>] bnx2_set_remote_link+0x14/0x2b [bnx2]

 [<f897dc3e>] bnx2_poll+0x2c/0x16b [bnx2]

 [<c02d66d4>] poll_napi+0xc8/0x122

 [<c02d6792>] netpoll_poll_dev+0x48/0x59

 [<c02d6dbb>] netpoll_send_skb+0x29e/0x2ba

 [<f89e6156>] write_msg+0x156/0x16d [netconsole]

 [<f89e6000>] write_msg+0x0/0x16d [netconsole]

 [<c0124cc2>] __call_console_drivers+0x36/0x40

 [<c0125311>] release_console_sem+0xec/0x1b6

 [<c0125157>] vprintk+0x22d/0x29d

 [<c0138709>] worker_thread+0x0/0x2f0

 [<c0124f27>] printk+0xe/0x11

 [<c0143a6b>] __print_symbol+0x8a/0x93

 [<c0150a00>] __alloc_pages+0xb4/0x2b1

 [<c02c45b9>] alloc_skb+0x33/0xc5

 [<c02d6d26>] netpoll_send_skb+0x209/0x2ba

 [<c0138709>] worker_thread+0x0/0x2f0

 [<c01064c2>] show_trace+0x3a/0x6b

 [<c0106566>] show_stack+0x73/0x79

 [<c012107d>] show_state+0x3a/0x5f

 [<c0241fe1>] __handle_sysrq+0xd9/0x1b0

 [<c01b19ad>] write_sysrq_trigger+0x23/0x29

 [<c016e1bc>] vfs_write+0xb6/0xe2

 [<c016e286>] sys_write+0x3c/0x62

 [<c0327947>] syscall_call+0x7/0xb

Code: 42 c0 e8 47 a1 0c 00 68 c0 1f 42 c0 68 06 92 33 c0 e8 64 0b 00 00 83 c4 0c 83 3d 9c 48 44 c0 00 75 09 83 3d 98 48 44 c0 00 74 08 <0f> 0b 4b 00 29 92 33 c0 31 c0 e8 7b 91 ff ff 31 d2 b9 c0 1f 42 

CPU#0 is executing netdump.

poll_lock is locked, unable to take a dump!

rebooting in 5 seconds

<6>NETDEV WATCHDOG: eth0: transmit timed out

I have seen this on at least 3 machines. All of them are using bnx2 driver.

hp-dl785g5-01.rhts.bos.redhat.com
dell-pe1950-01.rhts.bos.redhat.com
dell-pe1950-01.rhts.englab.brq.redhat.com

# uname -ra
Linux hp-dl785g5-01.rhts.bos.redhat.com 2.6.9-78.ELlargesmp #1 SMP Wed Jul 9 16:03:59 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux

# cat /etc/modprobe.conf
alias eth0 bnx2
alias eth1 bnx2
alias scsi_hostadapter cciss
alias usb-controller ehci-hcd
alias usb-controller1 ohci-hcd
alias usb-controller2 uhci-hcd

Version-Release number of selected component (if applicable):
kernel-2.6.9-78.0.12.EL5
kernel-2.6.9-78.0.8.EL5
kernel-2.6.9-78.0.6.EL5
kernel-2.6.9-78.EL5

How reproducible:
always

Steps to Reproduce:
1. configure netdump client on the machine.
2. "while :; do echo t >/proc/sysrq-trigger; done" via serial console
3. SSH to the machine from a remote host.
  
Actual results:
Kernel panic or hanging.

Expected results:
SSH is working as normal.

Comment 1 Qian Cai 2008-12-26 05:11:40 UTC

It probably worth mentioning that this does not limit to SSH only, any network operation may trigger it. For example,

Panic while NFS testing,
http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=5664964

Panic while HTS network testing,
https://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=5661380

Comment 2 Qian Cai 2008-12-26 05:39:32 UTC

You may ask what the relationship between this and the bug,
Bug 466113 - netdump fails when bnx2 has remote copper PHY - Badness in
local_bh_enable at kernel/softirq.c:141
which has already been fixed in the previous release 4.7.z kernel errata.

The answer to this question is -- not quite the same. Although netdump in the
former bug is also not working, but the kernel should not panic or hang at the
first place. In addition, the fix for the later bug looks like does not change
the behaviour of the first one.

Comment 3 Qian Cai 2008-12-26 05:39:59 UTC

Actually, it is a regression again RHEL 4.6 like Bug 466113 - netdump fails when
bnx2 has remote copper PHY - Badness in local_bh_enable at kernel/softirq.c:141,
because I have not seen such problem with kernel-2.6.9-67.0.22.EL.

# uname -ra
Linux hp-dl785g5-01.rhts.bos.redhat.com 2.6.9-67.0.22.EL #1 Fri Jul 11 10:27:41
EDT 2008 x86_64 x86_64 x86_64 GNU/Linux

# while :; do echo t >/proc/sysrq-trigger ; done
...

$ ssh root.bos.redhat.com
root.bos.redhat.com's password: 
...
[root@hp-dl785g5-01 ~]#

Comment 4 Qian Cai 2008-12-26 05:41:33 UTC

(In reply to comment #3)
> Actually, it is a regression again RHEL 4.6 like Bug 466113

Correction -- against RHEL 4.6.

Comment 5 Neil Horman 2008-12-30 01:01:27 UTC

I don't think this is a regression.  You may not see it in earlier kernels, but I think the problem is still there.  This looks like a combination of bz's 474479 and 477202.  The fix for the bz466133 doesn't handle this case, in which netdump is running on one cpu and we handle an interrupt on a different cpu for the NIC.  That allows the softirq handler to run on the secondary cpu wile netdump is working on the first, resulting in the oops of bz 474479.  The fix for bz 474479 is in kernel 2.6.9-78.22, but 477202 is still pending.  i'd suggest testing with at least that level of kernel (or the next build, when bz 477202 is in place.

Comment 6 Qian Cai 2008-12-30 06:09:52 UTC

So, this isn't a regression similar to,

Bug 461014 - netdump fails when bnx2 has remote copper PHY - Badness in
             local_bh_enable at kernel/softirq.c:141

introduced by bug 311531 - [Broadcom 4.7 feat] Update bnx2 to version 1.6.9 according to,

https://bugzilla.redhat.com/show_bug.cgi?id=461014#c24 ?

Because, I have seen their "badness" are in the same line of code. Anyway, let me know when you have a test kernel ready, and then I can try it out.

Comment 7 Neil Horman 2008-12-30 16:42:30 UTC

Yes, thats my assertion.  If you look through the bz, that badness alert isn't really the problem.  The problem is caused by the fact that we run a softirq on one cpu while we are running netdump on another cpu.  I think the other bz's will correct this (if not I'll certainly look closer).

I shouldn't need to build a test kernel.  Vivek should have the second patch integrated into the kernel on his next build, so you can just grab that.
Thanks!

Comment 8 Vivek Goyal 2009-01-05 14:55:46 UTC

Hi Cai,

Fix for 477202 has been included in 78.23.EL. I have released this kernel. Can you please test it again and see if you still see the problem.

Comment 9 Qian Cai 2009-01-06 04:30:23 UTC

Same problem with kernel-2.6.9-78.23.EL.

Badness in local_bh_enable at kernel/softirq.c:141

Call Trace:<ffffffff8013d44d>{local_bh_enable+90} <ffffffffa00ad032>{:bnx2:bnx2_reg_rd_ind+50} 
       <ffffffffa00af73a>{:bnx2:bnx2_poll+173} <ffffffff802b779b>{alloc_skb+92} 
       <ffffffffa00b3228>{:bnx2:bnx2_start_xmit+449} <ffffffff802c88f2>{netpoll_poll_dev+233} 
       <ffffffff802c87e7>{netpoll_send_skb+397} <ffffffffa017f169>{:netconsole:write_msg+361} 
       <ffffffff80138af8>{__call_console_drivers+68} <ffffffff80138d65>{release_console_sem+276} 
       <ffffffff80138ff0>{vprintk+498} <ffffffff80148533>{worker_thread+0} 
       <ffffffff8013909a>{printk+141} <ffffffff801117e7>{show_trace+426} 
       <ffffffff801118f0>{show_stack+241} <ffffffff8013553f>{show_state+482} 
       <ffffffff8023e54f>{__handle_sysrq+115} <ffffffff801b3699>{write_sysrq_trigger+43} 
       <ffffffff8017b772>{vfs_write+207} <ffffffff8017b85a>{sys_write+69} 
       <ffffffff801102f6>{system_call+126} 
NMI Watchdog detected LOCKUP, CPU=0, registers:
CPU 0 
Modules linked in: md5 ipv6 parport_pc lp parport netconsole netdump autofs4 sunrpc ds yenta_socket pcmcia_core cpufreq_powersave joydev loop button battery ac uhci_hcd ohci_hcd ehci_hcd bnx2 dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod cciss sd_mod scsi_mod
Pid: 5508, comm: sshd Not tainted 2.6.9-78.23.ELlargesmp
RIP: 0010:[<ffffffff801f1721>] <ffffffff801f1721>{__write_lock_failed+9}
RSP: 0018:000001041d977e38  EFLAGS: 00000087
RAX: ffffffff80526700 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 00000000000001bd RSI: 000001041d978000 RDI: ffffffff80526700
RBP: 000001042111f988 R08: 000000000000002b R09: 0000000001200011
R10: 0000000000000038 R11: 0000000000000000 R12: 00000108208a97f0
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
FS:  0000002a96a372a0(0000) GS:ffffffff80520180(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000002a968c6f70 CR3: 0000000000101000 CR4: 00000000000006e0
Process sshd (pid: 5508, threadinfo 000001041d976000, task 0000011c21ffc7f0)
Stack: ffffffff8031920b ffffffff8013783d 0000000000000246 0000000000000006 
       000001042111f9b8 000001042111f9c8 000001042111f9a0 00000110302ba040 
       0000002a96a37330 0000000000000000 
Call Trace:<ffffffff8031920b>{.text.lock.spinlock+113} <ffffffff8013783d>{copy_process+3725} 
       <ffffffff80137e1f>{do_fork+206} <ffffffff801102f6>{system_call+126} 
       <ffffffff8011066b>{ptregscall_common+103} 

Code: 81 38 00 00 00 01 75 f6 f0 81 28 00 00 00 01 0f 85 e2 ff ff 
Kernel panic - not syncing: nmi watchdog
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at panic:75
invalid operand: 0000 [1] SMP 
CPU 0 
Modules linked in: md5 ipv6 parport_pc lp parport netconsole netdump autofs4 sunrpc ds yenta_socket pcmcia_core cpufreq_powersave joydev loop button battery ac uhci_hcd ohci_hcd ehci_hcd bnx2 dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod cciss sd_mod scsi_mod
Pid: 5508, comm: sshd Not tainted 2.6.9-78.23.ELlargesmp
RIP: 0010:[<ffffffff80138496>] <ffffffff80138496>{panic+211}
RSP: 0018:ffffffff8047d6a8  EFLAGS: 00010086
RAX: 000000000000002c RBX: ffffffff8032c2de RCX: 0000000000000046
RDX: 0000000000039d44 RSI: 0000000000000046 RDI: ffffffff803f8480
RBP: ffffffff8047d858 R08: 0000000000000002 R09: ffffffff8032c2de
R10: 0000000000000000 R11: 0000ffff80413a20 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
FS:  0000002a96a372a0(0000) GS:ffffffff80520180(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000002a968c6f70 CR3: 0000000000101000 CR4: 00000000000006e0
Process sshd (pid: 5508, threadinfo 000001041d976000, task 0000011c21ffc7f0)
Stack: 0000003000000008 ffffffff8047d788 ffffffff8047d6c8 0000000000000013 
       0000000000000000 0000000000000046 0000000000039d18 0000000000000046 
       0000000000000002 ffffffff8032e99d 
Call Trace:<ffffffff801118f0>{show_stack+241} <ffffffff80111a1a>{show_registers+277} 
<ffffffff80111d21>{die_nmi+130} <ffffffff8011ddd1>{nmi_watchdog_tick+276} 
<ffffffff801125f2>{default_do_nmi+116} <ffffffff8011debb>{do_nmi+115} 
<ffffffff801111ff>{paranoid_exit+0} <ffffffff801f1721>{__write_lock_failed+9} 
 <EOE> <ffffffff8031920b>{.text.lock.spinlock+113} 
       <ffffffff8013783d>{copy_process+3725} <ffffffff80137e1f>{do_fork+206} 
       <ffffffff801102f6>{system_call+126} <ffffffff8011066b>{ptregscall_common+103} 
       

Code: 0f 0b 5e c8 32 80 ff ff ff ff 4b 00 31 ff e8 ab be fe ff e8 
RIP <ffffffff80138496>{panic+211} RSP <ffffffff8047d6a8>
CPU#0 is executing netdump.
CPU#1 is frozen.
CPU#2 is frozen.
CPU#3 is frozen.
CPU#4 is frozen.
CPU#5 is frozen.
CPU#6 is frozen.
CPU#7 is frozen.
CPU#8 is frozen.
CPU#9 is frozen.
CPU#10 is frozen.
CPU#11 is frozen.
CPU#12 is frozen.
CPU#13 is frozen.
CPU#14 is frozen.
CPU#15 is frozen.
CPU#16 is frozen.
CPU#17 is frozen.
CPU#18 is frozen.
CPU#19 is frozen.
CPU#20 is frozen.
CPU#21 is frozen.
CPU#22 is frozen.
CPU#23 is frozen.
CPU#24 is frozen.
CPU#25 is frozen.
CPU#26 is frozen.
CPU#27 is frozen.
CPU#28 is frozen.
CPU#29 is frozen.
CPU#30 is frozen.
CPU#31 is frozen.
poll_lock is locked, unable to take a dump!
rebooting in 5 seconds
Badness in netpoll_reset_locks at net/core/netpoll.c:864

Call Trace:<ffffffff802c9996>{netpoll_reset_locks+184} <ffffffffa01783ea>{:netdump:netpoll_netdump+44} 
       <ffffffffa017839a>{:netdump:netpoll_start_netdump+221}

Comment 10 Neil Horman 2009-01-06 14:40:50 UTC

grr, Ok, I think I see whats going on here.  Its not specifically a bnx2 problem, its some other deadlock, whose likelyhood of occuring is likely increased by the fact that bnx2 offers the opportunity to run softirqs from within its poll routine.  We probably need to fix the underlying deadlock that the nmi detected, but of course to do that we need to get a vmcore, so its kind of a chicken and egg problem, because the deadlock happens while the poll_lock is held by bnx2.  The most direct fix is the patch attached below I think.  You'll still get the nmi panic of course (assuming that you don't need to run a softirq to trigger it), but you should get an vmcore with this patch.  Please give it a test and let me know.  Thanks!

Comment 11 Neil Horman 2009-01-06 14:41:25 UTC

Created attachment 328283 [details]
poll to keep bh's disabled locally while poll_lock is held

Comment 12 Qian Cai 2009-01-09 09:51:08 UTC

After applied the above patch, I am not able to see the panic anymore, so there is no VMCore.

Brew build,
https://brewweb.devel.redhat.com/taskinfo?taskID=1642784

Comment 13 Neil Horman 2009-01-09 12:06:52 UTC

dang, I was still hoping we could trigger the origional panic.  Oh well.  This is most likely the best course of action anyway.  I'll post this shortly.  Thanks!

Comment 14 Qian Cai 2009-01-09 17:04:51 UTC

Probably it is worth mentioning that although there was no panic, but then I have seen consistently packets loss while running "echo t >/proc/sysrq-trigger" in a loop.

From the affected machine's serial console,
# while :; do echo t >/proc/sysrq-trigger; done

From another host,
$ ping hp-dl785g5-01.rhts.bos.redhat.com
...

I have seen lots of packets loss here.

Comment 15 RHEL Program Management 2009-01-15 21:41:07 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 16 Vivek Goyal 2009-01-19 19:58:30 UTC

Committed in 79.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/

Comment 17 Qian Cai 2009-02-01 12:29:51 UTC

I guess the packets loss in comment #14 still need to be addressed eventually.

Bug 483445 - Packets Loss with Netdump

Comment 19 Jan Tluka 2009-04-30 11:12:47 UTC

Patch is in -89.EL kernel.

Comment 22 errata-xmlrpc 2009-05-18 19:22:29 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1024.html

Note You need to log in before you can comment on or make changes to this bug.