Bug 461014
Summary: | netdump fails when bnx2 has remote copper PHY - Badness in local_bh_enable at kernel/softirq.c:141 | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | Flavio Leitner <fleitner> |
Component: | kernel | Assignee: | Neil Horman <nhorman> |
Status: | CLOSED ERRATA | QA Contact: | Martin Jenner <mjenner> |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | 4.7 | CC: | agospoda, akarlsso, dhoward, fluo, lmcilroy, ltroan, qcai, tao, vmayatsk |
Target Milestone: | rc | Keywords: | ZStream |
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2009-05-18 19:10:14 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 466113 | ||
Attachments: |
Per comment #1, a test package was built on an internal server on 9/03 -- it is not available yet for external consumption. So it appears this will be in 4.8 and is being considered for the 4.7.z stream. Can we get the bug updated to ASSIGNED status; it's still in NEW state. Everybody slow down. Larry, I'm not sure why you think this will be in 4.8, I've not seen anything go up to rhkl about it. The only thing that this has been built by us is Flavio's build in brew. That doesn't mean this is going into 4.8, nor does it inmply backporting to 4.7.z. As for the technical merit of the patch, I guess it doesn't hurt anything, but I don't really see a need for it either. Yeah we get a badness dump, but thats just a WARN_ON, not a huge deal. And its warning us that we're calling an unlock function that can sleep from a context that isn't able to sleep. That rescheduling happens in preempt_schedule, and in that function, we immediately check irq_disabled, and return if thats true. Since we got the WARN_ON in local_bh_enabled we should never actually schedule, and so practically speaking, its safe. Is there actually a problem here? Are we not captuing core dumps? Or are we just seeing the warning on the console? If its the latter, then we can really safely ignore this bug. If its the former, then lets figure out why we're hanging, before we do anything else. I agree that the warn_on message can be ignored sometimes but it usually indicates a problem and in this case we can't capture core dumps because after the warn_on messages the system hangs. After applying the proposed patch that doesn't invoke softirqs we could get a complete core dump as expected. Flavio "the problem goes away" isn't a good enough reason to take a patch. It says nothing about our understanding of the problem. We may have a timing issue elsewhere in the driver. Do you have a serial console on this box? If we were panic-ing because we were scheduling in an interrupt, I would expect to see us produce a "bad: scheduling while atomic" message when we called schedule and to dump the requisite stack. If we do, then I'm worried about how we got into a state in which we tried to preempt the kernel while in an interrupt context (which is what we should fix). If we don't see that message, then we have likely deadlocked somewhere else, and we should use sysrq-t (if the system is still responsive to it), or the nmi watchdog to determine where we are deadlocking and solve that problem. I'm not trying to assert that your patch is particularly bad. On the contrary, its probably a fine change to make. But I can't take a patch that just makes the problem go away without knowing how it fixes the problem in the first place. Can you please try to attach a serial console to the box and see if we get that secondary panic that I mentioned above? Thanks! This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. I'm attaching the serial console log from another system reproducing it. If you don't do anything the console stays stopped for more than 5 minutes, then if you send sysrq+t nothing happens (you can see telnet send brk command line) until the nmi watchdog comes in and print the second and the third backtrace at once. The system is ibm-hs21-8853-1.gsslab.rdu.redhat.com and it is currently reserved for fleite/olive, so check with him before do anything there. Flavio This event sent from IssueTracker by fleitner issue 214359 it_file 160919 My initial thoughts were still that this is a recursive call to netpoll (first netdump, then netconsole) and that's why you are stuck. My first look at the new capture from comment #13 seems to indicate that is the case as well, but I'll look at it more closely to be sure. It interesting, I agree with andy, this rather smells like we're taking a recursive spinlock of some sort, but given the traces from the nmi watchdog, it appears that we're getting stuck on the recursive taking of sysrq_key_table_lock in __handle_sysrq, which I'm guessing arose from the fact that you hit sysrq-t five minutes after you started the dump process. That would in turn suggest that prior to the issuance of the sysrq-t, the system was in fact _not_ deadlocked (I assume that the nmi never triggers if the system is just left to sit on its own?). If thats the case, it would seem that our cpus are either: a) spinning, doing no usefull work somewhere b) trying to do usefull work, but not accomplishing anything. Have we tried to capture a tcpdump of the netdump happenening from the netdump server. Given the trace above, I'm starting to wonder if we're sending out some netdump frames but never managing to receive anything on the dumping client. Flavio, can you grab a tcpdump from the netdump server while this system is dumping? It would be nice to see if we get any frames from the netdump client, and if so, when we stop getting them. Thanks! This is the messages.log in the other end (10.10.56.62) Oct 2 14:13:44 bl40p-1 kernel: device eth0 entered promiscuous mode Oct 2 14:13:47 bl40p-1 sshd(pam_unix)[13566]: session opened for user netdump by (uid=0) Oct 2 14:13:48 bl40p-1 sshd(pam_unix)[13566]: session closed for user netdump Oct 2 14:14:46 bl40p-1 netdump[6266]: Got too many timeouts in handshaking, ignoring client 10.10.56.163 Oct 2 14:14:49 bl40p-1 netdump[6266]: Got too many timeouts waiting for SHOW_STATUS for client 10.10.56.163, rebooting it Oct 2 14:16:53 bl40p-1 kernel: device eth0 left promiscuous mode Oct 2 14:17:04 bl40p-1 sshd(pam_unix)[13570]: session opened for user root by (uid=0) Oct 2 14:17:16 bl40p-1 sshd(pam_unix)[13570]: session closed for user root the traffic is attached. netdump server: 10.10.56.62 crash system: 10.10.56.163 Flavio This event sent from IssueTracker by fleitner issue 214359 it_file 161029 Thank you. So, I'm looking at your tcpdump and a few things stand out to me: 1) Frames 111, 197, 198, 199 & 200 Show the startup and usage of netconsole correctly. The source port is 6664 (which I assume that you specified with the LOCALPORT option), and everything seems to work well. 2) Frame 228, we seem to have the start of a netdump, except that the contained data looks all wrong. We send from local port 6666 instead of 6664 (which is correct for netdump), except we contain data that looks like its log data rather than the netdump reply header from the client with the REPLY_START_NETDUMP command in the header. 3) Frames 229,230, etc. The netdump server begins acting as though it _has_ recieved a REPLY_START_NETDUMP command, since it sends back a COMM_START_NETDUMP_ACK message, which times out and beigns to send COMM_HELLO messages every timeout period thereafter. So from this I think we can conclude that we at least got a netdump start message from the bnx2 based system, but the tcpdump on the netdump server never saw it (even though the netdump server itself did). It would be nice if we could figure out why that happened. Can you check the message log on the netdump server to see if any odd messages about clients appeared? Is it possible that a firewall on the client or server is dropping frames to/from port 6666 or some such? It seems really odd to me that we have a properly working netconsole, but the netdump startup message gets oddly dropped in the tcpdump. Also, what version of the netdump-server are you using? I'm looking at 0.7.16 sources and I couldn't find any server version info in the bz or it. netdump setup on client side: # grep -v '^#' /etc/sysconfig/netdump NETDUMPADDR=10.10.56.62 The netdump server messages log is available in my previous comment#16 # rpm -q netdump-server netdump-server-0.7.16-14 Flavio netpoll_netdump() ... netpoll_reset_locks(&np); <--- reset poll_lock netdump_startup_handshake() send_netdump_msg() netpoll_send_udp() netpoll_send_skb() netpoll_poll_dev() poll_napi() spin_trylock(&npinfo->poll_lock)) <--- hold poll_lock dev->poll() => bnx2:bnx2_poll() ... :bnx2:bnx2_reg_rd_ind() <--- enable softirqs do_softirq() net_rx_action() local_irq_enable() <-- watchdog still works have = netpoll_poll_lock(dev); spin_lock(&ndw->npinfo->poll_lock); <-- deadlock The watchdog still works as the local interrupts are enabled at this point but the CPU is stuck there. Triggering sysrq+t it does: __handle_sysrq() spin_lock_irqsave(&sysrq_key_table_lock, flags); <--- deadlock here because we had used sysrq to start the crash, but now the interrupts are disabled, so watchdog is able to come in and show the backtrace we are seeing. my 0,02. Flavio Created attachment 319285 [details]
patch to disable softirqs in netpoll
yup, that looks like it. Nicely done. And there are quite a few drivers which use spin_lock_bh in their ->poll paths. Its interesting that we've not seen this happen before. Regardless however, it seems like the appropritate fix is to disable softirqs in poll_napi, so that we are guaranteed to not have net_rx_action run on that cpu while the poll is taking place. Please test out the attached patch and confirm that it solves the problem. Thanks!
Created attachment 319290 [details]
log from patch id=319285
The badness still happens as the interrupts are off since
netpoll_start_netdump(), then it got in endless loop.
attaching the serial console log.
Flavio
I'm not worried about the badness warning yet. You say it loops endlessly, but it appears that its sending netdump packets (which is why you keep getting the warnings). Are you getting a core on the server? Created attachment 319294 [details]
traffic dump of the netdump session
I let it running more time then netdump server resets the client.
See the messages log on the server side:
Oct 2 21:17:24 bl40p-1 netdump[6266]: Got too many timeouts waiting for memory page for client 10.10.56.163, ignoring it
Oct 2 21:17:27 bl40p-1 netdump[6266]: Got too many timeouts waiting for SHOW_STATUS for client 10.10.56.163, rebooting it
Oct 2 21:17:27 bl40p-1 netdump[6266]: Got unexpected packet type 3 from ip 10.10.56.163
Oct 2 21:17:35 bl40p-1 last message repeated 123 times
Oct 2 21:17:37 bl40p-1 netdump[6266]: Got unexpected packet type 12 from ip 10.10.56.163
# pwd
/var/crash
# ls -la 10.10.56.163-2008-10-02-21\:16/
total 32
drwx------ 2 netdump netdump 4096 Oct 2 21:16 .
drwxr-xr-x 6 netdump netdump 4096 Oct 2 21:16 ..
-rw------- 1 netdump netdump 114 Oct 2 21:16 log
-rw------- 1 netdump netdump 4096 Oct 2 21:16 vmcore-incomplete
# file 10.10.56.163-2008-10-02-21\:16/vmcore-incomplete
vmcore-incomplete: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV), SVR4-style, from 'nux'
and on client side:
<snip>
Badness in local_bh_enable at kernel/softirq.c:141
Call Trace:<ffffffff8013d659>{local_bh_enable+70} <ffffffff802c9783>{netpoll_poll_dev+242}
<ffffffff802c965e>{netpoll_send_skb+340} <ffffffffa02185ac>{:netdump:netpoll_netdump+494}
<ffffffff8023eaac>{sysrq_handle_crash+0} <ffffffff8023eaac>{sysrq_handle_crash+0}
<ffffffffa021839a>{:netdump:netpoll_start_netdump+221}
netdump: rebooting in 3 seconds.
< snip, here the client reboots>
attaching the traffic dump.
Flavio
yeah, the upstream version added several spin_lock_bh clauses , thats why it worked before. So I'm looking at this code, and while I understand in the general case why we don't disable bottom halves for netpoll, I really don't see why we don't just disable them in their entirety for netdump specifically. We don't have any need for bottom halves while executing a netdump, no more than we need hard interrupts. I'm attaching a second patch for you to try, that should simplify all of this greatly. Created attachment 319400 [details]
patch to disable softirqs entirely during netdump operation
I'm not sure that patch is perfect -- I see the 'badness' errors on rhel4 using netconsole too. I don't see them with basically the same driver on rhel5, so we must have an _irqsave where we have a _bh in rhel5 (or something similar) in the netpoll and netdump paths. Neil, I think the problem isn't with softirqs anymore. The traffic dump shows some traffic going on between client and server but it still gets timed out errors. The fact that my initial patch has worked also indicates that your patch should have done the same, but it still gets the badness messages going on. It seems to me that bnx2_poll() is generating too much badness warning messages and they are stealing time to do the work causing the timed out errors on the server side. Following this idea I did one more try now without the serial console and it seems better: [root@bl40p-1 10.10.56.163-2008-10-02-23:53]# ls -la total 185408 drwx------ 2 netdump netdump 4096 Oct 3 06:04 . drwxr-xr-x 7 netdump netdump 4096 Oct 2 23:53 .. -rw------- 1 netdump netdump 41 Oct 3 06:04 log -rw------- 1 netdump netdump 1073467392 Oct 3 06:04 vmcore Flavio In response to comment 27: The _irqsave is in write_msg, which exists in both RHEL4 and RHEL5. Everything works fine in netconsole, its netdump thats the problem. We can't move that irqsave to poll_napi, because we don't want to create extra latency in the fast, nominal receive path. Thats why I wanted to disable softirqs for all of netdumps operation. That way we wouldn't ever wind up in net_rx_action while holding the poll lock on the same cpu. In response to comment #28, you say that you're still getting badness errors with my patch in comment 26? You absolutely shouldn't be, unless there is an unbalanced local_bh_enable somewhere. Do you have the log from the most recent kernel , which tested my patch from comment 26? It seems however, despite the timeouts and badness messages that you got a complete vmcore, is that correct? (In reply to comment #29) > In response to comment 27: > The _irqsave is in write_msg, which exists in both RHEL4 and RHEL5. Everything > works fine in netconsole, its netdump thats the problem. We can't move that > irqsave to poll_napi, because we don't want to create extra latency in the > fast, nominal receive path. Thats why I wanted to disable softirqs for all > of netdumps operation. That way we wouldn't ever wind up in net_rx_action > while holding the poll lock on the same cpu. > Netconsole over a bonding interface (when that bonding interface contains a bnx2-based card) spews this message on rhel4. Actually, thats a good point. local_bh_enable issues a WARN_ON if local irqs are disabled. write_msg issues a local_save_flags, but never actually disables irq's. So we shouldn't get this badness message when netconsole is running, ever. Thats in keeping with what we're seeing in this bug (netconsole works fine, but netdump doesn't). If you see this message get spewed with every netconsole packet to get sent over a bnx2 card in rhel4 from the bonding interface, something is then disabling interrupts somewhere and not re-enabling them properly. I don't see anything thas disabling from write_msg down through the bonding netpoll xmit routine or in the bnx2 xmit routine. Looking more closely, I think I see the problem. The patch I gave you is doing its job properly, and is keeping softirqs from running, but the WARN_ON message is checked unconditionally when we re-enable irqs, so we still get the message spew. What I don't understand is why they keep comming. Once we enter crashdump_mode, netconsole should suppress all messagess from a WARN_ON, but we continue to get them (or are you seeing these only on the serial console)? Regardless, the onther thing thats bothering me here is the frequency at which we get these messgaes. The only path that I can see in bnx2 that calls spin_unlock_bh is through bnx2_phy_int. Why are we getting so many phy events when we take a netdump? Is there something acutally happening on the phy when we go into netdump that we need to query it on every poll (i.e. do we check the phy on every napi poll as well when operating normally)? Or is something out of sync in the driver which prevents inadvertently drops us into this phy_checking clause when we trigger a netdump? I wonder if the best thing to do here isn't to add a condition to the warn on like this: WARN_ON(!softirq_count() && irqs_disabled()) That way we would only get the warning printed in the event that irqs were disabled when we _actually_ re-enabled softirqs, rather than unilaterally. Flavio, can you add that to the patch and retest? re#29 - my previous comments are about last results in comment#23. I was able to get a vmcore removing the serial console. I'll try the last patch asap. re#30 - the messages are indeed on serial console and on tty ones. - bnx2_phy_int() is frequently called by poll_napi(). - it's probably a heartbeat timer expiring. Flavio Neil, --- linux-2.6.9/drivers/net/netdump.c.orig 2008-10-03 14:14:33.000000000 -0400 +++ linux-2.6.9/drivers/net/netdump.c 2008-10-03 14:15:09.000000000 -0400 @@ -401,6 +401,7 @@ static asmlinkage void netpoll_netdump(s while (netdump_mode) { local_irq_disable(); + local_bh_disable(); Dprintk("main netdump loop: polling controller ...\n"); netpoll_poll(&np); This patch won't work because before that we call netdump_startup_handshake() and it will hangs there as before. I would suggest to try this patch https://bugzilla.redhat.com/attachment.cgi?id=319285 plus the one in comment#32. but I'm not sure this will fix Andy's one. Flavio Mine is a printk issue: asmlinkage int printk(const char *fmt, ...) { va_list args; int r; va_start(args, fmt); r = vprintk(fmt, args); va_end(args); return r; } asmlinkage int vprintk(const char *fmt, va_list args) { unsigned long flags; int printed_len; char *p; static char printk_buf[1024]; static int log_level_unknown = 1; if (unlikely(oops_in_progress)) zap_locks(); /* This stops the holder of console_sem just where we want him */ ----> spin_lock_irqsave(&logbuf_lock, flags); since printk is always in the stack on my issues, it's the problem. Looks like we need 2 different solutions.... Created attachment 319686 [details]
patch to disable softirqs during netdump, disable hard irqs sooner and supress superfoulous warning messages
This updated patch works quite well for me on the ibm blade in question. I've captured several dumps successfully with it.
Committed in 78.14.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/ I test for 2.6.9-78.ELsmp(has the problem ) ,2.6.9-88.ELsmp(has no problem) [root@dell-per805-01 ~]# uname -a Linux dell-per805-01.rhts.bos.redhat.com 2.6.9-78.ELsmp #1 SMP Wed Jul 9 15:46:26 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux [root@dell-per805-01 ~]# echo c > /proc/sysrq-trigger SysRq : Crashing the kernel by request ... < netdump activated - performing handshake with the server. > Badness in local_bh_enable at kernel/softirq.c:141 Call Trace:<ffffffff8013d595>{local_bh_enable+70} <ffffffffa00f2032>{:bnx2:bnx2_reg_rd_ind+50} <ffffffffa00f473a>{:bnx2:bnx2_poll+173} <ffffffff801f016b>{vsnprintf+1406} <ffffffff802c902c>{netpoll_poll_dev+223} <ffffffff802c8f1a>{netpoll_send_skb+340} <ffffffffa01984f2>{:netdump:netpoll_netdump+308} <ffffffff8011f63c>{flat_send_IPI_mask+0} <ffffffff8023e49c>{sysrq_handle_crash+0} <ffffffffa019839a>{:netdump:netpoll_start_netdump+221} [root@dell-per805-01 ~]# uname -a Linux dell-per805-01.rhts.bos.redhat.com 2.6.9-88.ELsmp #1 SMP Mon Apr 13 19:23:31 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux [root@dell-per805-01 ~]# echo c > /proc/sysrq-trigger SysRq : Crashing the kernel by request ... < netdump activated - performing handshake with the server. > NETDUMP START! < handshake completed - listening for dump requests. > ... the testing results are the same on two different mechines dell-per805-01.rhts.bos.redhat.com and hp-dl585g2-01.rhts.bos.redhat.com. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1024.html Adding issue 304632. |
Created attachment 315645 [details] Patch fixing spin lock Description of problem: During the crash dump the local interrupts are disabled and bnx2 driver tries to read a register doing the following bnx2_reg_rd_ind(struct bnx2 *bp, u32 offset) { u32 val; spin_lock_bh(&bp->indirect_lock); REG_WR(bp, BNX2_PCICFG_REG_WINDOW_ADDRESS, offset); val = REG_RD(bp, BNX2_PCICFG_REG_WINDOW); ===> spin_unlock_bh(&bp->indirect_lock); return val; } but spin_unlock_bh() can cause preemption and so, there is a warning there in case of interrupts are disabled. This should happen only with bnx2 boards with remote copper PHY that triggered STATUS_ATTN_BITS_TIMER_ABORT event. The code path is: bnx2_phy_int(struct bnx2 *bp) { ... if (bnx2_phy_event_is_set(bp, STATUS_ATTN_BITS_TIMER_ABORT)) =====> bnx2_set_remote_link(bp); ... and bnx2_set_remote_link(struct bnx2 *bp) { ... =====> evt_code = REG_RD_IND(bp, bp->shmem_base + BNX2_FW_EVT_CODE_MB); Here is the console output reproducing the problem: [root@ ~]# echo c > /proc/sysrq-trigger < ....Client hang after triggering dump here .. > Here is the serial console output for client(x3755) SysRq : Crashing the kernel by request Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP: <ffffffff8023e104>{sysrq_handle_crash+0} PML4 25405c067 PGD 25405a067 PMD 0 Oops: 0002 [1] SMP CPU 1 Modules linked in: netconsole netdump nfsd exportfs lockd nfs_acl parport_pc lp parport autofs4 i2c_dev i2c_core sunrpc ipmi_devintf ipmi_si ipmi_msghandler ds yenta_socket pcmcia_core cpufreq_powersave ib_srp ib_sdp ib_ipoib md5 ipv6 rdma_ucm rdma_cm iw_cm ib_addr ib_umad ib_ucm ib_uverbs ib_cm ib_sa ib_mad ib_core zlib_deflate dm_mirror dm_multipath dm_mod joydev button battery ac ohci_hcd ehci_hcd k8_edac edac_mc e1000 bnx2 ext3 jbd qla2400 aacraid qla2xxx scsi_transport_fc sd_mod scsi_mod Pid: 26687, comm: bash Not tainted 2.6.9-70.ELsmp RIP: 0010:[<ffffffff8023e104>] <ffffffff8023e104>{sysrq_handle_crash+0} RSP: 0018:00000100bc5ffeb0 EFLAGS: 00010012 RAX: 000000000000001f RBX: ffffffff80413380 RCX: ffffffff803f59a8 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000063 RBP: 0000000000000063 R08: ffffffff803f59a8 R09: ffffffff80413380 R10: 0000000100000000 R11: ffffffff8011f5fc R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000006 R15: 0000000000000246 FS: 0000002a9557a3e0(0000) GS:ffffffff8050c500(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000000 CR3: 00000000dffc0000 CR4: 00000000000006e0 Process bash (pid: 26687, threadinfo 00000100bc5fe000, task 000001015d43c030) Stack: ffffffff8023e2c7 0000000000000000 00000100bc5fe000 0000000000000002 00000100bc5fff50 0000000000000002 0000002a983fb000 0000000000000000 ffffffff801b391d 0000000000000246 Call Trace:<ffffffff8023e2c7>{__handle_sysrq+115} <ffffffff801b391d>{write_sysrq_trigger+43} <ffffffff8017bc46>{vfs_write+207} <ffffffff8017bd2e>{sys_write+69} <ffffffff801102b6>{system_call+126} Code: c6 04 25 00 00 00 00 00 c3 e9 78 ef f3 ff e9 01 3e f4 ff 48 RIP <ffffffff8023e104>{sysrq_handle_crash+0} RSP <00000100bc5ffeb0> CR2: 0000000000000000 CPU#0 is frozen. CPU#1 is executing netdump. CPU#2 is frozen. CPU#3 is frozen. CPU#4 is frozen. CPU#5 is frozen. CPU#6 is frozen. CPU#7 is frozen. < netdump activated - performing handshake with the server. > Badness in local_bh_enable at kernel/softirq.c:141 Call Trace:<ffffffff8013d54d>{local_bh_enable+70} <ffffffffa00e7032>{:bnx2:bnx2_reg_rd_ind+50} <ffffffffa00e9739>{:bnx2:bnx2_poll+173} <ffffffff801f0007>{vsnprintf+1406} <ffffffff802c89ac>{netpoll_poll_dev+223} <ffffffff802c889a>{netpoll_send_skb+340} <ffffffffa02f84f2>{:netdump:netpoll_netdump+308} <ffffffff8011f5fc>{flat_send_IPI_mask+0} <ffffffff8023e104>{sysrq_handle_crash+0} <ffffffffa02f839a>{:netdump:netpoll_start_netdump+221} How reproducible: Always Additional info: Attaching a patch fixing the lock with good feedback.