Bug 477945
Summary: | Kernel Panic with Bnx2 - Badness in local_bh_enable at kernel/softirq.c:141 | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | Qian Cai <qcai> | ||||
Component: | kernel | Assignee: | Neil Horman <nhorman> | ||||
Status: | CLOSED ERRATA | QA Contact: | Martin Jenner <mjenner> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 4.7 | CC: | agospoda, jtluka, mgahagan, vgoyal | ||||
Target Milestone: | rc | Keywords: | Regression | ||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2009-05-18 19:22:29 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Qian Cai
2008-12-26 04:47:44 UTC
It probably worth mentioning that this does not limit to SSH only, any network operation may trigger it. For example, Panic while NFS testing, http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=5664964 Panic while HTS network testing, https://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=5661380 You may ask what the relationship between this and the bug, Bug 466113 - netdump fails when bnx2 has remote copper PHY - Badness in local_bh_enable at kernel/softirq.c:141 which has already been fixed in the previous release 4.7.z kernel errata. The answer to this question is -- not quite the same. Although netdump in the former bug is also not working, but the kernel should not panic or hang at the first place. In addition, the fix for the later bug looks like does not change the behaviour of the first one. Actually, it is a regression again RHEL 4.6 like Bug 466113 - netdump fails when bnx2 has remote copper PHY - Badness in local_bh_enable at kernel/softirq.c:141, because I have not seen such problem with kernel-2.6.9-67.0.22.EL. # uname -ra Linux hp-dl785g5-01.rhts.bos.redhat.com 2.6.9-67.0.22.EL #1 Fri Jul 11 10:27:41 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux # while :; do echo t >/proc/sysrq-trigger ; done ... $ ssh root.bos.redhat.com root.bos.redhat.com's password: ... [root@hp-dl785g5-01 ~]# (In reply to comment #3) > Actually, it is a regression again RHEL 4.6 like Bug 466113 Correction -- against RHEL 4.6. I don't think this is a regression. You may not see it in earlier kernels, but I think the problem is still there. This looks like a combination of bz's 474479 and 477202. The fix for the bz466133 doesn't handle this case, in which netdump is running on one cpu and we handle an interrupt on a different cpu for the NIC. That allows the softirq handler to run on the secondary cpu wile netdump is working on the first, resulting in the oops of bz 474479. The fix for bz 474479 is in kernel 2.6.9-78.22, but 477202 is still pending. i'd suggest testing with at least that level of kernel (or the next build, when bz 477202 is in place. So, this isn't a regression similar to, Bug 461014 - netdump fails when bnx2 has remote copper PHY - Badness in local_bh_enable at kernel/softirq.c:141 introduced by bug 311531 - [Broadcom 4.7 feat] Update bnx2 to version 1.6.9 according to, https://bugzilla.redhat.com/show_bug.cgi?id=461014#c24 ? Because, I have seen their "badness" are in the same line of code. Anyway, let me know when you have a test kernel ready, and then I can try it out. Yes, thats my assertion. If you look through the bz, that badness alert isn't really the problem. The problem is caused by the fact that we run a softirq on one cpu while we are running netdump on another cpu. I think the other bz's will correct this (if not I'll certainly look closer). I shouldn't need to build a test kernel. Vivek should have the second patch integrated into the kernel on his next build, so you can just grab that. Thanks! Hi Cai, Fix for 477202 has been included in 78.23.EL. I have released this kernel. Can you please test it again and see if you still see the problem. Same problem with kernel-2.6.9-78.23.EL. Badness in local_bh_enable at kernel/softirq.c:141 Call Trace:<ffffffff8013d44d>{local_bh_enable+90} <ffffffffa00ad032>{:bnx2:bnx2_reg_rd_ind+50} <ffffffffa00af73a>{:bnx2:bnx2_poll+173} <ffffffff802b779b>{alloc_skb+92} <ffffffffa00b3228>{:bnx2:bnx2_start_xmit+449} <ffffffff802c88f2>{netpoll_poll_dev+233} <ffffffff802c87e7>{netpoll_send_skb+397} <ffffffffa017f169>{:netconsole:write_msg+361} <ffffffff80138af8>{__call_console_drivers+68} <ffffffff80138d65>{release_console_sem+276} <ffffffff80138ff0>{vprintk+498} <ffffffff80148533>{worker_thread+0} <ffffffff8013909a>{printk+141} <ffffffff801117e7>{show_trace+426} <ffffffff801118f0>{show_stack+241} <ffffffff8013553f>{show_state+482} <ffffffff8023e54f>{__handle_sysrq+115} <ffffffff801b3699>{write_sysrq_trigger+43} <ffffffff8017b772>{vfs_write+207} <ffffffff8017b85a>{sys_write+69} <ffffffff801102f6>{system_call+126} NMI Watchdog detected LOCKUP, CPU=0, registers: CPU 0 Modules linked in: md5 ipv6 parport_pc lp parport netconsole netdump autofs4 sunrpc ds yenta_socket pcmcia_core cpufreq_powersave joydev loop button battery ac uhci_hcd ohci_hcd ehci_hcd bnx2 dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod cciss sd_mod scsi_mod Pid: 5508, comm: sshd Not tainted 2.6.9-78.23.ELlargesmp RIP: 0010:[<ffffffff801f1721>] <ffffffff801f1721>{__write_lock_failed+9} RSP: 0018:000001041d977e38 EFLAGS: 00000087 RAX: ffffffff80526700 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 00000000000001bd RSI: 000001041d978000 RDI: ffffffff80526700 RBP: 000001042111f988 R08: 000000000000002b R09: 0000000001200011 R10: 0000000000000038 R11: 0000000000000000 R12: 00000108208a97f0 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 FS: 0000002a96a372a0(0000) GS:ffffffff80520180(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000002a968c6f70 CR3: 0000000000101000 CR4: 00000000000006e0 Process sshd (pid: 5508, threadinfo 000001041d976000, task 0000011c21ffc7f0) Stack: ffffffff8031920b ffffffff8013783d 0000000000000246 0000000000000006 000001042111f9b8 000001042111f9c8 000001042111f9a0 00000110302ba040 0000002a96a37330 0000000000000000 Call Trace:<ffffffff8031920b>{.text.lock.spinlock+113} <ffffffff8013783d>{copy_process+3725} <ffffffff80137e1f>{do_fork+206} <ffffffff801102f6>{system_call+126} <ffffffff8011066b>{ptregscall_common+103} Code: 81 38 00 00 00 01 75 f6 f0 81 28 00 00 00 01 0f 85 e2 ff ff Kernel panic - not syncing: nmi watchdog ----------- [cut here ] --------- [please bite here ] --------- Kernel BUG at panic:75 invalid operand: 0000 [1] SMP CPU 0 Modules linked in: md5 ipv6 parport_pc lp parport netconsole netdump autofs4 sunrpc ds yenta_socket pcmcia_core cpufreq_powersave joydev loop button battery ac uhci_hcd ohci_hcd ehci_hcd bnx2 dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod cciss sd_mod scsi_mod Pid: 5508, comm: sshd Not tainted 2.6.9-78.23.ELlargesmp RIP: 0010:[<ffffffff80138496>] <ffffffff80138496>{panic+211} RSP: 0018:ffffffff8047d6a8 EFLAGS: 00010086 RAX: 000000000000002c RBX: ffffffff8032c2de RCX: 0000000000000046 RDX: 0000000000039d44 RSI: 0000000000000046 RDI: ffffffff803f8480 RBP: ffffffff8047d858 R08: 0000000000000002 R09: ffffffff8032c2de R10: 0000000000000000 R11: 0000ffff80413a20 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 FS: 0000002a96a372a0(0000) GS:ffffffff80520180(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000002a968c6f70 CR3: 0000000000101000 CR4: 00000000000006e0 Process sshd (pid: 5508, threadinfo 000001041d976000, task 0000011c21ffc7f0) Stack: 0000003000000008 ffffffff8047d788 ffffffff8047d6c8 0000000000000013 0000000000000000 0000000000000046 0000000000039d18 0000000000000046 0000000000000002 ffffffff8032e99d Call Trace:<ffffffff801118f0>{show_stack+241} <ffffffff80111a1a>{show_registers+277} <ffffffff80111d21>{die_nmi+130} <ffffffff8011ddd1>{nmi_watchdog_tick+276} <ffffffff801125f2>{default_do_nmi+116} <ffffffff8011debb>{do_nmi+115} <ffffffff801111ff>{paranoid_exit+0} <ffffffff801f1721>{__write_lock_failed+9} <EOE> <ffffffff8031920b>{.text.lock.spinlock+113} <ffffffff8013783d>{copy_process+3725} <ffffffff80137e1f>{do_fork+206} <ffffffff801102f6>{system_call+126} <ffffffff8011066b>{ptregscall_common+103} Code: 0f 0b 5e c8 32 80 ff ff ff ff 4b 00 31 ff e8 ab be fe ff e8 RIP <ffffffff80138496>{panic+211} RSP <ffffffff8047d6a8> CPU#0 is executing netdump. CPU#1 is frozen. CPU#2 is frozen. CPU#3 is frozen. CPU#4 is frozen. CPU#5 is frozen. CPU#6 is frozen. CPU#7 is frozen. CPU#8 is frozen. CPU#9 is frozen. CPU#10 is frozen. CPU#11 is frozen. CPU#12 is frozen. CPU#13 is frozen. CPU#14 is frozen. CPU#15 is frozen. CPU#16 is frozen. CPU#17 is frozen. CPU#18 is frozen. CPU#19 is frozen. CPU#20 is frozen. CPU#21 is frozen. CPU#22 is frozen. CPU#23 is frozen. CPU#24 is frozen. CPU#25 is frozen. CPU#26 is frozen. CPU#27 is frozen. CPU#28 is frozen. CPU#29 is frozen. CPU#30 is frozen. CPU#31 is frozen. poll_lock is locked, unable to take a dump! rebooting in 5 seconds Badness in netpoll_reset_locks at net/core/netpoll.c:864 Call Trace:<ffffffff802c9996>{netpoll_reset_locks+184} <ffffffffa01783ea>{:netdump:netpoll_netdump+44} <ffffffffa017839a>{:netdump:netpoll_start_netdump+221} grr, Ok, I think I see whats going on here. Its not specifically a bnx2 problem, its some other deadlock, whose likelyhood of occuring is likely increased by the fact that bnx2 offers the opportunity to run softirqs from within its poll routine. We probably need to fix the underlying deadlock that the nmi detected, but of course to do that we need to get a vmcore, so its kind of a chicken and egg problem, because the deadlock happens while the poll_lock is held by bnx2. The most direct fix is the patch attached below I think. You'll still get the nmi panic of course (assuming that you don't need to run a softirq to trigger it), but you should get an vmcore with this patch. Please give it a test and let me know. Thanks! Created attachment 328283 [details]
poll to keep bh's disabled locally while poll_lock is held
After applied the above patch, I am not able to see the panic anymore, so there is no VMCore. Brew build, https://brewweb.devel.redhat.com/taskinfo?taskID=1642784 dang, I was still hoping we could trigger the origional panic. Oh well. This is most likely the best course of action anyway. I'll post this shortly. Thanks! Probably it is worth mentioning that although there was no panic, but then I have seen consistently packets loss while running "echo t >/proc/sysrq-trigger" in a loop. From the affected machine's serial console, # while :; do echo t >/proc/sysrq-trigger; done From another host, $ ping hp-dl785g5-01.rhts.bos.redhat.com ... I have seen lots of packets loss here. This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. Committed in 79.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/ I guess the packets loss in comment #14 still need to be addressed eventually. Bug 483445 - Packets Loss with Netdump Patch is in -89.EL kernel. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1024.html |