Bug 526481

Summary: bnx2: panic in bnx2_poll_work()
Product: Red Hat Enterprise Linux 5 Reporter: John Feeney <jfeeney>
Component: kernelAssignee: John Feeney <jfeeney>
Status: CLOSED ERRATA QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: medium Docs Contact:
Priority: urgent    
Version: 5.4CC: agospoda, anton, bzeranski, caiqian, davidkwood, dhoward, emcnabb, hjia, jane.lv, jburke, jpirko, jvillalo, lcm, luyu, mchan, nobody+PNT0273897, prarit, tis
Target Milestone: rcKeywords: ZStream
Target Release: 5.5   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 623265 (view as bug list) Environment:
Last Closed: 2010-03-30 06:54:04 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 532386, 515318, 533941, 539686, 623265    

Description John Feeney 2009-09-30 14:56:40 UTC
Description of problem:
Run nfs testing component of Tier1 test suite and bnx2 driver will panic within a couple of hours with the following panic:

Unable to handle kernel NULL pointer dereference at 00000000000000e8 RIP:
 [<ffffffff881fa7c1>] :bnx2:bnx2_poll_work+0x90/0x1227
PGD 130efb067 PUD 160580067 PMD 0
Oops: 0000 [1] SMP
last sysfs file: /devices/pci0000:00/0000:00:00.0/local_cpus
CPU 63
Modules linked in: pppoe pppox atm ppp_generic slhc rds rdma_cm af_key ib_cm iw_cm ib_sa ib_mad ib_core ib_addr testmgr_cipher testmgr aead crypto_blkcipher crypto_algapi nfs fscache nfs_acl autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc ipv6 xfrm_nalgo crypto_api cpufreq_ondemand acpi_cpufreq freq_table dm_multipath scsi_dh video hwmon backlight sbs i2c_ec button battery asus_acpi acpi_memhotplug ac parport_pc lp parport sr_mod cdrom sg i2c_i801 i2c_core bnx2 pcspkr dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod ata_piix libata shpchp mptsas mptscsih mptbase scsi_transport_sas sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 14110, comm: dumpcap Tainted: G      2.6.18-164.el5experimental.2 #1
RIP: 0010:[<ffffffff881fa7c1>]  [<ffffffff881fa7c1>] :bnx2:bnx2_poll_work+0x90/0x1227
RSP: 0018:ffff81037f8d3d10  EFLAGS: 00010282
RAX: 0000000000000000 RBX: ffff810105634e20 RCX: ffff81011c2b3e40
RDX: ffff81017c6e16df RSI: 00000000000000e2 RDI: 0000000000000000
RBP: 0000000016d516e2 R08: ffff8103ff143070 R09: 0000000000001000
R10: 0000000000000001 R11: 0000000000000001 R12: ffff81017c6e86c0
R13: 0000000000000000 R14: ffff81017c6e8500 R15: 000000007c6e16df
FS:  00002b890bd3e840(0000) GS:ffff8103ff415640(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000000000000e8 CR3: 0000000170e26000 CR4: 00000000000006e0
Process dumpcap (pid: 14110, threadinfo ffff8102f1abc000, task ffff81037ec6c7a0)
Stack:  0000000000000246 ffff81011c2b3e40 ffff810372a2e9c0 000000000000012c
 ffff81017c6e8520 000000017c6e86c0 ffff810300000006 ffff81017c6e8500
 0000000000000018 ffff8102812942a0 000000000000040c ffffffff800c69b5
Call Trace:
 <IRQ>  [<ffffffff800c69b5>] free_pages_bulk+0x1f0/0x268
 [<ffffffff80148c7f>] deadline_init_queue+0xdc/0x11b
 [<ffffffff80088b7f>] elf_core_dump+0xc1c/0xc2c
 [<ffffffff881fbd0e>] :bnx2:bnx2_poll+0xdf/0x209
 [<ffffffff8000c845>] net_rx_action+0xac/0x1e0
 [<ffffffff8001235a>] __do_softirq+0x89/0x133
 [<ffffffff8005e2fc>] call_softirq+0x1c/0x28
 [<ffffffff8006cb14>] do_softirq+0x2c/0x85
 [<ffffffff8006c99c>] do_IRQ+0xec/0xf5
 [<ffffffff8005d615>] ret_from_intr+0x0/0xa
 <EOI>  [<ffffffff8000e38f>] mark_page_accessed+0x2/0x68
 [<ffffffff8000b3e2>] __find_get_block+0x15c/0x16c
 [<ffffffff800076c0>] find_get_page+0x21/0x51
 [<ffffffff80019a8a>] __getblk+0x1d/0x236
 [<ffffffff8804daa9>] :ext3:__ext3_get_inode_loc+0x12f/0x2f9
 [<ffffffff8804dca7>] :ext3:ext3_reserve_inode_write+0x23/0x90
 [<ffffffff8804dd35>] :ext3:ext3_mark_inode_dirty+0x21/0x3c
 [<ffffffff88050c8a>] :ext3:ext3_dirty_inode+0x63/0x7b
 [<ffffffff80013b89>] __mark_inode_dirty+0x29/0x16e
 [<ffffffff8804e04f>] :ext3:ext3_generic_write_end+0x3e/0x46
 [<ffffffff8804ff70>] :ext3:ext3_ordered_write_end+0xb3/0x116
 [<ffffffff8000fd42>] generic_file_buffered_write+0x1cc/0x675
 [<ffffffff8001651f>] __generic_file_aio_write_nolock+0x369/0x3b6
 [<ffffffff8002157b>] generic_file_aio_write+0x65/0xc1
 [<ffffffff8804c1b6>] :ext3:ext3_file_write+0x16/0x91
 [<ffffffff8001812f>] do_sync_write+0xc7/0x104
 [<ffffffff8009f7b6>] autoremove_wake_function+0x0/0x2e
 [<ffffffff800420a5>] do_ioctl+0x21/0x6b
 [<ffffffff80016927>] vfs_write+0xce/0x174
 [<ffffffff800171df>] sys_write+0x45/0x6e
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0


Code: 49 8b 85 e8 00 00 00 66 83 78 06 00 74 25 8b 40 04 8d 54 05
RIP  [<ffffffff881fa7c1>] :bnx2:bnx2_poll_work+0x90/0x1227
 RSP <ffff81037f8d3d10>


Version-Release number of selected component (if applicable):
5.4

How reproducible:
Every time on x86_64 and i386.

Steps to Reproduce:
1. run Tier1 test suite.
2. wait
3.
  
Actual results:
see above

Expected results:
no panic

Additional info:

Comment 3 John Feeney 2009-09-30 17:28:05 UTC
The RHEL5 patch being posted is:


--- linux-2.6.18.noarch/drivers/net/bnx2.c.orig
+++ linux-2.6.18.noarch/drivers/net/bnx2.c
@@ -2750,6 +2750,7 @@ bnx2_get_hw_tx_cons(struct bnx2_napi *bn
        /* Tell compiler that status block fields can change. */
        barrier();
        cons = *bnapi->hw_tx_cons_ptr;
+       barrier();
        if (unlikely((cons & MAX_TX_DESC_CNT) == MAX_TX_DESC_CNT))
                cons++;
        return cons;
@@ -3031,6 +3032,7 @@ bnx2_get_hw_rx_cons(struct bnx2_napi *bn
        /* Tell compiler that status block fields can change. */
        barrier();
        cons = *bnapi->hw_rx_cons_ptr;
+       barrier();
        if (unlikely((cons & MAX_RX_DESC_CNT) == MAX_RX_DESC_CNT))
                cons++;
        return cons;

Comment 5 John Feeney 2009-10-01 17:51:02 UTC
Unfortunately, the patch provided in comment #3 does not fix the problem. The system under test still paniced.

Comment 6 Michael Chan 2009-10-01 18:55:32 UTC
This is most likely caused by NULL skb when we are handling tx interrupt in bnx2_tx_int().  A similar issue was reported upstream a while ago shown by the thread below.

http://marc.info/?t=121362387400001&r=1&w=2

This issue was ultimately fixed by the patch below.  Does RHEL5.4 have this patch?

69747650c814a8a79fef412c7416adf823293a3e
pkt_sched: Fix return value corruption in HTB and TBF.

This problem was only seen when using HTB or TBF qdisc though.

Comment 7 Prarit Bhargava 2009-10-01 19:02:11 UTC
(In reply to comment #6)
> This is most likely caused by NULL skb when we are handling tx interrupt in
> bnx2_tx_int().  A similar issue was reported upstream a while ago shown by the
> thread below.
> 
> http://marc.info/?t=121362387400001&r=1&w=2
> 
> This issue was ultimately fixed by the patch below.  Does RHEL5.4 have this
> patch?

The first two chunks look like they're in RHEL5, but the latter one is not.

P.

> 
> 69747650c814a8a79fef412c7416adf823293a3e
> pkt_sched: Fix return value corruption in HTB and TBF.
> 
> This problem was only seen when using HTB or TBF qdisc though.

Comment 8 John Feeney 2009-10-02 19:46:34 UTC
A x86_64 kernel rpm that has the patch provided by comment #6 can be found on my people page. A i686 version will be there as soon as it finishes building. See
http://people.redhat.com/jfeeney/.bz526481

Comment 9 IBM Bug Proxy 2009-10-29 21:10:31 UTC
------- Comment From kumarr@linux.ibm.com 2009-10-29 17:09 EDT-------
Mirroring over to IBM

Comment 10 Peter Bogdanovic 2009-11-10 22:50:44 UTC
I am available to test the kernel listed in comment #8 but it's not clear to me from the bug which Broadcom adapter the problem was discovered on. Does it only appear on the IBM Ghidorah?

Peter

Comment 11 Michael Chan 2009-11-11 02:37:39 UTC
Any Broadcom bnx2 NICs can potentially encounter this problem because the driver relies on the nr_frags in the SKB to not change when the SKB is queued for transmission.  This problem is only known to exist when using HTB or TBF qdisc.

Comment 12 Linda Wang 2009-11-20 19:43:01 UTC
patch posted on 11/20/09: 2:40 PM.EDT

Comment 15 Don Zickus 2009-11-23 15:33:34 UTC
in kernel-2.6.18-175.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.

Comment 19 Jiri Pirko 2010-02-16 08:24:35 UTC
*** Bug 523873 has been marked as a duplicate of this bug. ***

Comment 20 IBM Bug Proxy 2010-03-02 20:41:15 UTC
------- Comment From kumarr@linux.ibm.com 2010-03-02 15:38 EDT-------
(In reply to comment #4)
> I am available to test the kernel listed in comment #8 but it's not clear to me
> from the bug which Broadcom adapter the problem was discovered on. Does it only
> appear on the IBM Ghidorah?
>
> Peter

Peter,

Can you please verify this fix?

Thanks

Comment 21 IBM Bug Proxy 2010-03-03 22:10:48 UTC
------- Comment From coschult@us.ibm.com 2010-03-03 17:07 EDT-------
What kind of test is the nfs test in Tier1? Would fstress be a sufficient test for verifying this fix?

Comment 23 John Feeney 2010-03-05 15:40:48 UTC
With regard to comment #21, running fstress would provide a level of sanity checking for the fix. I would appreciate knowing the results from its test run.

Comment 24 IBM Bug Proxy 2010-03-06 00:00:37 UTC
------- Comment From coschult@us.ibm.com 2010-03-05 18:52 EDT-------
I ran bonnie++ (instead of fstress) for several hours with no problems. It looks like the fix is good. This is the command I used:

bonnie++ -d /test_dir/bonnie -s 10000 -n 5 -x 10 -u corinna -b -r 5000

Comment 26 errata-xmlrpc 2010-03-30 06:54:04 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0178.html