Description of problem: Run nfs testing component of Tier1 test suite and bnx2 driver will panic within a couple of hours with the following panic: Unable to handle kernel NULL pointer dereference at 00000000000000e8 RIP: [<ffffffff881fa7c1>] :bnx2:bnx2_poll_work+0x90/0x1227 PGD 130efb067 PUD 160580067 PMD 0 Oops: 0000 [1] SMP last sysfs file: /devices/pci0000:00/0000:00:00.0/local_cpus CPU 63 Modules linked in: pppoe pppox atm ppp_generic slhc rds rdma_cm af_key ib_cm iw_cm ib_sa ib_mad ib_core ib_addr testmgr_cipher testmgr aead crypto_blkcipher crypto_algapi nfs fscache nfs_acl autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc ipv6 xfrm_nalgo crypto_api cpufreq_ondemand acpi_cpufreq freq_table dm_multipath scsi_dh video hwmon backlight sbs i2c_ec button battery asus_acpi acpi_memhotplug ac parport_pc lp parport sr_mod cdrom sg i2c_i801 i2c_core bnx2 pcspkr dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod ata_piix libata shpchp mptsas mptscsih mptbase scsi_transport_sas sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 14110, comm: dumpcap Tainted: G 2.6.18-164.el5experimental.2 #1 RIP: 0010:[<ffffffff881fa7c1>] [<ffffffff881fa7c1>] :bnx2:bnx2_poll_work+0x90/0x1227 RSP: 0018:ffff81037f8d3d10 EFLAGS: 00010282 RAX: 0000000000000000 RBX: ffff810105634e20 RCX: ffff81011c2b3e40 RDX: ffff81017c6e16df RSI: 00000000000000e2 RDI: 0000000000000000 RBP: 0000000016d516e2 R08: ffff8103ff143070 R09: 0000000000001000 R10: 0000000000000001 R11: 0000000000000001 R12: ffff81017c6e86c0 R13: 0000000000000000 R14: ffff81017c6e8500 R15: 000000007c6e16df FS: 00002b890bd3e840(0000) GS:ffff8103ff415640(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00000000000000e8 CR3: 0000000170e26000 CR4: 00000000000006e0 Process dumpcap (pid: 14110, threadinfo ffff8102f1abc000, task ffff81037ec6c7a0) Stack: 0000000000000246 ffff81011c2b3e40 ffff810372a2e9c0 000000000000012c ffff81017c6e8520 000000017c6e86c0 ffff810300000006 ffff81017c6e8500 0000000000000018 ffff8102812942a0 000000000000040c ffffffff800c69b5 Call Trace: <IRQ> [<ffffffff800c69b5>] free_pages_bulk+0x1f0/0x268 [<ffffffff80148c7f>] deadline_init_queue+0xdc/0x11b [<ffffffff80088b7f>] elf_core_dump+0xc1c/0xc2c [<ffffffff881fbd0e>] :bnx2:bnx2_poll+0xdf/0x209 [<ffffffff8000c845>] net_rx_action+0xac/0x1e0 [<ffffffff8001235a>] __do_softirq+0x89/0x133 [<ffffffff8005e2fc>] call_softirq+0x1c/0x28 [<ffffffff8006cb14>] do_softirq+0x2c/0x85 [<ffffffff8006c99c>] do_IRQ+0xec/0xf5 [<ffffffff8005d615>] ret_from_intr+0x0/0xa <EOI> [<ffffffff8000e38f>] mark_page_accessed+0x2/0x68 [<ffffffff8000b3e2>] __find_get_block+0x15c/0x16c [<ffffffff800076c0>] find_get_page+0x21/0x51 [<ffffffff80019a8a>] __getblk+0x1d/0x236 [<ffffffff8804daa9>] :ext3:__ext3_get_inode_loc+0x12f/0x2f9 [<ffffffff8804dca7>] :ext3:ext3_reserve_inode_write+0x23/0x90 [<ffffffff8804dd35>] :ext3:ext3_mark_inode_dirty+0x21/0x3c [<ffffffff88050c8a>] :ext3:ext3_dirty_inode+0x63/0x7b [<ffffffff80013b89>] __mark_inode_dirty+0x29/0x16e [<ffffffff8804e04f>] :ext3:ext3_generic_write_end+0x3e/0x46 [<ffffffff8804ff70>] :ext3:ext3_ordered_write_end+0xb3/0x116 [<ffffffff8000fd42>] generic_file_buffered_write+0x1cc/0x675 [<ffffffff8001651f>] __generic_file_aio_write_nolock+0x369/0x3b6 [<ffffffff8002157b>] generic_file_aio_write+0x65/0xc1 [<ffffffff8804c1b6>] :ext3:ext3_file_write+0x16/0x91 [<ffffffff8001812f>] do_sync_write+0xc7/0x104 [<ffffffff8009f7b6>] autoremove_wake_function+0x0/0x2e [<ffffffff800420a5>] do_ioctl+0x21/0x6b [<ffffffff80016927>] vfs_write+0xce/0x174 [<ffffffff800171df>] sys_write+0x45/0x6e [<ffffffff8005d28d>] tracesys+0xd5/0xe0 Code: 49 8b 85 e8 00 00 00 66 83 78 06 00 74 25 8b 40 04 8d 54 05 RIP [<ffffffff881fa7c1>] :bnx2:bnx2_poll_work+0x90/0x1227 RSP <ffff81037f8d3d10> Version-Release number of selected component (if applicable): 5.4 How reproducible: Every time on x86_64 and i386. Steps to Reproduce: 1. run Tier1 test suite. 2. wait 3. Actual results: see above Expected results: no panic Additional info:
The RHEL5 patch being posted is: --- linux-2.6.18.noarch/drivers/net/bnx2.c.orig +++ linux-2.6.18.noarch/drivers/net/bnx2.c @@ -2750,6 +2750,7 @@ bnx2_get_hw_tx_cons(struct bnx2_napi *bn /* Tell compiler that status block fields can change. */ barrier(); cons = *bnapi->hw_tx_cons_ptr; + barrier(); if (unlikely((cons & MAX_TX_DESC_CNT) == MAX_TX_DESC_CNT)) cons++; return cons; @@ -3031,6 +3032,7 @@ bnx2_get_hw_rx_cons(struct bnx2_napi *bn /* Tell compiler that status block fields can change. */ barrier(); cons = *bnapi->hw_rx_cons_ptr; + barrier(); if (unlikely((cons & MAX_RX_DESC_CNT) == MAX_RX_DESC_CNT)) cons++; return cons;
Unfortunately, the patch provided in comment #3 does not fix the problem. The system under test still paniced.
This is most likely caused by NULL skb when we are handling tx interrupt in bnx2_tx_int(). A similar issue was reported upstream a while ago shown by the thread below. http://marc.info/?t=121362387400001&r=1&w=2 This issue was ultimately fixed by the patch below. Does RHEL5.4 have this patch? 69747650c814a8a79fef412c7416adf823293a3e pkt_sched: Fix return value corruption in HTB and TBF. This problem was only seen when using HTB or TBF qdisc though.
(In reply to comment #6) > This is most likely caused by NULL skb when we are handling tx interrupt in > bnx2_tx_int(). A similar issue was reported upstream a while ago shown by the > thread below. > > http://marc.info/?t=121362387400001&r=1&w=2 > > This issue was ultimately fixed by the patch below. Does RHEL5.4 have this > patch? The first two chunks look like they're in RHEL5, but the latter one is not. P. > > 69747650c814a8a79fef412c7416adf823293a3e > pkt_sched: Fix return value corruption in HTB and TBF. > > This problem was only seen when using HTB or TBF qdisc though.
A x86_64 kernel rpm that has the patch provided by comment #6 can be found on my people page. A i686 version will be there as soon as it finishes building. See http://people.redhat.com/jfeeney/.bz526481
------- Comment From kumarr.com 2009-10-29 17:09 EDT------- Mirroring over to IBM
I am available to test the kernel listed in comment #8 but it's not clear to me from the bug which Broadcom adapter the problem was discovered on. Does it only appear on the IBM Ghidorah? Peter
Any Broadcom bnx2 NICs can potentially encounter this problem because the driver relies on the nr_frags in the SKB to not change when the SKB is queued for transmission. This problem is only known to exist when using HTB or TBF qdisc.
patch posted on 11/20/09: 2:40 PM.EDT
in kernel-2.6.18-175.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified.
*** Bug 523873 has been marked as a duplicate of this bug. ***
------- Comment From kumarr.com 2010-03-02 15:38 EDT------- (In reply to comment #4) > I am available to test the kernel listed in comment #8 but it's not clear to me > from the bug which Broadcom adapter the problem was discovered on. Does it only > appear on the IBM Ghidorah? > > Peter Peter, Can you please verify this fix? Thanks
------- Comment From coschult.com 2010-03-03 17:07 EDT------- What kind of test is the nfs test in Tier1? Would fstress be a sufficient test for verifying this fix?
With regard to comment #21, running fstress would provide a level of sanity checking for the fix. I would appreciate knowing the results from its test run.
------- Comment From coschult.com 2010-03-05 18:52 EDT------- I ran bonnie++ (instead of fstress) for several hours with no problems. It looks like the fix is good. This is the command I used: bonnie++ -d /test_dir/bonnie -s 10000 -n 5 -x 10 -u corinna -b -r 5000
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0178.html