Bug 565964 - [Broadcom 5.5 bug] tg3: 5717 and 57765 asic revs can panic under load
Summary: [Broadcom 5.5 bug] tg3: 5717 and 57765 asic revs can panic under load
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.5
Hardware: All
OS: Linux
high
high
Target Milestone: rc
: 5.5
Assignee: Andy Gospodarek
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On:
Blocks: 533941
TreeView+ depends on / blocked
 
Reported: 2010-02-16 20:00 UTC by Matt Carlson
Modified: 2010-03-30 07:09 UTC (History)
16 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2010-03-30 07:09:28 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
fix submitted for review (7.24 KB, patch)
2010-02-22 21:36 UTC, Jarod Wilson
no flags Details | Diff
tg3-fix-panic-under-load.patch (8.06 KB, patch)
2010-02-23 17:44 UTC, Andy Gospodarek
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2010:0178 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.5 kernel security and bug fix update 2010-03-29 12:18:21 UTC

Description Matt Carlson 2010-02-16 20:00:10 UTC
Description of problem:

The following 3 commit IDs do a better job of explaining the problem than I can here.  Very briefly, it is possible for one MSI-X vector to receive enough traffic that it will overwrite data needed by another MSI-X vector's receive path.

Commit IDs
===========
e4af1af900328e4aa71cd5df75bb22669ab11522
e92967bfb1f4fa7da7c425df9239c4bb615dec30
f89f38b8ec3171664314669a1396ab70b43e8961

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Andrius Benokraitis 2010-02-16 21:47:16 UTC
Matt, we are at the very end of the RHEL 5.5 development cycle - no guarantees on this making it.

Comment 2 Matt Carlson 2010-02-16 21:53:44 UTC
Understood.  At least the problem is documented and trackable.

Comment 4 John Feeney 2010-02-17 16:16:53 UTC
A rpm with this fix can be found on my people page. See
http://people.redhat.com/jfeeney/.rhel5-tg3
Testing feedback would be appreciated.

Comment 6 RHEL Program Management 2010-02-17 18:44:55 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 9 Matt Carlson 2010-02-18 18:06:51 UTC
The rx side locks up under mild stress.  Can I see the sources?

Comment 10 Andrius Benokraitis 2010-02-18 18:11:15 UTC
unfortunately John is on PTO until next week. I'm not sure if anyone is filling in for him though.

Comment 11 Andy Gospodarek 2010-02-18 19:05:08 UTC
(In reply to comment #9)
> The rx side locks up under mild stress.  Can I see the sources?    

Unfortunately I do not have access to the source as John is out of town.  Is there any interesting output we might be able to use for debugging?

Comment 12 Matt Carlson 2010-02-18 19:42:24 UTC
Well.  I have a couple observations, but I'm not sure where they are leading me.

If I turn TSO off, the problem goes away.  However, I don't think the problem is with TSO.  The problem these patches are trying to solve is brought about by large volumes of traffic.  Changing the TSO setting alters the traffic flow seen by the hardware.  So on the surface, the behavior looks similar to the behavior of the problem we were trying to solve.

I have the 57765 connected back-to-back with another machine.  On the remote machine, I see packets arriving and responses leaving.  On the local side, I see the tx and rx stat counters incrementing, but obviously the traffic gets lost somewhere on the rx side.

rxbds_empty isn't increasing.  If only the first two patches were applied, you would see the hardware being starved of buffers, which would have made this counter increment.  So this doesn't appear to hint at a cause yet.

In the back of my head, I'm wondering if it is O.K. in RedHat's pseudo-NAPI scheme for one NAPI instance to schedule another NAPI instance.  Testing so far suggests it is O.K., but this is where my mind runs when I encounter a problem.

All-in-all, I suspect a backporting problem but I'm still trying to guess where the problem might be.

Comment 13 Jarod Wilson 2010-02-22 21:36:51 UTC
Created attachment 395580 [details]
fix submitted for review

Attached is the patch that was submitted internally for review, which I'd assume matches what John had in his test build...

Comment 14 Andrius Benokraitis 2010-02-22 22:20:33 UTC
Matt @ BRCM: Based on the issues you found late last week, we are going to hold off on this bugzilla for including until the issues are resolved.

Jarod's attachment should be the source you were looking for, since John is on vacation.

Comment 15 Matt Carlson 2010-02-23 00:58:16 UTC
Actually, the behavior we see now is still better than a crash.  The device stall can be quickly cleared with an 'ifconfig eth0 down && ifconfig eth0 up'.  But I think I found the problem.

@@ -4435,18 +4442,19 @@ next_pkt_nopost:
 		tpr->rx_std_prod_idx = std_prod_idx % TG3_RX_RING_SIZE;
 		tpr->rx_jmb_prod_idx = jmb_prod_idx % TG3_RX_JUMBO_RING_SIZE;
 
-		netif_rx_schedule(tp->napi[0].dummy_netdev);
+		if (tnapi != &tp->napi[1])
+			netif_rx_schedule(tp->napi[0].dummy_netdev);

I think this should be :

+			netif_rx_schedule(tp->napi[1].dummy_netdev);

Comment 17 Andy Gospodarek 2010-02-23 16:14:48 UTC
(In reply to comment #15)
> Actually, the behavior we see now is still better than a crash.  The device
> stall can be quickly cleared with an 'ifconfig eth0 down && ifconfig eth0 up'. 
> But I think I found the problem.
> 
> @@ -4435,18 +4442,19 @@ next_pkt_nopost:
>    tpr->rx_std_prod_idx = std_prod_idx % TG3_RX_RING_SIZE;
>    tpr->rx_jmb_prod_idx = jmb_prod_idx % TG3_RX_JUMBO_RING_SIZE;
> 
> -  netif_rx_schedule(tp->napi[0].dummy_netdev);
> +  if (tnapi != &tp->napi[1])
> +   netif_rx_schedule(tp->napi[0].dummy_netdev);
> 
> I think this should be :
> 
> +   netif_rx_schedule(tp->napi[1].dummy_netdev);    

That looks correct to me, Matt.  I will respin the patch and attach it.  If I make them against the latest kernels here:

http://people.redhat.com/jwilson/el5/

will you be able to test them or should I spin some test kernels for you?

Comment 18 Andy Gospodarek 2010-02-23 17:44:13 UTC
Created attachment 395777 [details]
tg3-fix-panic-under-load.patch

I agree with Matt's assessment, this is the patch I would propose in-place of the one John originally posted.

Comment 19 Matt Carlson 2010-02-23 17:55:36 UTC
I think I should be able to test against the kernels at jwilson, as long as I have a source RPM to extract the tg3 driver from.  It looks like it is there.  

Which test kernel at jwilson should I apply this patch against?

Comment 20 Andy Gospodarek 2010-02-23 18:52:21 UTC
Matt, they were against, 2.6.18-189.el5:

http://people.redhat.com/jwilson/el5/189.el5/

Comment 21 Matt Carlson 2010-02-23 20:17:38 UTC
The fix is looking good.  I'm going to let the machine run in my home-grown stress environment for an hour or two before I give the official thumbs-up though.

Comment 22 Andy Gospodarek 2010-02-23 20:28:23 UTC
Thanks, Matt.  I'm glad you were able to get it going (I'm not surprised). :)

Sometimes it takes a while to get an official build out of here, so I'm glad you were able to get it built and kick of a test run.  As long as this looks good by tomorrow morning I don't see any problem getting this included in 5.5.

Comment 23 Matt Carlson 2010-02-24 01:15:20 UTC
I'm satisfied that this patch fixes the problem.  My test ran for long enough without stalls or crashes.  I'll submit my test version to our QA dept for more testing.

Comment 24 Andy Gospodarek 2010-02-24 14:25:07 UTC
Thanks, Matt.  I will submit this for inclusion in RHEL5.5.

Comment 26 Andy Gospodarek 2010-02-24 15:32:20 UTC
Matt, my test kernels finished, so if you want to pass them on to others on your team feel free.  They are available here:

http://people.redhat.com/agospoda/#rhel5

Comment 27 Ben 2010-02-25 04:14:03 UTC
Follow comment 23, verified on 57765 with the test version that Matt gave us. Load/unload and copy/compare stress pass.

Comment 28 Ben 2010-02-25 10:26:44 UTC
Follow Comment 26, this test kernel (2.6.18.190.el5_x86_64) is still failed on broadcom 57765. Ping failed occurred due to Rx packets dropped. Check statistics by ethtool -S, rxbds_empty=1 and rx_discard is incresing.

Comment 29 Andy Gospodarek 2010-02-25 18:29:25 UTC
Ben, I'm having a hard time understanding the the problems you are seeing.

Can you clarify exactly which kernel you were using?  (2.6.18-190.el5 will not have the fix for this issue, but 2.6.18-190.el5.gtest.85 will.)

Comment 30 Matt Carlson 2010-02-25 20:41:00 UTC
Ah.  That may be Ben's problem.  I just tested 2.6.18-190.el5.gtest.85.  It works fine for me.

Comment 31 Andy Gospodarek 2010-02-25 21:41:35 UTC
Thanks, Matt!

Comment 32 Ben 2010-02-26 01:04:58 UTC
Sorry, It's my fault to use a wrong test kernel. By now, using 2.6.18-190.el5.gtest.85 for test and it works well.

Comment 33 Andy Gospodarek 2010-02-26 01:15:23 UTC
No problem, Ben.  Glad this issue is resolved with my test kernels.

Comment 35 Jarod Wilson 2010-03-03 15:45:21 UTC
in kernel-2.6.18-191.el5
You can download this test kernel from http://people.redhat.com/jwilson/el5

Please update the appropriate value in the Verified field
(cf_verified) to indicate this fix has been successfully
verified. Include a comment with verification details.

Comment 37 Jeff Leu 2010-03-10 03:26:46 UTC
For kernel 2.6.18-191.el5, QA has verified the driver load/unload and IPv6 Checksum Offload test. It works well.

Comment 39 errata-xmlrpc 2010-03-30 07:09:28 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0178.html


Note You need to log in before you can comment on or make changes to this bug.