Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

For bugs related to Red Hat Enterprise Linux 5 product line. The current stable release is 5.10. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 565964

Summary:

[Broadcom 5.5 bug] tg3: 5717 and 57765 asic revs can panic under load

Product:

Red Hat Enterprise Linux 5

Reporter:

Matt Carlson <mcarlson>

Component:

kernel

Assignee:

Andy Gospodarek <agospoda>

Status:

CLOSED ERRATA

QA Contact:

Red Hat Kernel QE team <kernel-qe>

Severity:

high

Docs Contact:

Priority:

high

Version:

5.5

CC:

agospoda, andriusb, arindam.nath, benlu, bzeranski, cward, emcnabb, ericlrl, gideonn, gloriach, henry.su, jfeeney, jleu, peterm, rdoty, shane.huang

Target Milestone:

Keywords:

OtherQA

Target Release:

5.5

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2010-03-30 07:09:28 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

533941

Attachments:

Description	Flags
fix submitted for review	none
tg3-fix-panic-under-load.patch	none

Description Matt Carlson 2010-02-16 20:00:10 UTC

Description of problem:

The following 3 commit IDs do a better job of explaining the problem than I can here.  Very briefly, it is possible for one MSI-X vector to receive enough traffic that it will overwrite data needed by another MSI-X vector's receive path.

Commit IDs
===========
e4af1af900328e4aa71cd5df75bb22669ab11522
e92967bfb1f4fa7da7c425df9239c4bb615dec30
f89f38b8ec3171664314669a1396ab70b43e8961

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Andrius Benokraitis 2010-02-16 21:47:16 UTC

Matt, we are at the very end of the RHEL 5.5 development cycle - no guarantees on this making it.

Comment 2 Matt Carlson 2010-02-16 21:53:44 UTC

Understood.  At least the problem is documented and trackable.

Comment 4 John Feeney 2010-02-17 16:16:53 UTC

A rpm with this fix can be found on my people page. See
http://people.redhat.com/jfeeney/.rhel5-tg3
Testing feedback would be appreciated.

Comment 6 RHEL Program Management 2010-02-17 18:44:55 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 9 Matt Carlson 2010-02-18 18:06:51 UTC

The rx side locks up under mild stress.  Can I see the sources?

Comment 10 Andrius Benokraitis 2010-02-18 18:11:15 UTC

unfortunately John is on PTO until next week. I'm not sure if anyone is filling in for him though.

Comment 11 Andy Gospodarek 2010-02-18 19:05:08 UTC

(In reply to comment #9)
> The rx side locks up under mild stress.  Can I see the sources?    

Unfortunately I do not have access to the source as John is out of town.  Is there any interesting output we might be able to use for debugging?

Comment 12 Matt Carlson 2010-02-18 19:42:24 UTC

Well.  I have a couple observations, but I'm not sure where they are leading me.

If I turn TSO off, the problem goes away.  However, I don't think the problem is with TSO.  The problem these patches are trying to solve is brought about by large volumes of traffic.  Changing the TSO setting alters the traffic flow seen by the hardware.  So on the surface, the behavior looks similar to the behavior of the problem we were trying to solve.

I have the 57765 connected back-to-back with another machine.  On the remote machine, I see packets arriving and responses leaving.  On the local side, I see the tx and rx stat counters incrementing, but obviously the traffic gets lost somewhere on the rx side.

rxbds_empty isn't increasing.  If only the first two patches were applied, you would see the hardware being starved of buffers, which would have made this counter increment.  So this doesn't appear to hint at a cause yet.

In the back of my head, I'm wondering if it is O.K. in RedHat's pseudo-NAPI scheme for one NAPI instance to schedule another NAPI instance.  Testing so far suggests it is O.K., but this is where my mind runs when I encounter a problem.

All-in-all, I suspect a backporting problem but I'm still trying to guess where the problem might be.

Comment 13 Jarod Wilson 2010-02-22 21:36:51 UTC

Created attachment 395580 [details]
fix submitted for review

Attached is the patch that was submitted internally for review, which I'd assume matches what John had in his test build...

Comment 14 Andrius Benokraitis 2010-02-22 22:20:33 UTC

Matt @ BRCM: Based on the issues you found late last week, we are going to hold off on this bugzilla for including until the issues are resolved.

Jarod's attachment should be the source you were looking for, since John is on vacation.

Comment 15 Matt Carlson 2010-02-23 00:58:16 UTC

Actually, the behavior we see now is still better than a crash.  The device stall can be quickly cleared with an 'ifconfig eth0 down && ifconfig eth0 up'.  But I think I found the problem.

@@ -4435,18 +4442,19 @@ next_pkt_nopost:
 		tpr->rx_std_prod_idx = std_prod_idx % TG3_RX_RING_SIZE;
 		tpr->rx_jmb_prod_idx = jmb_prod_idx % TG3_RX_JUMBO_RING_SIZE;
 
-		netif_rx_schedule(tp->napi[0].dummy_netdev);
+		if (tnapi != &tp->napi[1])
+			netif_rx_schedule(tp->napi[0].dummy_netdev);

I think this should be :

+			netif_rx_schedule(tp->napi[1].dummy_netdev);

Comment 17 Andy Gospodarek 2010-02-23 16:14:48 UTC

(In reply to comment #15)
> Actually, the behavior we see now is still better than a crash.  The device
> stall can be quickly cleared with an 'ifconfig eth0 down && ifconfig eth0 up'. 
> But I think I found the problem.
> 
> @@ -4435,18 +4442,19 @@ next_pkt_nopost:
>    tpr->rx_std_prod_idx = std_prod_idx % TG3_RX_RING_SIZE;
>    tpr->rx_jmb_prod_idx = jmb_prod_idx % TG3_RX_JUMBO_RING_SIZE;
> 
> -  netif_rx_schedule(tp->napi[0].dummy_netdev);
> +  if (tnapi != &tp->napi[1])
> +   netif_rx_schedule(tp->napi[0].dummy_netdev);
> 
> I think this should be :
> 
> +   netif_rx_schedule(tp->napi[1].dummy_netdev);    

That looks correct to me, Matt.  I will respin the patch and attach it.  If I make them against the latest kernels here:

http://people.redhat.com/jwilson/el5/

will you be able to test them or should I spin some test kernels for you?

Comment 18 Andy Gospodarek 2010-02-23 17:44:13 UTC

Created attachment 395777 [details]
tg3-fix-panic-under-load.patch

I agree with Matt's assessment, this is the patch I would propose in-place of the one John originally posted.

Comment 19 Matt Carlson 2010-02-23 17:55:36 UTC

I think I should be able to test against the kernels at jwilson, as long as I have a source RPM to extract the tg3 driver from.  It looks like it is there.  

Which test kernel at jwilson should I apply this patch against?

Comment 20 Andy Gospodarek 2010-02-23 18:52:21 UTC

Matt, they were against, 2.6.18-189.el5:

http://people.redhat.com/jwilson/el5/189.el5/

Comment 21 Matt Carlson 2010-02-23 20:17:38 UTC

The fix is looking good.  I'm going to let the machine run in my home-grown stress environment for an hour or two before I give the official thumbs-up though.

Comment 22 Andy Gospodarek 2010-02-23 20:28:23 UTC

Thanks, Matt.  I'm glad you were able to get it going (I'm not surprised). :)

Sometimes it takes a while to get an official build out of here, so I'm glad you were able to get it built and kick of a test run.  As long as this looks good by tomorrow morning I don't see any problem getting this included in 5.5.

Comment 23 Matt Carlson 2010-02-24 01:15:20 UTC

I'm satisfied that this patch fixes the problem.  My test ran for long enough without stalls or crashes.  I'll submit my test version to our QA dept for more testing.

Comment 24 Andy Gospodarek 2010-02-24 14:25:07 UTC

Thanks, Matt.  I will submit this for inclusion in RHEL5.5.

Comment 26 Andy Gospodarek 2010-02-24 15:32:20 UTC

Matt, my test kernels finished, so if you want to pass them on to others on your team feel free.  They are available here:

http://people.redhat.com/agospoda/#rhel5

Comment 27 Ben 2010-02-25 04:14:03 UTC

Follow comment 23, verified on 57765 with the test version that Matt gave us. Load/unload and copy/compare stress pass.

Comment 28 Ben 2010-02-25 10:26:44 UTC

Follow Comment 26, this test kernel (2.6.18.190.el5_x86_64) is still failed on broadcom 57765. Ping failed occurred due to Rx packets dropped. Check statistics by ethtool -S, rxbds_empty=1 and rx_discard is incresing.

Comment 29 Andy Gospodarek 2010-02-25 18:29:25 UTC

Ben, I'm having a hard time understanding the the problems you are seeing.

Can you clarify exactly which kernel you were using?  (2.6.18-190.el5 will not have the fix for this issue, but 2.6.18-190.el5.gtest.85 will.)

Comment 30 Matt Carlson 2010-02-25 20:41:00 UTC

Ah.  That may be Ben's problem.  I just tested 2.6.18-190.el5.gtest.85.  It works fine for me.

Comment 31 Andy Gospodarek 2010-02-25 21:41:35 UTC

Thanks, Matt!

Comment 32 Ben 2010-02-26 01:04:58 UTC

Sorry, It's my fault to use a wrong test kernel. By now, using 2.6.18-190.el5.gtest.85 for test and it works well.

Comment 33 Andy Gospodarek 2010-02-26 01:15:23 UTC

No problem, Ben.  Glad this issue is resolved with my test kernels.

Comment 35 Jarod Wilson 2010-03-03 15:45:21 UTC

in kernel-2.6.18-191.el5
You can download this test kernel from http://people.redhat.com/jwilson/el5

Please update the appropriate value in the Verified field
(cf_verified) to indicate this fix has been successfully
verified. Include a comment with verification details.

Comment 37 Jeff Leu 2010-03-10 03:26:46 UTC

For kernel 2.6.18-191.el5, QA has verified the driver load/unload and IPv6 Checksum Offload test. It works well.

Comment 39 errata-xmlrpc 2010-03-30 07:09:28 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0178.html