Bug 630124 - Detect and recover from cxgb3 adapter parity errors
Summary: Detect and recover from cxgb3 adapter parity errors
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.5
Hardware: All
OS: Linux
high
urgent
Target Milestone: rc
: 5.6
Assignee: Doug Ledford
QA Contact: Network QE
URL:
Whiteboard:
: 630123 (view as bug list)
Depends On:
Blocks: Chelsio5.6FT 557597 630978 631547
TreeView+ depends on / blocked
 
Reported: 2010-09-03 18:20 UTC by kxie
Modified: 2013-01-11 03:15 UTC (History)
18 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 631547 (view as bug list)
Environment:
Last Closed: 2011-01-13 21:15:34 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
IBM Linux Technology Center 66969 0 None None None Never
Red Hat Product Errata RHSA-2011:0017 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.6 kernel security and bug fix update 2011-01-13 10:37:42 UTC

Description kxie 2010-09-03 18:20:24 UTC
These patches implement code to detect and recover from T3 adapter parity errors. 

Here are the patches for the T3 parity detection/recover code which Chelsio sent to David Miller, the maintainer of the Linux networking elements:

[PATCH net-next 0/4] cxgb3: new code to deal with adapter parity errors (http://www.spinics.net/lists/netdev/msg139772.html)

[PATCH net-next 1/4] cxgb3: Add register bit definition for Fatal Parity Error.
(http://www.spinics.net/lists/netdev/msg139773.html)

[PATCH net-next 2/4] cxgb3: Set FATALPERREN.
(http://www.spinics.net/lists/netdev/msg139774.html)

[PATCH net-next 3/4] cxgb3: Leave interrupts for fatal errors asserted in common code.
(http://www.spinics.net/lists/netdev/msg139775.html)

[PATCH net-next 4/4] cxgb3: Avoid flush_workqueue() deadlock.
(http://www.spinics.net/lists/netdev/msg139776.html)

  Here's David Miller's acceptance of the patches:

Re: [PATCH net-next 0/4] cxgb3: new code to deal with adapter parity errors
(http://www.spinics.net/lists/netdev/msg139871.html)

  As soon as these appear in David's "netdev" git repository Chelsio will send out the git change set commit IDs.

Comment 1 John Jarvis 2010-09-03 18:57:16 UTC
IBM is signed up to test and provide feedback.

Comment 2 RHEL Program Management 2010-09-03 18:59:14 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 3 Larry Troan 2010-09-03 19:17:04 UTC
*** Bug 630123 has been marked as a duplicate of this bug. ***

Comment 5 Anthony Liguori 2010-09-03 20:27:06 UTC
Patch 4/4 does not apply to the RHEL5.6 kernel.  Please port the patch to http://people.redhat.com/jwilson/el5/215.el5/src/kernel-2.6.18-215.el5.src.rpm

The hunk that fails is removing a flush_workqueue() call from cxgb_down().  The call is still present but the function has changed in that there is now a t3_sge_stop() right before the flush_workqueue() call.

It needs someone more comfortable with the driver to resolve the conflict as it might not be a simple remove anymore.

Comment 6 Bob Dugan 2010-09-03 20:39:52 UTC
[[This is Casey Leedom using Bob Dugan's account because I don't have a login. -- Casey]]

  Yeah, that function has changed several times in the past so I'm not too surprised that it didn't patch 100% cleanly.  Basically, the call to flush_workqueue() just needs to be conditioned on the new function parameter "on_wq" -- i.e.:

    if (!on_wq)
        flush_workqueue(...);

The problem is that the error recovery code is running on the "cxgb3 Work Queue" that that call is trying to flush which causes the Linux kernel to deadlock waiting for all the threads to terminate ... including the thread which is waiting for the Work Queue to flush!  (It would be Really Nice(tm) if Linux had a Work Queue function that would "flush everyone _except me_".  Oh well.)

  In any case, I can do the change for you or review it.  My email address is leedom for direct communications or we can depend on Bob to throw things at me ... :-)

Casey

Comment 8 Doug Ledford 2010-09-03 21:22:00 UTC
Corrected patches have been posted internally for review and possible inclusion in the rhel5.6 and rhel5.5.z stream kernels.

Comment 9 Larry Troan 2010-09-04 13:14:56 UTC
Minor glitch; the build just finished a few minutes ago. Thanks to Doug and Peter. The ppc64 kernel is available on people page:

            http://people.redhat.com/peterm/.bz630124/

Believe this is the kernel you are interested in validating. Let me know if otherwise.

Please report test results here as soon as available.

Comment 10 Larry Troan 2010-09-04 13:16:19 UTC
Adding to above status: kernel-2.6.18-215.el5.bz630124.ppc64.rpm

Comment 11 Bob Dugan 2010-09-04 14:21:08 UTC
Thank you RedHat team from the Chelsio team!  Yes, the PPC64 kernel is one kernel we would like to verify.  However, we would also like to verify the x86 & x86-64 versions as well.  Let us know when those can be made available.

Comment 12 Larry Troan 2010-09-04 14:26:51 UTC
Peter will move the x86 and x86_64 kernels over when he can get to a computer terminal. The primary focus has been on the ppc64 for IBM to test.

Comment 13 Peter Martuccelli 2010-09-04 14:28:57 UTC
Bob, the x86 and x86_64 kernels are now available in the same download directory that Larry directed you to for the PPC kernel.  Please have Chelsio test this weekend and update the BZ with the test results.

Comment 15 Wen xiong 2010-09-07 12:45:30 UTC
IBM has verified ppc images with BAD adapter successfully. Also Finished full regression test over GOOD adapter with the the same images. All of testcases works fine and stress test runs for 60 hours without failing.

Thanks for your help!

Comment 21 Jarod Wilson 2010-09-10 21:41:18 UTC
in kernel-2.6.18-219.el5
You can download this test kernel from http://people.redhat.com/jwilson/el5

Detailed testing feedback is always welcomed.

Comment 23 Larry Troan 2010-09-12 14:50:41 UTC
Per comment #13 request, Chelsio responded Friday.......
> Hi Larry,
> 
> Sorry for the late reply.  To answer your questions:
> 
> > * Does Chelsio have test results from the x86 and x86_64 kernels?
> Our internal QA has passed on both these architectures.
> 
> > * Do you plan to test Itanium (ia64)?
> Yes, we do plan to test on Itanium but the urgency is lower on this
> architecture.

Comment 25 errata-xmlrpc 2011-01-13 21:15:34 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0017.html


Note You need to log in before you can comment on or make changes to this bug.