These patches implement code to detect and recover from T3 adapter parity errors. Here are the patches for the T3 parity detection/recover code which Chelsio sent to David Miller, the maintainer of the Linux networking elements: [PATCH net-next 0/4] cxgb3: new code to deal with adapter parity errors (http://www.spinics.net/lists/netdev/msg139772.html) [PATCH net-next 1/4] cxgb3: Add register bit definition for Fatal Parity Error. (http://www.spinics.net/lists/netdev/msg139773.html) [PATCH net-next 2/4] cxgb3: Set FATALPERREN. (http://www.spinics.net/lists/netdev/msg139774.html) [PATCH net-next 3/4] cxgb3: Leave interrupts for fatal errors asserted in common code. (http://www.spinics.net/lists/netdev/msg139775.html) [PATCH net-next 4/4] cxgb3: Avoid flush_workqueue() deadlock. (http://www.spinics.net/lists/netdev/msg139776.html) Here's David Miller's acceptance of the patches: Re: [PATCH net-next 0/4] cxgb3: new code to deal with adapter parity errors (http://www.spinics.net/lists/netdev/msg139871.html) As soon as these appear in David's "netdev" git repository Chelsio will send out the git change set commit IDs.
IBM is signed up to test and provide feedback.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
*** Bug 630123 has been marked as a duplicate of this bug. ***
Patch 4/4 does not apply to the RHEL5.6 kernel. Please port the patch to http://people.redhat.com/jwilson/el5/215.el5/src/kernel-2.6.18-215.el5.src.rpm The hunk that fails is removing a flush_workqueue() call from cxgb_down(). The call is still present but the function has changed in that there is now a t3_sge_stop() right before the flush_workqueue() call. It needs someone more comfortable with the driver to resolve the conflict as it might not be a simple remove anymore.
[[This is Casey Leedom using Bob Dugan's account because I don't have a login. -- Casey]] Yeah, that function has changed several times in the past so I'm not too surprised that it didn't patch 100% cleanly. Basically, the call to flush_workqueue() just needs to be conditioned on the new function parameter "on_wq" -- i.e.: if (!on_wq) flush_workqueue(...); The problem is that the error recovery code is running on the "cxgb3 Work Queue" that that call is trying to flush which causes the Linux kernel to deadlock waiting for all the threads to terminate ... including the thread which is waiting for the Work Queue to flush! (It would be Really Nice(tm) if Linux had a Work Queue function that would "flush everyone _except me_". Oh well.) In any case, I can do the change for you or review it. My email address is leedom for direct communications or we can depend on Bob to throw things at me ... :-) Casey
Corrected patches have been posted internally for review and possible inclusion in the rhel5.6 and rhel5.5.z stream kernels.
Minor glitch; the build just finished a few minutes ago. Thanks to Doug and Peter. The ppc64 kernel is available on people page: http://people.redhat.com/peterm/.bz630124/ Believe this is the kernel you are interested in validating. Let me know if otherwise. Please report test results here as soon as available.
Adding to above status: kernel-2.6.18-215.el5.bz630124.ppc64.rpm
Thank you RedHat team from the Chelsio team! Yes, the PPC64 kernel is one kernel we would like to verify. However, we would also like to verify the x86 & x86-64 versions as well. Let us know when those can be made available.
Peter will move the x86 and x86_64 kernels over when he can get to a computer terminal. The primary focus has been on the ppc64 for IBM to test.
Bob, the x86 and x86_64 kernels are now available in the same download directory that Larry directed you to for the PPC kernel. Please have Chelsio test this weekend and update the BZ with the test results.
IBM has verified ppc images with BAD adapter successfully. Also Finished full regression test over GOOD adapter with the the same images. All of testcases works fine and stress test runs for 60 hours without failing. Thanks for your help!
in kernel-2.6.18-219.el5 You can download this test kernel from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed.
Per comment #13 request, Chelsio responded Friday....... > Hi Larry, > > Sorry for the late reply. To answer your questions: > > > * Does Chelsio have test results from the x86 and x86_64 kernels? > Our internal QA has passed on both these architectures. > > > * Do you plan to test Itanium (ia64)? > Yes, we do plan to test on Itanium but the urgency is lower on this > architecture.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0017.html