Red Hat Bugzilla – Bug 660680
iw_cxgb3 advertises incorrect max cq depth causing stalls on large MPI clusters
Last modified: 2011-05-19 08:37:41 EDT
+++ This bug was initially created as a clone of Bug #628223 +++ Description of problem: iw_cxgb3 advertises a max cq depth of 256K entries, but he T3 HW only supports a max depth of 64K entries. This causes MPI applications to stall when running in large cluster configurations (like 256NP 32 node clusters). The fix is a 1 liner to drop the max advertised cq depth to 65536. The fix has been posted to linux-rdma@vger.kernel.org and I'll provide a patch to this bug. This is urgent as it breaks large cluster operation. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: --- Additional comment from swise@opengridcomputing.com on 2010-08-28 13:50:18 EDT --- Created attachment 441721 [details] Advertise the correct max cq depth for T3 devices. This was submitted to linux-rdma@vger.kernel.org today. --- Additional comment from pm-rhel@redhat.com on 2010-12-07 05:26:33 EST --- This request was evaluated by Red Hat Product Management for inclusion in Red Hat Enterprise Linux 5.6 and Red Hat does not plan to fix this issue the currently developed update. Contact your manager or support representative in case you need to escalate this bug. --- Additional comment from swise@opengridcomputing.com on 2010-12-07 10:11:03 EST --- But its a 1 line fix! Low risk. High yield. --- Additional comment from swise@opengridcomputing.com on 2010-12-07 10:18:54 EST --- Sites like University of Wisconsin and Purdue have large (128 + core) MPI clusters and will see this problem if you don't ship this fix.
I cloned 628223 to track this issue for rhel6.
This request was evaluated by Red Hat Product Management for inclusion in the current release of Red Hat Enterprise Linux. Because the affected component is not scheduled to be updated in the current release, Red Hat is unfortunately unable to address this request at this time. Red Hat invites you to ask your support representative to propose this request, if appropriate and relevant, in the next release of Red Hat Enterprise Linux. If you would like it considered as an exception in the current release, please ask your support representative.
This request was erroneously denied for the current release of Red Hat Enterprise Linux. The error has been fixed and this request has been re-proposed for the current release.
Fixed for 6.1
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
@Steve, Could you confirm test commitment that when 6.1 beta is out you will test this bug and post test feedback? Thanks.
(In reply to comment #7) > @Steve, > Could you confirm test commitment that when 6.1 beta is out you will test this > bug and post test feedback? > > Thanks. definitely. Point me at the kernel images when they're ready, and I'll verify the fix. thanks.
Patch(es) available on kernel-2.6.32-112.el6
Where can I pull this kernel? Thanks!
~~ Partners and Customers ~~ This bug was included in RHEL 6.1 Beta. Please confirm the status of this request as soon as possible. If you're having problems accessing 6.1 bits, are delayed in your test execution or find in testing that the request was not addressed adequately, please let us know. Thanks!
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0542.html