Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 660680 - iw_cxgb3 advertises incorrect max cq depth causing stalls on large MPI clusters
iw_cxgb3 advertises incorrect max cq depth causing stalls on large MPI clusters
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: kernel (Show other bugs)
6.0
All Linux
low Severity high
: rc
: ---
Assigned To: Doug Ledford
Network QE
: OtherQA
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2010-12-07 10:27 EST by Steve Wise
Modified: 2011-05-19 08:37 EDT (History)
8 users (show)

See Also:
Fixed In Version: kernel-2.6.32-112.el6
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 628223
Environment:
Last Closed: 2011-05-19 08:37:41 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2011:0542 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 6.1 kernel security, bug fix and enhancement update 2011-05-19 07:58:07 EDT

  None (edit)
Description Steve Wise 2010-12-07 10:27:53 EST
+++ This bug was initially created as a clone of Bug #628223 +++

Description of problem:

iw_cxgb3 advertises a max cq depth of 256K entries, but he T3 HW only supports a max depth of 64K entries.  This causes MPI applications to stall when running in large cluster configurations (like 256NP 32 node clusters).

The fix is a 1 liner to drop the max advertised cq depth to 65536.  The fix has been posted to linux-rdma@vger.kernel.org and I'll provide a patch to this bug.

This is urgent as it breaks large cluster operation.


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

--- Additional comment from swise@opengridcomputing.com on 2010-08-28 13:50:18 EDT ---

Created attachment 441721 [details]
Advertise the correct max cq depth for T3 devices.

This was submitted to linux-rdma@vger.kernel.org today.

--- Additional comment from pm-rhel@redhat.com on 2010-12-07 05:26:33 EST ---

This request was evaluated by Red Hat Product Management for inclusion in Red Hat Enterprise Linux 5.6 and Red Hat does not plan to fix this issue the currently developed update.

Contact your manager or support representative in case you need to escalate this bug.

--- Additional comment from swise@opengridcomputing.com on 2010-12-07 10:11:03 EST ---

But its a 1 line fix!  Low risk.  High yield.

--- Additional comment from swise@opengridcomputing.com on 2010-12-07 10:18:54 EST ---

Sites like University of Wisconsin and Purdue have large (128 + core) MPI clusters and will see this problem if you don't ship this fix.
Comment 1 Steve Wise 2010-12-07 10:28:30 EST
I cloned 628223 to track this issue for rhel6.
Comment 3 RHEL Product and Program Management 2011-01-06 23:32:42 EST
This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated
in the current release, Red Hat is unfortunately unable to
address this request at this time. Red Hat invites you to
ask your support representative to propose this request, if
appropriate and relevant, in the next release of Red Hat
Enterprise Linux. If you would like it considered as an
exception in the current release, please ask your support
representative.
Comment 4 Suzanne Yeghiayan 2011-01-07 11:16:57 EST
This request was erroneously denied for the current release of Red Hat
Enterprise Linux.  The error has been fixed and this request has been
re-proposed for the current release.
Comment 5 Doug Ledford 2011-01-14 13:12:41 EST
Fixed for 6.1
Comment 6 RHEL Product and Program Management 2011-01-17 18:21:48 EST
This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux maintenance release. Product Management has 
requested further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed 
products. This request is not yet committed for inclusion in an Update release.
Comment 7 Hushan Jia 2011-01-20 23:02:17 EST
@Steve,
Could you confirm test commitment that when 6.1 beta is out you will test this bug and post test feedback?

Thanks.
Comment 8 Steve Wise 2011-01-20 23:13:51 EST
(In reply to comment #7)
> @Steve,
> Could you confirm test commitment that when 6.1 beta is out you will test this
> bug and post test feedback?
> 
> Thanks.

definitely.

Point me at the kernel images when they're ready, and I'll verify the fix.

thanks.
Comment 9 Aristeu Rozanski 2011-02-03 11:34:22 EST
Patch(es) available on kernel-2.6.32-112.el6
Comment 11 Steve Wise 2011-02-03 11:40:55 EST
Where can I pull this kernel?  

Thanks!
Comment 13 Chris Ward 2011-04-06 07:03:30 EDT
~~ Partners and Customers ~~

This bug was included in RHEL 6.1 Beta. Please confirm the status of this request as soon as possible.

If you're having problems accessing 6.1 bits, are delayed in your test execution or find in testing that the request was not addressed adequately, please let us know.

Thanks!
Comment 14 errata-xmlrpc 2011-05-19 08:37:41 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0542.html

Note You need to log in before you can comment on or make changes to this bug.