Bug 451164

Summary: Firmware error with MT25204 Infiniband HCAs
Product: Red Hat Enterprise Linux 4 Reporter: Gurhan Ozen <gozen>
Component: openibAssignee: Doug Ledford <dledford>
Status: CLOSED ERRATA QA Contact:
Severity: high Docs Contact:
Priority: high    
Version: 4.7CC: ddomingo, jburke, mgahagan, peterm, riek, rlerch
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Hardware testing for the Mellanox MT25204 has revealed that an internal error occurs under certain high-load conditions. When the ib_mthca driver reports a catastrophic error on this hardware, it is usually related to an insufficient completion queue depth relative to the number of outstanding work requests generated by the user application. Although the driver will reset the hardware and recover from such an event, all existing connections at the time of the error will be lost. This generally results in a segmentation fault in the user application. Further, if opensm is running at the time the error occurs, then you need to manually restart it in order to resume proper operation.
Story Points: ---
Clone Of:
: 488813 509904 (view as bug list) Environment:
Last Closed: 2009-05-18 20:35:27 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 251934    
Bug Blocks: 458752, 488813, 509904    

Comment 1 Don Domingo 2008-06-13 12:19:18 UTC
added to RHEL4.7 release notes under "Known Issues":

<quote>
Hardware testing for the Mellanox MT25204 has revealed that an internal error
occurs under certain high-load conditions. When the ib_mthca driver reports a
catastrophic error on this hardware, it is usually related to an insufficient
completion queue depth relative to the number of outstanding work requests
generated by the user application.

Although the driver will reset the hardware and recover from such an event, all
existing connections at the time of the error will be lost. This generally
results in a segmentation fault in the user application. Further, if opensm is
running at the time the error occurs, then you need to manually restart it in
order to resume proper operation.
</quote>

please advise if any further revisions are required. thanks!

Comment 3 RHEL Program Management 2008-09-05 17:25:13 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 7 Don Domingo 2008-10-05 23:56:34 UTC
Release note added. If any revisions are required, please set the 
"requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

New Contents:
Hardware testing for the Mellanox MT25204 has revealed that an internal error
occurs under certain high-load conditions. When the ib_mthca driver reports a
catastrophic error on this hardware, it is usually related to an insufficient
completion queue depth relative to the number of outstanding work requests
generated by the user application.

Although the driver will reset the hardware and recover from such an event, all
existing connections at the time of the error will be lost. This generally
results in a segmentation fault in the user application. Further, if opensm is
running at the time the error occurs, then you need to manually restart it in
order to resume proper operation.

Comment 8 Peter Martuccelli 2008-10-07 13:37:56 UTC
Dev ACK for release note.

Comment 15 errata-xmlrpc 2009-05-18 20:35:27 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-1022.html