Bug 251934
| Summary: | Firmware error with MT25204 Infiniband HCAs | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Gurhan Ozen <gozen> | ||||
| Component: | kernel | Assignee: | Doug Ledford <dledford> | ||||
| Status: | CLOSED WONTFIX | QA Contact: | Martin Jenner <mjenner> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | urgent | ||||||
| Version: | 5.1 | CC: | ddomingo, dzickus, jburke, peterm, riek, rlerch, syeghiay | ||||
| Target Milestone: | rc | ||||||
| Target Release: | --- | ||||||
| Hardware: | All | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: |
(all architectures)
Hardware testing for the Mellanox MT25204 has revealed that an internal error occurs under certain high-load conditions. When the ib_mthca driver reports a catastrophic error on this hardware, it is usually related to an insufficient completion queue depth relative to the number of outstanding work requests generated by the user application.
Although the driver will reset the hardware and recover from such an event, all existing connections at the time of the error will be lost. This generally results in a segmentation fault in the user application. Further, if opensm is running at the time the error occurs, then you need to manually restart it in order to resume proper operation.
|
Story Points: | --- | ||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2008-11-06 19:07:47 UTC | Type: | --- | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 222082, 391221, 451164, 454962 | ||||||
| Attachments: |
|
||||||
|
Description
Gurhan Ozen
2007-08-13 15:28:00 UTC
Created attachment 161181 [details]
test script
By the way this is only happening with MT25204 cards. I also have MT25208 and MT23108 both of which don't give this issue. So I downloaded the firmware fw-25204-1_2_000-MHES14-XT.bin from Mellanox website, and burned it with "mstflint -d mthca0 -i fw-25204-1_2_000-MHES14-XT.bin bb" command and the same issue can still be seen. This appears to possibly be a known issue in the firmware on this card itself. However, it appears to also possibly be a known issue in the firmware as it relates to handling a programming bug in the user's software. Specifically, when the user space software posts more requests to the work queue than there are open spots in the completion queue, and enough requests complete to overflow the completion queue, then this particular firmware gives a catastrophic error while the firmware on later cards handles the situation gracefully. Gurhan, can you identify what test you are running that is causing this problem? If so, does changing the depth of tx/rx queues in the test solve the problem? (In reply to comment #4) > This appears to possibly be a known issue in the firmware on this card itself. > However, it appears to also possibly be a known issue in the firmware as it > relates to handling a programming bug in the user's software. Specifically, > when the user space software posts more requests to the work queue than there > are open spots in the completion queue, and enough requests complete to overflow > the completion queue, then this particular firmware gives a catastrophic error > while the firmware on later cards handles the situation gracefully. > > Gurhan, can you identify what test you are running that is causing this problem? > If so, does changing the depth of tx/rx queues in the test solve the problem? Ahh, identifying them will be tough if not impossible. I actually tried it before opening this bug report, but wasn't successful. I am just running openib.exp which executes different IB utilities one by one, and while that script running firmware error happens at some random point. It is not consistent on which program this happens:( Will try it again anyway. We've opted not to treat this as a blocker bug because the card does not become unusable when this happens. Instead, the driver resets the card, reinitializes the card, then continues on working. The current transfer will be interrupted, but subsequent transfers will work fine. In addition, this is likely a symptom of a programming error on the part of the user space software in that this condition is most likely triggered whenever the number of completion queue entries available is smaller than the total number of outstanding work requests, resulting in the firmware completing more work requests than it has completion queue slots. When it overflows the completion queue, it develops this error. Later versions of the mthca hardware handle this more gracefully, but it is still a user programming bug in that there are always supposed to be sufficient completion queue slots to process all outstanding work requests. I'm not sure if this is worth a release note, or if we should just file this away on kbase in case someone hits this error. When it happens, if opensmd is running on the host, then opensmd stalls and it has to be restarted too. We might want to add that to kbase article (or release note if opt to) as well. The opensm issue is one that's unlikely to get hit in the field. Usually, most HPC clusters avoid running opensm on a machine that's heavily loaded with other traffic, and only when loaded with traffic does this problem show up. So, if for no other reason than people tend to separate management nodes and compute nodes in terms of roles, this shouldn't be a problem. not sure if i got this correctly, please advise if RHEL5.1 release note update quoted below is correct: <quote> Running an openib application on a system equipped with an MT25204 will result in an error if the opensmd daemon is also running. When this occurs, simply restart opensmd. </quote> please advise. thanks! Actually, I think that's more confusing that anything ;-) The real problem is that openmpi applications will see random segfaults and programs terminating unexpectedly if they are run on a system with the MT25204 hardware and the hardware generates a catastrophic error event. If the user should experience their application crashing with a segfault, then they should check the output of dmesg to see if their hardware generated a catastrophic error. If it did, this is in indication that the user level application tried to queue too many commands to fit into the work completion queue and minor modification to the program to increase the size of the completion queue should solve the problem. In the event that the application segfaults, and the card generates this error, then if opensm was running on that machine, it will need restarted after the error event in order to continue managing the InfiniBand fabric. Something like this might work: Mellanox MT25204 hardware has been found to experience an internal error under certain high load conditions. If the ib_mthca driver reports a catastrophic error on this hardware, it is usually related to an insufficient completion queue depth relative to the number of outstanding work requests generated by the user application. Although the driver will reset the card and recover from the error, all existing connections are lost at the time of the error, generally resulting in a segfault in the user application. Additionally, if opensm is running at the time the error occurs, then it will have to be restarted in order to resume proper operation. thanks Doug, revising as follows: <quote> Hardware testing for the Mellanox MT25204 has revealed that an internal error occurs under certain high-load conditions. When the ib_mthca driver reports a catastrophic error on this hardware, it is usually related to an insufficient completion queue depth relative to the number of outstanding work requests generated by the user application. Although the driver will reset the hardware and recover from such an event, all existing connections are lost at the time of the error. This generally results in a segmentation fault in the user application. Further, if opensm is running at the time the error occurs, then it will have to be manually restarted in order to resume proper operation. </quote> *** Bug 276111 has been marked as a duplicate of this bug. *** This request was evaluated by Red Hat Product Management for inclusion, but this component is not scheduled to be updated in the current Red Hat Enterprise Linux release. This request will be reviewed for a future Red Hat Enterprise Linux release. adding same release note to "Known Issues" of RHEL5.2. please advise if resolved so we can document as such. thanks! Hi, the RHEL5.2 release notes will be dropped to translation on April 15, 2008, at which point no further additions or revisions will be entertained. a mockup of the RHEL5.2 release notes can be viewed at the following link: http://intranet.corp.redhat.com/ic/intranet/RHEL5u2relnotesmockup.html please use the aforementioned link to verify if your bugzilla is already in the release notes (if it needs to be). each item in the release notes contains a link to its original bug; as such, you can search through the release notes by bug number. Cheers, Don Tracking this bug for the Red Hat Enterprise Linux 5.3 Release Notes. This Release Note is currently located in the Known Issues section. Release note added. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. The current update for rhel5.3 may help with this, but since I can't reproduce I'll need Gurhan to let me know. |