Description of problem: I get the following error on MT25204 cards when running the openib testsuite: Aug 4 20:58:08 dell-pe1950-02 kernel: ib_mthca 0000:0c:00.0: Catastrophic error detected: internal error Aug 4 20:58:08 dell-pe1950-02 kernel: ib_mthca 0000:0c:00.0: buf[00]: 000d0000 Aug 4 20:58:08 dell-pe1950-02 kernel: ib_mthca 0000:0c:00.0: buf[01]: 00000000 Aug 4 20:58:08 dell-pe1950-02 kernel: ib_mthca 0000:0c:00.0: buf[02]: 00000000 Aug 4 20:58:08 dell-pe1950-02 kernel: ib_mthca 0000:0c:00.0: buf[03]: 00000000 Aug 4 20:58:08 dell-pe1950-02 kernel: ib_mthca 0000:0c:00.0: buf[04]: 00000000 Aug 4 20:58:08 dell-pe1950-02 kernel: ib_mthca 0000:0c:00.0: buf[05]: 00127ee4 Aug 4 20:58:08 dell-pe1950-02 kernel: ib_mthca 0000:0c:00.0: buf[06]: ffffffff Aug 4 20:58:08 dell-pe1950-02 kernel: ib_mthca 0000:0c:00.0: buf[07]: 00000000 Aug 4 20:58:08 dell-pe1950-02 kernel: ib_mthca 0000:0c:00.0: buf[08]: 00000000 Aug 4 20:58:08 dell-pe1950-02 kernel: ib_mthca 0000:0c:00.0: buf[09]: 00000000 Aug 4 20:58:08 dell-pe1950-02 kernel: ib_mthca 0000:0c:00.0: buf[0a]: 00000000 Aug 4 20:58:08 dell-pe1950-02 kernel: ib_mthca 0000:0c:00.0: buf[0b]: 00000000 Aug 4 20:58:08 dell-pe1950-02 avahi-daemon[3262]: Interface ib0.IPv6 no longer relevant for mDNS. Aug 4 20:58:08 dell-pe1950-02 kernel: ib_mthca 0000:0c:00.0: buf[0c]: 00000000 Aug 4 20:58:08 dell-pe1950-02 avahi-daemon[3262]: Leaving mDNS multicast group on interface ib0.IPv6 with address fe80::202:c902:20:f2ad. Aug 4 20:58:08 dell-pe1950-02 kernel: ib_mthca 0000:0c:00.0: buf[0d]: 00000000 Aug 4 20:58:08 dell-pe1950-02 avahi-daemon[3262]: Interface ib0.IPv4 no longer relevant for mDNS. Aug 4 20:58:08 dell-pe1950-02 kernel: ib_mthca 0000:0c:00.0: buf[0e]: 00000000 Aug 4 20:58:08 dell-pe1950-02 avahi-daemon[3262]: Leaving mDNS multicast group on interface ib0.IPv4 with address 192.168.1.10. Aug 4 20:58:08 dell-pe1950-02 kernel: ib_mthca 0000:0c:00.0: buf[0f]: 00000000 Aug 4 20:58:08 dell-pe1950-02 avahi-daemon[3262]: Withdrawing address record for fe80::202:c902:20:f2ad on ib0. Aug 4 20:58:08 dell-pe1950-02 avahi-daemon[3262]: Withdrawing address record for 192.168.1.10 on ib0. Aug 4 20:58:09 dell-pe1950-02 kernel: ACPI: PCI interrupt for device 0000:0c:00.0 disabled Aug 4 20:58:09 dell-pe1950-02 kernel: ib_mthca: Initializing 0000:0c:00.0 Aug 4 20:58:09 dell-pe1950-02 kernel: PCI: Enabling device 0000:0c:00.0 (0140 -> 0142) Aug 4 20:58:09 dell-pe1950-02 kernel: ACPI: PCI Interrupt 0000:0c:00.0[A] -> GSI 16 (level, low) -> IRQ 169 Aug 4 20:58:11 dell-pe1950-02 kernel: ib_mthca 0000:0c:00.0: HCA FW version 1.0.700 is old (1.1.0 is current). Aug 4 20:58:11 dell-pe1950-02 kernel: ib_mthca 0000:0c:00.0: If you have problems, try updating your HCA FW. This only happens on MT25204 cards. IT recovers itself nicely, unless, the box happens to run the opensmd deamon. Then it can't recover itself. It happens everytime when i run my testscript when it tries to run ibv_rc_pingpong, yet manually running ibv_rc_pingpong doesn't reproduce the case. Maybe things needs to be in specific sequence? I'll be attaching the script to the BZ, change the line "set pass password" to have the password of the remote machine you'll be running script against, and invoke it with: ./openib.exp <remote_host_name> Make sure that both machines have IB adapters and are in the same IB subnet. You have to have MT25204 card on the machine you are running the script. If you don't have that card, I can let you use the machines in westford lab that have. Version-Release number of selected component (if applicable): [root@dell-pe1950-02 ~]# rpm -qa | egrep "openib|mthca" libmthca-1.0.4-1.el5 libmthca-devel-1.0.4-1.el5 openib-perftest-1.2-1.el5 libmthca-devel-1.0.4-1.el5 openib-tvflash-0.9.2-1.el5 openib-mstflint-1.2-1.el5 openib-diags-1.2.7-1.el5 openib-1.2-1.el5 openib-srptools-0.0.6-1.el5 modinfo ib_mthca filename: /lib/modules/2.6.18-38.el5/kernel/drivers/infiniband/hw/mthca/ib_mthca.ko version: 0.08 license: Dual BSD/GPL description: Mellanox InfiniBand HCA low-level driver author: Roland Dreier srcversion: D945E6E42EECF8618BF3854 alias: pci:v00001867d00005E8Csv*sd*bc*sc*i* alias: pci:v000015B3d00005E8Csv*sd*bc*sc*i* alias: pci:v00001867d00006274sv*sd*bc*sc*i* alias: pci:v000015B3d00006274sv*sd*bc*sc*i* alias: pci:v00001867d00006282sv*sd*bc*sc*i* alias: pci:v000015B3d00006282sv*sd*bc*sc*i* alias: pci:v00001867d00006278sv*sd*bc*sc*i* alias: pci:v000015B3d00006278sv*sd*bc*sc*i* alias: pci:v00001867d00005A44sv*sd*bc*sc*i* alias: pci:v000015B3d00005A44sv*sd*bc*sc*i* depends: ib_mad,ib_core vermagic: 2.6.18-38.el5 SMP mod_unload gcc-4.1 parm: catas_reset_disable:disable reset on catastrophic event if nonzero (int) parm: qos_support:Enable QoS support if > 0 (int) parm: fw_cmd_doorbell:post FW commands through doorbell page if nonzero (and supported by FW) (int) parm: debug_level:Enable debug tracing if > 0 (int) parm: msi_x:attempt to use MSI-X if nonzero (int) parm: msi:attempt to use MSI if nonzero (int) parm: tune_pci:increase PCI burst from the default set by BIOS if nonzero (int) parm: num_qp:maximum number of QPs per HCA (int) parm: rdb_per_qp:number of RDB buffers per QP (int) parm: num_cq:maximum number of CQs per HCA (int) parm: num_mcg:maximum number of multicast groups per HCA (int) parm: num_mpt:maximum number of memory protection table entries per HCA (int) parm: num_mtt:maximum number of memory translation table segments per HCA (int) parm: num_udav:maximum number of UD address vectors per HCA (int) parm: fmr_reserved_mtts:number of memory translation table segments reserved for FMR (int) module_sig: 883f35046b38fd66346bc3bfc3c5df311211d009f5de2c99c70471b88f091278de56ebb575e1c904309f6f94576d4d6efff6c12589cf7422fc58cdce2d How reproducible: Everytime. Steps to Reproduce: 1. Grab openib.exp from this bug report. 2. Run it with ./openib.exp <remote_machine> 3. Watch /var/log/messages Actual results: Expected results: Additional info:
Created attachment 161181 [details] test script
By the way this is only happening with MT25204 cards. I also have MT25208 and MT23108 both of which don't give this issue.
So I downloaded the firmware fw-25204-1_2_000-MHES14-XT.bin from Mellanox website, and burned it with "mstflint -d mthca0 -i fw-25204-1_2_000-MHES14-XT.bin bb" command and the same issue can still be seen.
This appears to possibly be a known issue in the firmware on this card itself. However, it appears to also possibly be a known issue in the firmware as it relates to handling a programming bug in the user's software. Specifically, when the user space software posts more requests to the work queue than there are open spots in the completion queue, and enough requests complete to overflow the completion queue, then this particular firmware gives a catastrophic error while the firmware on later cards handles the situation gracefully. Gurhan, can you identify what test you are running that is causing this problem? If so, does changing the depth of tx/rx queues in the test solve the problem?
(In reply to comment #4) > This appears to possibly be a known issue in the firmware on this card itself. > However, it appears to also possibly be a known issue in the firmware as it > relates to handling a programming bug in the user's software. Specifically, > when the user space software posts more requests to the work queue than there > are open spots in the completion queue, and enough requests complete to overflow > the completion queue, then this particular firmware gives a catastrophic error > while the firmware on later cards handles the situation gracefully. > > Gurhan, can you identify what test you are running that is causing this problem? > If so, does changing the depth of tx/rx queues in the test solve the problem? Ahh, identifying them will be tough if not impossible. I actually tried it before opening this bug report, but wasn't successful. I am just running openib.exp which executes different IB utilities one by one, and while that script running firmware error happens at some random point. It is not consistent on which program this happens:( Will try it again anyway.
We've opted not to treat this as a blocker bug because the card does not become unusable when this happens. Instead, the driver resets the card, reinitializes the card, then continues on working. The current transfer will be interrupted, but subsequent transfers will work fine. In addition, this is likely a symptom of a programming error on the part of the user space software in that this condition is most likely triggered whenever the number of completion queue entries available is smaller than the total number of outstanding work requests, resulting in the firmware completing more work requests than it has completion queue slots. When it overflows the completion queue, it develops this error. Later versions of the mthca hardware handle this more gracefully, but it is still a user programming bug in that there are always supposed to be sufficient completion queue slots to process all outstanding work requests. I'm not sure if this is worth a release note, or if we should just file this away on kbase in case someone hits this error.
When it happens, if opensmd is running on the host, then opensmd stalls and it has to be restarted too. We might want to add that to kbase article (or release note if opt to) as well.
The opensm issue is one that's unlikely to get hit in the field. Usually, most HPC clusters avoid running opensm on a machine that's heavily loaded with other traffic, and only when loaded with traffic does this problem show up. So, if for no other reason than people tend to separate management nodes and compute nodes in terms of roles, this shouldn't be a problem.
not sure if i got this correctly, please advise if RHEL5.1 release note update quoted below is correct: <quote> Running an openib application on a system equipped with an MT25204 will result in an error if the opensmd daemon is also running. When this occurs, simply restart opensmd. </quote> please advise. thanks!
Actually, I think that's more confusing that anything ;-) The real problem is that openmpi applications will see random segfaults and programs terminating unexpectedly if they are run on a system with the MT25204 hardware and the hardware generates a catastrophic error event. If the user should experience their application crashing with a segfault, then they should check the output of dmesg to see if their hardware generated a catastrophic error. If it did, this is in indication that the user level application tried to queue too many commands to fit into the work completion queue and minor modification to the program to increase the size of the completion queue should solve the problem. In the event that the application segfaults, and the card generates this error, then if opensm was running on that machine, it will need restarted after the error event in order to continue managing the InfiniBand fabric.
Something like this might work: Mellanox MT25204 hardware has been found to experience an internal error under certain high load conditions. If the ib_mthca driver reports a catastrophic error on this hardware, it is usually related to an insufficient completion queue depth relative to the number of outstanding work requests generated by the user application. Although the driver will reset the card and recover from the error, all existing connections are lost at the time of the error, generally resulting in a segfault in the user application. Additionally, if opensm is running at the time the error occurs, then it will have to be restarted in order to resume proper operation.
thanks Doug, revising as follows: <quote> Hardware testing for the Mellanox MT25204 has revealed that an internal error occurs under certain high-load conditions. When the ib_mthca driver reports a catastrophic error on this hardware, it is usually related to an insufficient completion queue depth relative to the number of outstanding work requests generated by the user application. Although the driver will reset the hardware and recover from such an event, all existing connections are lost at the time of the error. This generally results in a segmentation fault in the user application. Further, if opensm is running at the time the error occurs, then it will have to be manually restarted in order to resume proper operation. </quote>
*** Bug 276111 has been marked as a duplicate of this bug. ***
This request was evaluated by Red Hat Product Management for inclusion, but this component is not scheduled to be updated in the current Red Hat Enterprise Linux release. This request will be reviewed for a future Red Hat Enterprise Linux release.
adding same release note to "Known Issues" of RHEL5.2. please advise if resolved so we can document as such. thanks!
Hi, the RHEL5.2 release notes will be dropped to translation on April 15, 2008, at which point no further additions or revisions will be entertained. a mockup of the RHEL5.2 release notes can be viewed at the following link: http://intranet.corp.redhat.com/ic/intranet/RHEL5u2relnotesmockup.html please use the aforementioned link to verify if your bugzilla is already in the release notes (if it needs to be). each item in the release notes contains a link to its original bug; as such, you can search through the release notes by bug number. Cheers, Don
Tracking this bug for the Red Hat Enterprise Linux 5.3 Release Notes. This Release Note is currently located in the Known Issues section.
Release note added. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team.
The current update for rhel5.3 may help with this, but since I can't reproduce I'll need Gurhan to let me know.