Description of problem: When running mpitests-IMB_MPI1 testsuite of mpitests package over 2 nodes where both nodes are InfiniPath_QLE7140 HCAs, the program crashes with the following backtrace: [dell-pe1950-03.rhts.boston.redhat.com][0,1,0][btl_openib_endpoint.c:213:mca_btl_openib_endpoint_post_send] error posting send request errno says Invalid argument [0,1,0][btl_openib_component.c:1332:btl_openib_component_progress] from dell-pe1950-03.rhts.boston.redhat.com to: dell-pe1950-02.rhts.boston.redhat.com error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 173782656 opcode 1 -------------------------------------------------------------------------- The InfiniBand retry count between two MPI processes has been exceeded. "Retry count" is defined in the InfiniBand spec 1.2 (section 12.7.38): The total number of times that the sender wishes the receiver to retry timeout, packet sequence, etc. errors before posting a completion error. This error typically means that there is something awry within the InfiniBand fabric itself. You should note the hosts on which this error has occurred; it has been observed that rebooting or removing a particular host from the job can sometimes resolve this issue. Two MCA parameters can be used to control Open MPI's behavior with respect to the retry count: * btl_openib_ib_retry_count - The number of times the sender will attempt to retry (defaulted to 7, the maximum value). * btl_openib_ib_timeout - The local ACK timeout parameter (defaulted to 10). The actual timeout value used is calculated as: 4.096 microseconds * (2^btl_openib_ib_timeout) See the InfiniBand spec 1.2 (section 12.7.34) for more details. -------------------------------------------------------------------------- mpirun noticed that job rank 1 with PID 4844 on node dell-pe1950-02.rhts.boston.redhat.com exited on signal 15 (Terminated). Version-Release number of selected component (if applicable): # rpm -qa | egrep "openib|openmpi|mpitests" openmpi-devel-1.2.3-4.el5 openib-srptools-0.0.6-5.el5 openib-mstflint-1.2-5.el5 openmpi-1.2.3-4.el5 mpitests-debuginfo-2.0-2 openib-1.2-5.el5 openib-diags-1.2.7-5.el5 openib-perftest-1.2-5.el5 openib-debuginfo-1.2-5.el5 openmpi-libs-1.2.3-4.el5 mpitests-2.0-2 openib-tvflash-0.9.2-5.el5 openmpi-debuginfo-1.2.3-4.el5 # modinfo ib_ipath filename: /lib/modules/2.6.18-43.el5/kernel/drivers/infiniband/hw/ipath/ib_ipath.ko description: QLogic InfiniPath driver author: QLogic <support> license: GPL srcversion: 60096FEC902AEF4EEFCAD65 alias: pci:v00001FC1d00000010sv*sd*bc*sc*i* alias: pci:v00001FC1d0000000Dsv*sd*bc*sc*i* depends: ib_core vermagic: 2.6.18-43.el5 SMP mod_unload gcc-4.1 parm: qp_table_size:QP table size (uint) parm: lkey_table_size:LKEY table size in bits (2^n, 1 <= n <= 23) (uint) parm: max_pds:Maximum number of protection domains to support (uint) parm: max_ahs:Maximum number of address handles to support (uint) parm: max_cqes:Maximum number of completion queue entries to support (uint) parm: max_cqs:Maximum number of completion queues to support (uint) parm: max_qp_wrs:Maximum number of QP WRs to support (uint) parm: max_qps:Maximum number of QPs to support (uint) parm: max_sges:Maximum number of SGEs to support (uint) parm: max_mcast_grps:Maximum number of multicast groups to support (uint) parm: max_mcast_qp_attached:Maximum number of attached QPs to support (uint) parm: max_srqs:Maximum number of SRQs to support (uint) parm: max_srq_sges:Maximum number of SRQ SGEs to support (uint) parm: max_srq_wrs:Maximum number of SRQ WRs support (uint) parm: disable_sma:uint parm: ib_ipath_disable_sma:Disable the SMA parm: cfgports:Set max number of ports to use (ushort) parm: kpiobufs:Set number of PIO buffers for driver parm: debug:mask for debug prints (uint) module_sig: 883f35046cb61a956bc506ef7b1fb11243e309e23f9e3a61fda5c94159875195a459bc1f34c40a09ee073e5d729c3b0816f6bf372e92c1b46efefe How reproducible: Everytime Steps to Reproduce: 1. Have 2 nodes with Qlogic PCIe cards (I used InfiniPath_QLE7140) 2. Build, install mpitests. 3. Run mpitests-IMB_MPI1 . Actual results: Expected results: Additional info:
It turns out that this problem, specifically the segfault, wasn't related to two ipath cards as it was that one of the ipath cards was running at a scant 2% of the overall speed of the IB fabric and as such was causing excessive retries that closed the connection and the mpitest program wasn't built to deal with connections going away unexpectedly and as a result segfaulted. As such, I'm closing this as NOTABUG. The problem went away when the test network was updated to get rid of the slow connection that was disrupting the IB fabric.