Red Hat Bugzilla – Bug 1408316
openmpi hfi_wait_for_device causes 15s delay
Last modified: 2017-08-17 11:11:56 EDT
Created attachment 1234881 [details] C mpi source for hello-world Description of problem: mpi programs take 15s to start execution waiting for /dev/hfi1_0 device. Version-Release number of selected component (if applicable): openmpi-1.10.3-3.el7.x86_64 How reproducible: 100% Steps to Reproduce: see attached C and Fortran90 source 1. module load mpi/openmpi-x86_64 2. compile either fortran or c "hello world" program 2a. mpicc mpi-hello.c -o mpi-hello-c 2b. mpifort mpi-hello.f90 -o mpi-hello-f 3. time mpirun -np 4 mpi-hello-c time mpirun -np 4 mpi-hello-f Actual results: spud.cam.nist.gov.15211hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out spud.cam.nist.gov.15212hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out spud.cam.nist.gov.15214hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out spud.cam.nist.gov.15213hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out Hello world from process 1 of 4 Hello world from process 2 of 4 Hello world from process 0 of 4 Hello world from process 3 of 4 real 0m15.270s user 0m0.088s sys 0m0.069s Expected results: no errors, execution time < 1s Additional info: after sudo modprobe hfi1 $ time mpirun -np 4 mpi-hello-c spud.cam.nist.gov.15358hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out spud.cam.nist.gov.15359hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out -------------------------------------------------------------------------- [[18561,1],0]: A high-performance Open MPI point-to-point messaging module was unable to find any relevant network interfaces: Module: OpenFabrics (openib) Host: spud Another transport will be used instead, although this may result in lower performance. -------------------------------------------------------------------------- spud.cam.nist.gov.15361hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out spud.cam.nist.gov.15360hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out Hello world from process 2 of 4 Hello world from process 0 of 4 Hello world from process 3 of 4 Hello world from process 1 of 4 [spud.cam.nist.gov:15356] 3 more processes have sent help message help-mpi-btl-base.txt / btl:no-nics [spud.cam.nist.gov:15356] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages real 0m15.269s user 0m0.091s sys 0m0.075s
Created attachment 1234882 [details] Fortran90 mpi source for hello-world
Forgot to mention this is a bug submitted by a community user using a CentOS 7.3 system. Thank you!
This is not an openmpi issue. It is a libfabric and libpsm2 issue. [root@rdma-dev-02 ~]$ fi_info rdma-dev-02.3246hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out verbs: IB-0x80fe version: 1.0 type: FI_EP_MSG protocol: FI_PROTO_RDMA_CM_IB_RC verbs: IB-0x80fe version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_IB_RDM UDP: UDP-IP version: 1.0 type: FI_EP_DGRAM protocol: FI_PROTO_UDP sockets: IP version: 1.0 type: FI_EP_MSG protocol: FI_PROTO_SOCK_TCP sockets: IP version: 1.0 type: FI_EP_DGRAM protocol: FI_PROTO_SOCK_TCP sockets: IP version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_SOCK_TCP [root@rdma-dev-02 ~]$ rpm -q libfabric libpsm2 openmpi libfabric-1.3.0-3.el7.x86_64 libpsm2-10.2.33-1.el7.x86_64 package openmpi is not installed
Hi, Chris Could you please update libfabric to libfabric-1.4.0 and try again? Please download the SRPM from following link. You need rebuild it with rpmbuild tool. https://koji.fedoraproject.org/koji/packageinfo?packageID=20963 [root@rdma-dev-02 tmp]$ rpm -qf $(which fi_info) libfabric-1.4.0-1.el7.x86_64 [root@rdma-dev-02 tmp]$ time fi_info provider: verbs fabric: IB-0x80fe domain: mlx5_0 version: 1.0 type: FI_EP_MSG protocol: FI_PROTO_RDMA_CM_IB_RC provider: UDP fabric: UDP-IP domain: udp version: 1.0 type: FI_EP_DGRAM protocol: FI_PROTO_UDP provider: UDP fabric: UDP-IP domain: udp version: 1.0 type: FI_EP_DGRAM protocol: FI_PROTO_UDP provider: UDP fabric: UDP-IP domain: udp version: 1.0 type: FI_EP_DGRAM protocol: FI_PROTO_UDP provider: UDP fabric: UDP-IP domain: udp version: 1.0 type: FI_EP_DGRAM protocol: FI_PROTO_UDP provider: UDP fabric: UDP-IP domain: udp version: 1.0 type: FI_EP_DGRAM protocol: FI_PROTO_UDP provider: UDP fabric: UDP-IP domain: udp version: 1.0 type: FI_EP_DGRAM protocol: FI_PROTO_UDP provider: UDP fabric: UDP-IP domain: udp version: 1.0 type: FI_EP_DGRAM protocol: FI_PROTO_UDP provider: UDP fabric: UDP-IP domain: udp version: 1.0 type: FI_EP_DGRAM protocol: FI_PROTO_UDP provider: UDP fabric: UDP-IP domain: udp version: 1.0 type: FI_EP_DGRAM protocol: FI_PROTO_UDP provider: UDP fabric: UDP-IP domain: udp version: 1.0 type: FI_EP_DGRAM protocol: FI_PROTO_UDP provider: UDP fabric: UDP-IP domain: udp version: 1.0 type: FI_EP_DGRAM protocol: FI_PROTO_UDP provider: UDP fabric: UDP-IP domain: udp version: 1.0 type: FI_EP_DGRAM protocol: FI_PROTO_UDP provider: UDP fabric: UDP-IP domain: udp version: 1.0 type: FI_EP_DGRAM protocol: FI_PROTO_UDP provider: UDP fabric: UDP-IP domain: udp version: 1.0 type: FI_EP_DGRAM protocol: FI_PROTO_UDP provider: UDP fabric: UDP-IP domain: udp version: 1.0 type: FI_EP_DGRAM protocol: FI_PROTO_UDP provider: UDP fabric: UDP-IP domain: udp version: 1.0 type: FI_EP_DGRAM protocol: FI_PROTO_UDP provider: UDP fabric: UDP-IP domain: udp version: 1.0 type: FI_EP_DGRAM protocol: FI_PROTO_UDP provider: UDP fabric: UDP-IP domain: udp version: 1.0 type: FI_EP_DGRAM protocol: FI_PROTO_UDP provider: UDP fabric: UDP-IP domain: udp version: 1.0 type: FI_EP_DGRAM protocol: FI_PROTO_UDP provider: UDP fabric: UDP-IP domain: udp version: 1.0 type: FI_EP_DGRAM protocol: FI_PROTO_UDP provider: UDP fabric: UDP-IP domain: udp version: 1.0 type: FI_EP_DGRAM protocol: FI_PROTO_UDP provider: sockets fabric: 10.16.40.0/24 domain: lom_1 version: 2.0 type: FI_EP_MSG protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 10.16.40.0/24 domain: lom_1 version: 2.0 type: FI_EP_DGRAM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 10.16.40.0/24 domain: lom_1 version: 2.0 type: FI_EP_RDM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 172.31.0.0/24 domain: mlx5_ib0 version: 2.0 type: FI_EP_MSG protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 172.31.0.0/24 domain: mlx5_ib0 version: 2.0 type: FI_EP_DGRAM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 172.31.0.0/24 domain: mlx5_ib0 version: 2.0 type: FI_EP_RDM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 172.31.1.0/24 domain: mlx5_ib1 version: 2.0 type: FI_EP_MSG protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 172.31.1.0/24 domain: mlx5_ib1 version: 2.0 type: FI_EP_DGRAM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 172.31.1.0/24 domain: mlx5_ib1 version: 2.0 type: FI_EP_RDM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 172.31.2.0/24 domain: mlx5_ib0.8002 version: 2.0 type: FI_EP_MSG protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 172.31.2.0/24 domain: mlx5_ib0.8002 version: 2.0 type: FI_EP_DGRAM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 172.31.2.0/24 domain: mlx5_ib0.8002 version: 2.0 type: FI_EP_RDM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 172.31.4.0/24 domain: mlx5_ib0.8004 version: 2.0 type: FI_EP_MSG protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 172.31.4.0/24 domain: mlx5_ib0.8004 version: 2.0 type: FI_EP_DGRAM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 172.31.4.0/24 domain: mlx5_ib0.8004 version: 2.0 type: FI_EP_RDM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 172.31.6.0/24 domain: mlx5_ib0.8006 version: 2.0 type: FI_EP_MSG protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 172.31.6.0/24 domain: mlx5_ib0.8006 version: 2.0 type: FI_EP_DGRAM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 172.31.6.0/24 domain: mlx5_ib0.8006 version: 2.0 type: FI_EP_RDM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 172.31.3.0/24 domain: mlx5_ib1.8003 version: 2.0 type: FI_EP_MSG protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 172.31.3.0/24 domain: mlx5_ib1.8003 version: 2.0 type: FI_EP_DGRAM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 172.31.3.0/24 domain: mlx5_ib1.8003 version: 2.0 type: FI_EP_RDM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 172.31.5.0/24 domain: mlx5_ib1.8005 version: 2.0 type: FI_EP_MSG protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 172.31.5.0/24 domain: mlx5_ib1.8005 version: 2.0 type: FI_EP_DGRAM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 172.31.5.0/24 domain: mlx5_ib1.8005 version: 2.0 type: FI_EP_RDM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 172.31.7.0/24 domain: mlx5_ib1.8007 version: 2.0 type: FI_EP_MSG protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 172.31.7.0/24 domain: mlx5_ib1.8007 version: 2.0 type: FI_EP_DGRAM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 172.31.7.0/24 domain: mlx5_ib1.8007 version: 2.0 type: FI_EP_RDM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 127.0.0.0/8 domain: lo version: 2.0 type: FI_EP_MSG protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 127.0.0.0/8 domain: lo version: 2.0 type: FI_EP_DGRAM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 127.0.0.0/8 domain: lo version: 2.0 type: FI_EP_RDM protocol: FI_PROTO_SOCK_TCP real 0m0.245s user 0m0.005s sys 0m0.019s [root@rdma-dev-02 tmp]$
This bug could be a duplication of https://bugzilla.redhat.com/show_bug.cgi?id=1354417
*** Bug 1354417 has been marked as a duplicate of this bug. ***
Confirmed - updating to libfabric-1.4.0-1.el7.centos.x86_64 resolves the issue. Thank you! [schanzle@spud src]$ rpm -q libfabric libfabric-1.4.0-1.el7.centos.x86_64 [schanzle@spud src]$ time mpirun -np 4 mpi-hello-c Hello world from process 0 of 4 Hello world from process 1 of 4 Hello world from process 2 of 4 Hello world from process 3 of 4 real 0m0.262s user 0m0.067s sys 0m0.081s
This bug is still on RHEL7 with libfabric-1.3 libfabric-1.3.0-3.el7.x86_64, and the link does not show a more recent version for RHEL7, only for Fedora
(In reply to Alexandre Strube from comment #9) > This bug is still on RHEL7 with libfabric-1.3 libfabric-1.3.0-3.el7.x86_64, > and the link does not show a more recent version for RHEL7, only for Fedora This bug will be fixed for RHEL-7.4.
It looks like the upstream fix is at https://github.com/ofiwg/libfabric/commit/31384811a549cb7c3c7f8fba6f326a1850a8a5b1
Is there a workaround for this with 1.3.0?
(In reply to Orion Poplawski from comment #12) > Is there a workaround for this with 1.3.0? No.
Reproducer: [root@rdma-qe-06 ~]$ rpm -qf $(which fi_info) libfabric-1.3.0-3.el7.x86_64 [root@rdma-qe-06 ~]$ fi_info rdma-qe-06.56339hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out verbs: IB-0x80fe version: 1.0 type: FI_EP_MSG protocol: FI_PROTO_RDMA_CM_IB_RC verbs: IB-0x80fe version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_IB_RDM UDP: UDP-IP version: 1.0 type: FI_EP_DGRAM protocol: FI_PROTO_UDP sockets: IP version: 1.0 type: FI_EP_MSG protocol: FI_PROTO_SOCK_TCP sockets: IP version: 1.0 type: FI_EP_DGRAM protocol: FI_PROTO_SOCK_TCP sockets: IP version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_SOCK_TCP Verification: [root@rdma-qe-06 ~]$ rpm -q libfabric libfabric-1.4.1-1.el7.x86_64 [root@rdma-qe-06 ~]$ time fi_info provider: verbs fabric: IB-0x80fe domain: mlx5_0 version: 1.0 type: FI_EP_MSG protocol: FI_PROTO_RDMA_CM_IB_RC provider: UDP fabric: UDP-IP domain: udp version: 1.0 type: FI_EP_DGRAM protocol: FI_PROTO_UDP provider: UDP fabric: UDP-IP domain: udp version: 1.0 type: FI_EP_DGRAM protocol: FI_PROTO_UDP provider: UDP fabric: UDP-IP domain: udp version: 1.0 type: FI_EP_DGRAM protocol: FI_PROTO_UDP provider: UDP fabric: UDP-IP domain: udp version: 1.0 type: FI_EP_DGRAM protocol: FI_PROTO_UDP provider: UDP fabric: UDP-IP domain: udp version: 1.0 type: FI_EP_DGRAM protocol: FI_PROTO_UDP provider: UDP fabric: UDP-IP domain: udp version: 1.0 type: FI_EP_DGRAM protocol: FI_PROTO_UDP provider: UDP fabric: UDP-IP domain: udp version: 1.0 type: FI_EP_DGRAM protocol: FI_PROTO_UDP provider: UDP fabric: UDP-IP domain: udp version: 1.0 type: FI_EP_DGRAM protocol: FI_PROTO_UDP provider: UDP fabric: UDP-IP domain: udp version: 1.0 type: FI_EP_DGRAM protocol: FI_PROTO_UDP provider: UDP fabric: UDP-IP domain: udp version: 1.0 type: FI_EP_DGRAM protocol: FI_PROTO_UDP provider: UDP fabric: UDP-IP domain: udp version: 1.0 type: FI_EP_DGRAM protocol: FI_PROTO_UDP provider: UDP fabric: UDP-IP domain: udp version: 1.0 type: FI_EP_DGRAM protocol: FI_PROTO_UDP provider: UDP fabric: UDP-IP domain: udp version: 1.0 type: FI_EP_DGRAM protocol: FI_PROTO_UDP provider: UDP fabric: UDP-IP domain: udp version: 1.0 type: FI_EP_DGRAM protocol: FI_PROTO_UDP provider: UDP fabric: UDP-IP domain: udp version: 1.0 type: FI_EP_DGRAM protocol: FI_PROTO_UDP provider: UDP fabric: UDP-IP domain: udp version: 1.0 type: FI_EP_DGRAM protocol: FI_PROTO_UDP provider: UDP fabric: UDP-IP domain: udp version: 1.0 type: FI_EP_DGRAM protocol: FI_PROTO_UDP provider: UDP fabric: UDP-IP domain: udp version: 1.0 type: FI_EP_DGRAM protocol: FI_PROTO_UDP provider: UDP fabric: UDP-IP domain: udp version: 1.0 type: FI_EP_DGRAM protocol: FI_PROTO_UDP provider: UDP fabric: UDP-IP domain: udp version: 1.0 type: FI_EP_DGRAM protocol: FI_PROTO_UDP provider: UDP fabric: UDP-IP domain: udp version: 1.0 type: FI_EP_DGRAM protocol: FI_PROTO_UDP provider: sockets fabric: 10.16.40.0/24 domain: lom_1 version: 2.0 type: FI_EP_MSG protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 10.16.40.0/24 domain: lom_1 version: 2.0 type: FI_EP_DGRAM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 10.16.40.0/24 domain: lom_1 version: 2.0 type: FI_EP_RDM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 172.31.0.0/24 domain: mlx5_ib0 version: 2.0 type: FI_EP_MSG protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 172.31.0.0/24 domain: mlx5_ib0 version: 2.0 type: FI_EP_DGRAM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 172.31.0.0/24 domain: mlx5_ib0 version: 2.0 type: FI_EP_RDM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 172.31.1.0/24 domain: mlx5_ib1 version: 2.0 type: FI_EP_MSG protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 172.31.1.0/24 domain: mlx5_ib1 version: 2.0 type: FI_EP_DGRAM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 172.31.1.0/24 domain: mlx5_ib1 version: 2.0 type: FI_EP_RDM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 172.31.2.0/24 domain: mlx5_ib0.8002 version: 2.0 type: FI_EP_MSG protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 172.31.2.0/24 domain: mlx5_ib0.8002 version: 2.0 type: FI_EP_DGRAM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 172.31.2.0/24 domain: mlx5_ib0.8002 version: 2.0 type: FI_EP_RDM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 172.31.6.0/24 domain: mlx5_ib0.8006 version: 2.0 type: FI_EP_MSG protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 172.31.6.0/24 domain: mlx5_ib0.8006 version: 2.0 type: FI_EP_DGRAM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 172.31.6.0/24 domain: mlx5_ib0.8006 version: 2.0 type: FI_EP_RDM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 172.31.4.0/24 domain: mlx5_ib0.8004 version: 2.0 type: FI_EP_MSG protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 172.31.4.0/24 domain: mlx5_ib0.8004 version: 2.0 type: FI_EP_DGRAM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 172.31.4.0/24 domain: mlx5_ib0.8004 version: 2.0 type: FI_EP_RDM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 172.31.5.0/24 domain: mlx5_ib1.8005 version: 2.0 type: FI_EP_MSG protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 172.31.5.0/24 domain: mlx5_ib1.8005 version: 2.0 type: FI_EP_DGRAM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 172.31.5.0/24 domain: mlx5_ib1.8005 version: 2.0 type: FI_EP_RDM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 172.31.3.0/24 domain: mlx5_ib1.8003 version: 2.0 type: FI_EP_MSG protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 172.31.3.0/24 domain: mlx5_ib1.8003 version: 2.0 type: FI_EP_DGRAM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 172.31.3.0/24 domain: mlx5_ib1.8003 version: 2.0 type: FI_EP_RDM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 172.31.7.0/24 domain: mlx5_ib1.8007 version: 2.0 type: FI_EP_MSG protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 172.31.7.0/24 domain: mlx5_ib1.8007 version: 2.0 type: FI_EP_DGRAM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 172.31.7.0/24 domain: mlx5_ib1.8007 version: 2.0 type: FI_EP_RDM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 127.0.0.0/8 domain: lo version: 2.0 type: FI_EP_MSG protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 127.0.0.0/8 domain: lo version: 2.0 type: FI_EP_DGRAM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 127.0.0.0/8 domain: lo version: 2.0 type: FI_EP_RDM protocol: FI_PROTO_SOCK_TCP real 0m0.247s user 0m0.001s sys 0m0.026s
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:2011
(In reply to Orion Poplawski from comment #12) > Is there a workaround for this with 1.3.0? Try using "--mca pml ob1 --mca btl self,tcp" this will enforce using TCP interface from the very beginning.