RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1878829 - [RHEL8.3] all MVAPICH2 benchmarks fail with RC 1 when run with "mpirun_rsh" on certain RDMA HCAs
Summary: [RHEL8.3] all MVAPICH2 benchmarks fail with RC 1 when run with "mpirun_rsh" o...
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: mvapich2
Version: 8.3
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: 8.0
Assignee: Honggang LI
QA Contact: Infiniband QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-09-14 15:33 UTC by Brian Chae
Modified: 2020-11-14 09:04 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-10 03:05:44 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
client test log for mvapich2 where all mpirun_rsh benchmarks failed (345.67 KB, text/plain)
2020-09-14 15:33 UTC, Brian Chae
no flags Details

Description Brian Chae 2020-09-14 15:33:36 UTC
Created attachment 1714821 [details]
client test log for mvapich2 where all mpirun_rsh benchmarks failed

Description of problem:

All benchmarks of MVAPICH2 fail with RC1 when run with "mpirun_rsh", the error messages for each benchmark are shown as below:


+ [20-09-04 18:26:59] timeout --preserve-status --kill-after=5m 3m mpirun_rsh -np 2 -hostfile /root/hfile_one_core mpitests-IMB-MPI1 Allgatherv -time 1.5
[src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:1672] Could not modify qpto RTR
[src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:1672] Could not modify qpto RTR
#------------------------------------------------------------
#    Intel(R) MPI Benchmarks 2019 Update 6, MPI-1 part    
#------------------------------------------------------------
# Date                  : Fri Sep  4 18:27:00 2020
# Machine               : x86_64
# System                : Linux
# Release               : 4.18.0-234.el8.x86_64
# Version               : #1 SMP Thu Aug 20 10:25:32 EDT 2020
# MPI Version           : 3.1
# MPI Thread Environment: 


# Calling sequence was: 

# mpitests-IMB-MPI1 Allgatherv -time 1.5 

# Minimum message length in bytes:   0
# Maximum message length in bytes:   4194304
#
# MPI_Datatype                   :   MPI_BYTE 
# MPI_Datatype for reductions    :   MPI_FLOAT 
# MPI_Op                         :   MPI_SUM  
# 
# 

# List of Benchmarks to run:

# Allgatherv
[rdma-dev-22.lab.bos.redhat.com:mpispawn_1][report_error] connect() failed: Connection refused (111)
[rdma-dev-21.lab.bos.redhat.com:mpispawn_0][report_error] connect() failed: Connection refused (111)
[rdma-dev-22.lab.bos.redhat.com:mpirun_rsh][signal_processor] Caught signal 15, killing job


*** This issue has been observed only on the following RDMA lab hosts.

  - rdma-virt-02/03
    PowerEdge R430 - mlx5 MT27700 CX-4 ib0/ib1
  - rdma-dev-21/22
    PowerEdge R630 - mlx5 MT27700 CX-4 ib0/ib1
  - rdma-virt-00/01
    mlx4 MT27520 CX-3Pro ib0/ib1



*** NO such issues observed and all MVAPICH2 benchmarks successfully run on the following RDMA hosts.

   - rdma-qe-06/rdma-qe-07
     mlx5 MT27600 CIB ib0/ib1

   - rdma-dev-10/11
     mlx4 MT27500 CX-3 ib0/ib1

   - rdma-dev-00/01
     mlx4 MT27500 CX-3 ib0/ib1

   - rdma-perf-00/01
     mlx4 MT27500 CX-3 ib0








Version-Release number of selected component (if applicable):

DISTRO=RHEL-8.3.0-20200825.0
+ [20-09-04 14:32:38] cat /etc/redhat-release
Red Hat Enterprise Linux release 8.3 Beta (Ootpa)
+ [20-09-04 14:32:38] uname -a
Linux rdma-dev-22.lab.bos.redhat.com 4.18.0-234.el8.x86_64 #1 SMP Thu Aug 20 10:25:32 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux
+ [20-09-04 14:32:38] cat /proc/cmdline
BOOT_IMAGE=(hd0,msdos1)/vmlinuz-4.18.0-234.el8.x86_64 root=/dev/mapper/rhel_rdma--dev--22-root ro intel_idle.max_cstate=0 processor.max_cstate=0 intel_iommu=on iommu=on console=tty0 rd_NO_PLYMOUTH crashkernel=auto resume=/dev/mapper/rhel_rdma--dev--22-swap rd.lvm.lv=rhel_rdma-dev-22/root rd.lvm.lv=rhel_rdma-dev-22/swap console=ttyS1,115200n81
+ [20-09-04 14:32:38] rpm -q rdma-core linux-firmware
rdma-core-29.0-3.el8.x86_64
linux-firmware-20200619-99.git3890db36.el8.noarch
+ [20-09-04 14:32:38] tail /sys/class/infiniband/mlx5_0/fw_ver /sys/class/infiniband/mlx5_1/fw_ver /sys/class/infiniband/mlx5_2/fw_ver
==> /sys/class/infiniband/mlx5_0/fw_ver <==
12.16.1020

==> /sys/class/infiniband/mlx5_1/fw_ver <==
12.23.1020

==> /sys/class/infiniband/mlx5_2/fw_ver <==
12.23.1020
+ [20-09-04 14:32:38] lspci
+ [20-09-04 14:32:38] grep -i -e ethernet -e infiniband -e omni -e ConnectX
01:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 2-port Gigabit Ethernet PCIe
01:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 2-port Gigabit Ethernet PCIe
02:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 2-port Gigabit Ethernet PCIe
02:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 2-port Gigabit Ethernet PCIe
04:00.0 Ethernet controller: Mellanox Technologies MT27700 Family [ConnectX-4]
82:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
82:00.1 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]



Installed:
  mpitests-mvapich2-5.6.2-1.el8.x86_64        mvapich2-2.3.3-1.el8.x86_64       



How reproducible:

100% on the above specified HCAs/RDMA hosts


Steps to Reproduce:
1. On the client host

timeout --preserve-status --kill-after=5m 3m mpirun_rsh -np 2 -hostfile /root/hfile_one_core mpitests-IMB-MPI1 Sendrecv -time 1.5
  


Actual results:


[src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:1672] Could not modify qpto RTR
[src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:1672] Could not modify qpto RTR
#------------------------------------------------------------
#    Intel(R) MPI Benchmarks 2019 Update 6, MPI-1 part    
#------------------------------------------------------------
# Date                  : Fri Sep  4 18:14:56 2020
# Machine               : x86_64
# System                : Linux
# Release               : 4.18.0-234.el8.x86_64
# Version               : #1 SMP Thu Aug 20 10:25:32 EDT 2020
# MPI Version           : 3.1
# MPI Thread Environment: 


# Calling sequence was: 

# mpitests-IMB-MPI1 Sendrecv -time 1.5 

# Minimum message length in bytes:   0
# Maximum message length in bytes:   4194304
#
# MPI_Datatype                   :   MPI_BYTE 
# MPI_Datatype for reductions    :   MPI_FLOAT 
# MPI_Op                         :   MPI_SUM  
# 
# 

# List of Benchmarks to run:

# Sendrecv
[rdma-dev-22.lab.bos.redhat.com:mpispawn_1][report_error] connect() failed: Connection refused (111)
[rdma-dev-21.lab.bos.redhat.com:mpispawn_0][read_size] Unexpected End-Of-File on file descriptor 6. MPI process died?
[rdma-dev-21.lab.bos.redhat.com:mpispawn_0][read_size] Unexpected End-Of-File on file descriptor 6. MPI process died?
[rdma-dev-21.lab.bos.redhat.com:mpispawn_0][handle_mt_peer] Error while reading PMI socket. MPI process died?
[rdma-dev-21.lab.bos.redhat.com:mpispawn_0][report_error] connect() failed: Connection refused (111)
[rdma-dev-22.lab.bos.redhat.com:mpirun_rsh][signal_processor] Caught signal 15, killing job
[rdma-dev-22.lab.bos.redhat.com:mpirun_rsh][signal_processor] Caught signal 15, killing job



Expected results:

+ [20-09-05 11:36:03] timeout --preserve-status --kill-after=5m 3m mpirun_rsh -np 2 -hostfile /root/hfile_one_core mpitests-IMB-MPI1 Sendrecv -time 1.5
#------------------------------------------------------------
#    Intel(R) MPI Benchmarks 2019 Update 6, MPI-1 part    
#------------------------------------------------------------
# Date                  : Sat Sep  5 11:36:04 2020
# Machine               : x86_64
# System                : Linux
# Release               : 4.18.0-234.el8.x86_64
# Version               : #1 SMP Thu Aug 20 10:25:32 EDT 2020
# MPI Version           : 3.1
# MPI Thread Environment: 


# Calling sequence was: 

# mpitests-IMB-MPI1 Sendrecv -time 1.5 

# Minimum message length in bytes:   0
# Maximum message length in bytes:   4194304
#
# MPI_Datatype                   :   MPI_BYTE 
# MPI_Datatype for reductions    :   MPI_FLOAT 
# MPI_Op                         :   MPI_SUM  
# 
# 

# List of Benchmarks to run:

# Sendrecv

#-----------------------------------------------------------------------------
# Benchmarking Sendrecv 
# #processes = 2 
#-----------------------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]   Mbytes/sec
            0         1000         1.79         1.79         1.79         0.00
            1         1000         1.98         1.98         1.98         1.01
            2         1000         1.90         1.90         1.90         2.11
            4         1000         1.90         1.90         1.90         4.21
            8         1000         1.92         1.92         1.92         8.34
           16         1000         1.91         1.91         1.91        16.76
           32         1000         1.96         1.96         1.96        32.62
           64         1000         2.02         2.02         2.02        63.50
          128         1000         2.11         2.11         2.11       121.34
          256         1000         2.92         2.92         2.92       175.32
          512         1000         3.09         3.09         3.09       331.33
         1024         1000         3.48         3.48         3.48       588.27
         2048         1000         4.38         4.39         4.39       933.99
         4096         1000         6.31         6.31         6.31      1297.48
         8192         1000         9.94         9.95         9.94      1647.44
        16384         1000        11.63        11.63        11.63      2818.16
        32768         1000        16.19        16.19        16.19      4048.87
        65536          640        19.85        19.85        19.85      6604.17
       131072          320        31.45        31.45        31.45      8334.96
       262144          160        52.86        52.87        52.87      9916.96
       524288           80        97.16        97.16        97.16     10791.76
      1048576           40       187.76       187.77       187.76     11168.93
      2097152           20       388.82       388.86       388.84     10786.13
      4194304           10       808.84       809.53       809.18     10362.36


# All processes entering MPI_Finalize

[1] Failed to dealloc pd (Device or resource busy)
[0] Failed to dealloc pd (Device or resource busy)
[0] 16 at [0x000055b56e58fce0], src/mpid/ch3/src/mpid_rma.c[182]
[0] 8 at [0x000055b56e5fe9c0], src/mpi/comm/create_2level_comm.c[1058]
[0] 24 at [0x000055b56e5fad60], src/mpi/group/grouputil.c[74]
[0] 8 at [0x000055b56e5fef00], src/mpi/comm/create_2level_comm.c[1016]
[0] 56 at [0x000055b56e5fc260], src/mpi/coll/ch3_shmem_coll.c[4040]
[0] 24 at [0x000055b56e5fb7e0], src/mpi/group/grouputil.c[74]
[0] 8 at [0x000055b56e5fcf80], src/mpi/comm/create_2level_comm.c[743]
[0] 8 at [0x000055b56e5fa2e0], src/util/procmap/local_proc.c[93]
[0] 8 at [0x000055b56e5fa820], src/util/procmap/local_proc.c[92]
[0] 16 at [0x000055b56e5eae90], src/mpi/group/grouputil.c[74]
[0] 8 at [0x000055b56e5fdf40], src/util/procmap/local_proc.c[93]
[0] 8 at [0x000055b56e5fe480], src/util/procmap/local_proc.c[92]
[0] 1024 at [0x000055b56e050f50], src/mpi/coll/ch3_shmem_coll.c[4783]
[0] 8 at [0x000055b56e24ede0], src/mpi/coll/ch3_shmem_coll.c[4779]
[0] 312 at [0x000055b56e050d70], src/mpi/coll/ch3_shmem_coll.c[4732]
[0] 208 at [0x000055b56e050c00], src/mpi/coll/ch3_shmem_coll.c[4682]
[0] 8 at [0x000055b56e24ed30], src/mpi/comm/create_2level_comm.c[1607]
[0] 8 at [0x000055b56e24ec80], src/mpi/comm/create_2level_comm.c[1599]
[0] 8 at [0x000055b56e24ef40], src/util/procmap/local_proc.c[93]
[0] 8 at [0x000055b56e24ee90], src/util/procmap/local_proc.c[92]
[0] 16 at [0x000055b56e58fe40], src/mpi/group/grouputil.c[74]
[0] 24 at [0x000055b56e58ffa0], src/mpi/group/grouputil.c[74]
[0] 8 at [0x000055b56e58fef0], src/mpi/comm/create_2level_comm.c[1502]
[0] 8 at [0x000055b56e58fd90], src/mpi/comm/create_2level_comm.c[1478]
[0] 24 at [0x000055b56e0499e0], src/mpi/group/grouputil.c[74]
[0] 8 at [0x000055b56e24e980], src/mpid/ch3/src/mpid_rma.c[182]
[0] 8 at [0x000055b56e24e8d0], src/mpid/ch3/src/mpid_rma.c[182]
[0] 8 at [0x000055b56e24e820], src/mpid/ch3/src/mpid_rma.c[182]
[0] 8 at [0x000055b56e02d6c0], src/mpid/ch3/src/mpid_rma.c[182]
[0] 8 at [0x000055b56e02e200], src/mpid/ch3/src/mpid_rma.c[182]
[0] 8 at [0x000055b56e02e020], src/mpid/ch3/src/mpid_rma.c[182]
[0] 504 at [0x000055b56e051920], src/mpi/comm/commutil.c[328]
[0] 32 at [0x000055b56e02e9e0], src/mpid/ch3/src/mpid_vc.c[110]
[1] 8 at [0x000055cc306d2990], src/mpi/comm/create_2level_comm.c[1058]
[1] 24 at [0x000055cc306d2ed0], src/mpi/group/grouputil.c[74]
[1] 8 at [0x000055cc306d3e90], src/mpi/comm/create_2level_comm.c[1016]
[1] 56 at [0x000055cc306d43d0], src/mpi/coll/ch3_shmem_coll.c[4023]
[1] 24 at [0x000055cc306d5630], src/mpi/group/grouputil.c[74]
[1] 8 at [0x000055cc30128300], src/mpi/comm/create_2level_comm.c[743]
[1] 8 at [0x000055cc306d3410], src/util/procmap/local_proc.c[93]
[1] 8 at [0x000055cc306d3950], src/util/procmap/local_proc.c[92]
[1] 24 at [0x000055cc306d4910], src/mpid/ch3/src/mpid_vc.c[110]
[1] 16 at [0x000055cc306d7b80], src/mpi/group/grouputil.c[74]
[1] 8 at [0x000055cc306d6b30], src/util/procmap/local_proc.c[93]
[1] 8 at [0x000055cc306d7070], src/util/procmap/local_proc.c[92]
[1] 1024 at [0x000055cc30129c00], src/mpi/coll/ch3_shmem_coll.c[4783]
[1] 8 at [0x000055cc30122630], src/mpi/coll/ch3_shmem_coll.c[4779]
[1] 312 at [0x000055cc30122450], src/mpi/coll/ch3_shmem_coll.c[4732]
[1] 208 at [0x000055cc301222e0], src/mpi/coll/ch3_shmem_coll.c[4682]
[1] 8 at [0x000055cc30327d40], src/mpi/comm/create_2level_comm.c[1607]
[1] 8 at [0x000055cc30327f80], src/mpi/comm/create_2level_comm.c[1599]
[1] 8 at [0x000055cc30122230], src/util/procmap/local_proc.c[93]
[1] 8 at [0x000055cc30122180], src/util/procmap/local_proc.c[92]
[1] 24 at [0x000055cc30327ec0], src/mpid/ch3/src/mpid_vc.c[110]
[1] 16 at [0x000055cc30668ef0], src/mpi/group/grouputil.c[74]
[1] 24 at [0x000055cc30327c80], src/mpi/group/grouputil.c[74]
[1] 8 at [0x000055cc30668fa0], src/mpi/comm/create_2level_comm.c[1502]
[1] 8 at [0x000055cc30668e40], src/mpi/comm/create_2level_comm.c[1478]
[1] 24 at [0x000055cc3011d790], src/mpi/group/grouputil.c[74]
[1] 8 at [0x000055cc30327980], src/mpid/ch3/src/mpid_rma.c[182]
[1] 8 at [0x000055cc303278d0], src/mpid/ch3/src/mpid_rma.c[182]
[1] 8 at [0x000055cc30327820], src/mpid/ch3/src/mpid_rma.c[182]
[1] 8 at [0x000055cc301066c0], src/mpid/ch3/src/mpid_rma.c[182]
[1] 8 at [0x000055cc30107200], src/mpid/ch3/src/mpid_rma.c[182]
[1] 8 at [0x000055cc30107020], src/mpid/ch3/src/mpid_rma.c[182]
[1] 504 at [0x000055cc3012a920], src/mpi/comm/commutil.c[328]
[1] 32 at [0x000055cc301079e0], src/mpid/ch3/src/mpid_vc.c[110]


Additional info:

Comment 1 Honggang LI 2020-09-15 10:50:43 UTC
(In reply to Brian Chae from comment #0)

> *** This issue has been observed only on the following RDMA lab hosts.
> 
>   - rdma-virt-02/03
>     PowerEdge R430 - mlx5 MT27700 CX-4 ib0/ib1

I can't get access virt-02/03 at this moment.

>   - rdma-dev-21/22
>     PowerEdge R630 - mlx5 MT27700 CX-4 ib0/ib1
>   - rdma-virt-00/01
>     mlx4 MT27520 CX-3Pro ib0/ib1

I checked those machines. They are have RoCE and IB HCAs.

To run mvapich2 over RoCE, please use two environment variables. It's a known issue.

[root@rdma-dev-21 ~]$ grep -i distro /etc/motd
                           DISTRO=RHEL-8.4.0-20200914.n.0
      Job Whiteboard: Reserve Workflow provision of distro RHEL-8.4.0-20200914.n.0 on a specific system for 86400 seconds
[root@rdma-dev-21 ~]$ 
[root@rdma-dev-21 ~]$ rpm -q mvapich2
mvapich2-2.3.3-1.el8.x86_64
[root@rdma-dev-21 ~]$ 
[root@rdma-dev-21 ~]$ mpirun_rsh -np 2 -hostfile /root/hfile_one_core MV2_IBA_HCA=mlx5_0 MV2_USE_RoCE=1  mpitests-IMB-MPI1 Allgatherv
#------------------------------------------------------------
#    Intel(R) MPI Benchmarks 2019 Update 6, MPI-1 part    
#------------------------------------------------------------
# Date                  : Tue Sep 15 06:49:54 2020
# Machine               : x86_64
# System                : Linux
# Release               : 4.18.0-235.el8.x86_64
# Version               : #1 SMP Thu Sep 3 10:48:30 EDT 2020
# MPI Version           : 3.1
# MPI Thread Environment: 


# Calling sequence was: 

# mpitests-IMB-MPI1 Allgatherv 

# Minimum message length in bytes:   0
# Maximum message length in bytes:   4194304
#
# MPI_Datatype                   :   MPI_BYTE 
# MPI_Datatype for reductions    :   MPI_FLOAT 
# MPI_Op                         :   MPI_SUM  
# 
# 

# List of Benchmarks to run:

# Allgatherv

#----------------------------------------------------------------
# Benchmarking Allgatherv 
# #processes = 2 
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.05         0.06         0.06
            1         1000         1.23         2.25         1.74
            2         1000         1.46         2.03         1.74
            4         1000         0.67         2.80         1.74
            8         1000         0.67         2.80         1.74
           16         1000         0.68         2.81         1.74
           32         1000         1.70         1.80         1.75
           64         1000         1.36         2.18         1.77
          128         1000         1.79         2.45         2.12
          256         1000         1.18         3.26         2.22
          512         1000         1.38         3.31         2.34
         1024         1000         1.27         3.36         2.32
         2048         1000         2.21         2.84         2.52
         4096         1000         2.45         4.34         3.39
         8192         1000         3.70         5.82         4.76
        16384         1000         9.18         9.51         9.35
        32768         1000        12.49        12.55        12.52
        65536          640        19.07        19.58        19.32
       131072          320        36.16        36.47        36.32
       262144          160        66.81        67.42        67.11
       524288           80       125.33       127.12       126.23
      1048576           40       467.54       468.40       467.97
      2097152           20       953.67       954.22       953.95
      4194304           10      2076.20      2077.48      2076.84


# All processes entering MPI_Finalize

[1] Failed to dealloc pd (Device or resource busy)
[0] Failed to dealloc pd (Device or resource busy)
[0] 16 at [0x000055a3a45fe5a0], src/mpid/ch3/src/mpid_rma.c[182]
[0] 56 at [0x000055a3a4ce1ec0], src/mpi/coll/ch3_shmem_coll.c[4040]
[0] 8 at [0x000055a3a4ce0720], src/mpi/comm/create_2level_comm.c[1058]
[0] 24 at [0x000055a3a4ce3ed0], src/mpi/group/grouputil.c[74]
[0] 8 at [0x000055a3a4c87fc0], src/mpi/comm/create_2level_comm.c[1016]
[0] 24 at [0x000055a3a4cdf760], src/mpi/group/grouputil.c[74]
[0] 8 at [0x000055a3a4ce1980], src/mpi/comm/create_2level_comm.c[743]
[0] 8 at [0x000055a3a4ce33c0], src/util/procmap/local_proc.c[93]
[0] 8 at [0x000055a3a4cdece0], src/util/procmap/local_proc.c[92]
[0] 16 at [0x000055a3a4c40620], src/mpi/group/grouputil.c[74]
[0] 8 at [0x000055a3a4ce2400], src/util/procmap/local_proc.c[93]
[0] 8 at [0x000055a3a4ce2940], src/util/procmap/local_proc.c[92]
[0] 1024 at [0x000055a3a45f9920], src/mpi/coll/ch3_shmem_coll.c[4783]
[0] 8 at [0x000055a3a45fe040], src/mpi/coll/ch3_shmem_coll.c[4779]
[0] 312 at [0x000055a3a45fe3c0], src/mpi/coll/ch3_shmem_coll.c[4732]
[0] 208 at [0x000055a3a45fe250], src/mpi/coll/ch3_shmem_coll.c[4682]
[0] 8 at [0x000055a3a45fdf90], src/mpi/comm/create_2level_comm.c[1607]
[0] 8 at [0x000055a3a4601cd0], src/mpi/comm/create_2level_comm.c[1599]
[0] 8 at [0x000055a3a45fe1a0], src/util/procmap/local_proc.c[93]
[0] 8 at [0x000055a3a45fe0f0], src/util/procmap/local_proc.c[92]
[0] 16 at [0x000055a3a4601ab0], src/mpi/group/grouputil.c[74]
[0] 24 at [0x000055a3a4601c10], src/mpi/group/grouputil.c[74]
[0] 8 at [0x000055a3a4601b60], src/mpi/comm/create_2level_comm.c[1502]
[0] 8 at [0x000055a3a4601a00], src/mpi/comm/create_2level_comm.c[1478]
[0] 24 at [0x000055a3a4c87ce0], src/mpi/group/grouputil.c[74]
[0] 8 at [0x000055a3a45fb430], src/mpid/ch3/src/mpid_rma.c[182]
[0] 8 at [0x000055a3a4601f70], src/mpid/ch3/src/mpid_rma.c[182]
[0] 8 at [0x000055a3a4601ec0], src/mpid/ch3/src/mpid_rma.c[182]
[0] 8 at [0x000055a3a4601e10], src/mpid/ch3/src/mpid_rma.c[182]
[0] 8 at [0x000055a3a45fc590], src/mpid/ch3/src/mpid_rma.c[182]
[0] 8 at [0x000055a3a45f60a0], src/mpid/ch3/src/mpid_rma.c[182]
[0] 504 at [0x000055a3a4778cd0], src/mpi/comm/commutil.c[328]
[0] 32 at [0x000055a3a45d90b0], src/mpid/ch3/src/mpid_vc.c[110]
[1] 56 at [0x000056210d2d85b0], src/mpi/coll/ch3_shmem_coll.c[4023]
[1] 8 at [0x000056210d2d9570], src/mpi/comm/create_2level_comm.c[1058]
[1] 24 at [0x000056210cc19f70], src/mpi/group/grouputil.c[74]
[1] 8 at [0x000056210d2dd7a0], src/mpi/comm/create_2level_comm.c[1016]
[1] 24 at [0x000056210d2db790], src/mpi/group/grouputil.c[74]
[1] 8 at [0x000056210d2d9ab0], src/mpi/comm/create_2level_comm.c[743]
[1] 8 at [0x000056210d2dc750], src/util/procmap/local_proc.c[93]
[1] 8 at [0x000056210d2dcc90], src/util/procmap/local_proc.c[92]
[1] 24 at [0x000056210d2d9030], src/mpid/ch3/src/mpid_vc.c[110]
[1] 16 at [0x000056210cbf0450], src/mpi/group/grouputil.c[74]
[1] 8 at [0x000056210d2dbcd0], src/util/procmap/local_proc.c[93]
[1] 8 at [0x000056210d2dc210], src/util/procmap/local_proc.c[92]
[1] 1024 at [0x000056210cf367e0], src/mpi/coll/ch3_shmem_coll.c[4783]
[1] 8 at [0x000056210cbf3e80], src/mpi/coll/ch3_shmem_coll.c[4779]
[1] 312 at [0x000056210cbf3ca0], src/mpi/coll/ch3_shmem_coll.c[4732]
[1] 208 at [0x000056210cbf3b30], src/mpi/coll/ch3_shmem_coll.c[4682]
[1] 8 at [0x000056210cbfbb70], src/mpi/comm/create_2level_comm.c[1607]
[1] 8 at [0x000056210cbf3920], src/mpi/comm/create_2level_comm.c[1599]
[1] 8 at [0x000056210cbf3a80], src/util/procmap/local_proc.c[93]
[1] 8 at [0x000056210cbf39d0], src/util/procmap/local_proc.c[92]
[1] 24 at [0x000056210cbfbcf0], src/mpid/ch3/src/mpid_vc.c[110]
[1] 16 at [0x000056210d281fa0], src/mpi/group/grouputil.c[74]
[1] 24 at [0x000056210cbfbab0], src/mpi/group/grouputil.c[74]
[1] 8 at [0x000056210cbfba00], src/mpi/comm/create_2level_comm.c[1502]
[1] 8 at [0x000056210d281ef0], src/mpi/comm/create_2level_comm.c[1478]
[1] 24 at [0x000056210cbf84a0], src/mpi/group/grouputil.c[74]
[1] 8 at [0x000056210cbf5430], src/mpid/ch3/src/mpid_rma.c[182]
[1] 8 at [0x000056210cbfbf70], src/mpid/ch3/src/mpid_rma.c[182]
[1] 8 at [0x000056210cbfbec0], src/mpid/ch3/src/mpid_rma.c[182]
[1] 8 at [0x000056210cbfbe10], src/mpid/ch3/src/mpid_rma.c[182]
[1] 8 at [0x000056210cbf6590], src/mpid/ch3/src/mpid_rma.c[182]
[1] 8 at [0x000056210cbf00a0], src/mpid/ch3/src/mpid_rma.c[182]
[1] 504 at [0x000056210cd72cd0], src/mpi/comm/commutil.c[328]
[1] 32 at [0x000056210cbd30b0], src/mpid/ch3/src/mpid_vc.c[110]

Comment 4 Honggang LI 2020-10-10 03:05:44 UTC
To run mvapich2 over system with multiple HCA or RoCE HCA, these parameters MV2_IBA_HCA, MV2_USE_RoCE are needed.


Note You need to log in before you can comment on or make changes to this bug.