Bug 1744780
| Summary: | ucx-enabled openmpi test with mpirun fails causing Segmentation fault | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Afom T. Michael <tmichael> |
| Component: | ucx | Assignee: | Jonathan Toppins <jtoppins> |
| Status: | CLOSED ERRATA | QA Contact: | Afom T. Michael <tmichael> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 8.1 | CC: | anto.trande, areber, jarod, rdma-dev-team, yosefe, zguo |
| Target Milestone: | rc | Flags: | pm-rhel:
mirror+
|
| Target Release: | 8.2 | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-04-28 15:34:54 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1708794 | ||
On RHEL-8.0.0, the same tests pass both on hosts with hfi1 & bnxt_re.
openmpi & mpitests-openmpi versions:
RHEL-8.0.0 RHEL-8.1.0
----------------------------- ----------
openmpi 3.1.2-5 4.0.1-3
mpitests-openmpi 5.4.2-4 5.4.2-4
I am testing openmpi for [Bug 1731749] The libfabric update to 1.8.0 in 8.1 breaks mpirun with containers, to verify no regression was caused by libfabric update. Test passed with openmpi-4.0.1-2.el8.x86_64 while failed with openmpi-4.0.1-3.el8.x86_64, both with libfabric-1.8.0-2.el8.x86_64, so it would be an openmpi regression, not libfabric. (In reply to zguo from comment #2) > > I am testing openmpi for [Bug 1731749] The libfabric update to 1.8.0 in 8.1 > breaks mpirun with containers, to verify no regression was caused by > libfabric update. Test passed with openmpi-4.0.1-2.el8.x86_64 while failed > with openmpi-4.0.1-3.el8.x86_64, both with libfabric-1.8.0-2.el8.x86_64, so > it would be an openmpi regression, not libfabric. The only difference between 4.0.1-2 and -3 is the enablement of UCX support. Afom, can you add 4.0.1-2 to the matrix in comment #1? Afom and I have collaborated to test out some things, and with an update from ucx 1.4.0 to 1.5.2, the segmentation fault goes away, so this bug is getting reassigned to the ucx component. Not sure if a full update to 1.5.2 is warranted, but it seems we need at least some fixes backported. (In reply to Jarod Wilson from comment #6) > Afom and I have collaborated to test out some things, and with an update > from ucx 1.4.0 to 1.5.2, the segmentation fault goes away, so this bug is > getting reassigned to the ucx component. Not sure if a full update to 1.5.2 > is warranted, but it seems we need at least some fixes backported. When the branch for 8.2 becomes available I will post the v1.5.2 package. It is impossible to do a backport at least to solve bz1717018 in RHEL-8 because to even get the fix a 20-30 commit feature series would have to be backported too. And I am not doing that as a patch-stack. FYI this is not a regression as the v1.4.0 version of UCX was the only version ever released in RHEL so there is nothing to regress to. FYI until qa_ack is provided I won't be able to commit the update. (In reply to Jonathan Toppins from comment #9) > FYI this is not a regression as the v1.4.0 version of UCX was the only > version ever released in RHEL so there is nothing to regress to. Ah. This was originally filed against OpenMPI, where it can be considered a regression, since prior non-ucx enabled openmpi didn't crash like this. But something Afom noted: "By default, for Open MPI 4.0 and later, infiniband ports on a device are not used by default. The intent is to use UCX for these devices. You can override this policy by setting the btl_openib_allow_ib MCA parameter to true." So it looks like we can get prior behavior with that added param, and release-note this for 8.1 if a fix isn't feasible. I also hit this on a test VM running the ring example from Open MPI. Using 'mpirun --mca btl_openib_allow_ib true --allow-run-as-root -np 4 /tmp/ring' I still get a segfault. The VM has just one simple ethernet device. Any way I can run my test with openmpi-4.0.1-3.el8.x86_64. For now I just downgraded to openmpi-4.0.1-2.el8.x86_64 taken directly from brew. (In reply to Adrian Reber from comment #12) > I also hit this on a test VM running the ring example from Open MPI. > > Using 'mpirun --mca btl_openib_allow_ib true --allow-run-as-root -np 4 > /tmp/ring' I still get a segfault. The VM has just one simple ethernet > device. > > Any way I can run my test with openmpi-4.0.1-3.el8.x86_64. For now I just > downgraded to openmpi-4.0.1-2.el8.x86_64 taken directly from brew. Yeah the way to fix it is to drop support for UCX in openmpi until 8.2. Which version of UCX is used? this issue and https://bugzilla.redhat.com/show_bug.cgi?id=1717018 are fixed in v1.5.2 and above Moving to verified since test on 4.18.0-167.el8.x86_64 with packages shown below is pass. Test was performed on the same hosts where issue was initially seen.
[root@rdma-dev-26 ~]$ cat /etc/redhat-release
Red Hat Enterprise Linux release 8.2 Beta (Ootpa)
[root@rdma-dev-26 ~]$ uname -r
4.18.0-167.el8.x86_64
[root@rdma-dev-26 ~]$ rpm -qa | egrep 'rdma|openmpi|ucx|verbs'
libibverbs-26.0-7.el8.x86_64
mpitests-openmpi-5.4.2-4.el8.x86_64
rdma-core-devel-26.0-7.el8.x86_64
libibverbs-utils-26.0-7.el8.x86_64
librdmacm-26.0-7.el8.x86_64
ucx-1.6.1-1.el8.x86_64
openmpi-4.0.2-2.el8.x86_64
librdmacm-utils-26.0-7.el8.x86_64
rdma-core-26.0-7.el8.x86_64
[root@rdma-dev-26 ~]$ ibstatus
Infiniband device 'bnxt_re0' port 1 status:
default gid: fe80:0000:0000:0000:020a:f7ff:feea:cd90
base lid: 0x0
sm lid: 0x0
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 100 Gb/sec (4X EDR)
link_layer: Ethernet
[root@rdma-dev-26 ~]$ timeout 3m /usr/lib64/openmpi/bin/mpirun --allow-run-as-root --map-by node -mca btl_openib_warn_nonexistent_if 0 -mca btl_openib_if_include bnxt_re0:1 -mca mtl '^psm2,psm,ofi' -mca btl '^openib,usnic' -hostfile /root/hfile_one_core -np 2 /usr/lib64/openmpi/bin/mpitests-IMB-MPI1 PingPong
#------------------------------------------------------------
# Intel (R) MPI Benchmarks 2018 Update 1, MPI-1 part
#------------------------------------------------------------
# Date : Fri Jan 24 13:57:28 2020
# Machine : x86_64
# System : Linux
# Release : 4.18.0-167.el8.x86_64
# Version : #1 SMP Sun Dec 15 01:24:23 UTC 2019
# MPI Version : 3.1
# MPI Thread Environment:
# Calling sequence was:
# /usr/lib64/openmpi/bin/mpitests-IMB-MPI1 PingPong
# Minimum message length in bytes: 0
# Maximum message length in bytes: 4194304
#
# MPI_Datatype : MPI_BYTE
# MPI_Datatype for reductions : MPI_FLOAT
# MPI_Op : MPI_SUM
#
#
# List of Benchmarks to run:
# PingPong
#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
#---------------------------------------------------
#bytes #repetitions t[usec] Mbytes/sec
0 1000 9.43 0.00
1 1000 9.45 0.11
2 1000 9.40 0.21
4 1000 9.43 0.42
8 1000 9.43 0.85
16 1000 9.42 1.70
32 1000 9.46 3.38
64 1000 9.49 6.75
128 1000 9.60 13.33
256 1000 10.10 25.35
512 1000 10.01 51.16
1024 1000 10.23 100.10
2048 1000 10.82 189.23
4096 1000 12.13 337.81
8192 1000 21.25 385.55
16384 1000 37.76 433.87
32768 1000 44.80 731.39
65536 640 55.83 1173.82
131072 320 83.72 1565.56
262144 160 169.54 1546.17
524288 80 289.37 1811.85
1048576 40 526.45 1991.79
2097152 20 1000.64 2095.82
4194304 10 1986.67 2111.23
# All processes entering MPI_Finalize
[root@rdma-dev-26 ~]$ echo $?
0
[root@rdma-dev-26 ~]$
[root@rdma-dev-26 ~]$ timeout 3m /usr/lib64/openmpi/bin/mpirun --allow-run-as-root --map-by node -mca btl_openib_warn_nonexistent_if 0 -mca btl_openib_if_include bnxt_re0:1 -mca mtl '^psm2,psm,ofi' -mca btl '^openib,usnic' -hostfile /root/hfile_one_core -np 2 /usr/lib64/openmpi/bin/mpitests-IMB-IO S_Read_indv
#------------------------------------------------------------
# Intel (R) MPI Benchmarks 2018 Update 1, MPI-IO part
#------------------------------------------------------------
# Date : Fri Jan 24 13:59:14 2020
# Machine : x86_64
# System : Linux
# Release : 4.18.0-167.el8.x86_64
# Version : #1 SMP Sun Dec 15 01:24:23 UTC 2019
# MPI Version : 3.1
# MPI Thread Environment:
# Calling sequence was:
# /usr/lib64/openmpi/bin/mpitests-IMB-IO S_Read_indv
# Minimum io portion in bytes: 0
# Maximum io portion in bytes: 4194304
#
#
#
# List of Benchmarks to run:
# S_Read_Indv
#---------------------------------------------------
# Benchmarking S_Read_Indv
# #processes = 1
# ( 1 additional process waiting in MPI_Barrier)
#---------------------------------------------------
#
# MODE: AGGREGATE
#
#bytes #repetitions t[usec] Mbytes/sec
0 1000 0.01 0.00
1 1000 1.22 0.82
2 1000 1.22 1.63
4 1000 1.22 3.28
8 1000 1.23 6.53
16 1000 1.22 13.08
32 1000 1.23 26.06
64 1000 1.24 51.42
128 1000 1.24 103.28
256 1000 1.24 206.42
512 1000 1.25 409.69
1024 1000 1.28 800.05
2048 1000 1.36 1508.93
4096 1000 1.50 2736.56
8192 1000 1.84 4447.77
16384 1000 2.79 5871.00
32768 1000 4.87 6724.44
65536 640 8.62 7603.50
131072 320 16.62 7887.74
262144 160 33.77 7763.28
524288 80 67.59 7757.43
1048576 40 130.44 8038.47
2097152 20 274.23 7647.34
4194304 10 567.47 7391.17
# All processes entering MPI_Finalize
[root@rdma-dev-26 ~]$ echo $?
0
[root@rdma-dev-26 ~]$ timeout 3m /usr/lib64/openmpi/bin/mpirun --allow-run-as-root --map-by node -mca btl_openib_warn_nonexistent_if 0 -mca btl_openib_if_include bnxt_re0:1 -mca mtl '^psm2,psm,ofi' -mca btl '^openib,usnic' -hostfile /root/hfile_one_core -np 2 /usr/lib64/openmpi/bin/mpitests-IMB-EXT Window
#------------------------------------------------------------
# Intel (R) MPI Benchmarks 2018 Update 1, MPI-2 part
#------------------------------------------------------------
# Date : Fri Jan 24 13:59:22 2020
# Machine : x86_64
# System : Linux
# Release : 4.18.0-167.el8.x86_64
# Version : #1 SMP Sun Dec 15 01:24:23 UTC 2019
# MPI Version : 3.1
# MPI Thread Environment:
# Calling sequence was:
# /usr/lib64/openmpi/bin/mpitests-IMB-EXT Window
# Minimum message length in bytes: 0
# Maximum message length in bytes: 4194304
#
# MPI_Datatype : MPI_BYTE
# MPI_Datatype for reductions : MPI_FLOAT
# MPI_Op : MPI_SUM
#
#
# List of Benchmarks to run:
# Window
#----------------------------------------------------------------
# Benchmarking Window
# #processes = 2
#----------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
0 100 262.09 262.10 262.09
4 100 260.52 260.53 260.52
8 100 263.02 263.18 263.10
16 100 263.31 263.41 263.36
32 100 263.64 263.94 263.79
64 100 262.22 262.52 262.37
128 100 263.29 263.33 263.31
256 100 261.69 261.70 261.69
512 100 261.79 262.01 261.90
1024 100 263.12 263.14 263.13
2048 100 262.97 262.98 262.98
4096 100 261.92 262.05 261.98
8192 100 261.05 261.15 261.10
16384 100 262.22 262.23 262.22
32768 100 261.73 261.84 261.78
65536 100 261.61 261.73 261.67
131072 100 261.57 261.59 261.58
262144 100 263.26 263.29 263.28
524288 80 263.14 263.16 263.15
1048576 40 260.18 260.45 260.31
2097152 20 260.28 260.29 260.29
4194304 10 261.04 262.29 261.67
# All processes entering MPI_Finalize
[root@rdma-dev-26 ~]$ echo $?
0
[root@rdma-dev-26 ~]$ timeout 3m /usr/lib64/openmpi/bin/mpirun --allow-run-as-root --map-by node -mca btl_openib_warn_nonexistent_if 0 -mca btl_openib_if_include bnxt_re0:1 -mca mtl '^psm2,psm,ofi' -mca btl '^openib,usnic' -hostfile /root/hfile_one_core -np 2 /usr/lib64/openmpi/bin/mpitests-osu_get_bw
# OSU MPI_Get Bandwidth Test v5.4.1
# Window creation: MPI_Win_allocate
# Synchronization: MPI_Win_flush
# Size Bandwidth (MB/s)
1 0.23
2 0.94
4 1.87
8 3.77
16 7.57
32 15.03
64 29.83
128 58.66
256 116.64
512 227.98
1024 437.24
2048 776.50
4096 1145.15
8192 1415.28
16384 1829.16
32768 2147.24
65536 2250.84
131072 1830.33
262144 1786.59
524288 1782.16
1048576 1784.25
2097152 1784.00
4194304 1782.62
[root@rdma-dev-26 ~]$ echo $?
0
[root@rdma-dev-26 ~]$
Test results for sanity on rdma-dev-26:
4.18.0-167.el8.x86_64, bnxt, roce, & bnxt_re0
Result | Status | Test
---------+--------+------------------------------------
PASS | 0 | load module bnxt_re
PASS | 0 | load module bnxt_en
PASS | 0 | ping 172.31.40.126
PASS | 0 | ping6 bnxt_roce/fe80::20a:f7ff:feea:cd90
PASS | 0 | ibstatus reported expected HCA rate
PASS | 0 | vlan bnxt_roce.81 create/delete
PASS | 0 | /usr/sbin/ibstat
PASS | 0 | /usr/sbin/ibstatus
PASS | 0 | systemctl start srp_daemon.service
SKIP | 777 | ibsrpdm
PASS | 0 | systemctl stop srp_daemon
PASS | 0 | client pings server
PASS | 0 | openmpi mpitests-IMB-MPI1 PingPong
PASS | 0 | openmpi mpitests-IMB-IO S_Read_indv
PASS | 0 | openmpi mpitests-IMB-EXT Window
PASS | 0 | openmpi mpitests-osu_get_bw
PASS | 0 | ip multicast addr
PASS | 0 | rping
PASS | 0 | rcopy
PASS | 0 | ib_read_bw
PASS | 0 | ib_send_bw
PASS | 0 | ib_write_bw
PASS | 0 | iser login
PASS | 0 | mount /dev/sdb /iser
PASS | 0 | iser write 1K
PASS | 0 | iser write 1M
PASS | 0 | iser write 1G
PASS | 0 | nfsordma mount
PASS | 0 | nfsordma write 1K
PASS | 0 | nfsordma write 1M
PASS | 0 | nfsordma write 1G
Test results for mpi/openmpi on rdma-dev-26:
4.18.0-167.el8.x86_64, bnxt, roce, & bnxt_re0
Result | Status | Test
---------+--------+------------------------------------
PASS | 0 | openmpi IMB-MPI1 PingPong mpirun one_core
PASS | 0 | openmpi IMB-MPI1 PingPing mpirun one_core
PASS | 0 | openmpi IMB-MPI1 Sendrecv mpirun one_core
PASS | 0 | openmpi IMB-MPI1 Exchange mpirun one_core
PASS | 0 | openmpi IMB-MPI1 Bcast mpirun one_core
PASS | 0 | openmpi IMB-MPI1 Allgather mpirun one_core
PASS | 0 | openmpi IMB-MPI1 Allgatherv mpirun one_core
PASS | 0 | openmpi IMB-MPI1 Gather mpirun one_core
PASS | 0 | openmpi IMB-MPI1 Gatherv mpirun one_core
PASS | 0 | openmpi IMB-MPI1 Scatter mpirun one_core
PASS | 0 | openmpi IMB-MPI1 Scatterv mpirun one_core
PASS | 0 | openmpi IMB-MPI1 Alltoall mpirun one_core
PASS | 0 | openmpi IMB-MPI1 Alltoallv mpirun one_core
PASS | 0 | openmpi IMB-MPI1 Reduce mpirun one_core
PASS | 0 | openmpi IMB-MPI1 Reduce_scatter mpirun one_core
PASS | 0 | openmpi IMB-MPI1 Allreduce mpirun one_core
PASS | 0 | openmpi IMB-MPI1 Barrier mpirun one_core
PASS | 0 | openmpi IMB-IO S_Write_indv mpirun one_core
PASS | 0 | openmpi IMB-IO S_Read_indv mpirun one_core
PASS | 0 | openmpi IMB-IO S_Write_expl mpirun one_core
PASS | 0 | openmpi IMB-IO S_Read_expl mpirun one_core
PASS | 0 | openmpi IMB-IO P_Write_indv mpirun one_core
PASS | 0 | openmpi IMB-IO P_Read_indv mpirun one_core
PASS | 0 | openmpi IMB-IO P_Write_expl mpirun one_core
PASS | 0 | openmpi IMB-IO P_Read_expl mpirun one_core
PASS | 0 | openmpi IMB-IO P_Write_shared mpirun one_core
PASS | 0 | openmpi IMB-IO P_Read_shared mpirun one_core
PASS | 0 | openmpi IMB-IO P_Write_priv mpirun one_core
PASS | 0 | openmpi IMB-IO P_Read_priv mpirun one_core
PASS | 0 | openmpi IMB-IO C_Write_indv mpirun one_core
PASS | 0 | openmpi IMB-IO C_Read_indv mpirun one_core
PASS | 0 | openmpi IMB-IO C_Write_expl mpirun one_core
PASS | 0 | openmpi IMB-IO C_Read_expl mpirun one_core
PASS | 0 | openmpi IMB-IO C_Write_shared mpirun one_core
PASS | 0 | openmpi IMB-IO C_Read_shared mpirun one_core
PASS | 0 | openmpi IMB-EXT Window mpirun one_core
PASS | 0 | openmpi IMB-EXT Unidir_Put mpirun one_core
PASS | 0 | openmpi IMB-EXT Unidir_Get mpirun one_core
PASS | 0 | openmpi IMB-EXT Bidir_Get mpirun one_core
PASS | 0 | openmpi IMB-EXT Bidir_Put mpirun one_core
PASS | 0 | openmpi IMB-EXT Accumulate mpirun one_core
PASS | 0 | openmpi IMB-NBC Ibcast mpirun one_core
PASS | 0 | openmpi IMB-NBC Iallgather mpirun one_core
PASS | 0 | openmpi IMB-NBC Iallgatherv mpirun one_core
PASS | 0 | openmpi IMB-NBC Igather mpirun one_core
PASS | 0 | openmpi IMB-NBC Igatherv mpirun one_core
PASS | 0 | openmpi IMB-NBC Iscatter mpirun one_core
PASS | 0 | openmpi IMB-NBC Iscatterv mpirun one_core
PASS | 0 | openmpi IMB-NBC Ialltoall mpirun one_core
PASS | 0 | openmpi IMB-NBC Ialltoallv mpirun one_core
PASS | 0 | openmpi IMB-NBC Ireduce mpirun one_core
PASS | 0 | openmpi IMB-NBC Ireduce_scatter mpirun one_core
PASS | 0 | openmpi IMB-NBC Iallreduce mpirun one_core
PASS | 0 | openmpi IMB-NBC Ibarrier mpirun one_core
PASS | 0 | openmpi IMB-RMA Unidir_put mpirun one_core
PASS | 0 | openmpi IMB-RMA Unidir_get mpirun one_core
PASS | 0 | openmpi IMB-RMA Bidir_put mpirun one_core
PASS | 0 | openmpi IMB-RMA Bidir_get mpirun one_core
PASS | 0 | openmpi IMB-RMA One_put_all mpirun one_core
PASS | 0 | openmpi IMB-RMA One_get_all mpirun one_core
PASS | 0 | openmpi IMB-RMA All_put_all mpirun one_core
PASS | 0 | openmpi IMB-RMA All_get_all mpirun one_core
PASS | 0 | openmpi IMB-RMA Put_local mpirun one_core
PASS | 0 | openmpi IMB-RMA Put_all_local mpirun one_core
PASS | 0 | openmpi IMB-RMA Exchange_put mpirun one_core
PASS | 0 | openmpi IMB-RMA Exchange_get mpirun one_core
PASS | 0 | openmpi IMB-RMA Accumulate mpirun one_core
PASS | 0 | openmpi IMB-RMA Get_accumulate mpirun one_core
PASS | 0 | openmpi IMB-RMA Fetch_and_op mpirun one_core
PASS | 0 | openmpi IMB-RMA Compare_and_swap mpirun one_core
PASS | 0 | openmpi IMB-RMA Get_local mpirun one_core
PASS | 0 | openmpi IMB-RMA Get_all_local mpirun one_core
PASS | 0 | openmpi OSU acc_latency mpirun one_core
PASS | 0 | openmpi OSU allgather mpirun one_core
PASS | 0 | openmpi OSU allgatherv mpirun one_core
PASS | 0 | openmpi OSU allreduce mpirun one_core
PASS | 0 | openmpi OSU alltoall mpirun one_core
PASS | 0 | openmpi OSU alltoallv mpirun one_core
PASS | 0 | openmpi OSU barrier mpirun one_core
PASS | 0 | openmpi OSU bcast mpirun one_core
PASS | 0 | openmpi OSU bibw mpirun one_core
PASS | 0 | openmpi OSU bw mpirun one_core
PASS | 0 | openmpi OSU cas_latency mpirun one_core
PASS | 0 | openmpi OSU fop_latency mpirun one_core
PASS | 0 | openmpi OSU gather mpirun one_core
PASS | 0 | openmpi OSU gatherv mpirun one_core
PASS | 0 | openmpi OSU get_acc_latency mpirun one_core
PASS | 0 | openmpi OSU get_bw mpirun one_core
PASS | 0 | openmpi OSU get_latency mpirun one_core
PASS | 0 | openmpi OSU hello mpirun one_core
PASS | 0 | openmpi OSU iallgather mpirun one_core
PASS | 0 | openmpi OSU iallgatherv mpirun one_core
PASS | 0 | openmpi OSU ialltoall mpirun one_core
PASS | 0 | openmpi OSU ialltoallv mpirun one_core
PASS | 0 | openmpi OSU ialltoallw mpirun one_core
PASS | 0 | openmpi OSU ibarrier mpirun one_core
PASS | 0 | openmpi OSU ibcast mpirun one_core
PASS | 0 | openmpi OSU igather mpirun one_core
PASS | 0 | openmpi OSU igatherv mpirun one_core
PASS | 0 | openmpi OSU init mpirun one_core
PASS | 0 | openmpi OSU iscatter mpirun one_core
PASS | 0 | openmpi OSU iscatterv mpirun one_core
PASS | 0 | openmpi OSU latency mpirun one_core
PASS | 0 | openmpi OSU mbw_mr mpirun one_core
PASS | 0 | openmpi OSU multi_lat mpirun one_core
PASS | 0 | openmpi OSU put_bibw mpirun one_core
PASS | 0 | openmpi OSU put_bw mpirun one_core
PASS | 0 | openmpi OSU put_latency mpirun one_core
PASS | 0 | openmpi OSU reduce mpirun one_core
PASS | 0 | openmpi OSU reduce_scatter mpirun one_core
PASS | 0 | openmpi OSU scatter mpirun one_core
PASS | 0 | openmpi OSU scatterv mpirun one_core
PASS | 0 | NON-ROOT IMB-MPI1 PingPong
Checking for failures and known issues:
no test failures
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:1590 |
Description of problem: Running RHEL-8.1.0 Snapshot-2, sanity test of openmpi fails with "Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x153918978768)". So far, I saw this on hosts with hfi1 & bnxt_re HCAs. Version-Release number of selected component (if applicable): DISTRO=RHEL-8.1.0-20190820.3 4.18.0-135.el8.x86_64 mpitests-openmpi-5.4.2-4.el8.x86_64 openmpi-4.0.1-3.el8.x86_64 rdma-core-22.3-1.el8.x86_64 $ lspci | grep -i -e ethernet -e infiniband -e omni 02:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 2-port Gigabit Ethernet PCIe 02:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 2-port Gigabit Ethernet PCIe 03:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 2-port Gigabit Ethernet PCIe 03:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 2-port Gigabit Ethernet PCIe 04:00.0 Ethernet controller: Broadcom Inc. and subsidiaries BCM57454 NetXtreme-E 10Gb/25Gb/40Gb/50Gb/100Gb Ethernet (rev 01) $ ibstat CA 'bnxt_re0' CA type: Broadcom NetXtreme-C/E RoCE Driver HCA Number of ports: 1 Firmware version: 212.0.106.0 Hardware version: 0x14e4 Node GUID: 0x020af7fffeeacd90 System image GUID: 0x020af7fffeeacd90 Port 1: State: Active Physical state: LinkUp Rate: 100 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x001d0000 Port GUID: 0x020af7fffeeacd90 Link layer: Ethernet $ ibstatus Infiniband device 'bnxt_re0' port 1 status: default gid: fe80:0000:0000:0000:020a:f7ff:feea:cd90 base lid: 0x0 sm lid: 0x0 state: 4: ACTIVE phys state: 5: LinkUp rate: 100 Gb/sec (4X EDR) link_layer: Ethernet $ ip addr show [...snip...] 3: bnxt_roce: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 1000 link/ether 00:0a:f7:ea:cd:90 brd ff:ff:ff:ff:ff:ff inet 172.31.40.126/24 brd 172.31.40.255 scope global dynamic noprefixroute bnxt_roce valid_lft 3209sec preferred_lft 3209sec inet6 fe80::20a:f7ff:feea:cd90/64 scope link noprefixroute valid_lft forever preferred_lft forever [...snip...] $ How reproducible: Always Steps to Reproduce: 1. Execute /usr/lib64/openmpi/bin/mpirun with the arguments shown below on hosts with bnxt_re or hfi1 HCA. Or just run our sanity test script. 2. 3. Actual results: timeout 3m /usr/lib64/openmpi/bin/mpirun --allow-run-as-root --map-by node -mca btl_openib_warn_nonexistent_if 0 -mca btl_openib_if_include bnxt_re0:1 -mca mtl '^psm2,psm,ofi' -mca btl '^openib,usnic' -hostfile /root/hfile_one_core -np 2 /usr/lib64/openmpi/bin/mpitests-IMB-MPI1 PingPong [rdma-dev-25:24279:0:24279] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x153918978768) ==== backtrace ==== 0 /lib64/libucs.so.0(+0x18bb0) [0x15391830bbb0] 1 /lib64/libucs.so.0(+0x18d8a) [0x15391830bd8a] 2 /lib64/libuct.so.0(+0x1655b) [0x15391354f55b] 3 /lib64/ld-linux-x86-64.so.2(+0xfd0a) [0x15392558cd0a] 4 /lib64/ld-linux-x86-64.so.2(+0xfe0a) [0x15392558ce0a] 5 /lib64/ld-linux-x86-64.so.2(+0x13def) [0x153925590def] 6 /lib64/libc.so.6(_dl_catch_exception+0x77) [0x153924da7ab7] 7 /lib64/ld-linux-x86-64.so.2(+0x1365e) [0x15392559065e] 8 /lib64/libdl.so.2(+0x11ba) [0x1539245011ba] 9 /lib64/libc.so.6(_dl_catch_exception+0x77) [0x153924da7ab7] 10 /lib64/libc.so.6(_dl_catch_error+0x33) [0x153924da7b53] 11 /lib64/libdl.so.2(+0x1939) [0x153924501939] 12 /lib64/libdl.so.2(dlopen+0x4a) [0x15392450125a] 13 /usr/lib64/openmpi/lib/libopen-pal.so.40(+0x6df05) [0x153924771f05] 14 /usr/lib64/openmpi/lib/libopen-pal.so.40(mca_base_component_repository_open+0x206) [0x15392474fb16] 15 /usr/lib64/openmpi/lib/libopen-pal.so.40(mca_base_component_find+0x35a) [0x15392474ea5a] 16 /usr/lib64/openmpi/lib/libopen-pal.so.40(mca_base_framework_components_register+0x2e) [0x15392475a3ce] 17 /usr/lib64/openmpi/lib/libopen-pal.so.40(mca_base_framework_register+0x252) [0x15392475a8b2] 18 /usr/lib64/openmpi/lib/libopen-pal.so.40(mca_base_framework_open+0x15) [0x15392475a915] 19 /usr/lib64/openmpi/lib/libmpi.so.40(ompi_mpi_init+0x674) [0x1539252a3494] 20 /usr/lib64/openmpi/lib/libmpi.so.40(MPI_Init+0x72) [0x1539252d36b2] 21 /usr/lib64/openmpi/bin/mpitests-IMB-MPI1(+0x2a66) [0x559af99eba66] 22 /lib64/libc.so.6(__libc_start_main+0xf3) [0x153924c92873] 23 /usr/lib64/openmpi/bin/mpitests-IMB-MPI1(+0x318e) [0x559af99ec18e] =================== [rdma-dev-26:24261:0:24261] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7feace7a2768) ==== backtrace ==== 0 /lib64/libucs.so.0(+0x18bb0) [0x7feace135bb0] 1 /lib64/libucs.so.0(+0x18d8a) [0x7feace135d8a] 2 /lib64/libuct.so.0(+0x1655b) [0x7feacd46555b] 3 /lib64/ld-linux-x86-64.so.2(+0xfd0a) [0x7feadbf56d0a] 4 /lib64/ld-linux-x86-64.so.2(+0xfe0a) [0x7feadbf56e0a] 5 /lib64/ld-linux-x86-64.so.2(+0x13def) [0x7feadbf5adef] 6 /lib64/libc.so.6(_dl_catch_exception+0x77) [0x7feadb771ab7] 7 /lib64/ld-linux-x86-64.so.2(+0x1365e) [0x7feadbf5a65e] 8 /lib64/libdl.so.2(+0x11ba) [0x7feadaecb1ba] 9 /lib64/libc.so.6(_dl_catch_exception+0x77) [0x7feadb771ab7] 10 /lib64/libc.so.6(_dl_catch_error+0x33) [0x7feadb771b53] 11 /lib64/libdl.so.2(+0x1939) [0x7feadaecb939] 12 /lib64/libdl.so.2(dlopen+0x4a) [0x7feadaecb25a] 13 /usr/lib64/openmpi/lib/libopen-pal.so.40(+0x6df05) [0x7feadb13bf05] 14 /usr/lib64/openmpi/lib/libopen-pal.so.40(mca_base_component_repository_open+0x206) [0x7feadb119b16] 15 /usr/lib64/openmpi/lib/libopen-pal.so.40(mca_base_component_find+0x35a) [0x7feadb118a5a] 16 /usr/lib64/openmpi/lib/libopen-pal.so.40(mca_base_framework_components_register+0x2e) [0x7feadb1243ce] 17 /usr/lib64/openmpi/lib/libopen-pal.so.40(mca_base_framework_register+0x252) [0x7feadb1248b2] 18 /usr/lib64/openmpi/lib/libopen-pal.so.40(mca_base_framework_open+0x15) [0x7feadb124915] 19 /usr/lib64/openmpi/lib/libmpi.so.40(ompi_mpi_init+0x674) [0x7feadbc6d494] 20 /usr/lib64/openmpi/lib/libmpi.so.40(MPI_Init+0x72) [0x7feadbc9d6b2] 21 /usr/lib64/openmpi/bin/mpitests-IMB-MPI1(+0x2a66) [0x55da14413a66] 22 /lib64/libc.so.6(__libc_start_main+0xf3) [0x7feadb65c873] 23 /usr/lib64/openmpi/bin/mpitests-IMB-MPI1(+0x318e) [0x55da1441418e] =================== -------------------------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun noticed that process rank 1 with PID 24279 on node 172.31.45.125 exited on signal 11 (Segmentation fault). -------------------------------------------------------------------------- + [19-08-20 16:48:58] RQA_check_result -r 139 -t 'openmpi mpitests-IMB-MPI1 PingPong' Expected results: Command to complete successfully & test to pass. Additional info: Similar test on hosts with cxgb4, mlx4 roce, mlx5 ib, mlx5 roce passed. Tests with "mpitests-IMB-IO S_Read_indv", "mpitests-IMB-EXT Window", & "mpitests-osu_get_bw" args failed in a similar way. I'll try to reproduce this on RHEL-8.0 since these was dependency failure issue with previous RHEL-8.1.0 builds.