Bug 1810189 - [RHEL7.8] Segmentation fault while running openmpi3 IMB-EXT Accumulate mpirun one_core test on QIB
Summary: [RHEL7.8] Segmentation fault while running openmpi3 IMB-EXT Accumulate mpirun...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: ucx
Version: 7.8
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: ---
Assignee: Jonathan Toppins
QA Contact: Brian Chae
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-03-04 16:49 UTC by Brian Chae
Modified: 2021-01-27 17:23 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-01-27 17:23:00 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Brian Chae 2020-03-04 16:49:56 UTC
Description of problem:

A consistent Segmentation Fault taking place while running test for openmpi3 IMB-EXT Accumulate mpirun one_core

Hosts on which the tests ran: 
rdma-dev-06 (server) / rdma-dev-07 (client)

Image used:
RHEL-7.8-20200225.1 Server x86_64

Version-Release number of selected component (if applicable):

+ [20-03-02 06:58:14] cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.8 (Maipo)
+ [20-03-02 06:58:14] uname -a
Linux rdma-dev-07.lab.bos.redhat.com 3.10.0-1127.el7.x86_64 #1 SMP Tue Feb 18 16:39:12 EST 2020 x86_64 x86_64 x86_64 GNU/Linux
+ [20-03-02 06:58:14] cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-3.10.0-1127.el7.x86_64 root=UUID=3f0496ee-241f-4be8-9dee-9bbf9fcdbee4 ro console=tty0 rd_NO_PLYMOUTH amd_iommu=on crashkernel=auto console=ttyS1,115200
+ [20-03-02 06:58:14] rpm -q rdma-core linux-firmware
rdma-core-22.4-1.el7.x86_64
linux-firmware-20191203-76.gite8a0f4c.el7.noarch

OPENMPI
=======

================================================================================
 Package                Arch        Version            Repository          Size
================================================================================
Installing:
 mpitests-openmpi3      x86_64      5.4.2-1.el7        beaker-Server      328 k
 openmpi3               x86_64      3.1.3-2.el7        beaker-Server      2.9 M

Transaction Summary
================================================================================
Install  2 Packages

Total download size: 3.3 M
Installed size: 12 M
Downloading packages:
--------------------------------------------------------------------------------
Total                                               31 MB/s | 3.3 MB  00:00     
Running transaction check
Running transaction test
Transaction test succeeded
Running transaction
  Installing : openmpi3-3.1.3-2.el7.x86_64                                  1/2 
  Installing : mpitests-openmpi3-5.4.2-1.el7.x86_64                         2/2 
  Verifying  : openmpi3-3.1.3-2.el7.x86_64                                  1/2 
  Verifying  : mpitests-openmpi3-5.4.2-1.el7.x86_64                         2/2 

Installed:
  mpitests-openmpi3.x86_64 0:5.4.2-1.el7      openmpi3.x86_64 0:3.1.3-2.el7     


How reproducible:

all the time - 2 out of 2

Steps to Reproduce:
1. In RDMA client - server topoloty, with QIB device on hosts, install the above OPENMPI software

2. Add server & client HCA's ip address to /root/hfile_one_core
3. Run the following mpirun command

timeout 5m mpirun -hostfile /root/hfile_one_core -np 2 --allow-run-as-root --map-by node -mca btl_openib_warn_nonexistent_if 0 -mca btl_openib_if_include qib0:1 -mca mtl '^psm2,ofi' -mca btl openib,self mpitests-IMB-EXT Accumulate -time 1.5

Actual results:


#------------------------------------------------------------
#    Intel (R) MPI Benchmarks 2018 Update 1, MPI-2 part    
#------------------------------------------------------------
# Date                  : Mon Mar  2 04:20:33 2020
# Machine               : x86_64
# System                : Linux
# Release               : 3.10.0-1127.el7.x86_64
# Version               : #1 SMP Tue Feb 18 16:39:12 EST 2020
# MPI Version           : 3.1
# MPI Thread Environment: 


# Calling sequence was: 

# mpitests-IMB-EXT Accumulate -time 1.5

# Minimum message length in bytes:   0
# Maximum message length in bytes:   4194304
#
# MPI_Datatype                   :   MPI_BYTE 
# MPI_Datatype for reductions    :   MPI_FLOAT
# MPI_Op                         :   MPI_SUM  
#
#

# List of Benchmarks to run:

# Accumulate
[rdma-dev-07:45404:0:45404] Caught signal 11 (Segmentation fault: <unknown si_code> at address 0x4e)
==== backtrace ====
    0  /lib64/libucs.so.0(+0x17970) [0x7f91046bb970]
    1  /lib64/libucs.so.0(+0x17b22) [0x7f91046bbb22]
    2  /lib64/libc.so.6(+0x36400) [0x7f91193d3400]
    3  /usr/lib64/openmpi3/lib/openmpi/mca_btl_openib.so(mca_btl_openib_get+0x131) [0x7f91063a2b71]
    4  /usr/lib64/openmpi3/lib/openmpi/mca_osc_rdma.so(ompi_osc_get_data_blocking+0x1bc) [0x7f91027a8c0c]
    5  /usr/lib64/openmpi3/lib/openmpi/mca_osc_rdma.so(+0x11703) [0x7f91027b2703]
    6  /usr/lib64/openmpi3/lib/openmpi/mca_osc_rdma.so(ompi_osc_rdma_accumulate+0x187) [0x7f91027b7587]
    7  /usr/lib64/openmpi3/lib/libmpi.so.40(PMPI_Accumulate+0x2d3) [0x7f9119a19a43]
    8  mpitests-IMB-EXT() [0x4088bd]
    9  mpitests-IMB-EXT() [0x4063c8]
   10  mpitests-IMB-EXT() [0x402016]
   11  /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f91193bf555]
   12  mpitests-IMB-EXT() [0x402286]
===================
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node rdma-dev-07 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Expected results:

Normal test output with stats

Additional info:

QIB and interface info:

+ [20-03-02 04:09:49] ibstatus
Infiniband device 'qib0' port 1 status:
	default gid:	 fe80:0000:0000:0000:0011:7500:006f:33a2
	base lid:	 0x1b
	sm lid:		 0xd
	state:		 4: ACTIVE
	phys state:	 5: LinkUp
	rate:		 40 Gb/sec (4X QDR)
	link_layer:	 InfiniBand

Infiniband device 'qib0' port 2 status:
	default gid:	 fe80:0000:0000:0001:0011:7500:006f:33a3
	base lid:	 0x14
	sm lid:		 0x1
	state:		 4: ACTIVE
	phys state:	 5: LinkUp
	rate:		 40 Gb/sec (4X QDR)
	link_layer:	 InfiniBand

+ [20-03-02 04:09:49] ip addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: lom_1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 14:18:77:32:68:05 brd ff:ff:ff:ff:ff:ff
    inet 10.16.45.178/21 brd 10.16.47.255 scope global noprefixroute dynamic lom_1
       valid_lft 80925sec preferred_lft 80925sec
    inet6 2620:52:0:102f:1618:77ff:fe32:6805/64 scope global noprefixroute dynamic 
       valid_lft 2591917sec preferred_lft 604717sec
    inet6 fe80::1618:77ff:fe32:6805/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
3: lom_2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether 14:18:77:32:68:06 brd ff:ff:ff:ff:ff:ff
4: qib_ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520 qdisc pfifo_fast state UP group default qlen 256
    link/infiniband 80:00:00:03:fe:80:00:00:00:00:00:00:00:11:75:00:00:6f:33:a2 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
    inet 172.31.0.107/24 brd 172.31.0.255 scope global noprefixroute dynamic qib_ib0
       valid_lft 2947sec preferred_lft 2947sec
    inet6 fe80::211:7500:6f:33a2/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
5: qib_ib1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc pfifo_fast state UP group default qlen 256
    link/infiniband 80:00:00:05:fe:80:00:00:00:00:00:01:00:11:75:00:00:6f:33:a3 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
    inet 172.31.1.107/24 brd 172.31.1.255 scope global noprefixroute dynamic qib_ib1
       valid_lft 2886sec preferred_lft 2886sec
    inet6 fe80::211:7500:6f:33a3/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
6: qib_ib0.8004@qib_ib0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 4092 qdisc pfifo_fast state LOWERLAYERDOWN group default qlen 256
    link/infiniband 80:00:00:07:fe:80:00:00:00:00:00:00:00:11:75:00:00:6f:33:a2 brd 00:ff:ff:ff:ff:12:40:1b:80:04:00:00:00:00:00:00:ff:ff:ff:ff
7: qib_ib0.8010@qib_ib0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 4092 qdisc pfifo_fast state LOWERLAYERDOWN group default qlen 256
    link/infiniband 80:00:00:09:fe:80:00:00:00:00:00:00:00:11:75:00:00:6f:33:a2 brd 00:ff:ff:ff:ff:12:40:1b:80:10:00:00:00:00:00:00:ff:ff:ff:ff
8: qib_ib0.8008@qib_ib0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 4092 qdisc pfifo_fast state LOWERLAYERDOWN group default qlen 256
    link/infiniband 80:00:00:0b:fe:80:00:00:00:00:00:00:00:11:75:00:00:6f:33:a2 brd 00:ff:ff:ff:ff:12:40:1b:80:08:00:00:00:00:00:00:ff:ff:ff:ff
9: qib_ib0.8002@qib_ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520 qdisc pfifo_fast state UP group default qlen 256
    link/infiniband 80:00:00:0d:fe:80:00:00:00:00:00:00:00:11:75:00:00:6f:33:a2 brd 00:ff:ff:ff:ff:12:40:1b:80:02:00:00:00:00:00:00:ff:ff:ff:ff
    inet 172.31.2.107/24 brd 172.31.2.255 scope global noprefixroute dynamic qib_ib0.8002
       valid_lft 2881sec preferred_lft 2881sec
    inet6 fe80::211:7500:6f:33a2/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
10: qib_ib0.8014@qib_ib0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 4092 qdisc pfifo_fast state LOWERLAYERDOWN group default qlen 256
    link/infiniband 80:00:00:0f:fe:80:00:00:00:00:00:00:00:11:75:00:00:6f:33:a2 brd 00:ff:ff:ff:ff:12:40:1b:80:14:00:00:00:00:00:00:ff:ff:ff:ff
11: qib_ib0.8012@qib_ib0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 4092 qdisc pfifo_fast state LOWERLAYERDOWN group default qlen 256
    link/infiniband 80:00:00:11:fe:80:00:00:00:00:00:00:00:11:75:00:00:6f:33:a2 brd 00:ff:ff:ff:ff:12:40:1b:80:12:00:00:00:00:00:00:ff:ff:ff:ff
12: qib_ib0.8016@qib_ib0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 4092 qdisc pfifo_fast state LOWERLAYERDOWN group default qlen 256
    link/infiniband 80:00:00:13:fe:80:00:00:00:00:00:00:00:11:75:00:00:6f:33:a2 brd 00:ff:ff:ff:ff:12:40:1b:80:16:00:00:00:00:00:00:ff:ff:ff:ff
13: qib_ib0.8006@qib_ib0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 4092 qdisc pfifo_fast state LOWERLAYERDOWN group default qlen 256
    link/infiniband 80:00:00:15:fe:80:00:00:00:00:00:00:00:11:75:00:00:6f:33:a2 brd 00:ff:ff:ff:ff:12:40:1b:80:06:00:00:00:00:00:00:ff:ff:ff:ff
14: qib_ib1.8013@qib_ib1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 4092 qdisc pfifo_fast state LOWERLAYERDOWN group default qlen 256
    link/infiniband 80:00:00:17:fe:80:00:00:00:00:00:01:00:11:75:00:00:6f:33:a3 brd 00:ff:ff:ff:ff:12:40:1b:80:13:00:00:00:00:00:00:ff:ff:ff:ff
15: qib_ib1.8011@qib_ib1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 4092 qdisc pfifo_fast state LOWERLAYERDOWN group default qlen 256
    link/infiniband 80:00:00:19:fe:80:00:00:00:00:00:01:00:11:75:00:00:6f:33:a3 brd 00:ff:ff:ff:ff:12:40:1b:80:11:00:00:00:00:00:00:ff:ff:ff:ff
16: qib_ib1.8009@qib_ib1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 4092 qdisc pfifo_fast state LOWERLAYERDOWN group default qlen 256
    link/infiniband 80:00:00:1b:fe:80:00:00:00:00:00:01:00:11:75:00:00:6f:33:a3 brd 00:ff:ff:ff:ff:12:40:1b:80:09:00:00:00:00:00:00:ff:ff:ff:ff
17: qib_ib1.8003@qib_ib1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc pfifo_fast state UP group default qlen 256
    link/infiniband 80:00:00:1d:fe:80:00:00:00:00:00:01:00:11:75:00:00:6f:33:a3 brd 00:ff:ff:ff:ff:12:40:1b:80:03:00:00:00:00:00:00:ff:ff:ff:ff
    inet 172.31.3.107/24 brd 172.31.3.255 scope global noprefixroute dynamic qib_ib1.8003
       valid_lft 3058sec preferred_lft 3058sec
    inet6 fe80::211:7500:6f:33a3/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
18: qib_ib1.8005@qib_ib1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 4092 qdisc pfifo_fast state LOWERLAYERDOWN group default qlen 256
    link/infiniband 80:00:00:1f:fe:80:00:00:00:00:00:01:00:11:75:00:00:6f:33:a3 brd 00:ff:ff:ff:ff:12:40:1b:80:05:00:00:00:00:00:00:ff:ff:ff:ff
19: qib_ib1.8007@qib_ib1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 4092 qdisc pfifo_fast state LOWERLAYERDOWN group default qlen 256

Comment 2 Brian Chae 2020-03-04 17:42:47 UTC
Additional info on the same crash. The following additional tests resulted in the same Segmentation Faults.


+ [20-02-05 12:41:54] timeout 5m mpirun -hostfile /root/hfile_one_core -np 2 --allow-run-as-root --map-by node -mca btl_openib_warn_nonexistent_if 0 -mca btl_openib_if_include mlx4_0:1 -mca mtl '^psm2,psm,ofi' -mca btl openib,self mpitests-IMB-RMA Get_local -time 1.5
#------------------------------------------------------------
#    Intel (R) MPI Benchmarks 2018 Update 1, MPI-RMA part  
#------------------------------------------------------------
# Date                  : Wed Feb  5 12:41:55 2020
# Machine               : x86_64
# System                : Linux
# Release               : 3.10.0-1126.el7.x86_64
# Version               : #1 SMP Mon Feb 3 15:30:44 EST 2020
# MPI Version           : 3.1
# MPI Thread Environment: 


# Calling sequence was: 

# mpitests-IMB-RMA Get_local -time 1.5

# Minimum message length in bytes:   0
# Maximum message length in bytes:   4194304
#
# MPI_Datatype                   :   MPI_BYTE 
# MPI_Datatype for reductions    :   MPI_FLOAT
# MPI_Op                         :   MPI_SUM  
#
#

# List of Benchmarks to run:

# Get_local

#---------------------------------------------------
# Benchmarking Get_local 
# #processes = 2 
#---------------------------------------------------
#
#    MODE: NON-AGGREGATE 
#
       #bytes #repetitions      t[usec]   Mbytes/sec
            0          100         0.13         0.00
            1          100         2.04         0.49
            2          100         2.06         0.97
            4          100         2.02         1.98
            8          100         2.02         3.97
           16          100         2.01         7.96
           32          100         2.04        15.68
           64          100         2.03        31.45
          128          100         2.12        60.52
          256          100         2.34       109.42
          512          100         2.52       203.21
         1024          100         2.95       347.14
         2048          100         3.89       526.76
         4096          100         4.60       889.62
         8192          100         6.14      1334.88
        16384          100         9.24      1773.60
        32768          100        12.71      2578.04
        65536          100        22.10      2964.92
       131072          100        40.79      3213.02
       262144          100        78.20      3352.07
       524288           80       153.18      3422.63
      1048576           40       302.86      3462.29
      2097152           20       602.23      3482.29
      4194304           10      1201.08      3492.11
[rdma-qe-11:17520:0:17520] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace ====
    0  /lib64/libucs.so.0(+0x17970) [0x7f1d5f846970]
    1  /lib64/libucs.so.0(+0x17b22) [0x7f1d5f846b22]
    2  /usr/lib64/openmpi3/lib/openmpi/mca_btl_openib.so(mca_btl_openib_get+0x131) [0x7f1d6152db71]
    3  /usr/lib64/openmpi3/lib/openmpi/mca_osc_rdma.so(+0x6c59) [0x7f1d5e3a7c59]
    4  /usr/lib64/openmpi3/lib/openmpi/mca_osc_rdma.so(ompi_osc_rdma_get+0x4a9) [0x7f1d5e3aabc9]
    5  /usr/lib64/openmpi3/lib/libmpi.so.40(PMPI_Get+0x4e) [0x7f1d74d43cce]
    6  mpitests-IMB-RMA() [0x409c88]
    7  mpitests-IMB-RMA() [0x40695f]
    8  mpitests-IMB-RMA() [0x402426]
    9  /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f1d746e9555]
   10  mpitests-IMB-RMA() [0x402696]
===================
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node rdma-qe-11 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------




================================================================================================================================================




+ [20-02-05 12:21:32] timeout 5m mpirun -hostfile /root/hfile_one_core -np 2 --allow-run-as-root --map-by node -mca btl_openib_warn_nonexistent_if 0 -mca btl_openib_if_include mlx4_0:1 -mca mtl '^psm2,psm,ofi' -mca btl openib,self mpitests-IMB-RMA Unidir_get -time 1.5
#------------------------------------------------------------
#    Intel (R) MPI Benchmarks 2018 Update 1, MPI-RMA part  
#------------------------------------------------------------
# Date                  : Wed Feb  5 12:21:33 2020
# Machine               : x86_64
# System                : Linux
# Release               : 3.10.0-1126.el7.x86_64
# Version               : #1 SMP Mon Feb 3 15:30:44 EST 2020
# MPI Version           : 3.1
# MPI Thread Environment: 


# Calling sequence was: 

# mpitests-IMB-RMA Unidir_get -time 1.5

# Minimum message length in bytes:   0
# Maximum message length in bytes:   4194304
#
# MPI_Datatype                   :   MPI_BYTE 
# MPI_Datatype for reductions    :   MPI_FLOAT
# MPI_Op                         :   MPI_SUM  
#
#

# List of Benchmarks to run:

# Unidir_get

#---------------------------------------------------
# Benchmarking Unidir_get 
# #processes = 2 
#---------------------------------------------------
#
#    MODE: NON-AGGREGATE 
#
       #bytes #repetitions      t[usec]   Mbytes/sec
            0          100         0.12         0.00
            1          100         2.02         0.50
            2          100         2.03         0.99
            4          100         2.01         1.99
            8          100         2.02         3.95
           16          100         2.05         7.80
           32          100         2.05        15.63
           64          100         2.03        31.50
          128          100         2.11        60.75
          256          100         2.92        87.73
          512          100         2.49       205.36
         1024          100         2.94       348.10
         2048          100         3.92       522.04
         4096          100         4.63       885.14
         8192          100         6.07      1350.22
        16384          100         9.14      1792.27
        32768          100        12.76      2568.10
        65536          100        22.05      2972.35
       131072          100        40.84      3209.45
       262144          100        78.21      3352.01
       524288           80       153.18      3422.69
      1048576           40       302.81      3462.87
      2097152           20       602.34      3481.68
      4194304           10      1200.77      3493.01

#---------------------------------------------------
# Benchmarking Unidir_get 
# #processes = 2 
#---------------------------------------------------
#
#    MODE: AGGREGATE 
#
       #bytes #repetitions      t[usec]   Mbytes/sec
            0         1000         0.02         0.00
            1         1000         0.65         1.54
            2         1000         0.65         3.06
            4         1000         0.65         6.12
            8         1000         0.65        12.31
           16         1000         0.65        24.54
[rdma-qe-11:6543 :0:6543] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x67aa56e88)
==== backtrace ====
    0  /lib64/libucs.so.0(+0x17970) [0x7fa63a350970]
    1  /lib64/libucs.so.0(+0x17b22) [0x7fa63a350b22]
    2  /usr/lib64/openmpi3/lib/openmpi/mca_osc_rdma.so(+0x267f7) [0x7fa638ed17f7]
    3  /usr/lib64/openmpi3/lib/openmpi/mca_osc_rdma.so(ompi_osc_rdma_peer_lookup+0x9e) [0x7fa638ed1c2e]
    4  /usr/lib64/openmpi3/lib/openmpi/mca_osc_rdma.so(ompi_osc_rdma_lock_atomic+0xbab) [0x7fa638ece0db]
    5  /usr/lib64/openmpi3/lib/libmpi.so.40(MPI_Win_lock+0xdb) [0x7fa64f7d530b]
    6  mpitests-IMB-RMA() [0x40974a]
    7  mpitests-IMB-RMA() [0x4053ce]
    8  mpitests-IMB-RMA() [0x4022f9]
    9  /lib64/libc.so.6(__libc_start_main+0xf5) [0x7fa64f178555]
   10  mpitests-IMB-RMA() [0x402696]
===================
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node rdma-qe-11 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

These took place on MLX4 IB0.

image: RHEL-7.8-20200205.2 Server x86_64

+ [20-02-05 12:01:49] cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.8 Beta (Maipo)
+ [20-02-05 12:01:49] uname -a
Linux rdma-qe-11.lab.bos.redhat.com 3.10.0-1126.el7.x86_64 #1 SMP Mon Feb 3 15:30:44 EST 2020 x86_64 x86_64 x86_64 GNU/Linux
+ [20-02-05 12:01:49] cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-3.10.0-1126.el7.x86_64 root=UUID=93421318-e9ed-42ea-a391-62075904d777 ro intel_idle.max_cstate=0 intremap=no_x2apic_optout processor.max_cstate=0 console=tty0 rd_NO_PLYMOUTH rd.driver.blacklist=csiostor crashkernel=auto console=ttyS1,115200n81
+ [20-02-05 12:01:49] rpm -q rdma-core linux-firmware
rdma-core-22.4-1.el7.x86_64
linux-firmware-20191203-76.gite8a0f4c.el7.noarch
+ [20-02-05 12:01:49] tail /sys/class/infiniband/mlx4_0/fw_ver
2.9.8350



Test results for mpi/openmpi3 on rdma-qe-11:
3.10.0-1126.el7.x86_64, mlx4, ib0, & mlx4_0

Comment 6 John W. Linville 2021-01-27 17:23:00 UTC
RHEL7 is now well into Maintenance Support 2 Phase, and RHEL 7.9 was the last minor release of RHEL 7. Only critical bugs are still eligible for fixes in RHEL 7. This bug will be closed.


Note You need to log in before you can comment on or make changes to this bug.