Bug 2083222
| Summary: | [RHEL8.7] all mvapich2 benchmarks fail with "Error creating SRQ" message when tested on CXGB4 IW device | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Brian Chae <bchae> |
| Component: | mvapich2 | Assignee: | Kamal Heib <kheib> |
| Status: | ASSIGNED --- | QA Contact: | Infiniband QE <infiniband-qe> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 8.7 | CC: | hwkernel-mgr, kheib, rdma-dev-team |
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | Type: | Bug | |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Brian Chae
2022-05-09 13:43:48 UTC
On RHEL-8.9 build, RHEL-8.9.0-20230718.23, for CTC#2 test, mvapich2 running on CXGB4 iWARP device failed in the following way. ============================================== 1. Some benchmarks passed 2. A lot of benchmarks failed with RC139 or RC255, as the following: + [23-07-20 16:39:11] timeout --preserve-status --kill-after=5m 3m mpirun -hostfile /root/hfile_one_core -np 2 mpitests-IMB-MPI1 Sendrecv -time 1.5 #---------------------------------------------------------------- # Intel(R) MPI Benchmarks 2021.3, MPI-1 part #---------------------------------------------------------------- # Date : Thu Jul 20 16:39:12 2023 # Machine : x86_64 # System : Linux # Release : 4.18.0-502.el8.x86_64 # Version : #1 SMP Tue Jul 11 12:32:03 EDT 2023 # MPI Version : 3.1 # MPI Thread Environment: # Calling sequence was: # mpitests-IMB-MPI1 Sendrecv -time 1.5 # Minimum message length in bytes: 0 # Maximum message length in bytes: 4194304 # # MPI_Datatype : MPI_BYTE # MPI_Datatype for reductions : MPI_FLOAT # MPI_Op : MPI_SUM # # # List of Benchmarks to run: # Sendrecv [rdma-perf-07.rdma.lab.eng.rdu2.redhat.com:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 62623 RUNNING AT 172.31.50.187 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== [proxy:0:0.lab.eng.rdu2.redhat.com] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911): assert (!closed) failed [proxy:0:0.lab.eng.rdu2.redhat.com] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status [proxy:0:0.lab.eng.rdu2.redhat.com] main (pm/pmiserv/pmip.c:202): demux engine error waiting for event YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11) This typically refers to a problem with your application. Please see the FAQ page for debugging suggestions + [23-07-20 16:39:12] __MPI_check_result 139 mpitests-mvapich2 IMB-MPI1 Sendrecv mpirun /root/hfile_one_core + [23-07-20 16:39:12] '[' 6 -ne 6 ']' + [23-07-20 16:39:12] local status=139 + [23-07-20 16:39:12] local pkg=mvapich2 + [23-07-20 16:39:12] local benchmark=IMB-MPI1 ++ [23-07-20 16:39:12] basename Sendrecv + [23-07-20 16:39:12] local app=Sendrecv + [23-07-20 16:39:12] app=Sendrecv + [23-07-20 16:39:12] local cmd=mpirun ++ [23-07-20 16:39:12] basename /root/hfile_one_core + [23-07-20 16:39:12] local hfile=hfile_one_core + [23-07-20 16:39:12] hfile=one_core + [23-07-20 16:39:12] RQA_check_result -r 139 -t 'mvapich2 IMB-MPI1 Sendrecv mpirun one_core' + [23-07-20 16:39:12] local test_pass=0 + [23-07-20 16:39:12] local test_skip=777 + [23-07-20 16:39:12] test 4 -gt 0 + [23-07-20 16:39:12] case $1 in + [23-07-20 16:39:12] local rc=139 + [23-07-20 16:39:12] shift + [23-07-20 16:39:12] shift + [23-07-20 16:39:12] test 2 -gt 0 + [23-07-20 16:39:12] case $1 in + [23-07-20 16:39:12] local 'msg=mvapich2 IMB-MPI1 Sendrecv mpirun one_core' + [23-07-20 16:39:12] shift + [23-07-20 16:39:12] shift + [23-07-20 16:39:12] test 0 -gt 0 + [23-07-20 16:39:12] '[' -z 139 -o -z 'mvapich2 IMB-MPI1 Sendrecv mpirun one_core' ']' + [23-07-20 16:39:12] '[' -z /tmp/tmp.aqaE5El2H8/results_mpi-mvapich2.txt ']' + [23-07-20 16:39:12] '[' -z /tmp/tmp.aqaE5El2H8/results_mpi-mvapich2.txt ']' + [23-07-20 16:39:12] '[' 139 -eq 0 ']' + [23-07-20 16:39:12] '[' 139 -eq 777 ']' + [23-07-20 16:39:12] local test_result=FAIL + [23-07-20 16:39:12] export result=FAIL + [23-07-20 16:39:12] result=FAIL + [23-07-20 16:39:12] [[ ! -z '' ]] + [23-07-20 16:39:12] printf '%10s | %6s | %s\n' FAIL 139 'mvapich2 IMB-MPI1 Sendrecv mpirun one_core' + [23-07-20 16:39:12] set +x --- - TEST RESULT FOR mvapich2 - Test: mvapich2 IMB-MPI1 Sendrecv mpirun one_core - Result: FAIL - Return: 139 --- + [23-07-20 16:39:12] timeout --preserve-status --kill-after=5m 3m mpirun -hostfile /root/hfile_one_core -np 2 mpitests-IMB-MPI1 Exchange -time 1.5 #---------------------------------------------------------------- # Intel(R) MPI Benchmarks 2021.3, MPI-1 part #---------------------------------------------------------------- # Date : Thu Jul 20 16:39:13 2023 # Machine : x86_64 # System : Linux # Release : 4.18.0-502.el8.x86_64 # Version : #1 SMP Tue Jul 11 12:32:03 EDT 2023 # MPI Version : 3.1 # MPI Thread Environment: # Calling sequence was: # mpitests-IMB-MPI1 Exchange -time 1.5 # Minimum message length in bytes: 0 # Maximum message length in bytes: 4194304 # # MPI_Datatype : MPI_BYTE # MPI_Datatype for reductions : MPI_FLOAT # MPI_Op : MPI_SUM # # # List of Benchmarks to run: # Exchange [rdma-perf-07.rdma.lab.eng.rdu2.redhat.com:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 62684 RUNNING AT 172.31.50.187 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== [proxy:0:0.lab.eng.rdu2.redhat.com] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911): assert (!closed) failed [proxy:0:0.lab.eng.rdu2.redhat.com] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status [proxy:0:0.lab.eng.rdu2.redhat.com] main (pm/pmiserv/pmip.c:202): demux engine error waiting for event [mpiexec.lab.eng.rdu2.redhat.com] HYDT_bscu_wait_for_completion (tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting [mpiexec.lab.eng.rdu2.redhat.com] HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion [mpiexec.lab.eng.rdu2.redhat.com] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for completion [mpiexec.lab.eng.rdu2.redhat.com] main (ui/mpich/mpiexec.c:340): process manager error waiting for completion + [23-07-20 16:39:13] __MPI_check_result 255 mpitests-mvapich2 IMB-MPI1 Exchange mpirun /root/hfile_one_core + [23-07-20 16:39:13] '[' 6 -ne 6 ']' + [23-07-20 16:39:13] local status=255 + [23-07-20 16:39:13] local pkg=mvapich2 + [23-07-20 16:39:13] local benchmark=IMB-MPI1 ++ [23-07-20 16:39:13] basename Exchange + [23-07-20 16:39:13] local app=Exchange + [23-07-20 16:39:13] app=Exchange + [23-07-20 16:39:13] local cmd=mpirun ++ [23-07-20 16:39:13] basename /root/hfile_one_core + [23-07-20 16:39:13] local hfile=hfile_one_core + [23-07-20 16:39:13] hfile=one_core + [23-07-20 16:39:13] RQA_check_result -r 255 -t 'mvapich2 IMB-MPI1 Exchange mpirun one_core' + [23-07-20 16:39:13] local test_pass=0 + [23-07-20 16:39:13] local test_skip=777 + [23-07-20 16:39:13] test 4 -gt 0 + [23-07-20 16:39:13] case $1 in + [23-07-20 16:39:13] local rc=255 + [23-07-20 16:39:13] shift + [23-07-20 16:39:13] shift + [23-07-20 16:39:13] test 2 -gt 0 + [23-07-20 16:39:13] case $1 in + [23-07-20 16:39:13] local 'msg=mvapich2 IMB-MPI1 Exchange mpirun one_core' + [23-07-20 16:39:13] shift + [23-07-20 16:39:13] shift + [23-07-20 16:39:13] test 0 -gt 0 + [23-07-20 16:39:13] '[' -z 255 -o -z 'mvapich2 IMB-MPI1 Exchange mpirun one_core' ']' + [23-07-20 16:39:13] '[' -z /tmp/tmp.aqaE5El2H8/results_mpi-mvapich2.txt ']' + [23-07-20 16:39:13] '[' -z /tmp/tmp.aqaE5El2H8/results_mpi-mvapich2.txt ']' + [23-07-20 16:39:13] '[' 255 -eq 0 ']' + [23-07-20 16:39:13] '[' 255 -eq 777 ']' + [23-07-20 16:39:13] local test_result=FAIL + [23-07-20 16:39:13] export result=FAIL + [23-07-20 16:39:13] result=FAIL + [23-07-20 16:39:13] [[ ! -z '' ]] + [23-07-20 16:39:13] printf '%10s | %6s | %s\n' FAIL 255 'mvapich2 IMB-MPI1 Exchange mpirun one_core' + [23-07-20 16:39:13] set +x --- - TEST RESULT FOR mvapich2 - Test: mvapich2 IMB-MPI1 Exchange mpirun one_core - Result: FAIL - Return: 255 --- 3. A lot of benchmarks with "mpirun_rsh" command failed with RC1 + [23-07-20 18:07:02] timeout --preserve-status --kill-after=5m 3m mpirun_rsh -np 2 -hostfile /root/hfile_one_core mpitests-IMB-MPI1 PingPong -time 1.5 #---------------------------------------------------------------- # Intel(R) MPI Benchmarks 2021.3, MPI-1 part #---------------------------------------------------------------- # Date : Thu Jul 20 18:07:03 2023 # Machine : x86_64 # System : Linux # Release : 4.18.0-502.el8.x86_64 # Version : #1 SMP Tue Jul 11 12:32:03 EDT 2023 # MPI Version : 3.1 # MPI Thread Environment: # Calling sequence was: # mpitests-IMB-MPI1 PingPong -time 1.5 # Minimum message length in bytes: 0 # Maximum message length in bytes: 4194304 # # MPI_Datatype : MPI_BYTE # MPI_Datatype for reductions : MPI_FLOAT # MPI_Op : MPI_SUM # # # List of Benchmarks to run: # PingPong [rdma-perf-07.rdma.lab.eng.rdu2.redhat.com:mpirun_rsh][signal_processor] Caught signal 15, killing job + [23-07-20 18:10:03] __MPI_check_result 1 mpitests-mvapich2 IMB-MPI1 PingPong mpirun_rsh /root/hfile_one_core + [23-07-20 18:10:03] '[' 6 -ne 6 ']' + [23-07-20 18:10:03] local status=1 + [23-07-20 18:10:03] local pkg=mvapich2 + [23-07-20 18:10:03] local benchmark=IMB-MPI1 ++ [23-07-20 18:10:03] basename PingPong + [23-07-20 18:10:03] local app=PingPong + [23-07-20 18:10:03] app=PingPong + [23-07-20 18:10:03] local cmd=mpirun_rsh ++ [23-07-20 18:10:03] basename /root/hfile_one_core + [23-07-20 18:10:03] local hfile=hfile_one_core + [23-07-20 18:10:03] hfile=one_core + [23-07-20 18:10:03] RQA_check_result -r 1 -t 'mvapich2 IMB-MPI1 PingPong mpirun_rsh one_core' + [23-07-20 18:10:03] local test_pass=0 + [23-07-20 18:10:03] local test_skip=777 + [23-07-20 18:10:03] test 4 -gt 0 + [23-07-20 18:10:03] case $1 in + [23-07-20 18:10:03] local rc=1 + [23-07-20 18:10:03] shift + [23-07-20 18:10:03] shift + [23-07-20 18:10:03] test 2 -gt 0 + [23-07-20 18:10:03] case $1 in + [23-07-20 18:10:03] local 'msg=mvapich2 IMB-MPI1 PingPong mpirun_rsh one_core' + [23-07-20 18:10:03] shift + [23-07-20 18:10:03] shift + [23-07-20 18:10:03] test 0 -gt 0 + [23-07-20 18:10:03] '[' -z 1 -o -z 'mvapich2 IMB-MPI1 PingPong mpirun_rsh one_core' ']' + [23-07-20 18:10:03] '[' -z /tmp/tmp.aqaE5El2H8/results_mpi-mvapich2.txt ']' + [23-07-20 18:10:03] '[' -z /tmp/tmp.aqaE5El2H8/results_mpi-mvapich2.txt ']' + [23-07-20 18:10:03] '[' 1 -eq 0 ']' + [23-07-20 18:10:03] '[' 1 -eq 777 ']' + [23-07-20 18:10:03] local test_result=FAIL + [23-07-20 18:10:03] export result=FAIL + [23-07-20 18:10:03] result=FAIL + [23-07-20 18:10:03] [[ ! -z '' ]] + [23-07-20 18:10:03] printf '%10s | %6s | %s\n' FAIL 1 'mvapich2 IMB-MPI1 PingPong mpirun_rsh one_core' + [23-07-20 18:10:03] set +x --- - TEST RESULT FOR mvapich2 - Test: mvapich2 IMB-MPI1 PingPong mpirun_rsh one_core - Result: FAIL - Return: 1 --- =========================== Clients: rdma-perf-07 Servers: rdma-perf-06 DISTRO=RHEL-8.9.0-20230718.23 + [23-07-20 16:35:59] cat /etc/redhat-release Red Hat Enterprise Linux release 8.9 Beta (Ootpa) + [23-07-20 16:35:59] uname -a Linux rdma-perf-07.rdma.lab.eng.rdu2.redhat.com 4.18.0-502.el8.x86_64 #1 SMP Tue Jul 11 12:32:03 EDT 2023 x86_64 x86_64 x86_64 GNU/Linux + [23-07-20 16:35:59] cat /proc/cmdline BOOT_IMAGE=(hd0,gpt2)/vmlinuz-4.18.0-502.el8.x86_64 root=UUID=2ac8b670-3cca-4590-aa27-ac48d0577a07 ro crashkernel=auto resume=UUID=607a6a26-7f13-4052-ad45-2a030dc48592 console=ttyS0,115200n81 + [23-07-20 16:35:59] rpm -q rdma-core linux-firmware rdma-core-46.0-1.el8.1.x86_64 linux-firmware-20230711-116.gitd3f66064.el8.noarch + [23-07-20 16:35:59] tail /sys/class/infiniband/bnxt_re0/fw_ver /sys/class/infiniband/bnxt_re1/fw_ver /sys/class/infiniband/cxgb4_0/fw_ver /sys/class/infiniband/hfi1_0/fw_ver /sys/class/infiniband/mlx5_0/fw_ver /sys/class/infiniband/mlx5_1/fw_ver ==> /sys/class/infiniband/bnxt_re0/fw_ver <== 214.0.189.0 ==> /sys/class/infiniband/bnxt_re1/fw_ver <== 214.0.189.0 ==> /sys/class/infiniband/cxgb4_0/fw_ver <== 1.27.3.0 ==> /sys/class/infiniband/hfi1_0/fw_ver <== 1.27.0 ==> /sys/class/infiniband/mlx5_0/fw_ver <== 16.24.1000 ==> /sys/class/infiniband/mlx5_1/fw_ver <== 16.24.1000 + [23-07-20 16:35:59] lspci + [23-07-20 16:35:59] grep -i -e ethernet -e infiniband -e omni -e ConnectX 01:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe 01:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe 19:00.0 Ethernet controller: Broadcom Inc. and subsidiaries BCM57412 NetXtreme-E 10Gb RDMA Ethernet Controller (rev 01) 19:00.1 Ethernet controller: Broadcom Inc. and subsidiaries BCM57412 NetXtreme-E 10Gb RDMA Ethernet Controller (rev 01) 5e:00.0 Ethernet controller: Chelsio Communications Inc T62100-LP-CR Unified Wire Ethernet Controller 5e:00.1 Ethernet controller: Chelsio Communications Inc T62100-LP-CR Unified Wire Ethernet Controller 5e:00.2 Ethernet controller: Chelsio Communications Inc T62100-LP-CR Unified Wire Ethernet Controller 5e:00.3 Ethernet controller: Chelsio Communications Inc T62100-LP-CR Unified Wire Ethernet Controller 5e:00.4 Ethernet controller: Chelsio Communications Inc T62100-LP-CR Unified Wire Ethernet Controller af:00.0 Infiniband controller: Mellanox Technologies MT27800 Family [ConnectX-5] af:00.1 Infiniband controller: Mellanox Technologies MT27800 Family [ConnectX-5] d8:00.0 Fabric controller: Intel Corporation Omni-Path HFI Silicon 100 Series [discrete] (rev 11) |