Bug 2149874

Summary: [RHEL9.2] - some openmpi benchmarks time-out with return code of 1 when executed on CXGB4 devices
Product: Red Hat Enterprise Linux 9 Reporter: Brian Chae <bchae>
Component: openmpiAssignee: Kamal Heib <kheib>
Status: CLOSED DUPLICATE QA Contact: Infiniband QE <infiniband-qe>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 9.2CC: kheib, rdma-dev-team
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Brian Chae 2022-12-01 09:29:37 UTC
Description of problem:

Some of the OPENMPI benchmarks time-out with RC1 when run on CXGB4 devices.
The failed benchmarks are as the following:

      FAIL |      1 | openmpi IMB-IO P_Write_indv mpirun one_core
      FAIL |      1 | openmpi IMB-IO P_Write_expl mpirun one_core
      FAIL |      1 | openmpi IMB-IO P_Write_shared mpirun one_core
      FAIL |      1 | openmpi IMB-IO P_Write_priv mpirun one_core
      FAIL |      1 | openmpi IMB-IO C_Write_indv mpirun one_core
      FAIL |      1 | openmpi IMB-IO C_Write_expl mpirun one_core
      FAIL |      1 | openmpi IMB-IO C_Write_shared mpirun one_core
      FAIL |      1 | openmpi OSU get_acc_latency mpirun one_core
      FAIL |      1 | openmpi OSU mbw_mr mpirun one_core

This issue seems to be consistent in the following hosts.

a. rdma-qe-12 (cxgb4 t5 iw 40) / rdma-perf-06 (cxgb4 T6 iw 100)

   beaker job : https://beaker.engineering.redhat.com/jobs/7293293

b. rdma-dev-13 (cxgb4 t6 iw 100) / rdma-perf-06 (cxgb4 T6 iw 100)

   https://beaker.engineering.redhat.com/jobs/7292260 


Version-Release number of selected component (if applicable):

Clients: rdma-perf-06
Servers: rdma-qe-12

DISTRO=RHEL-9.2.0-20221129.2

+ [22-11-30 18:38:52] cat /etc/redhat-release
Red Hat Enterprise Linux release 9.2 Beta (Plow)

+ [22-11-30 18:38:52] uname -a
Linux rdma-perf-06.rdma.lab.eng.rdu2.redhat.com 5.14.0-202.el9.x86_64 #1 SMP PREEMPT_DYNAMIC Mon Nov 28 08:49:47 EST 2022 x86_64 x86_64 x86_64 GNU/Linux

+ [22-11-30 18:38:52] cat /proc/cmdline
BOOT_IMAGE=(hd0,gpt2)/vmlinuz-5.14.0-202.el9.x86_64 root=UUID=60790874-ea0a-4a35-8447-d83f2475913b ro crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M resume=UUID=08d83c36-2fab-45c6-a375-8bb16849b90a console=ttyS0,115200n81

+ [22-11-30 18:38:52] rpm -q rdma-core linux-firmware
rdma-core-41.0-3.el9.x86_64
linux-firmware-20221012-128.el9.noarch

+ [22-11-30 18:38:52] tail /sys/class/infiniband/cxgb4_0/fw_ver /sys/class/infiniband/hfi1_0/fw_ver /sys/class/infiniband/mlx5_0/fw_ver /sys/class/infiniband/mlx5_1/fw_ver /sys/class/infiniband/qedr0/fw_ver /sys/class/infiniband/qedr1/fw_ver
==> /sys/class/infiniband/cxgb4_0/fw_ver <==
1.27.0.0

==> /sys/class/infiniband/hfi1_0/fw_ver <==
1.27.0

==> /sys/class/infiniband/mlx5_0/fw_ver <==
20.99.5392

==> /sys/class/infiniband/mlx5_1/fw_ver <==
20.99.5392

==> /sys/class/infiniband/qedr0/fw_ver <==
8.59.1.0

==> /sys/class/infiniband/qedr1/fw_ver <==
8.59.1.0

+ [22-11-30 18:38:52] lspci
+ [22-11-30 18:38:52] grep -i -e ethernet -e infiniband -e omni -e ConnectX
19:00.0 Ethernet controller: QLogic Corp. FastLinQ QL41000 Series 10/25/40/50GbE Controller (rev 02)
19:00.1 Ethernet controller: QLogic Corp. FastLinQ QL41000 Series 10/25/40/50GbE Controller (rev 02)
19:00.2 Ethernet controller: QLogic Corp. FastLinQ QL41000 Series 10/25/40/50GbE Controller (rev 02)
19:00.3 Ethernet controller: QLogic Corp. FastLinQ QL41000 Series 10/25/40/50GbE Controller (rev 02)
5e:00.0 Ethernet controller: Chelsio Communications Inc T62100-LP-CR Unified Wire Ethernet Controller
5e:00.1 Ethernet controller: Chelsio Communications Inc T62100-LP-CR Unified Wire Ethernet Controller
5e:00.2 Ethernet controller: Chelsio Communications Inc T62100-LP-CR Unified Wire Ethernet Controller
5e:00.3 Ethernet controller: Chelsio Communications Inc T62100-LP-CR Unified Wire Ethernet Controller
5e:00.4 Ethernet controller: Chelsio Communications Inc T62100-LP-CR Unified Wire Ethernet Controller
af:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
af:00.1 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
d8:00.0 Fabric controller: Intel Corporation Omni-Path HFI Silicon 100 Series [discrete] (rev 11)

How reproducible:

100% in the above combinations of RDMA hosts

Steps to Reproduce:

1. Please refer to the beaker job outputs in client hosts mentioned above.
2.
3.

Actual results:


Expected results:


Additional info:

However, with the following CXGB4 hosts combinations, ALL OPENMPI benchmarks PASSED

a. mpi suite over rdma-iw-cxgb pool[ RHEL-9.2.0-20221129.2: rdma-perf-06/07 - mpich2,openmpi ]

   beaker job : https://beaker.engineering.redhat.com/jobs/7291986 

b. mpi suite over rdma-iw-cxgb pool[ RHEL-9.2.0-20221129.2: rdma-dev-13/rdma-qe-12 - mpich2,openmpi ]  - J:7293324

mpi/openmpi test results on rdma-dev-13/rdma-qe-12 & Beaker job J:7293324:
5.14.0-202.el9.x86_64, rdma-core-41.0-3.el9, cxgb4, iw, T520-CR & cxgb4_0
    Result | Status | Test
  ---------+--------+------------------------------------
Checking for failures and known issues:
  no test failures

  beaker job : https://beaker.engineering.redhat.com/jobs/7293324

Comment 1 Brian Chae 2023-07-10 13:47:10 UTC

*** This bug has been marked as a duplicate of bug 2149873 ***