Bug 2149873 - [RHEL9.2] - some openmpi benchmarks time-out with return code of 1 when executed on CXGB4 devices
Summary: [RHEL9.2] - some openmpi benchmarks time-out with return code of 1 when execu...
Keywords:
Status: ASSIGNED
Alias: None
Product: Red Hat Enterprise Linux 9
Classification: Red Hat
Component: openmpi
Version: 9.2
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: ---
Assignee: Kamal Heib
QA Contact: Infiniband QE
URL:
Whiteboard:
: 2149874 2149878 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-12-01 09:27 UTC by Brian Chae
Modified: 2023-07-10 13:48 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHELPLAN-141004 0 None None None 2022-12-01 09:43:34 UTC

Description Brian Chae 2022-12-01 09:27:06 UTC
Description of problem:

Some of the OPENMPI benchmarks time-out with RC1 when run on CXGB4 devices.
The failed benchmarks are as the following:

      FAIL |      1 | openmpi IMB-IO P_Write_indv mpirun one_core
      FAIL |      1 | openmpi IMB-IO P_Write_expl mpirun one_core
      FAIL |      1 | openmpi IMB-IO P_Write_shared mpirun one_core
      FAIL |      1 | openmpi IMB-IO P_Write_priv mpirun one_core
      FAIL |      1 | openmpi IMB-IO C_Write_indv mpirun one_core
      FAIL |      1 | openmpi IMB-IO C_Write_expl mpirun one_core
      FAIL |      1 | openmpi IMB-IO C_Write_shared mpirun one_core
      FAIL |      1 | openmpi OSU get_acc_latency mpirun one_core
      FAIL |      1 | openmpi OSU mbw_mr mpirun one_core

This issue seems to be consistent in the following hosts.

a. rdma-qe-12 (cxgb4 t5 iw 40) / rdma-perf-06 (cxgb4 T6 iw 100)

   beaker job : https://beaker.engineering.redhat.com/jobs/7293293

b. rdma-dev-13 (cxgb4 t6 iw 100) / rdma-perf-06 (cxgb4 T6 iw 100)

   https://beaker.engineering.redhat.com/jobs/7292260 


Version-Release number of selected component (if applicable):

Clients: rdma-perf-06
Servers: rdma-qe-12

DISTRO=RHEL-9.2.0-20221129.2

+ [22-11-30 18:38:52] cat /etc/redhat-release
Red Hat Enterprise Linux release 9.2 Beta (Plow)

+ [22-11-30 18:38:52] uname -a
Linux rdma-perf-06.rdma.lab.eng.rdu2.redhat.com 5.14.0-202.el9.x86_64 #1 SMP PREEMPT_DYNAMIC Mon Nov 28 08:49:47 EST 2022 x86_64 x86_64 x86_64 GNU/Linux

+ [22-11-30 18:38:52] cat /proc/cmdline
BOOT_IMAGE=(hd0,gpt2)/vmlinuz-5.14.0-202.el9.x86_64 root=UUID=60790874-ea0a-4a35-8447-d83f2475913b ro crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M resume=UUID=08d83c36-2fab-45c6-a375-8bb16849b90a console=ttyS0,115200n81

+ [22-11-30 18:38:52] rpm -q rdma-core linux-firmware
rdma-core-41.0-3.el9.x86_64
linux-firmware-20221012-128.el9.noarch

+ [22-11-30 18:38:52] tail /sys/class/infiniband/cxgb4_0/fw_ver /sys/class/infiniband/hfi1_0/fw_ver /sys/class/infiniband/mlx5_0/fw_ver /sys/class/infiniband/mlx5_1/fw_ver /sys/class/infiniband/qedr0/fw_ver /sys/class/infiniband/qedr1/fw_ver
==> /sys/class/infiniband/cxgb4_0/fw_ver <==
1.27.0.0

==> /sys/class/infiniband/hfi1_0/fw_ver <==
1.27.0

==> /sys/class/infiniband/mlx5_0/fw_ver <==
20.99.5392

==> /sys/class/infiniband/mlx5_1/fw_ver <==
20.99.5392

==> /sys/class/infiniband/qedr0/fw_ver <==
8.59.1.0

==> /sys/class/infiniband/qedr1/fw_ver <==
8.59.1.0

+ [22-11-30 18:38:52] lspci
+ [22-11-30 18:38:52] grep -i -e ethernet -e infiniband -e omni -e ConnectX
19:00.0 Ethernet controller: QLogic Corp. FastLinQ QL41000 Series 10/25/40/50GbE Controller (rev 02)
19:00.1 Ethernet controller: QLogic Corp. FastLinQ QL41000 Series 10/25/40/50GbE Controller (rev 02)
19:00.2 Ethernet controller: QLogic Corp. FastLinQ QL41000 Series 10/25/40/50GbE Controller (rev 02)
19:00.3 Ethernet controller: QLogic Corp. FastLinQ QL41000 Series 10/25/40/50GbE Controller (rev 02)
5e:00.0 Ethernet controller: Chelsio Communications Inc T62100-LP-CR Unified Wire Ethernet Controller
5e:00.1 Ethernet controller: Chelsio Communications Inc T62100-LP-CR Unified Wire Ethernet Controller
5e:00.2 Ethernet controller: Chelsio Communications Inc T62100-LP-CR Unified Wire Ethernet Controller
5e:00.3 Ethernet controller: Chelsio Communications Inc T62100-LP-CR Unified Wire Ethernet Controller
5e:00.4 Ethernet controller: Chelsio Communications Inc T62100-LP-CR Unified Wire Ethernet Controller
af:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
af:00.1 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
d8:00.0 Fabric controller: Intel Corporation Omni-Path HFI Silicon 100 Series [discrete] (rev 11)

How reproducible:

100% in the above combinations of RDMA hosts

Steps to Reproduce:

1. Please refer to the beaker job outputs in client hosts mentioned above.
2.
3.

Actual results:


Expected results:


Additional info:

However, with the following CXGB4 hosts combinations, ALL OPENMPI benchmarks PASSED

a. mpi suite over rdma-iw-cxgb pool[ RHEL-9.2.0-20221129.2: rdma-perf-06/07 - mpich2,openmpi ]

   beaker job : https://beaker.engineering.redhat.com/jobs/7291986 

b. mpi suite over rdma-iw-cxgb pool[ RHEL-9.2.0-20221129.2: rdma-dev-13/rdma-qe-12 - mpich2,openmpi ]  - J:7293324

mpi/openmpi test results on rdma-dev-13/rdma-qe-12 & Beaker job J:7293324:
5.14.0-202.el9.x86_64, rdma-core-41.0-3.el9, cxgb4, iw, T520-CR & cxgb4_0
    Result | Status | Test
  ---------+--------+------------------------------------
Checking for failures and known issues:
  no test failures

  beaker job : https://beaker.engineering.redhat.com/jobs/7293324

Comment 1 Jakub Jelen 2022-12-01 10:05:31 UTC
I do not see how this is related to the opensc package. Was the component wrongly selected and should this be reported against openmpi? I will reassign. If I am wrong, please find the right component.

Comment 2 Brian Chae 2023-07-10 13:47:10 UTC
*** Bug 2149874 has been marked as a duplicate of this bug. ***

Comment 3 Brian Chae 2023-07-10 13:48:03 UTC
*** Bug 2149878 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.