Bug 2064273 - [RHEL9.0] 2 openmpi benchmarks fail consistently on QEDE IW device
Summary: [RHEL9.0] 2 openmpi benchmarks fail consistently on QEDE IW device
Keywords:
Status: ASSIGNED
Alias: None
Product: Red Hat Enterprise Linux 9
Classification: Red Hat
Component: openmpi
Version: 9.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: ---
Assignee: Kamal Heib
QA Contact: Infiniband QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-03-15 13:11 UTC by Brian Chae
Modified: 2023-08-16 07:28 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHELPLAN-115606 0 None None None 2022-03-15 13:12:49 UTC

Description Brian Chae 2022-03-15 13:11:20 UTC
Description of problem:

The following OPENMPI benchmarks fail in QEDE IW device consistently.
      FAIL |      1 | openmpi OSU acc_latency mpirun one_core
      FAIL |      1 | openmpi OSU get_acc_latency mpirun one_core


Version-Release number of selected component (if applicable):

DISTRO=RHEL-9.0.0-20220313.2

+ [22-03-14 13:15:47] cat /etc/redhat-release
Red Hat Enterprise Linux release 9.0 Beta (Plow)

+ [22-03-14 13:15:47] uname -a
Linux rdma-dev-03.rdma.lab.eng.rdu2.redhat.com 5.14.0-70.1.1.el9.x86_64 #1 SMP PREEMPT Tue Mar 8 22:22:02 EST 2022 x86_64 x86_64 x86_64 GNU/Linux

+ [22-03-14 13:15:47] cat /proc/cmdline
BOOT_IMAGE=(hd0,msdos1)/vmlinuz-5.14.0-70.1.1.el9.x86_64 root=UUID=2076c1cf-ae89-4a0a-be94-8b47702b363e ro console=tty0 rd_NO_PLYMOUTH intel_iommu=on iommu=on crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M resume=UUID=d37d3c68-e4f7-4218-84fd-c3feacdef6fa console=ttyS1,115200

+ [22-03-14 13:15:47] rpm -q rdma-core linux-firmware
rdma-core-37.2-1.el9.x86_64
linux-firmware-20220209-125.el9.noarch

+ [22-03-14 13:15:47] tail /sys/class/infiniband/qedr0/fw_ver /sys/class/infiniband/qedr1/fw_ver
==> /sys/class/infiniband/qedr0/fw_ver <==
8.42.2.0

==> /sys/class/infiniband/qedr1/fw_ver <==
8.42.2.0
+ [22-03-14 13:15:47] lspci
+ [22-03-14 13:15:47] grep -i -e ethernet -e infiniband -e omni -e ConnectX
02:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
02:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
08:00.0 Ethernet controller: QLogic Corp. FastLinQ QL45000 Series 25GbE Controller (rev 10)
08:00.1 Ethernet controller: QLogic Corp. FastLinQ QL45000 Series 25GbE Controller (rev 10)


How reproducible:
100%

Steps to Reproduce:
1. bring up the RDMA hosts mentioned above with RHEL9.0 build
2. set up RDMA hosts for openmpi benchamrk tests
3. run the failed benchmarks test commands as the following on the client:

a) 
timeout --preserve-status --kill-after=5m 3m mpirun -hostfile /root/hfile_one_core -np 2 --allow-run-as-root --map-by node -mca btl_openib_warn_nonexistent_if 0 -mca btl_openib_if_include qedr1:1 -mca mtl '^psm2,psm,ofi' -mca btl '^openib' --mca mtl_base_verbose 100 --mca btl_openib_verbose 100 -mca pml ucx -mca osc ucx -x UCX_NET_DEVICES=qede_iw --mca osc_ucx_verbose 100 --mca pml_ucx_verbose 100 /usr/lib64/openmpi/bin/mpitests-osu_acc_latency


b)
timeout --preserve-status --kill-after=5m 3m mpirun -hostfile /root/hfile_one_core -np 2 --allow-run-as-root --map-by node -mca btl_openib_warn_nonexistent_if 0 -mca btl_openib_if_include qedr1:1 -mca mtl '^psm2,psm,ofi' -mca btl '^openib' --mca mtl_base_verbose 100 --mca btl_openib_verbose 100 -mca pml ucx -mca osc ucx -x UCX_NET_DEVICES=qede_iw --mca osc_ucx_verbose 100 --mca pml_ucx_verbose 100 /usr/lib64/openmpi/bin/mpitests-osu_get_acc_latency


Actual results:

a)
+ [22-03-14 13:54:10] timeout --preserve-status --kill-after=5m 3m mpirun -hostfile /root/hfile_one_core -np 2 --allow-run-as-root --map-by node -mca btl_openib_warn_nonexistent_if 0 -mca btl_openib_if_include qedr1:1 -mca mtl '^psm2,psm,ofi' -mca btl '^openib' --mca mtl_base_verbose 100 --mca btl_openib_verbose 100 -mca pml ucx -mca osc ucx -x UCX_NET_DEVICES=qede_iw --mca osc_ucx_verbose 100 --mca pml_ucx_verbose 100 /usr/lib64/openmpi/bin/mpitests-osu_acc_latency
[create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2752]create qp: failed on ibv_cmd_create_qp with 95
[create_qp:2752]create qp: failed on ibv_cmd_create_qp with 95
[create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2752]create qp: failed on ibv_cmd_create_qp with 95
[create_qp:2752]create qp: failed on ibv_cmd_create_qp with 95
[rdma-dev-03.rdma.lab.eng.rdu2.redhat.com:57556] pml_ucx.c:197 mca_pml_ucx_open: UCX version 1.11.2
[rdma-dev-03.rdma.lab.eng.rdu2.redhat.com:57556] pml_ucx.c:289 mca_pml_ucx_init
[rdma-dev-03.rdma.lab.eng.rdu2.redhat.com:57556] pml_ucx.c:114 Pack remote worker address, size 38
[rdma-dev-03.rdma.lab.eng.rdu2.redhat.com:57556] pml_ucx.c:114 Pack local worker address, size 141
[rdma-dev-03.rdma.lab.eng.rdu2.redhat.com:57556] pml_ucx.c:351 created ucp context 0x55ce8e5bbd80, worker 0x55ce8e644cb0
[rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:53906] pml_ucx.c:197 mca_pml_ucx_open: UCX version 1.11.2
[rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:53906] pml_ucx.c:289 mca_pml_ucx_init
[rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:53906] pml_ucx.c:114 Pack remote worker address, size 38
[rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:53906] pml_ucx.c:114 Pack local worker address, size 141
[rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:53906] pml_ucx.c:351 created ucp context 0x55d9370c6c30, worker 0x55d93714fb60
[rdma-dev-03.rdma.lab.eng.rdu2.redhat.com:57556] pml_ucx.c:182 Got proc 0 address, size 141
[rdma-dev-03.rdma.lab.eng.rdu2.redhat.com:57556] pml_ucx.c:411 connecting to proc. 0
[rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:53906] pml_ucx.c:182 Got proc 1 address, size 141
[rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:53906] pml_ucx.c:411 connecting to proc. 1
# OSU MPI_Accumulate latency Test v5.8
# Window creation: MPI_Win_allocate
# Synchronization: MPI_Win_flush
# Size          Latency (us)
[rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:53906] pml_ucx.c:182 Got proc 0 address, size 38
[rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:53906] pml_ucx.c:411 connecting to proc. 0
[rdma-dev-03.rdma.lab.eng.rdu2.redhat.com:57556] pml_ucx.c:182 Got proc 1 address, size 38
[rdma-dev-03.rdma.lab.eng.rdu2.redhat.com:57556] pml_ucx.c:411 connecting to proc. 1
1                    3825.37
2                    3825.72
4                    3825.25
8                    3825.56
+ [22-03-14 13:57:14] __MPI_check_result 1 mpitests-openmpi OSU /usr/lib64/openmpi/bin/mpitests-osu_acc_latency mpirun /root/hfile_one_core



b)
+ [22-03-14 14:00:32] timeout --preserve-status --kill-after=5m 3m mpirun -hostfile /root/hfile_one_core -np 2 --allow-run-as-root --map-by node -mca btl_openib_warn_nonexistent_if 0 -mca btl_openib_if_include qedr1:1 -mca mtl '^psm2,psm,ofi' -mca btl '^openib' --mca mtl_base_verbose 100 --mca btl_openib_verbose 100 -mca pml ucx -mca osc ucx -x UCX_NET_DEVICES=qede_iw --mca osc_ucx_verbose 100 --mca pml_ucx_verbose 100 /usr/lib64/openmpi/bin/mpitests-osu_get_acc_latency
[create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2752]create qp: failed on ibv_cmd_create_qp with 95
[create_qp:2752]create qp: failed on ibv_cmd_create_qp with 95
[create_qp:2752]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2752]create qp: failed on ibv_cmd_create_qp with 95
[create_qp:2752]create qp: failed on ibv_cmd_create_qp with 95
[rdma-dev-03.rdma.lab.eng.rdu2.redhat.com:58268] pml_ucx.c:197 mca_pml_ucx_open: UCX version 1.11.2
[rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:54393] pml_ucx.c:197 mca_pml_ucx_open: UCX version 1.11.2
[rdma-dev-03.rdma.lab.eng.rdu2.redhat.com:58268] pml_ucx.c:289 mca_pml_ucx_init
[rdma-dev-03.rdma.lab.eng.rdu2.redhat.com:58268] pml_ucx.c:114 Pack remote worker address, size 38
[rdma-dev-03.rdma.lab.eng.rdu2.redhat.com:58268] pml_ucx.c:114 Pack local worker address, size 141
[rdma-dev-03.rdma.lab.eng.rdu2.redhat.com:58268] pml_ucx.c:351 created ucp context 0x562449807d90, worker 0x562449890cc0
[rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:54393] pml_ucx.c:289 mca_pml_ucx_init
[rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:54393] pml_ucx.c:114 Pack remote worker address, size 38
[rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:54393] pml_ucx.c:114 Pack local worker address, size 141
[rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:54393] pml_ucx.c:351 created ucp context 0x5557b6d21c30, worker 0x5557b6daab60
[rdma-dev-03.rdma.lab.eng.rdu2.redhat.com:58268] pml_ucx.c:182 Got proc 0 address, size 141
[rdma-dev-03.rdma.lab.eng.rdu2.redhat.com:58268] pml_ucx.c:411 connecting to proc. 0
[rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:54393] pml_ucx.c:182 Got proc 1 address, size 141
[rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:54393] pml_ucx.c:411 connecting to proc. 1
# OSU MPI_Get_accumulate latency Test v5.8
# Window creation: MPI_Win_create
# Synchronization: MPI_Win_lock/unlock
# Size          Latency (us)
[rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:54393] pml_ucx.c:182 Got proc 0 address, size 38
[rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:54393] pml_ucx.c:411 connecting to proc. 0
[rdma-dev-03.rdma.lab.eng.rdu2.redhat.com:58268] pml_ucx.c:182 Got proc 1 address, size 38
[rdma-dev-03.rdma.lab.eng.rdu2.redhat.com:58268] pml_ucx.c:411 connecting to proc. 1
1                    4933.29
2                    4934.98
4                    4939.33
mpirun: Forwarding signal 18 to job
+ [22-03-14 14:03:36] __MPI_check_result 1 mpitests-openmpi OSU /usr/lib64/openmpi/bin/mpitests-osu_get_acc_latency mpirun /root/hfile_one_core



Expected results:

Both benchmarks to complete normally with all benchark stats.

Additional info:


Note You need to log in before you can comment on or make changes to this bug.