Bug 2064309

Summary: [RHEL8.6] 2 openmpi benchmarks fail consistently on QEDE IW device
Product: Red Hat Enterprise Linux 8 Reporter: Brian Chae <bchae>
Component: openmpiAssignee: Kamal Heib <kheib>
Status: ASSIGNED --- QA Contact: Infiniband QE <infiniband-qe>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 8.6CC: kheib, rdma-dev-team
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Brian Chae 2022-03-15 14:14:40 UTC
Description of problem:

The following OPENMPI benchmarks fail in QEDE IW device consistently.
      FAIL |      1 | openmpi OSU acc_latency mpirun one_core
      FAIL |      1 | openmpi OSU get_acc_latency mpirun one_core

Version-Release number of selected component (if applicable):


DISTRO=RHEL-8.6.0-20220308.2

+ [22-03-09 14:32:59] cat /etc/redhat-release
Red Hat Enterprise Linux release 8.6 Beta (Ootpa)

+ [22-03-09 14:32:59] uname -a
Linux rdma-dev-03.rdma.lab.eng.rdu2.redhat.com 4.18.0-369.el8.x86_64 #1 SMP Mon Feb 21 10:56:06 EST 2022 x86_64 x86_64 x86_64 GNU/Linux

+ [22-03-09 14:32:59] cat /proc/cmdline
BOOT_IMAGE=(hd0,msdos1)/vmlinuz-4.18.0-369.el8.x86_64 root=UUID=c5d529c7-6a64-4977-919f-0a74fd1e8ea4 ro console=tty0 rd_NO_PLYMOUTH intel_iommu=on iommu=on crashkernel=auto resume=UUID=fc27848f-6563-403d-913d-57cde4a66bc5 console=ttyS1,115200

+ [22-03-09 14:32:59] rpm -q rdma-core linux-firmware
rdma-core-37.2-1.el8.x86_64
linux-firmware-20220210-106.git6342082c.el8.noarch

+ [22-03-09 14:32:59] tail /sys/class/infiniband/qedr0/fw_ver /sys/class/infiniband/qedr1/fw_ver
==> /sys/class/infiniband/qedr0/fw_ver <==
8. 42. 2. 0

==> /sys/class/infiniband/qedr1/fw_ver <==
8. 42. 2. 0
+ [22-03-09 14:32:59] lspci
+ [22-03-09 14:32:59] grep -i -e ethernet -e infiniband -e omni -e ConnectX
02:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
02:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
08:00.0 Ethernet controller: QLogic Corp. FastLinQ QL45000 Series 25GbE Controller (rev 10)
08:00.1 Ethernet controller: QLogic Corp. FastLinQ QL45000 Series 25GbE Controller (rev 10)


How reproducible:
100%

Steps to Reproduce:
1. bring up the RDMA hosts mentioned above with RHEL8.6 build
2. set up RDMA hosts for openmpi benchamrk tests
3. run the failed benchmarks test commands as the following on the client:

a) 
timeout --preserve-status --kill-after=5m 3m mpirun -hostfile /root/hfile_one_core -np 2 --allow-run-as-root --map-by node -mca btl_openib_warn_nonexistent_if 0 -mca btl_openib_if_include qedr1:1 -mca mtl '^psm2,psm,ofi' -mca btl '^openib' --mca mtl_base_verbose 100 --mca btl_openib_verbose 100 -mca pml ucx -mca osc ucx -x UCX_NET_DEVICES=qede_iw --mca osc_ucx_verbose 100 --mca pml_ucx_verbose 100 /usr/lib64/openmpi/bin/mpitests-osu_acc_latency


b)
timeout --preserve-status --kill-after=5m 3m mpirun -hostfile /root/hfile_one_core -np 2 --allow-run-as-root --map-by node -mca btl_openib_warn_nonexistent_if 0 -mca btl_openib_if_include qedr1:1 -mca mtl '^psm2,psm,ofi' -mca btl '^openib' --mca mtl_base_verbose 100 --mca btl_openib_verbose 100 -mca pml ucx -mca osc ucx -x UCX_NET_DEVICES=qede_iw --mca osc_ucx_verbose 100 --mca pml_ucx_verbose 100 /usr/lib64/openmpi/bin/mpitests-osu_acc_latency


Actual results:

a)
+ [22-03-09 15:05:36] timeout --preserve-status --kill-after=5m 3m mpirun -hostfile /root/hfile_one_core -np 2 --allow-run-as-root --map-by node -mca btl_openib_warn_nonexistent_if 0 -mca btl_openib_if_include qedr1:1 -mca mtl '^psm2,psm,ofi' -mca btl '^openib' --mca mtl_base_verbose 100 --mca btl_openib_verbose 100 -mca pml ucx -mca osc ucx -x UCX_NET_DEVICES=qede_iw --mca osc_ucx_verbose 100 --mca pml_ucx_verbose 100 /usr/lib64/openmpi/bin/mpitests-osu_acc_latency
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 95
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 95
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 95
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 95
[rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:68871] pml_ucx.c:197 mca_pml_ucx_open: UCX version 1.11.2
[rdma-dev-03.rdma.lab.eng.rdu2.redhat.com:69994] pml_ucx.c:197 mca_pml_ucx_open: UCX version 1.11.2
[rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:68871] pml_ucx.c:289 mca_pml_ucx_init
[rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:68871] pml_ucx.c:114 Pack remote worker address, size 38
[rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:68871] pml_ucx.c:114 Pack local worker address, size 141
[rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:68871] pml_ucx.c:351 created ucp context 0x564ade796090, worker 0x564ade7bb330
[rdma-dev-03.rdma.lab.eng.rdu2.redhat.com:69994] pml_ucx.c:289 mca_pml_ucx_init
[rdma-dev-03.rdma.lab.eng.rdu2.redhat.com:69994] pml_ucx.c:114 Pack remote worker address, size 38
[rdma-dev-03.rdma.lab.eng.rdu2.redhat.com:69994] pml_ucx.c:114 Pack local worker address, size 141
[rdma-dev-03.rdma.lab.eng.rdu2.redhat.com:69994] pml_ucx.c:351 created ucp context 0x555a682da060, worker 0x555a682ff300
[rdma-dev-03.rdma.lab.eng.rdu2.redhat.com:69994] pml_ucx.c:182 Got proc 0 address, size 141
[rdma-dev-03.rdma.lab.eng.rdu2.redhat.com:69994] pml_ucx.c:411 connecting to proc. 0
[rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:68871] pml_ucx.c:182 Got proc 1 address, size 141
[rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:68871] pml_ucx.c:411 connecting to proc. 1
# OSU MPI_Accumulate latency Test v5.8
# Window creation: MPI_Win_allocate
# Synchronization: MPI_Win_flush
# Size          Latency (us)
[rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:68871] pml_ucx.c:182 Got proc 0 address, size 38
[rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:68871] pml_ucx.c:411 connecting to proc. 0
[rdma-dev-03.rdma.lab.eng.rdu2.redhat.com:69994] pml_ucx.c:182 Got proc 1 address, size 38
[rdma-dev-03.rdma.lab.eng.rdu2.redhat.com:69994] pml_ucx.c:411 connecting to proc. 1
1                    2569.98
2                    2569.98
4                    2569.98
8                    2569.98
16                   2569.98
32                   2569.98
+ [22-03-09 15:08:40] __MPI_check_result 1 mpitests-openmpi OSU /usr/lib64/openmpi/bin/mpitests-osu_acc_latency mpirun /root/hfile_one_core


b)
+ [22-03-09 15:05:36] timeout --preserve-status --kill-after=5m 3m mpirun -hostfile /root/hfile_one_core -np 2 --allow-run-as-root --map-by node -mca btl_openib_warn_nonexistent_if 0 -mca btl_openib_if_include qedr1:1 -mca mtl '^psm2,psm,ofi' -mca btl '^openib' --mca mtl_base_verbose 100 --mca btl_openib_verbose 100 -mca pml ucx -mca osc ucx -x UCX_NET_DEVICES=qede_iw --mca osc_ucx_verbose 100 --mca pml_ucx_verbose 100 /usr/lib64/openmpi/bin/mpitests-osu_acc_latency
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 95
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 95
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 95
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 95
[rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:68871] pml_ucx.c:197 mca_pml_ucx_open: UCX version 1.11.2
[rdma-dev-03.rdma.lab.eng.rdu2.redhat.com:69994] pml_ucx.c:197 mca_pml_ucx_open: UCX version 1.11.2
[rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:68871] pml_ucx.c:289 mca_pml_ucx_init
[rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:68871] pml_ucx.c:114 Pack remote worker address, size 38
[rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:68871] pml_ucx.c:114 Pack local worker address, size 141
[rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:68871] pml_ucx.c:351 created ucp context 0x564ade796090, worker 0x564ade7bb330
[rdma-dev-03.rdma.lab.eng.rdu2.redhat.com:69994] pml_ucx.c:289 mca_pml_ucx_init
[rdma-dev-03.rdma.lab.eng.rdu2.redhat.com:69994] pml_ucx.c:114 Pack remote worker address, size 38
[rdma-dev-03.rdma.lab.eng.rdu2.redhat.com:69994] pml_ucx.c:114 Pack local worker address, size 141
[rdma-dev-03.rdma.lab.eng.rdu2.redhat.com:69994] pml_ucx.c:351 created ucp context 0x555a682da060, worker 0x555a682ff300
[rdma-dev-03.rdma.lab.eng.rdu2.redhat.com:69994] pml_ucx.c:182 Got proc 0 address, size 141
[rdma-dev-03.rdma.lab.eng.rdu2.redhat.com:69994] pml_ucx.c:411 connecting to proc. 0
[rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:68871] pml_ucx.c:182 Got proc 1 address, size 141
[rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:68871] pml_ucx.c:411 connecting to proc. 1
# OSU MPI_Accumulate latency Test v5.8
# Window creation: MPI_Win_allocate
# Synchronization: MPI_Win_flush
# Size          Latency (us)
[rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:68871] pml_ucx.c:182 Got proc 0 address, size 38
[rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:68871] pml_ucx.c:411 connecting to proc. 0
[rdma-dev-03.rdma.lab.eng.rdu2.redhat.com:69994] pml_ucx.c:182 Got proc 1 address, size 38
[rdma-dev-03.rdma.lab.eng.rdu2.redhat.com:69994] pml_ucx.c:411 connecting to proc. 1
1                    2569.98
2                    2569.98
4                    2569.98
8                    2569.98
16                   2569.98
32                   2569.98
+ [22-03-09 15:08:40] __MPI_check_result 1 mpitests-openmpi OSU /usr/lib64/openmpi/bin/mpitests-osu_acc_latency mpirun /root/hfile_one_core

Expected results:
Normal completion of the benchmark tests with expected latency stats outputs

Additional info: