Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1974937

Summary: [RHEL9.0-BETA] two openmpi latency benchmarks failed with time-outs displaying significant latency on QEDR IW device
Product: Red Hat Enterprise Linux 9 Reporter: Brian Chae <bchae>
Component: openmpiAssignee: Nobody <nobody>
Status: CLOSED WONTFIX QA Contact: Brian Chae <bchae>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 9.0CC: mchopra, palok, pkushwaha, rdma-dev-team
Target Milestone: betaKeywords: Regression
Target Release: ---Flags: pm-rhel: mirror+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2092512 (view as bug list) Environment:
Last Closed: 2022-12-22 07:27:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2092512    

Description Brian Chae 2021-06-22 19:38:26 UTC
Description of problem:

Two openmpi benchmarks shown below failed with timing out showing significant latency values compared to the RHEL8.4 result when run on QEDR Iwarp device.

mpitests-osu_get_acc_latency mpirun
mpitests-osu_acc_latency mpirun 

This is a regression issue from RHEL8.4.


Version-Release number of selected component (if applicable):


DISTRO=RHEL-9.0.0-20210614.6

+ [21-06-15 23:51:03] cat /etc/redhat-release
Red Hat Enterprise Linux release 9.0 Beta (Plow)

+ [21-06-15 23:51:03] uname -a
Linux rdma-dev-03.lab.bos.redhat.com 5.13.0-0.rc4.33.el9.x86_64 #1 SMP Wed Jun 2 19:15:08 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux

+ [21-06-15 23:51:03] cat /proc/cmdline
BOOT_IMAGE=(hd0,msdos1)/vmlinuz-5.13.0-0.rc4.33.el9.x86_64 root=UUID=1946e1d7-f4ad-42b2-9ece-3264f30c47c5 ro console=tty0 rd_NO_PLYMOUTH intel_iommu=on iommu=on resume=UUID=87cc0df2-b6b2-4e4e-97ca-54cc90a3242b console=ttyS1,115200

+ [21-06-15 23:51:03] rpm -q rdma-core linux-firmware
rdma-core-34.0-4.el9.x86_64
linux-firmware-20210315-120.el9.noarch

+ [21-06-15 23:51:03] tail /sys/class/infiniband/qedr0/fw_ver /sys/class/infiniband/qedr1/fw_ver
==> /sys/class/infiniband/qedr0/fw_ver <==
8.42.2.0

==> /sys/class/infiniband/qedr1/fw_ver <==
8.42.2.0
+ [21-06-15 23:51:03] lspci
+ [21-06-15 23:51:03] grep -i -e ethernet -e infiniband -e omni -e ConnectX
02:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 2-port Gigabit Ethernet PCIe
02:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 2-port Gigabit Ethernet PCIe
08:00.0 Ethernet controller: QLogic Corp. FastLinQ QL45000 Series 25GbE Controller (rev 10)
08:00.1 Ethernet controller: QLogic Corp. FastLinQ QL45000 Series 25GbE Controller (rev 10)


Installed:
  mpitests-openmpi-5.7-2.el9.x86_64          openmpi-4.1.0-6.el9.x86_64         


RDMA hosts tested on:

Clients: rdma-dev-03
Servers: rdma-dev-02


How reproducible:

100%


Steps to Reproduce:
1. With the above build & packages boot up RDMA hosts of server and client with QEDR Iwarp device

2. issue the following two benchmark commands on the client

timeout --preserve-status --kill-after=5m 3m mpirun -hostfile /root/hfile_one_core -np 2 --allow-run-as-root --map-by node -mca btl_openib_warn_nonexistent_if 0 -mca btl_openib_if_include qedr1:1 -mca mtl '^psm2,psm,ofi' -mca btl '^openib' -mca pml ucx -x UCX_NET_DEVICES=qede_iw /usr/lib64/openmpi/bin/mpitests-osu_acc_latency

timeout --preserve-status --kill-after=5m 3m mpirun -hostfile /root/hfile_one_core -np 2 --allow-run-as-root --map-by node -mca btl_openib_warn_nonexistent_if 0 -mca btl_openib_if_include qedr1:1 -mca mtl '^psm2,psm,ofi' -mca btl '^openib' -mca pml ucx -x UCX_NET_DEVICES=qede_iw /usr/lib64/openmpi/bin/mpitests-osu_get_acc_latency



3.

Actual results:

# OSU MPI_Accumulate latency Test v5.7
# Window creation: MPI_Win_allocate
# Synchronization: MPI_Win_flush
# Size          Latency (us)
1                    2055.99
2                    2056.01
4                    2055.99
8                    2055.99
16                   2055.99
32                   2055.99
64                   2055.99
128                  2055.99
mpirun: Forwarding signal 18 to job
[1623817938.406761] [rdma-dev-02:50681:0]           sock.c:451  UCX  ERROR recv(fd=46) failed: Connection reset by peer
+ [21-06-16 00:32:20] __MPI_check_result 1 mpitests-openmpi OSU /usr/lib64/openmpi/bin/mpitests-osu_acc_latency mpirun /root/hfile_one_core


# OSU MPI_Get_accumulate latency Test v5.7
# Window creation: MPI_Win_create
# Synchronization: MPI_Win_lock/unlock
# Size          Latency (us)
1                    3751.91
2                    3755.30
4                    3751.97
8                    3749.09
mpirun: Forwarding signal 18 to job
[1623818306.290564] [rdma-dev-02:51309:0]           sock.c:451  UCX  ERROR recv(fd=46) failed: Connection reset by peer




Expected results:

Based on RHEL8.4 results:




# OSU MPI_Accumulate latency Test v5.7
# Window creation: MPI_Win_allocate
# Synchronization: MPI_Win_flush
# Size          Latency (us)
1                     113.62
2                     112.55
4                     112.75
8                     112.62
16                    112.28
32                    112.61
64                    113.36
128                   114.21
256                   114.96
512                   115.08
1024                  115.97
2048                  120.05
4096                  127.22
8192                  199.89
16384                 245.07
32768                 296.35
65536                 333.01
131072                451.64
262144                655.62
524288               1134.11
1048576              1999.18
2097152              3613.38
4194304              6917.08
+ [21-06-22 12:13:57] __MPI_check_result 0 mpitests-openmpi OSU /usr/lib64/openmpi/bin/mpitests-osu_acc_latency mpirun /root/hfile_one_core




# OSU MPI_Get_accumulate latency Test v5.7
# Window creation: MPI_Win_create
# Synchronization: MPI_Win_lock/unlock
# Size          Latency (us)
1                     209.78
2                     202.78
4                     196.63
8                     196.05
16                    239.80
32                    196.06
64                    232.56
128                   196.13
256                   218.99
512                   220.77
1024                  196.88
2048                  208.63
4096                  230.35
8192                  294.08
16384                 392.08
32768                 396.60
65536                 457.26
131072                564.21
262144                814.36
524288               1373.00
1048576              2415.52
2097152              4394.46
4194304              8461.62
+ [21-06-22 12:15:21] __MPI_check_result 0 mpitests-openmpi OSU /usr/lib64/openmpi/bin/mpitests-osu_get_acc_latency mpirun /root/hfile_one_core





Additional info:

Comment 2 Prabhakar 2022-02-22 10:01:47 UTC
Thanks for the raising issue with us.

Can you please help me with following queries.
A) is this test run first time with the driver?
B) if No, can you please help with the driver version or kernel version where it was passed?

Comment 3 Brian Chae 2022-02-22 13:30:32 UTC
(In reply to Prabhakar from comment #2)
> Thanks for the raising issue with us.
> 
> Can you please help me with following queries.
> A) is this test run first time with the driver?

[bchae] May be...

> B) if No, can you please help with the driver version or kernel version
> where it was passed?

It passed with RHEL8.5 build and the pacakge info is :


DISTRO=RHEL-8.5.0

+ [22-02-22 07:34:04] cat /etc/redhat-release
Red Hat Enterprise Linux release 8.5 (Ootpa)

+ [22-02-22 07:34:04] uname -a
Linux rdma-dev-03.rdma.lab.eng.rdu2.redhat.com 4.18.0-348.el8.x86_64 #1 SMP Mon Oct 4 12:17:22 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux

+ [22-02-22 07:34:04] cat /proc/cmdline
BOOT_IMAGE=(hd0,msdos1)/vmlinuz-4.18.0-348.el8.x86_64 root=/dev/mapper/rhel_rdma--dev--03-root ro console=tty0 rd_NO_PLYMOUTH intel_iommu=on iommu=on crashkernel=auto resume=/dev/mapper/rhel_rdma--dev--03-swap rd.lvm.lv=rhel_rdma-dev-03/root rd.lvm.lv=rhel_rdma-dev-03/swap console=ttyS1,115200

+ [22-02-22 07:34:04] rpm -q rdma-core linux-firmware
rdma-core-35.0-1.el8.x86_64
linux-firmware-20210702-103.gitd79c2677.el8.noarch

+ [22-02-22 07:34:04] tail /sys/class/infiniband/qedr0/fw_ver /sys/class/infiniband/qedr1/fw_ver
==> /sys/class/infiniband/qedr0/fw_ver <==
8. 42. 2. 0

==> /sys/class/infiniband/qedr1/fw_ver <==
8. 42. 2. 0

+ [22-02-22 07:34:04] lspci
+ [22-02-22 07:34:04] grep -i -e ethernet -e infiniband -e omni -e ConnectX
02:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
02:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
08:00.0 Ethernet controller: QLogic Corp. FastLinQ QL45000 Series 25GbE Controller (rev 10)
08:00.1 Ethernet controller: QLogic Corp. FastLinQ QL45000 Series 25GbE Controller (rev 10)



Installed:
  mpitests-openmpi-5.7-2.el8.x86_64          openmpi-4.1.1-2.el8.x86_64         
  openmpi-devel-4.1.1-2.el8.x86_64          


UCX version 1.10.1

Comment 5 RHEL Program Management 2022-12-22 07:27:59 UTC
After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.