Bug 1960078 - [RHEL-8.5][RDMA] update OSU Micro-Benchmarks 5.7.1
Summary: [RHEL-8.5][RDMA] update OSU Micro-Benchmarks 5.7.1
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: mpitests
Version: 8.5
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: beta
: ---
Assignee: Honggang LI
QA Contact: Brian Chae
URL:
Whiteboard:
Depends On: 1971771
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-05-13 01:19 UTC by Honggang LI
Modified: 2021-11-10 08:22 UTC (History)
3 users (show)

Fixed In Version: mpitests-5.7-2.el8
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-11-09 19:41:50 UTC
Type: Bug
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2021:4412 0 None None None 2021-11-09 19:42:00 UTC

Description Honggang LI 2021-05-13 01:19:59 UTC
Description of problem:


Version-Release number of selected component (if applicable):
OSU Micro-Benchmarks 5.7.1

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
OSU Micro Benchmarks v5.7.1 (05/11/2021)

* New Features & Enhancements
    - Add support to send and receive data from different buffers for
      osu_latency, osu_bw, osu_bibw, and osu_mbw_mr
    - Enhance support for CUDA managed memory benchmarks
        - Thanks to Ian Karlin and Nathan Hanford @LLNL for the feedback
    - Add support to print minimum and maximum communication times for
      non-blocking benchmarks

* Bug Fixes (since v5.7)
    - Update README file with updated description for osu_latency_mp
        - Thanks to Honggang Li @RedHat for the suggestion
    - Fix error in setting benchmark name in osu_allgatherv.c and osu_allgatherv.c
        - Thanks to Brandon Cook @LBL for the report

Comment 4 Brian Chae 2021-07-06 08:43:05 UTC
ALL OPENMPI and MVAPICH2 benchmarks tested as part of RHEL8.5 CTC#1 test cycle.

1. build and packages


DISTRO=RHEL-8.5.0-20210609.n.3

+ [21-06-14 06:53:38] cat /etc/redhat-release
Red Hat Enterprise Linux release 8.5 Beta (Ootpa)

+ [21-06-14 06:53:38] uname -a
Linux rdma-dev-20.lab.bos.redhat.com 4.18.0-310.el8.x86_64 #1 SMP Thu May 27 14:56:02 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux

+ [21-06-14 06:53:38] cat /proc/cmdline
BOOT_IMAGE=(hd0,msdos1)/vmlinuz-4.18.0-310.el8.x86_64 root=UUID=6754d3e0-4dac-4c9a-9f19-2c0d10402740 ro intel_idle.max_cstate=0 processor.max_cstate=0 intel_iommu=on iommu=on console=tty0 rd_NO_PLYMOUTH crashkernel=auto resume=UUID=d2356100-f7ee-485c-9db4-ddd3ac4bfc43 console=ttyS1,115200n81

+ [21-06-14 06:53:38] rpm -q rdma-core linux-firmware
rdma-core-35.0-1.el8.x86_64
linux-firmware-20201218-102.git05789708.el8.noarch


MVAPICH2:
=========

Installed:
  mpitests-mvapich2-5.7-2.el8.x86_64         mvapich2-2.3.6-1.el8.x86_64 

OPENMPI
=======

Installed:
  mpitests-openmpi-5.7-2.el8.x86_64          openmpi-4.1.1-1.el8.x86_64         
  openmpi-devel-4.1.1-1.el8.x86_64   

2. Tested HCAs

MLX4 IB, MLX4 ROCE, MLX5 IB, MLX5 ROCE, BNXT ROCE, CXGB4 IW, HFI OPA

3. Result

With MVAPICH2 runs OSU benchmarks ran with the following Version

# OSU MPI_Accumulate latency Test v5.7.1

However, with OPENMPI tests, all benchmarks failed with the following error messages, including the OSU benchmarks.
[1623669085.731067] [rdma-virt-03:71896:0]    ucp_context.c:1533 UCX  WARN  UCP version is incompatible, required: 1.10, actual: 1.9 (release 0 /lib64/libucp.so.0)
--------------------------------------------------------------------------
No components were able to be opened in the pml framework.

This typically means that either no components of this type were
installed, or none of the installed components can be loaded.
Sometimes this means that shared libraries required by these
components are unable to be found/loaded.

  Host:      rdma-virt-02
  Framework: pml
--------------------------------------------------------------------------
[rdma-virt-02.lab.bos.redhat.com:71157] PML ucx cannot be selected


Refer to the following Bugzilla:

https://bugzilla.redhat.com/show_bug.cgi?id=1971771


Even through OSU micro-benchmarks tested with MVAPICH2, all of them failed with message that seem to suggest problem with OPENMPI package, mpitests-5.7-2.el8.
Setting this bugzilla back to Assigned state.

Comment 6 Honggang LI 2021-07-08 01:45:21 UTC
(In reply to Brian Chae from comment #4)

> 
> https://bugzilla.redhat.com/show_bug.cgi?id=1971771
> 
> 
> Even through OSU micro-benchmarks tested with MVAPICH2, all of them failed
> with message that seem to suggest problem with OPENMPI package,
> mpitests-5.7-2.el8.
> Setting this bugzilla back to Assigned state.

I will revert the bad commit for openmpi. And then please re-run mpitests.

https://bugzilla.redhat.com/show_bug.cgi?id=1971771#c17
https://bugzilla.redhat.com/show_bug.cgi?id=1980171

Comment 9 Honggang LI 2021-07-22 00:10:04 UTC
(In reply to Honggang LI from comment #6)
> (In reply to Brian Chae from comment #4)
> 
> > 
> > https://bugzilla.redhat.com/show_bug.cgi?id=1971771
> > 
> > 
> > Even through OSU micro-benchmarks tested with MVAPICH2, all of them failed
> > with message that seem to suggest problem with OPENMPI package,
> > mpitests-5.7-2.el8.
> > Setting this bugzilla back to Assigned state.
> 
> I will revert the bad commit for openmpi. And then please re-run mpitests.
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=1971771#c17

I had revert the bad commit for openmpi.

> https://bugzilla.redhat.com/show_bug.cgi?id=1980171

Alaa provided a scratch build for this bug. I confirmed it works for us.

https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=38162372

As all blocker issues of this bug had been addressed, I transfer this bug into 'ON_QA' status again.

Comment 10 Brian Chae 2021-07-26 17:57:00 UTC
Sanity tests on MLX4 IB and MLX4 ROCE showed opempi tests were successful.

Test results for sanity on rdma-virt-01:
4.18.0-323.el8.x86_64, rdma-core-35.0-1.el8, mlx4, ib0, & mlx4_0
    Result | Status | Test
  ---------+--------+------------------------------------
      PASS |      0 | load module mlx4_ib
      PASS |      0 | load module mlx4_en
      PASS |      0 | load module mlx4_core
      PASS |      0 | enable opensm
      PASS |      0 | restart opensm
      PASS |      0 | osmtest -f c -g 0xe41d2d03001d6791
      PASS |      0 | ibstatus reported expected HCA rate
      PASS |      0 | pkey mlx4_ib0.8080 create/delete
      PASS |      0 | /usr/sbin/ibstat
      PASS |      0 | /usr/sbin/ibstatus
      PASS |      0 | systemctl start srp_daemon.service
      PASS |      0 | /usr/sbin/ibsrpdm -vc
      PASS |      0 | systemctl stop srp_daemon
      PASS |      0 | ping self - 172.31.0.201
      PASS |      0 | ping6 self - fe80::e61d:2d03:1d:6791%mlx4_ib0
      PASS |      0 | /usr/share/pmix/test/pmix_test
      PASS |      0 | ping server - 172.31.0.200
      PASS |      0 | ping6 server - fe80::e61d:2d03:1d:6701%mlx4_ib0
      PASS |      0 | openmpi mpitests-IMB-MPI1 PingPong
      PASS |      0 | openmpi mpitests-IMB-IO S_Read_indv
      PASS |      0 | openmpi mpitests-IMB-EXT Window
      PASS |      0 | openmpi mpitests-osu_get_bw
      PASS |      0 | ip multicast addr
      PASS |      0 | rping
      PASS |      0 | rcopy
      PASS |      0 | ib_read_bw
      PASS |      0 | ib_send_bw
      PASS |      0 | ib_write_bw
      PASS |      0 | iser login
      PASS |      0 | mount /dev/sdb /iser
      PASS |      0 | iser write 1K
      PASS |      0 | iser write 1M
      PASS |      0 | iser write 1G
      PASS |      0 | nfsordma mount - XFS_EXT
      PASS |      0 | nfsordma - wrote [5KB, 5MB, 5GB in 1KB, 1MB, 1GB bs]
      PASS |      0 | nfsordma umount - XFS_EXT
      PASS |      0 | nfsordma mount - RAMDISK
      PASS |      0 | nfsordma - wrote [5KB, 5MB, 5GB in 1KB, 1MB, 1GB bs]
      PASS |      0 | nfsordma umount - RAMDISK

Checking for failures and known issues:
  no test failures


Installed:
  mpitests-openmpi-5.7-2.el8.x86_64          openmpi-4.1.1-2.el8.x86_64         
  openmpi-devel-4.1.1-2.el8.x86_64

Comment 11 Brian Chae 2021-07-28 14:52:16 UTC
The verification has been conducted as the following:

1. build & packages

DISTRO=RHEL-8.5.0-20210721.n.0

+ [21-07-22 09:45:51] cat /etc/redhat-release
Red Hat Enterprise Linux release 8.5 Beta (Ootpa)

+ [21-07-22 09:45:51] uname -a
Linux rdma-virt-00.lab.bos.redhat.com 4.18.0-323.el8.x86_64 #1 SMP Wed Jul 14 12:52:14 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux

+ [21-07-22 09:45:51] cat /proc/cmdline
BOOT_IMAGE=(hd0,msdos1)/vmlinuz-4.18.0-323.el8.x86_64 root=UUID=5caaadea-6f56-4bcd-a27c-8f062f6a5665 ro intel_idle.max_cstate=0 processor.max_cstate=0 intel_iommu=on iommu=on console=tty0 rd_NO_PLYMOUTH crashkernel=auto resume=UUID=723711e4-2262-4b49-bb3b-8c9cffa5d35e console=ttyS1,115200n81

+ [21-07-22 09:45:51] rpm -q rdma-core linux-firmware
rdma-core-35.0-1.el8.x86_64
linux-firmware-20201218-102.git05789708.el8.noarch


openmpi package:
Installed:
  mpitests-openmpi-5.7-2.el8.x86_64          openmpi-4.1.1-2.el8.x86_64         
  openmpi-devel-4.1.1-2.el8.x86_64          

mvapich2 package:
Installed:
  mpitests-mvapich2-5.7-2.el8.x86_64         mvapich2-2.3.6-1.el8.x86_64        

2. HCAs tested

OPENMPI : MLX4 IB0, MLX4 ROCE, MLX5 IB0, MLX5 ROCE, BNXT ROCE, QEDR IW, HFI OPA0
mvapich2 : ML4 IB0, MLX5 IB0

3. Results

a. all of openmpi benchmarks passed on all MLX4 IB0, MLX4 ROCE, BXNT ROCE, QEDR IW devices
    some of the benchmarks failed on MLX5 IB/ROCE device - there will new buguialls filed for these issues
b. all of mvapich2 benchmarks passed on MLX4 IB0; however, MLX4 IB on rdma-perf-00/01 had failures [ these are known issues from RHEL8.4 ]
   ALL OF mvalich2 benchmarks on MLX5 IB0 failed; these are also known issues with bz filed on RHEL8.4

So, it will be declared as verified.

Comment 13 errata-xmlrpc 2021-11-09 19:41:50 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (RDMA stack bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:4412


Note You need to log in before you can comment on or make changes to this bug.