Bug 1960078
| Summary: | [RHEL-8.5][RDMA] update OSU Micro-Benchmarks 5.7.1 | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Honggang LI <honli> |
| Component: | mpitests | Assignee: | Honggang LI <honli> |
| Status: | CLOSED ERRATA | QA Contact: | Brian Chae <bchae> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 8.5 | CC: | bchae, linville, rdma-dev-team |
| Target Milestone: | beta | Keywords: | Triaged |
| Target Release: | --- | Flags: | pm-rhel:
mirror+
|
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | mpitests-5.7-2.el8 | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-11-09 19:41:50 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 1971771 | ||
| Bug Blocks: | |||
ALL OPENMPI and MVAPICH2 benchmarks tested as part of RHEL8.5 CTC#1 test cycle. 1. build and packages DISTRO=RHEL-8.5.0-20210609.n.3 + [21-06-14 06:53:38] cat /etc/redhat-release Red Hat Enterprise Linux release 8.5 Beta (Ootpa) + [21-06-14 06:53:38] uname -a Linux rdma-dev-20.lab.bos.redhat.com 4.18.0-310.el8.x86_64 #1 SMP Thu May 27 14:56:02 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux + [21-06-14 06:53:38] cat /proc/cmdline BOOT_IMAGE=(hd0,msdos1)/vmlinuz-4.18.0-310.el8.x86_64 root=UUID=6754d3e0-4dac-4c9a-9f19-2c0d10402740 ro intel_idle.max_cstate=0 processor.max_cstate=0 intel_iommu=on iommu=on console=tty0 rd_NO_PLYMOUTH crashkernel=auto resume=UUID=d2356100-f7ee-485c-9db4-ddd3ac4bfc43 console=ttyS1,115200n81 + [21-06-14 06:53:38] rpm -q rdma-core linux-firmware rdma-core-35.0-1.el8.x86_64 linux-firmware-20201218-102.git05789708.el8.noarch MVAPICH2: ========= Installed: mpitests-mvapich2-5.7-2.el8.x86_64 mvapich2-2.3.6-1.el8.x86_64 OPENMPI ======= Installed: mpitests-openmpi-5.7-2.el8.x86_64 openmpi-4.1.1-1.el8.x86_64 openmpi-devel-4.1.1-1.el8.x86_64 2. Tested HCAs MLX4 IB, MLX4 ROCE, MLX5 IB, MLX5 ROCE, BNXT ROCE, CXGB4 IW, HFI OPA 3. Result With MVAPICH2 runs OSU benchmarks ran with the following Version # OSU MPI_Accumulate latency Test v5.7.1 However, with OPENMPI tests, all benchmarks failed with the following error messages, including the OSU benchmarks. [1623669085.731067] [rdma-virt-03:71896:0] ucp_context.c:1533 UCX WARN UCP version is incompatible, required: 1.10, actual: 1.9 (release 0 /lib64/libucp.so.0) -------------------------------------------------------------------------- No components were able to be opened in the pml framework. This typically means that either no components of this type were installed, or none of the installed components can be loaded. Sometimes this means that shared libraries required by these components are unable to be found/loaded. Host: rdma-virt-02 Framework: pml -------------------------------------------------------------------------- [rdma-virt-02.lab.bos.redhat.com:71157] PML ucx cannot be selected Refer to the following Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1971771 Even through OSU micro-benchmarks tested with MVAPICH2, all of them failed with message that seem to suggest problem with OPENMPI package, mpitests-5.7-2.el8. Setting this bugzilla back to Assigned state. (In reply to Brian Chae from comment #4) > > https://bugzilla.redhat.com/show_bug.cgi?id=1971771 > > > Even through OSU micro-benchmarks tested with MVAPICH2, all of them failed > with message that seem to suggest problem with OPENMPI package, > mpitests-5.7-2.el8. > Setting this bugzilla back to Assigned state. I will revert the bad commit for openmpi. And then please re-run mpitests. https://bugzilla.redhat.com/show_bug.cgi?id=1971771#c17 https://bugzilla.redhat.com/show_bug.cgi?id=1980171 (In reply to Honggang LI from comment #6) > (In reply to Brian Chae from comment #4) > > > > > https://bugzilla.redhat.com/show_bug.cgi?id=1971771 > > > > > > Even through OSU micro-benchmarks tested with MVAPICH2, all of them failed > > with message that seem to suggest problem with OPENMPI package, > > mpitests-5.7-2.el8. > > Setting this bugzilla back to Assigned state. > > I will revert the bad commit for openmpi. And then please re-run mpitests. > > https://bugzilla.redhat.com/show_bug.cgi?id=1971771#c17 I had revert the bad commit for openmpi. > https://bugzilla.redhat.com/show_bug.cgi?id=1980171 Alaa provided a scratch build for this bug. I confirmed it works for us. https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=38162372 As all blocker issues of this bug had been addressed, I transfer this bug into 'ON_QA' status again. Sanity tests on MLX4 IB and MLX4 ROCE showed opempi tests were successful.
Test results for sanity on rdma-virt-01:
4.18.0-323.el8.x86_64, rdma-core-35.0-1.el8, mlx4, ib0, & mlx4_0
Result | Status | Test
---------+--------+------------------------------------
PASS | 0 | load module mlx4_ib
PASS | 0 | load module mlx4_en
PASS | 0 | load module mlx4_core
PASS | 0 | enable opensm
PASS | 0 | restart opensm
PASS | 0 | osmtest -f c -g 0xe41d2d03001d6791
PASS | 0 | ibstatus reported expected HCA rate
PASS | 0 | pkey mlx4_ib0.8080 create/delete
PASS | 0 | /usr/sbin/ibstat
PASS | 0 | /usr/sbin/ibstatus
PASS | 0 | systemctl start srp_daemon.service
PASS | 0 | /usr/sbin/ibsrpdm -vc
PASS | 0 | systemctl stop srp_daemon
PASS | 0 | ping self - 172.31.0.201
PASS | 0 | ping6 self - fe80::e61d:2d03:1d:6791%mlx4_ib0
PASS | 0 | /usr/share/pmix/test/pmix_test
PASS | 0 | ping server - 172.31.0.200
PASS | 0 | ping6 server - fe80::e61d:2d03:1d:6701%mlx4_ib0
PASS | 0 | openmpi mpitests-IMB-MPI1 PingPong
PASS | 0 | openmpi mpitests-IMB-IO S_Read_indv
PASS | 0 | openmpi mpitests-IMB-EXT Window
PASS | 0 | openmpi mpitests-osu_get_bw
PASS | 0 | ip multicast addr
PASS | 0 | rping
PASS | 0 | rcopy
PASS | 0 | ib_read_bw
PASS | 0 | ib_send_bw
PASS | 0 | ib_write_bw
PASS | 0 | iser login
PASS | 0 | mount /dev/sdb /iser
PASS | 0 | iser write 1K
PASS | 0 | iser write 1M
PASS | 0 | iser write 1G
PASS | 0 | nfsordma mount - XFS_EXT
PASS | 0 | nfsordma - wrote [5KB, 5MB, 5GB in 1KB, 1MB, 1GB bs]
PASS | 0 | nfsordma umount - XFS_EXT
PASS | 0 | nfsordma mount - RAMDISK
PASS | 0 | nfsordma - wrote [5KB, 5MB, 5GB in 1KB, 1MB, 1GB bs]
PASS | 0 | nfsordma umount - RAMDISK
Checking for failures and known issues:
no test failures
Installed:
mpitests-openmpi-5.7-2.el8.x86_64 openmpi-4.1.1-2.el8.x86_64
openmpi-devel-4.1.1-2.el8.x86_64
The verification has been conducted as the following:
1. build & packages
DISTRO=RHEL-8.5.0-20210721.n.0
+ [21-07-22 09:45:51] cat /etc/redhat-release
Red Hat Enterprise Linux release 8.5 Beta (Ootpa)
+ [21-07-22 09:45:51] uname -a
Linux rdma-virt-00.lab.bos.redhat.com 4.18.0-323.el8.x86_64 #1 SMP Wed Jul 14 12:52:14 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux
+ [21-07-22 09:45:51] cat /proc/cmdline
BOOT_IMAGE=(hd0,msdos1)/vmlinuz-4.18.0-323.el8.x86_64 root=UUID=5caaadea-6f56-4bcd-a27c-8f062f6a5665 ro intel_idle.max_cstate=0 processor.max_cstate=0 intel_iommu=on iommu=on console=tty0 rd_NO_PLYMOUTH crashkernel=auto resume=UUID=723711e4-2262-4b49-bb3b-8c9cffa5d35e console=ttyS1,115200n81
+ [21-07-22 09:45:51] rpm -q rdma-core linux-firmware
rdma-core-35.0-1.el8.x86_64
linux-firmware-20201218-102.git05789708.el8.noarch
openmpi package:
Installed:
mpitests-openmpi-5.7-2.el8.x86_64 openmpi-4.1.1-2.el8.x86_64
openmpi-devel-4.1.1-2.el8.x86_64
mvapich2 package:
Installed:
mpitests-mvapich2-5.7-2.el8.x86_64 mvapich2-2.3.6-1.el8.x86_64
2. HCAs tested
OPENMPI : MLX4 IB0, MLX4 ROCE, MLX5 IB0, MLX5 ROCE, BNXT ROCE, QEDR IW, HFI OPA0
mvapich2 : ML4 IB0, MLX5 IB0
3. Results
a. all of openmpi benchmarks passed on all MLX4 IB0, MLX4 ROCE, BXNT ROCE, QEDR IW devices
some of the benchmarks failed on MLX5 IB/ROCE device - there will new buguialls filed for these issues
b. all of mvapich2 benchmarks passed on MLX4 IB0; however, MLX4 IB on rdma-perf-00/01 had failures [ these are known issues from RHEL8.4 ]
ALL OF mvalich2 benchmarks on MLX5 IB0 failed; these are also known issues with bz filed on RHEL8.4
So, it will be declared as verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (RDMA stack bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:4412 |
Description of problem: Version-Release number of selected component (if applicable): OSU Micro-Benchmarks 5.7.1 How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: OSU Micro Benchmarks v5.7.1 (05/11/2021) * New Features & Enhancements - Add support to send and receive data from different buffers for osu_latency, osu_bw, osu_bibw, and osu_mbw_mr - Enhance support for CUDA managed memory benchmarks - Thanks to Ian Karlin and Nathan Hanford @LLNL for the feedback - Add support to print minimum and maximum communication times for non-blocking benchmarks * Bug Fixes (since v5.7) - Update README file with updated description for osu_latency_mp - Thanks to Honggang Li @RedHat for the suggestion - Fix error in setting benchmark name in osu_allgatherv.c and osu_allgatherv.c - Thanks to Brandon Cook @LBL for the report