Bug 1991185
| Summary: | [RHEL9.0] All ucx_info commands for enpoint config fail on MLX5 IB and ROCE devices | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 9 | Reporter: | Brian Chae <bchae> |
| Component: | ucx | Assignee: | Jonathan Toppins <jtoppins> |
| Status: | CLOSED ERRATA | QA Contact: | Brian Chae <bchae> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 9.0 | CC: | cwei, rdma-dev-team |
| Target Milestone: | rc | Keywords: | Regression, Triaged |
| Target Release: | 9.0 | Flags: | pm-rhel:
mirror+
|
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | ucx-1.11.2-2.el9 | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-05-17 15:53:57 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Is this still a problem given the information in the similar RHEL-8.5 bug? Tested with latest RHEL-9.0 build but the issue still persists.
Test results for ucx/ucx/ on rdma-dev-20:
5.14.0-11.el9.x86_64, rdma-core-37.1-1.el9, mlx5, ib0, & mlx5_2
Result | Status | Test
---------+--------+------------------------------------
PASS | 0 | install ucx
PASS | 0 | ucx version info
PASS | 0 | ucx build info
PASS | 0 | ucx system info
PASS | 0 | ucx device info
PASS | 0 | ucx transport info - posix
PASS | 0 | ucx transport info - self
PASS | 0 | ucx transport info - sysv
PASS | 0 | ucx transport info - tcp
PASS | 0 | ucx configuration info
PASS | 0 | ucp context info for a
PASS | 0 | ucp worker info for a
FAIL | 242 | ucp endpoint config for a
PASS | 0 | ucp context info for r
PASS | 0 | ucp worker info for r
FAIL | 242 | ucp endpoint config for r
PASS | 0 | ucp context info for t
PASS | 0 | ucp worker info for t
FAIL | 242 | ucp endpoint config for t
PASS | 0 | ucp context info for w
PASS | 0 | ucp worker info for w
FAIL | 242 | ucp endpoint config for w
PASS | 0 | ucp context info for ae
PASS | 0 | ucp worker info for ae
FAIL | 242 | ucp endpoint config for ae
PASS | 0 | ucp context info for re
PASS | 0 | ucp worker info for re
FAIL | 242 | ucp endpoint config for re
PASS | 0 | ucp context info for te
PASS | 0 | ucp worker info for te
FAIL | 242 | ucp endpoint config for te
PASS | 0 | ucp context info for we
PASS | 0 | ucp worker info for we
FAIL | 242 | ucp endpoint config for we
PASS | 0 | ucx type and struct info
FAIL | 255 | ucx_perftest am_lat
FAIL | 255 | ucx_perftest put_lat
FAIL | 255 | ucx_perftest add_lat
FAIL | 255 | ucx_perftest fadd
FAIL | 255 | ucx_perftest cswap
FAIL | 255 | ucx_perftest am_bw
FAIL | 255 | ucx_perftest put_bw
FAIL | 255 | ucx_perftest add_mr
PASS | 0 | ucx_perftest tag_lat
PASS | 0 | ucx_perftest tag_bw
PASS | 0 | ucx_perftest ucp_put_lat
PASS | 0 | ucx_perftest ucp_put_bw
FAIL | 143 | ucx_perftest ucp_get
PASS | 0 | openmpi setup
PASS | 0 | openmpi built with ucx
FAIL | 16 | openmpi ucx osu_bw
o build and packages tested
DISTRO=RHEL-9.0.0-20211104.5
+ [21-11-05 07:10:57] cat /etc/redhat-release
Red Hat Enterprise Linux release 9.0 Beta (Plow)
+ [21-11-05 07:10:57] uname -a
Linux rdma-dev-20.rdma.lab.eng.rdu2.redhat.com 5.14.0-11.el9.x86_64 #1 SMP Thu Oct 28 18:29:41 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux
+ [21-11-05 07:10:57] cat /proc/cmdline
BOOT_IMAGE=(hd0,msdos1)/vmlinuz-5.14.0-11.el9.x86_64 root=/dev/mapper/rhel_rdma--dev--20-root ro intel_idle.max_cstate=0 processor.max_cstate=0 intel_iommu=on iommu=on console=tty0 rd_NO_PLYMOUTH crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M resume=/dev/mapper/rhel_rdma--dev--20-swap rd.lvm.lv=rhel_rdma-dev-20/root rd.lvm.lv=rhel_rdma-dev-20/swap console=ttyS1,115200n81
+ [21-11-05 07:10:57] rpm -q rdma-core linux-firmware
rdma-core-37.1-1.el9.x86_64
linux-firmware-20210919-122.el9.noarch
+ [21-11-05 07:10:57] tail /sys/class/infiniband/mlx5_2/fw_ver /sys/class/infiniband/mlx5_3/fw_ver /sys/class/infiniband/mlx5_bond_0/fw_ver
==> /sys/class/infiniband/mlx5_2/fw_ver <==
12.28.2006
==> /sys/class/infiniband/mlx5_3/fw_ver <==
12.28.2006
==> /sys/class/infiniband/mlx5_bond_0/fw_ver <==
14.31.1014
Installed:
mpitests-openmpi-5.7-4.el9.x86_64 openmpi-4.1.1-4.el9.x86_64
openmpi-devel-4.1.1-4.el9.x86_64
Package ucx-1.11.2-1.el9.x86_64 is already installed. <<<=============================================
2. HW tested : MLX5 IB
Servers: rdma-dev-19
Clinets: rdma-dev-20
The verification test was conducted as the following:
1. build and packages
DISTRO=RHEL-9.0.0-20220103.2
+ [22-01-04 08:04:53] cat /etc/redhat-release
Red Hat Enterprise Linux release 9.0 Beta (Plow)
+ [22-01-04 08:04:53] uname -a
Linux rdma-dev-20.rdma.lab.eng.rdu2.redhat.com 5.14.0-39.el9.x86_64 #1 SMP PREEMPT Fri Dec 24 00:07:58 EST 2021 x86_64 x86_64 x86_64 GNU/Linux
+ [22-01-04 08:04:53] cat /proc/cmdline
BOOT_IMAGE=(hd0,msdos1)/vmlinuz-5.14.0-39.el9.x86_64 root=/dev/mapper/rhel_rdma--dev--20-root ro intel_idle.max_cstate=0 processor.max_cstate=0 intel_iommu=on iommu=on console=tty0 rd_NO_PLYMOUTH crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M resume=/dev/mapper/rhel_rdma--dev--20-swap rd.lvm.lv=rhel_rdma-dev-20/root rd.lvm.lv=rhel_rdma-dev-20/swap console=ttyS1,115200n81
+ [22-01-04 08:04:53] rpm -q rdma-core linux-firmware
rdma-core-37.1-1.el9.x86_64
linux-firmware-20211027-123.el9.noarch
Installed:
ucx-cma-1.11.2-2.el9.x86_64 ucx-ib-1.11.2-2.el9.x86_64
ucx-rdmacm-1.11.2-2.el9.x86_64
Package ucx-1.11.2-2.el9.x86_64 is already installed.
2. all UCP tests passed
PASS | 0 | ucp context info for a
PASS | 0 | ucp worker info for a
PASS | 0 | ucp endpoint config for a
PASS | 0 | ucp context info for r
PASS | 0 | ucp worker info for r
PASS | 0 | ucp endpoint config for r
PASS | 0 | ucp context info for t
PASS | 0 | ucp worker info for t
PASS | 0 | ucp endpoint config for t
PASS | 0 | ucp context info for m
PASS | 0 | ucp worker info for m
PASS | 0 | ucp endpoint config for m
PASS | 0 | ucp context info for ae
PASS | 0 | ucp worker info for ae
PASS | 0 | ucp endpoint config for ae
PASS | 0 | ucp context info for re
PASS | 0 | ucp worker info for re
PASS | 0 | ucp endpoint config for re
PASS | 0 | ucp context info for te
PASS | 0 | ucp worker info for te
PASS | 0 | ucp endpoint config for te
PASS | 0 | ucp context info for me
PASS | 0 | ucp worker info for me
PASS | 0 | ucp endpoint config for me
PASS | 0 | ucp context info for aw
PASS | 0 | ucp worker info for aw
PASS | 0 | ucp endpoint config for aw
PASS | 0 | ucp context info for rw
PASS | 0 | ucp worker info for rw
PASS | 0 | ucp endpoint config for rw
PASS | 0 | ucp context info for tw
PASS | 0 | ucp worker info for tw
PASS | 0 | ucp endpoint config for tw
PASS | 0 | ucp context info for mw
PASS | 0 | ucp worker info for mw
PASS | 0 | ucp endpoint config for mw
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (new packages: RDMA stack), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2022:3950 |
Description of problem: With build, RHEL-9.0.0-20210805.7, all ucp endpoint config commands failed on MLX5 IB and ROCE devices as the following with return code of 242 FAIL | 242 | ucp endpoint config for a FAIL | 242 | ucp endpoint config for r FAIL | 242 | ucp endpoint config for t FAIL | 242 | ucp endpoint config for w FAIL | 242 | ucp endpoint config for ae FAIL | 242 | ucp endpoint config for re FAIL | 242 | ucp endpoint config for te FAIL | 242 | ucp endpoint config for we The commands that failed are ucx_info -u a -e -n 256 ucx_info -u r -e -n 256 ucx_info -u t -e -n 256 ucx_info -u w -e -n 256 ucx_info -u ae -e -n 256 ucx_info -u re -e -n 256 ucx_info -u te -e -n 256 ucx_info -u we -e -n 256 Version-Release number of selected component (if applicable): DISTRO=RHEL-9.0.0-20210805.7 + [21-08-06 12:33:39] cat /etc/redhat-release Red Hat Enterprise Linux release 9.0 Beta (Plow) + [21-08-06 12:33:39] uname -a Linux rdma-dev-22.lab.bos.redhat.com 5.14.0-0.rc4.35.el9.x86_64 #1 SMP Tue Aug 3 13:02:44 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux + [21-08-06 12:33:39] cat /proc/cmdline BOOT_IMAGE=(hd0,msdos1)/vmlinuz-5.14.0-0.rc4.35.el9.x86_64 root=/dev/mapper/rhel_rdma--dev--22-root ro intel_idle.max_cstate=0 processor.max_cstate=0 intel_iommu=on iommu=on console=tty0 rd_NO_PLYMOUTH crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M resume=/dev/mapper/rhel_rdma--dev--22-swap rd.lvm.lv=rhel_rdma-dev-22/root rd.lvm.lv=rhel_rdma-dev-22/swap console=ttyS1,115200n81 + [21-08-06 12:33:39] rpm -q rdma-core linux-firmware rdma-core-35.0-2.el9.x86_64 linux-firmware-20210315-120.el9.noarch Installed: mpitests-openmpi-5.7-3.el9.x86_64 openmpi-4.1.1-3.el9.x86_64 openmpi-devel-4.1.1-3.el9.x86_64 Package ucx-1.10.1-2.el9.x86_64 is already installed. How reproducible: 100% Steps to Reproduce: 1. With the abive packages run the following commands on the server followed by running the commands on the client, one at a time 2. UCX_TLS=rc UCX_NET_DEVICES=<hca_id>:<port_id> like <mlx5_1:1> 3. ucx_info -u a -e -n 256 ucx_info -u r -e -n 256 ucx_info -u t -e -n 256 ucx_info -u w -e -n 256 ucx_info -u ae -e -n 256 ucx_info -u re -e -n 256 ucx_info -u te -e -n 256 ucx_info -u we -e -n 256 Actual results: All of the above commands in the client failed with return code of 242 on both server side and client side Expected results: All of them to complete the commands with return code of 0 Additional info: When run the above failed "ucx_info" command with RHEL-9.0.0-20210607.0 PASS | 0 | ucp endpoint config for a PASS | 0 | ucp endpoint config for r PASS | 0 | ucp endpoint config for t PASS | 0 | ucp endpoint config for w PASS | 0 | ucp endpoint config for ae PASS | 0 | ucp endpoint config for re PASS | 0 | ucp endpoint config for te PASS | 0 | ucp endpoint config for we