Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1991185

Summary: [RHEL9.0] All ucx_info commands for enpoint config fail on MLX5 IB and ROCE devices
Product: Red Hat Enterprise Linux 9 Reporter: Brian Chae <bchae>
Component: ucxAssignee: Jonathan Toppins <jtoppins>
Status: CLOSED ERRATA QA Contact: Brian Chae <bchae>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 9.0CC: cwei, rdma-dev-team
Target Milestone: rcKeywords: Regression, Triaged
Target Release: 9.0Flags: pm-rhel: mirror+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ucx-1.11.2-2.el9 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-05-17 15:53:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Brian Chae 2021-08-08 00:43:35 UTC
Description of problem:

With build, RHEL-9.0.0-20210805.7, 

all ucp endpoint config commands failed on MLX5 IB and ROCE devices as the following with return code of 242

      FAIL |    242 | ucp endpoint config for a
      FAIL |    242 | ucp endpoint config for r
      FAIL |    242 | ucp endpoint config for t
      FAIL |    242 | ucp endpoint config for w
      FAIL |    242 | ucp endpoint config for ae
      FAIL |    242 | ucp endpoint config for re
      FAIL |    242 | ucp endpoint config for te
      FAIL |    242 | ucp endpoint config for we

The commands that failed are

ucx_info -u a -e -n 256
ucx_info -u r -e -n 256
ucx_info -u t -e -n 256
ucx_info -u w -e -n 256
ucx_info -u ae -e -n 256
ucx_info -u re -e -n 256
ucx_info -u te -e -n 256
ucx_info -u we -e -n 256



Version-Release number of selected component (if applicable):

DISTRO=RHEL-9.0.0-20210805.7

+ [21-08-06 12:33:39] cat /etc/redhat-release
Red Hat Enterprise Linux release 9.0 Beta (Plow)

+ [21-08-06 12:33:39] uname -a
Linux rdma-dev-22.lab.bos.redhat.com 5.14.0-0.rc4.35.el9.x86_64 #1 SMP Tue Aug 3 13:02:44 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux

+ [21-08-06 12:33:39] cat /proc/cmdline
BOOT_IMAGE=(hd0,msdos1)/vmlinuz-5.14.0-0.rc4.35.el9.x86_64 root=/dev/mapper/rhel_rdma--dev--22-root ro intel_idle.max_cstate=0 processor.max_cstate=0 intel_iommu=on iommu=on console=tty0 rd_NO_PLYMOUTH crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M resume=/dev/mapper/rhel_rdma--dev--22-swap rd.lvm.lv=rhel_rdma-dev-22/root rd.lvm.lv=rhel_rdma-dev-22/swap console=ttyS1,115200n81

+ [21-08-06 12:33:39] rpm -q rdma-core linux-firmware
rdma-core-35.0-2.el9.x86_64
linux-firmware-20210315-120.el9.noarch


Installed:
  mpitests-openmpi-5.7-3.el9.x86_64          openmpi-4.1.1-3.el9.x86_64         
  openmpi-devel-4.1.1-3.el9.x86_64          


Package ucx-1.10.1-2.el9.x86_64 is already installed.


How reproducible:

100%

Steps to Reproduce:
1. With the abive packages run the following commands on the server followed by running the commands on the client, one at a time

2. UCX_TLS=rc
   UCX_NET_DEVICES=<hca_id>:<port_id> like <mlx5_1:1>

3. ucx_info -u a -e -n 256
   ucx_info -u r -e -n 256
   ucx_info -u t -e -n 256
   ucx_info -u w -e -n 256
   ucx_info -u ae -e -n 256
   ucx_info -u re -e -n 256
   ucx_info -u te -e -n 256
   ucx_info -u we -e -n 256


Actual results:

All of the above commands in the client failed with return code of 242 on both server side and client side

Expected results:

All of them to complete the commands with return code of 0

Additional info:

When run the above failed "ucx_info" command with RHEL-9.0.0-20210607.0

      PASS |      0 | ucp endpoint config for a
      PASS |      0 | ucp endpoint config for r
      PASS |      0 | ucp endpoint config for t
      PASS |      0 | ucp endpoint config for w
      PASS |      0 | ucp endpoint config for ae
      PASS |      0 | ucp endpoint config for re
      PASS |      0 | ucp endpoint config for te
      PASS |      0 | ucp endpoint config for we

Comment 1 Jonathan Toppins 2021-08-16 17:38:07 UTC
Is this still a problem given the information in the similar RHEL-8.5 bug?

Comment 6 Brian Chae 2021-11-05 11:32:39 UTC
Tested with latest RHEL-9.0 build but the issue still persists.

Test results for ucx/ucx/ on rdma-dev-20:
5.14.0-11.el9.x86_64, rdma-core-37.1-1.el9, mlx5, ib0, & mlx5_2
    Result | Status | Test
  ---------+--------+------------------------------------
      PASS |      0 | install ucx
      PASS |      0 | ucx version info
      PASS |      0 | ucx build info
      PASS |      0 | ucx system info
      PASS |      0 | ucx device info
      PASS |      0 | ucx transport info - posix
      PASS |      0 | ucx transport info - self
      PASS |      0 | ucx transport info - sysv
      PASS |      0 | ucx transport info - tcp
      PASS |      0 | ucx configuration info
      PASS |      0 | ucp context info for a
      PASS |      0 | ucp worker info for a
      FAIL |    242 | ucp endpoint config for a
      PASS |      0 | ucp context info for r
      PASS |      0 | ucp worker info for r
      FAIL |    242 | ucp endpoint config for r
      PASS |      0 | ucp context info for t
      PASS |      0 | ucp worker info for t
      FAIL |    242 | ucp endpoint config for t
      PASS |      0 | ucp context info for w
      PASS |      0 | ucp worker info for w
      FAIL |    242 | ucp endpoint config for w
      PASS |      0 | ucp context info for ae
      PASS |      0 | ucp worker info for ae
      FAIL |    242 | ucp endpoint config for ae
      PASS |      0 | ucp context info for re
      PASS |      0 | ucp worker info for re
      FAIL |    242 | ucp endpoint config for re
      PASS |      0 | ucp context info for te
      PASS |      0 | ucp worker info for te
      FAIL |    242 | ucp endpoint config for te
      PASS |      0 | ucp context info for we
      PASS |      0 | ucp worker info for we
      FAIL |    242 | ucp endpoint config for we
      PASS |      0 | ucx type and struct info
      FAIL |    255 | ucx_perftest am_lat
      FAIL |    255 | ucx_perftest put_lat
      FAIL |    255 | ucx_perftest add_lat
      FAIL |    255 | ucx_perftest fadd
      FAIL |    255 | ucx_perftest cswap
      FAIL |    255 | ucx_perftest am_bw
      FAIL |    255 | ucx_perftest put_bw
      FAIL |    255 | ucx_perftest add_mr
      PASS |      0 | ucx_perftest tag_lat
      PASS |      0 | ucx_perftest tag_bw
      PASS |      0 | ucx_perftest ucp_put_lat
      PASS |      0 | ucx_perftest ucp_put_bw
      FAIL |    143 | ucx_perftest ucp_get
      PASS |      0 | openmpi setup
      PASS |      0 | openmpi built with ucx
      FAIL |     16 | openmpi ucx osu_bw

o build and packages tested


DISTRO=RHEL-9.0.0-20211104.5

+ [21-11-05 07:10:57] cat /etc/redhat-release
Red Hat Enterprise Linux release 9.0 Beta (Plow)

+ [21-11-05 07:10:57] uname -a
Linux rdma-dev-20.rdma.lab.eng.rdu2.redhat.com 5.14.0-11.el9.x86_64 #1 SMP Thu Oct 28 18:29:41 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux
+ [21-11-05 07:10:57] cat /proc/cmdline
BOOT_IMAGE=(hd0,msdos1)/vmlinuz-5.14.0-11.el9.x86_64 root=/dev/mapper/rhel_rdma--dev--20-root ro intel_idle.max_cstate=0 processor.max_cstate=0 intel_iommu=on iommu=on console=tty0 rd_NO_PLYMOUTH crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M resume=/dev/mapper/rhel_rdma--dev--20-swap rd.lvm.lv=rhel_rdma-dev-20/root rd.lvm.lv=rhel_rdma-dev-20/swap console=ttyS1,115200n81

+ [21-11-05 07:10:57] rpm -q rdma-core linux-firmware
rdma-core-37.1-1.el9.x86_64
linux-firmware-20210919-122.el9.noarch

+ [21-11-05 07:10:57] tail /sys/class/infiniband/mlx5_2/fw_ver /sys/class/infiniband/mlx5_3/fw_ver /sys/class/infiniband/mlx5_bond_0/fw_ver
==> /sys/class/infiniband/mlx5_2/fw_ver <==
12.28.2006

==> /sys/class/infiniband/mlx5_3/fw_ver <==
12.28.2006

==> /sys/class/infiniband/mlx5_bond_0/fw_ver <==
14.31.1014



Installed:
  mpitests-openmpi-5.7-4.el9.x86_64          openmpi-4.1.1-4.el9.x86_64         
  openmpi-devel-4.1.1-4.el9.x86_64    

Package ucx-1.11.2-1.el9.x86_64 is already installed.                  <<<=============================================


2. HW tested : MLX5 IB
Servers: rdma-dev-19
Clinets: rdma-dev-20

Comment 13 Brian Chae 2022-01-04 14:17:29 UTC
The verification test was conducted as the following: 

1. build and packages

DISTRO=RHEL-9.0.0-20220103.2

+ [22-01-04 08:04:53] cat /etc/redhat-release
Red Hat Enterprise Linux release 9.0 Beta (Plow)

+ [22-01-04 08:04:53] uname -a
Linux rdma-dev-20.rdma.lab.eng.rdu2.redhat.com 5.14.0-39.el9.x86_64 #1 SMP PREEMPT Fri Dec 24 00:07:58 EST 2021 x86_64 x86_64 x86_64 GNU/Linux

+ [22-01-04 08:04:53] cat /proc/cmdline
BOOT_IMAGE=(hd0,msdos1)/vmlinuz-5.14.0-39.el9.x86_64 root=/dev/mapper/rhel_rdma--dev--20-root ro intel_idle.max_cstate=0 processor.max_cstate=0 intel_iommu=on iommu=on console=tty0 rd_NO_PLYMOUTH crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M resume=/dev/mapper/rhel_rdma--dev--20-swap rd.lvm.lv=rhel_rdma-dev-20/root rd.lvm.lv=rhel_rdma-dev-20/swap console=ttyS1,115200n81

+ [22-01-04 08:04:53] rpm -q rdma-core linux-firmware
rdma-core-37.1-1.el9.x86_64
linux-firmware-20211027-123.el9.noarch

Installed:
  ucx-cma-1.11.2-2.el9.x86_64              ucx-ib-1.11.2-2.el9.x86_64          
  ucx-rdmacm-1.11.2-2.el9.x86_64 

Package ucx-1.11.2-2.el9.x86_64 is already installed.

2. all UCP tests passed

      PASS |      0 | ucp context info for a
      PASS |      0 | ucp worker info for a
      PASS |      0 | ucp endpoint config for a
      PASS |      0 | ucp context info for r
      PASS |      0 | ucp worker info for r
      PASS |      0 | ucp endpoint config for r
      PASS |      0 | ucp context info for t
      PASS |      0 | ucp worker info for t
      PASS |      0 | ucp endpoint config for t
      PASS |      0 | ucp context info for m
      PASS |      0 | ucp worker info for m
      PASS |      0 | ucp endpoint config for m
      PASS |      0 | ucp context info for ae
      PASS |      0 | ucp worker info for ae
      PASS |      0 | ucp endpoint config for ae
      PASS |      0 | ucp context info for re
      PASS |      0 | ucp worker info for re
      PASS |      0 | ucp endpoint config for re
      PASS |      0 | ucp context info for te
      PASS |      0 | ucp worker info for te
      PASS |      0 | ucp endpoint config for te
      PASS |      0 | ucp context info for me
      PASS |      0 | ucp worker info for me
      PASS |      0 | ucp endpoint config for me
      PASS |      0 | ucp context info for aw
      PASS |      0 | ucp worker info for aw
      PASS |      0 | ucp endpoint config for aw
      PASS |      0 | ucp context info for rw
      PASS |      0 | ucp worker info for rw
      PASS |      0 | ucp endpoint config for rw
      PASS |      0 | ucp context info for tw
      PASS |      0 | ucp worker info for tw
      PASS |      0 | ucp endpoint config for tw
      PASS |      0 | ucp context info for mw
      PASS |      0 | ucp worker info for mw
      PASS |      0 | ucp endpoint config for mw

Comment 16 errata-xmlrpc 2022-05-17 15:53:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (new packages: RDMA stack), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2022:3950