Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1948337

Summary: [RHEL9.0] All ucx_perftests fail with segmentation fault when tested on MLX5 IB and ROCE devices
Product: Red Hat Enterprise Linux 9 Reporter: Brian Chae <bchae>
Component: ucxAssignee: Jonathan Toppins <jtoppins>
Status: CLOSED ERRATA QA Contact: Brian Chae <bchae>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 9.0CC: mschmidt, rdma-dev-team
Target Milestone: betaKeywords: Triaged
Target Release: 9.0Flags: pm-rhel: mirror+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ucx-1.11.2-1.el9 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-05-17 15:53:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Brian Chae 2021-04-12 00:20:34 UTC
Description of problem:
all of the ucx perftests fail with segmentation fault, as shown below:

      FAIL |    139 | ucx_perftest am_lat
      FAIL |    139 | ucx_perftest put_lat
      FAIL |    139 | ucx_perftest add_lat
      FAIL |    139 | ucx_perftest fadd
      FAIL |    139 | ucx_perftest cswap
      FAIL |    139 | ucx_perftest am_bw
      FAIL |    139 | ucx_perftest put_bw
      FAIL |    139 | ucx_perftest add_mr
      FAIL |    139 | ucx_perftest tag_lat
      FAIL |    139 | ucx_perftest tag_bw
      FAIL |    139 | ucx_perftest ucp_put_lat
      FAIL |    139 | ucx_perftest ucp_put_bw
      FAIL |    139 | ucx_perftest ucp_get

[rdma-virt-03:105091:0:105091] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace (tid: 105091) ====
 0  /lib64/libucs.so.0(ucs_handle_error+0x2a4) [0x7fce7e3615c4]
 1  /lib64/libucs.so.0(+0x2916d) [0x7fce7e36416d]
 2  /lib64/libucs.so.0(+0x2934a) [0x7fce7e36434a]
 3  /lib64/libpthread.so.0(+0x13a00) [0x7fce7e32da00]
 4  ucx_perftest(+0x6114) [0x5573b9097114]
 5  ucx_perftest(+0xc8d1) [0x5573b909d8d1]
 6  ucx_perftest(+0x59ed) [0x5573b90969ed]
 7  /lib64/libc.so.6(__libc_start_main+0xd5) [0x7fce7e110b75]
 8  ucx_perftest(+0x603e) [0x5573b909703e]
=================================
timeout: the monitored command dumped core



Version-Release number of selected component (if applicable):


DISTRO=RHEL-9.0.0-20210330.8

+ [21-04-11 18:06:38] cat /etc/redhat-release
Red Hat Enterprise Linux release 9.0 Beta (Plow)

+ [21-04-11 18:06:38] uname -a
Linux rdma-virt-03.lab.bos.redhat.com 5.11.0-2.el9.x86_64 #1 SMP Wed Mar 10 14:55:23 EST 2021 x86_64 x86_64 x86_64 GNU/Linux

+ [21-04-11 18:06:38] cat /proc/cmdline
BOOT_IMAGE=(hd0,msdos1)/vmlinuz-5.11.0-2.el9.x86_64 root=UUID=95261748-5608-45f0-8d17-51b97e1a6d1f ro intel_idle.max_cstate=0 processor.max_cstate=0 intel_iommu=on iommu=on console=tty0 rd_NO_PLYMOUTH resume=UUID=49bd0668-6d8a-4ad9-bfe9-f3e96bf403ce console=ttyS1,115200n81

+ [21-04-11 18:06:38] rpm -q rdma-core linux-firmware
rdma-core-34.0-2.el9.x86_64
linux-firmware-20210208-118.el9.noarch

+ [21-04-11 18:06:38] tail /sys/class/infiniband/mlx5_0/fw_ver /sys/class/infiniband/mlx5_1/fw_ver /sys/class/infiniband/mlx5_bond_0/fw_ver
==> /sys/class/infiniband/mlx5_0/fw_ver <==
12.25.1020

==> /sys/class/infiniband/mlx5_1/fw_ver <==
12.25.1020

==> /sys/class/infiniband/mlx5_bond_0/fw_ver <==
14.27.1016
+ [21-04-11 18:06:38] lspci
+ [21-04-11 18:06:38] grep -i -e ethernet -e infiniband -e omni -e ConnectX
02:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 2-port Gigabit Ethernet PCIe
02:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 2-port Gigabit Ethernet PCIe
03:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 2-port Gigabit Ethernet PCIe
03:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 2-port Gigabit Ethernet PCIe
04:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
04:00.1 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
05:00.0 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
05:00.1 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]

Tested RDMA hosts:

Clients: rdma-virt-03
Servers: rdma-virt-02

How reproducible:

100%


Steps to Reproduce:

server side of link
===================

8: mlx5_ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP group default qlen 256
    link/infiniband 00:00:11:0f:fe:80:00:00:00:00:00:00:e4:1d:2d:03:00:e7:0f:f6 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
    altname ibp4s0f0
    inet 172.31.0.202/24 brd 172.31.0.255 scope global dynamic noprefixroute mlx5_ib0
       valid_lft 2384sec preferred_lft 2384sec
    inet6 fe80::e61d:2d03:e7:ff6/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever


1. bring up RDMA hosts with above build
2. on the server host, issue the following command

timeout --preserve-status --kill-after=5m 3m ucx_perftest -d mlx5_0:1 -t am_lat -x rc -c 1


3. on the client host, issue the following command

timeout --preserve-status --kill-after=5m 3m ucx_perftest -d mlx5_0:1 -t am_lat -x rc -c 1 172.31.0.202



Actual results:

both server and client hosts with the above ucx pertest commands produce the output:

[rdma-virt-03:105091:0:105091] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace (tid: 105091) ====
 0  /lib64/libucs.so.0(ucs_handle_error+0x2a4) [0x7fce7e3615c4]
 1  /lib64/libucs.so.0(+0x2916d) [0x7fce7e36416d]
 2  /lib64/libucs.so.0(+0x2934a) [0x7fce7e36434a]
 3  /lib64/libpthread.so.0(+0x13a00) [0x7fce7e32da00]
 4  ucx_perftest(+0x6114) [0x5573b9097114]
 5  ucx_perftest(+0xc8d1) [0x5573b909d8d1]
 6  ucx_perftest(+0x59ed) [0x5573b90969ed]
 7  /lib64/libc.so.6(__libc_start_main+0xd5) [0x7fce7e110b75]
 8  ucx_perftest(+0x603e) [0x5573b909703e]
=================================
timeout: the monitored command dumped core





Expected results:

Normal performance stats


Additional info:

Also tested on rdma-qe-06(server) / rdma-qe-07(client) with exactly same seg. falut, as shown above

Comment 1 Jonathan Toppins 2021-05-25 17:20:50 UTC
Can you try this scratch build?

https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=36974513

Comment 2 Brian Chae 2021-06-30 19:07:11 UTC
(In reply to Jonathan Toppins from comment #1)
> Can you try this scratch build?
> 
> https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=36974513

Jon, I just have saw this request...
I will get it tested and will post the result as soon as possible.

-Brian

Comment 3 Brian Chae 2021-07-01 09:22:37 UTC
(In reply to Brian Chae from comment #2)
> (In reply to Jonathan Toppins from comment #1)
> > Can you try this scratch build?
> > 
> > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=36974513
> 
> Jon, I just have saw this request...
> I will get it tested and will post the result as soon as possible.
> 
> -Brian

Jon, could you create this scratch build one more time? I think specified build no longer exists.

-Brian

Comment 4 Michal Schmidt 2021-08-27 13:22:31 UTC
ucx was updated in bug 1858571 (currently ON_QA).
Does the testing for bug 1858571 cover the ucx_perftests that this BZ is about? If it works, maybe this can be closed as a duplicate of the ucx update bug?

Comment 6 Brian Chae 2021-11-08 11:57:59 UTC
Setting this bugzilla back to Assigned.

Comment 9 Brian Chae 2021-11-17 20:46:52 UTC
Honggang, as of RHEL-9.0.0-20211116.6 build, the ucx still has Package "ucx-1.10.1-3.el9.x86_64". Since this bugzilla is in ON_QA state, I would expect the version to be "ucx-1.11.2-1.el9".
Am I missing something? Or Should this be bugzilla still be in Assigned state?

-Brian

Comment 10 Honggang LI 2021-11-18 05:42:33 UTC
(In reply to Brian Chae from comment #9)
> Honggang, as of RHEL-9.0.0-20211116.6 build, the ucx still has Package
> "ucx-1.10.1-3.el9.x86_64". Since this bugzilla is in ON_QA state, I would
> expect the version to be "ucx-1.11.2-1.el9".
> Am I missing something? Or Should this be bugzilla still be in Assigned
> state?

http://download.eng.bos.redhat.com/rhel-9/composes/RHEL-9/RHEL-9.0.0-20211117.d.7/compose/AppStream/x86_64/os/Packages/ucx-1.11.2-1.el9.x86_64.rpm

It is available in RHEL-9.0.0-20211117.d.7 .

Comment 11 Brian Chae 2021-11-30 15:11:03 UTC
The verification done as the following:

1. build and packages


Clients: rdma-dev-22
Servers: rdma-dev-21

DISTRO=RHEL-9.0.0-20211129.2

+ [21-11-30 08:28:49] cat /etc/redhat-release
Red Hat Enterprise Linux release 9.0 Beta (Plow)

+ [21-11-30 08:28:49] uname -a
Linux rdma-dev-22.rdma.lab.eng.rdu2.redhat.com 5.14.0-21.el9.x86_64 #1 SMP Thu Nov 25 21:41:11 EST 2021 x86_64 x86_64 x86_64 GNU/Linux

+ [21-11-30 08:28:49] cat /proc/cmdline
BOOT_IMAGE=(hd0,msdos1)/vmlinuz-5.14.0-21.el9.x86_64 root=/dev/mapper/rhel_rdma--dev--22-root ro intel_idle.max_cstate=0 processor.max_cstate=0 intel_iommu=on iommu=on console=tty0 rd_NO_PLYMOUTH crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M resume=/dev/mapper/rhel_rdma--dev--22-swap rd.lvm.lv=rhel_rdma-dev-22/root rd.lvm.lv=rhel_rdma-dev-22/swap console=ttyS1,115200n81

+ [21-11-30 08:28:49] rpm -q rdma-core linux-firmware
rdma-core-37.1-1.el9.x86_64
linux-firmware-20211027-123.el9.noarch


Package ucx-1.11.2-2.el9.x86_64 is already installed.



2. Tested HW

MLX5 IB : rdma-dev-19/20 & rdma-dev-21/22 pairs
MLX5 ROCE : rdma-dev-19/20 & rdma-dev-21/22 pairs

3. Results

      FAIL |    255 | ucx_perftest am_lat
      FAIL |    255 | ucx_perftest am_bw
      PASS |      0 | ucx_perftest tag_lat
      PASS |      0 | ucx_perftest tag_bw
      PASS |      0 | ucx_perftest ucp_put_lat
      PASS |      0 | ucx_perftest ucp_put_bw
      FAIL |    143 | ucx_perftest ucp_get

These failures will be investigated and separate bug reports will be filed as necessary

Comment 14 errata-xmlrpc 2022-05-17 15:53:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (new packages: RDMA stack), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2022:3950