Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1950272

Summary: [RHEL8.4] when qperf tests are run on ALL ROCE devices, segfault core dumps generated in the server host
Product: Red Hat Enterprise Linux 8 Reporter: Brian Chae <bchae>
Component: qperfAssignee: Nobody <nobody>
Status: CLOSED WONTFIX QA Contact: Brian Chae <bchae>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 8.4CC: dledford, gcase, hwkernel-mgr, rdma-dev-team
Target Milestone: betaKeywords: Triaged
Target Release: ---Flags: pm-rhel: mirror+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2049647 2050979 (view as bug list) Environment:
Last Closed: 2022-11-01 07:29:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1903942, 2049647, 2050979    

Description Brian Chae 2021-04-16 09:37:19 UTC
Description of problem:

After run "qperf" tests on BXNT ROCE device, "segfault" core dumps are generated in the server host machine only ( NOT in the client )


After run "qperf" tests on BXNT ROCE device, "segfault" core dumps are generated in the server host machine only ( NOT in the client )


Version-Release number of selected component (if applicable):


DISTRO=RHEL-8.4.0-20210409.0

+ [21-04-15 11:36:55] cat /etc/redhat-release
Red Hat Enterprise Linux release 8.4 (Ootpa)

+ [21-04-15 11:36:55] uname -a
Linux rdma-qe-24.lab.bos.redhat.com 4.18.0-304.el8.x86_64 #1 SMP Tue Apr 6 05:19:59 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux

+ [21-04-15 11:36:55] cat /proc/cmdline
BOOT_IMAGE=(hd0,gpt2)/vmlinuz-4.18.0-304.el8.x86_64 root=UUID=6688985c-7e30-4ed4-9cbe-02f0fd718035 ro crashkernel=auto resume=UUID=a94dfd2d-37b1-4c4f-8257-b7862835c0b7 console=ttyS0,115200n81

+ [21-04-15 11:36:55] rpm -q rdma-core linux-firmware
rdma-core-32.0-4.el8.x86_64

linux-firmware-20201218-102.git05789708.el8.noarch

+ [21-04-15 11:36:55] tail /sys/class/infiniband/bnxt_re0/fw_ver /sys/class/infiniband/bnxt_re1/fw_ver /sys/class/infiniband/bnxt_re2/fw_ver /sys/class/infiniband/bnxt_re3/fw_ver
==> /sys/class/infiniband/bnxt_re0/fw_ver <==
20.8.30.0

==> /sys/class/infiniband/bnxt_re1/fw_ver <==
20.8.30.0

==> /sys/class/infiniband/bnxt_re2/fw_ver <==
216.0.51.0

==> /sys/class/infiniband/bnxt_re3/fw_ver <==
216.0.51.0

+ [21-04-15 11:36:55] lspci
+ [21-04-15 11:36:55] grep -i -e ethernet -e infiniband -e omni -e ConnectX
01:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 2-port Gigabit Ethernet PCIe
01:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 2-port Gigabit Ethernet PCIe
1a:00.0 Ethernet controller: Broadcom Inc. and subsidiaries BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet Controller (rev 01)
1a:00.1 Ethernet controller: Broadcom Inc. and subsidiaries BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet Controller (rev 01)
5e:00.0 Ethernet controller: Broadcom Inc. and subsidiaries BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller (rev 01)
5e:00.1 Ethernet controller: Broadcom Inc. and subsidiaries BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller (rev 01)

+ [21-04-15 11:36:57] rpm -q qperf
qperf-0.4.11-1.el8.x86_64

RDMA HOSTS tested on:


Clients: rdma-qe-25
Servers: rdma-qe-24




How reproducible:

100

Steps to Reproduce:
1. Bring up both server and client hosts with the above build
2. on the server host, initiate qperf test with issuing the following command
qperf

3. on the client host, run the following qperf tests

qperf -v -i bnxt_re3:1 -cm 1 172.31.45.24 conf
qperf -v -i bnxt_re3:1 -cm 1 172.31.45.24 rc_bi_bw
qperf -v -i bnxt_re3:1 -cm 1 172.31.45.24 rc_bw
qperf -v -i bnxt_re3:1 -cm 1 172.31.45.24 rc_lat
qperf -v -i bnxt_re3:1 -cm 1 172.31.45.24 rc_rdma_read_bw
qperf -v -i bnxt_re3:1 -cm 1 172.31.45.24 rc_rdma_read_lat
qperf -v -i bnxt_re3:1 -cm 1 172.31.45.24 rc_rdma_write_bw
qperf -v -i bnxt_re3:1 -cm 1 172.31.45.24 rc_rdma_write_lat
qperf -v -i bnxt_re3:1 -cm 1 172.31.45.24 rc_rdma_write_poll_lat
qperf -v -i bnxt_re3:1 -cm 1 172.31.45.24 quit




Actual results:

ALL tests passed successfully; but there were core dumps found after the tests and "segfault" message, as the following:
[root@rdma-qe-24 qperf]$ ls -lrt /var/lib/systemd/coredump/
total 1188
-rw-r-----. 1 root root 190852 Apr 16 05:13 core.qperf.0.4010cc350f7a49b49f7d2c6c63e64afb.159750.1618564410000000.lz4
-rw-r-----. 1 root root 152357 Apr 16 05:13 core.qperf.0.4010cc350f7a49b49f7d2c6c63e64afb.159761.1618564413000000.lz4
-rw-r-----. 1 root root 126924 Apr 16 05:13 core.qperf.0.4010cc350f7a49b49f7d2c6c63e64afb.159775.1618564415000000.lz4
-rw-r-----. 1 root root 125671 Apr 16 05:13 core.qperf.0.4010cc350f7a49b49f7d2c6c63e64afb.159789.1618564417000000.lz4
-rw-r-----. 1 root root 125666 Apr 16 05:13 core.qperf.0.4010cc350f7a49b49f7d2c6c63e64afb.159802.1618564420000000.lz4
-rw-r-----. 1 root root 152380 Apr 16 05:13 core.qperf.0.4010cc350f7a49b49f7d2c6c63e64afb.159812.1618564422000000.lz4
-rw-r-----. 1 root root 127966 Apr 16 05:13 core.qperf.0.4010cc350f7a49b49f7d2c6c63e64afb.159825.1618564424000000.lz4
-rw-r-----. 1 root root 164217 Apr 16 05:13 core.qperf.0.4010cc350f7a49b49f7d2c6c63e64afb.159835.1618564427000000.lz4


/proc/kmsg:<6>[47873.093880] qperf[159750]: segfault at 0 ip 0000146ffcdac7f4 sp 00007fff5c1beea8 error 4 in libibverbs.so.1.11.32.0[146ffcd95000+1e000]
/proc/kmsg:<6>[47875.439271] qperf[159761]: segfault at 0 ip 0000146ffcdac7f4 sp 00007fff5c1beeb8 error 4 in libibverbs.so.1.11.32.0[146ffcd95000+1e000]
/proc/kmsg:<6>[47877.769555] qperf[159775]: segfault at 0 ip 0000146ffcdac7f4 sp 00007fff5c1cae48 error 4 in libibverbs.so.1.11.32.0[146ffcd95000+1e000]
/proc/kmsg:<6>[47880.115045] qperf[159789]: segfault at 0 ip 0000146ffcdac7f4 sp 00007fff5c1caed8 error 4 in libibverbs.so.1.11.32.0[146ffcd95000+1e000]
/proc/kmsg:<6>[47882.439115] qperf[159802]: segfault at 0 ip 0000146ffcdac7f4 sp 00007fff5c1caed8 error 4 in libibverbs.so.1.11.32.0[146ffcd95000+1e000]
/proc/kmsg:<6>[47884.775459] qperf[159812]: segfault at 0 ip 0000146ffcdac7f4 sp 00007fff5c1beeb8 error 4 in libibverbs.so.1.11.32.0[146ffcd95000+1e000]
/proc/kmsg:<6>[47887.100425] qperf[159825]: segfault at 0 ip 0000146ffcdac7f4 sp 00007fff5c1cae48 error 4 in libibverbs.so.1.11.32.0[146ffcd95000+1e000]
/proc/kmsg:<6>[47889.426829] qperf[159835]: segfault at 0 ip 0000146ffcdac7f4 sp 00007fff5c1cae48 error 4 in libibverbs.so.1.11.32.0[146ffcd95000+1e000]





Expected results:

NO CORE DUMPS and NO segfaults with any of the qperf tests


Additional info:

Comment 1 Brian Chae 2021-04-16 10:18:22 UTC
debug info from core dumps...
[root@rdma-qe-24 coredump]$ gdb core.qperf.0.4010cc350f7a49b49f7d2c6c63e64afb.159750.1618564410000000
GNU gdb (GDB) Red Hat Enterprise Linux 8.2-15.el8
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
[New LWP 159750]
Reading symbols from /usr/bin/qperf...Reading symbols from /usr/lib/debug/usr/bin/qperf-0.4.11-1.el8.x86_64.debug...done.
done.
BFD: reopening /dev/infiniband/uverbs3: Illegal seek

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `qperf'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  get_priv (ctx=<optimized out>) at libibverbs/ibverbs.h:87
87              return &get_priv(ctx)->ops;
(gdb) where
#0  get_priv (ctx=<optimized out>) at libibverbs/ibverbs.h:87
#1  get_ops (ctx=<optimized out>) at libibverbs/ibverbs.h:87
#2  __ibv_destroy_qp_1_1 (qp=0x0) at libibverbs/verbs.c:670
#3  0x0000146ffcb826a1 in rdma_destroy_qp (id=0x561c1b7c94f0) at librdmacm/cma.c:1709
#4  0x0000561c1a267c37 in cm_close (dev=0x7fff5c1beee0) at rdma.c:1940
#5  rd_close (dev=dev@entry=0x7fff5c1beee0) at rdma.c:1602
#6  0x0000561c1a267f44 in rd_bi_bw (transport=<optimized out>) at rdma.c:1027
#7  0x0000561c1a2608e3 in server () at qperf.c:1376
warning: (Internal error: pc 0x561c1a25e2e3 in read in CU, but not in symtab.)
warning: (Internal error: pc 0x561c1a25e2e3 in read in CU, but not in symtab.)
warning: (Internal error: pc 0x561c1a25dfb0 in read in CU, but not in symtab.)
warning: (Internal error: pc 0x561c1a25e2e3 in read in CU, but not in symtab.)
warning: (Internal error: pc 0x561c1a25e2e3 in read in CU, but not in symtab.)
warning: (Internal error: pc 0x561c1a25e2e3 in read in CU, but not in symtab.)
#8  0x0000561c1a25e2e4 in main () at qperf.c:758
warning: (Internal error: pc 0x561c1a25e2e3 in read in CU, but not in symtab.)
(gdb) 

All of the 8 core dumps point to the same place.

Comment 7 Brian Chae 2021-05-31 11:34:27 UTC
As more tests performed on this issue, it became evident that this applies to ALL ROCE HCAs - that is, MLX4 ROCE, MLX5 ROCE, and BNXT ROCE.
This bugzilla's headline was changed to reflect as such.

Comment 8 Brian Chae 2021-05-31 11:37:40 UTC
Additional info: This issue has been observed in RHEL8.4 as well as RHEL8.5 builds on all ROCE HCAs.

Comment 10 Brian Chae 2021-06-20 15:18:57 UTC
Honggang, this scratch build would fail all tests with the server side rejecting the client of the qperf tests as the following:


==============================================================================================================
 Package       Architecture   Version                 Repository                                         Size
==============================================================================================================
Upgrading:
 qperf         x86_64         0.4.11-3.el8            brew-task-repo-qperf-0.4.11-3.el8-scratch          68 k

Transaction Summary
==============================================================================================================
Upgrade  1 Package

Total download size: 68 k
Downloading Packages:
qperf-0.4.11-3.el8.x86_64.rpm                                                  76 kB/s |  68 kB     00:00
--------------------------------------------------------------------------------------------------------------
Total                                                                          76 kB/s |  68 kB     00:00
Running transaction check
Transaction check succeeded.
Running transaction test
Transaction test succeeded.
Running transaction
  Preparing        :                                                                                      1/1
  Upgrading        : qperf-0.4.11-3.el8.x86_64                                                            1/2
  Cleanup          : qperf-0.4.11-2.el8.x86_64                                                            2/2
  Running scriptlet: qperf-0.4.11-2.el8.x86_64                                                            2/2
  Verifying        : qperf-0.4.11-3.el8.x86_64                                                            1/2
  Verifying        : qperf-0.4.11-2.el8.x86_64                                                            2/2
Installed products updated.

Upgraded:
  qperf-0.4.11-3.el8.x86_64

Complete!


SERVER

+ [21-06-20 07:16:34] qperf
rdma_disconnect failed: Invalid argument
rdma_disconnect failed: Invalid argument
rdma_disconnect failed: Invalid argument
rdma_disconnect failed: Invalid argument
rdma_disconnect failed: Invalid argument
rdma_disconnect failed: Invalid argument
rdma_disconnect failed: Invalid argument
rdma_disconnect failed: Invalid argument
rdma_disconnect failed: Invalid argument
rdma_disconnect failed: Invalid argument
rdma_disconnect failed: Invalid argument
rdma_disconnect failed: Invalid argument

CLIENT 

+ [21-06-20 07:16:47] for qperf_test in $QPERF_TESTS
+ [21-06-20 07:16:47] qperf -v -i mlx4_1:1 -cm 1 172.31.45.200 rc_rdma_read_lat
server: rdma_disconnect failed: Invalid argument <<<=======================================

+ [21-06-20 07:17:06] cat /tmp/tmp.m6NI1PvSJU/results_qperf.txt
Test results for qperf on rdma-virt-01:
4.18.0-314.el8.x86_64, rdma-core-35.0-1.el8, mlx4, roce.45, & mlx4_1
    Result | Status | Test
  ---------+--------+------------------------------------
      PASS |      0 | ping server
      PASS |      0 | conf
      FAIL |      1 | rc_bi_bw
      FAIL |      1 | rc_bw
      FAIL |      1 | rc_lat
      FAIL |      1 | rc_rdma_read_bw
      FAIL |      1 | rc_rdma_read_lat
      FAIL |      1 | rc_rdma_write_bw
      FAIL |      1 | rc_rdma_write_lat
      FAIL |      1 | rc_rdma_write_poll_lat
      FAIL |      1 | rc_compare_swap_mr
      FAIL |      1 | rc_fetch_add_mr
      FAIL |      1 | ver_rc_compare_swap
      FAIL |      1 | ver_rc_fetch_add
      PASS |      0 | quit

This will cause qperf test on ALL ROCE devices to fail as above.
Not sure if there is issue with new qperf package or this fix may have contributed the above new issue.

Comment 17 RHEL Program Management 2022-11-01 07:29:01 UTC
After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.