Bug 1950272
| Summary: | [RHEL8.4] when qperf tests are run on ALL ROCE devices, segfault core dumps generated in the server host | |||
|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Brian Chae <bchae> | |
| Component: | qperf | Assignee: | Nobody <nobody> | |
| Status: | CLOSED WONTFIX | QA Contact: | Brian Chae <bchae> | |
| Severity: | unspecified | Docs Contact: | ||
| Priority: | unspecified | |||
| Version: | 8.4 | CC: | dledford, gcase, hwkernel-mgr, rdma-dev-team | |
| Target Milestone: | beta | Keywords: | Triaged | |
| Target Release: | --- | Flags: | pm-rhel:
mirror+
|
|
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 2049647 2050979 (view as bug list) | Environment: | ||
| Last Closed: | 2022-11-01 07:29:01 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1903942, 2049647, 2050979 | |||
|
Description
Brian Chae
2021-04-16 09:37:19 UTC
debug info from core dumps... [root@rdma-qe-24 coredump]$ gdb core.qperf.0.4010cc350f7a49b49f7d2c6c63e64afb.159750.1618564410000000 GNU gdb (GDB) Red Hat Enterprise Linux 8.2-15.el8 Copyright (C) 2018 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". Type "show configuration" for configuration details. For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>. Find the GDB manual and other documentation resources online at: <http://www.gnu.org/software/gdb/documentation/>. For help, type "help". Type "apropos word" to search for commands related to "word"... [New LWP 159750] Reading symbols from /usr/bin/qperf...Reading symbols from /usr/lib/debug/usr/bin/qperf-0.4.11-1.el8.x86_64.debug...done. done. BFD: reopening /dev/infiniband/uverbs3: Illegal seek [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `qperf'. Program terminated with signal SIGSEGV, Segmentation fault. #0 get_priv (ctx=<optimized out>) at libibverbs/ibverbs.h:87 87 return &get_priv(ctx)->ops; (gdb) where #0 get_priv (ctx=<optimized out>) at libibverbs/ibverbs.h:87 #1 get_ops (ctx=<optimized out>) at libibverbs/ibverbs.h:87 #2 __ibv_destroy_qp_1_1 (qp=0x0) at libibverbs/verbs.c:670 #3 0x0000146ffcb826a1 in rdma_destroy_qp (id=0x561c1b7c94f0) at librdmacm/cma.c:1709 #4 0x0000561c1a267c37 in cm_close (dev=0x7fff5c1beee0) at rdma.c:1940 #5 rd_close (dev=dev@entry=0x7fff5c1beee0) at rdma.c:1602 #6 0x0000561c1a267f44 in rd_bi_bw (transport=<optimized out>) at rdma.c:1027 #7 0x0000561c1a2608e3 in server () at qperf.c:1376 warning: (Internal error: pc 0x561c1a25e2e3 in read in CU, but not in symtab.) warning: (Internal error: pc 0x561c1a25e2e3 in read in CU, but not in symtab.) warning: (Internal error: pc 0x561c1a25dfb0 in read in CU, but not in symtab.) warning: (Internal error: pc 0x561c1a25e2e3 in read in CU, but not in symtab.) warning: (Internal error: pc 0x561c1a25e2e3 in read in CU, but not in symtab.) warning: (Internal error: pc 0x561c1a25e2e3 in read in CU, but not in symtab.) #8 0x0000561c1a25e2e4 in main () at qperf.c:758 warning: (Internal error: pc 0x561c1a25e2e3 in read in CU, but not in symtab.) (gdb) All of the 8 core dumps point to the same place. As more tests performed on this issue, it became evident that this applies to ALL ROCE HCAs - that is, MLX4 ROCE, MLX5 ROCE, and BNXT ROCE. This bugzilla's headline was changed to reflect as such. Additional info: This issue has been observed in RHEL8.4 as well as RHEL8.5 builds on all ROCE HCAs. Honggang, this scratch build would fail all tests with the server side rejecting the client of the qperf tests as the following:
==============================================================================================================
Package Architecture Version Repository Size
==============================================================================================================
Upgrading:
qperf x86_64 0.4.11-3.el8 brew-task-repo-qperf-0.4.11-3.el8-scratch 68 k
Transaction Summary
==============================================================================================================
Upgrade 1 Package
Total download size: 68 k
Downloading Packages:
qperf-0.4.11-3.el8.x86_64.rpm 76 kB/s | 68 kB 00:00
--------------------------------------------------------------------------------------------------------------
Total 76 kB/s | 68 kB 00:00
Running transaction check
Transaction check succeeded.
Running transaction test
Transaction test succeeded.
Running transaction
Preparing : 1/1
Upgrading : qperf-0.4.11-3.el8.x86_64 1/2
Cleanup : qperf-0.4.11-2.el8.x86_64 2/2
Running scriptlet: qperf-0.4.11-2.el8.x86_64 2/2
Verifying : qperf-0.4.11-3.el8.x86_64 1/2
Verifying : qperf-0.4.11-2.el8.x86_64 2/2
Installed products updated.
Upgraded:
qperf-0.4.11-3.el8.x86_64
Complete!
SERVER
+ [21-06-20 07:16:34] qperf
rdma_disconnect failed: Invalid argument
rdma_disconnect failed: Invalid argument
rdma_disconnect failed: Invalid argument
rdma_disconnect failed: Invalid argument
rdma_disconnect failed: Invalid argument
rdma_disconnect failed: Invalid argument
rdma_disconnect failed: Invalid argument
rdma_disconnect failed: Invalid argument
rdma_disconnect failed: Invalid argument
rdma_disconnect failed: Invalid argument
rdma_disconnect failed: Invalid argument
rdma_disconnect failed: Invalid argument
CLIENT
+ [21-06-20 07:16:47] for qperf_test in $QPERF_TESTS
+ [21-06-20 07:16:47] qperf -v -i mlx4_1:1 -cm 1 172.31.45.200 rc_rdma_read_lat
server: rdma_disconnect failed: Invalid argument <<<=======================================
+ [21-06-20 07:17:06] cat /tmp/tmp.m6NI1PvSJU/results_qperf.txt
Test results for qperf on rdma-virt-01:
4.18.0-314.el8.x86_64, rdma-core-35.0-1.el8, mlx4, roce.45, & mlx4_1
Result | Status | Test
---------+--------+------------------------------------
PASS | 0 | ping server
PASS | 0 | conf
FAIL | 1 | rc_bi_bw
FAIL | 1 | rc_bw
FAIL | 1 | rc_lat
FAIL | 1 | rc_rdma_read_bw
FAIL | 1 | rc_rdma_read_lat
FAIL | 1 | rc_rdma_write_bw
FAIL | 1 | rc_rdma_write_lat
FAIL | 1 | rc_rdma_write_poll_lat
FAIL | 1 | rc_compare_swap_mr
FAIL | 1 | rc_fetch_add_mr
FAIL | 1 | ver_rc_compare_swap
FAIL | 1 | ver_rc_fetch_add
PASS | 0 | quit
This will cause qperf test on ALL ROCE devices to fail as above.
Not sure if there is issue with new qperf package or this fix may have contributed the above new issue.
After evaluating this issue, there are no plans to address it further or fix it in an upcoming release. Therefore, it is being closed. If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened. |