Bug 1948337
| Summary: | [RHEL9.0] All ucx_perftests fail with segmentation fault when tested on MLX5 IB and ROCE devices | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 9 | Reporter: | Brian Chae <bchae> |
| Component: | ucx | Assignee: | Jonathan Toppins <jtoppins> |
| Status: | CLOSED ERRATA | QA Contact: | Brian Chae <bchae> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 9.0 | CC: | mschmidt, rdma-dev-team |
| Target Milestone: | beta | Keywords: | Triaged |
| Target Release: | 9.0 | Flags: | pm-rhel:
mirror+
|
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | ucx-1.11.2-1.el9 | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-05-17 15:53:57 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Can you try this scratch build? https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=36974513 (In reply to Jonathan Toppins from comment #1) > Can you try this scratch build? > > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=36974513 Jon, I just have saw this request... I will get it tested and will post the result as soon as possible. -Brian (In reply to Brian Chae from comment #2) > (In reply to Jonathan Toppins from comment #1) > > Can you try this scratch build? > > > > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=36974513 > > Jon, I just have saw this request... > I will get it tested and will post the result as soon as possible. > > -Brian Jon, could you create this scratch build one more time? I think specified build no longer exists. -Brian ucx was updated in bug 1858571 (currently ON_QA). Does the testing for bug 1858571 cover the ucx_perftests that this BZ is about? If it works, maybe this can be closed as a duplicate of the ucx update bug? Setting this bugzilla back to Assigned. Honggang, as of RHEL-9.0.0-20211116.6 build, the ucx still has Package "ucx-1.10.1-3.el9.x86_64". Since this bugzilla is in ON_QA state, I would expect the version to be "ucx-1.11.2-1.el9". Am I missing something? Or Should this be bugzilla still be in Assigned state? -Brian (In reply to Brian Chae from comment #9) > Honggang, as of RHEL-9.0.0-20211116.6 build, the ucx still has Package > "ucx-1.10.1-3.el9.x86_64". Since this bugzilla is in ON_QA state, I would > expect the version to be "ucx-1.11.2-1.el9". > Am I missing something? Or Should this be bugzilla still be in Assigned > state? http://download.eng.bos.redhat.com/rhel-9/composes/RHEL-9/RHEL-9.0.0-20211117.d.7/compose/AppStream/x86_64/os/Packages/ucx-1.11.2-1.el9.x86_64.rpm It is available in RHEL-9.0.0-20211117.d.7 . The verification done as the following:
1. build and packages
Clients: rdma-dev-22
Servers: rdma-dev-21
DISTRO=RHEL-9.0.0-20211129.2
+ [21-11-30 08:28:49] cat /etc/redhat-release
Red Hat Enterprise Linux release 9.0 Beta (Plow)
+ [21-11-30 08:28:49] uname -a
Linux rdma-dev-22.rdma.lab.eng.rdu2.redhat.com 5.14.0-21.el9.x86_64 #1 SMP Thu Nov 25 21:41:11 EST 2021 x86_64 x86_64 x86_64 GNU/Linux
+ [21-11-30 08:28:49] cat /proc/cmdline
BOOT_IMAGE=(hd0,msdos1)/vmlinuz-5.14.0-21.el9.x86_64 root=/dev/mapper/rhel_rdma--dev--22-root ro intel_idle.max_cstate=0 processor.max_cstate=0 intel_iommu=on iommu=on console=tty0 rd_NO_PLYMOUTH crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M resume=/dev/mapper/rhel_rdma--dev--22-swap rd.lvm.lv=rhel_rdma-dev-22/root rd.lvm.lv=rhel_rdma-dev-22/swap console=ttyS1,115200n81
+ [21-11-30 08:28:49] rpm -q rdma-core linux-firmware
rdma-core-37.1-1.el9.x86_64
linux-firmware-20211027-123.el9.noarch
Package ucx-1.11.2-2.el9.x86_64 is already installed.
2. Tested HW
MLX5 IB : rdma-dev-19/20 & rdma-dev-21/22 pairs
MLX5 ROCE : rdma-dev-19/20 & rdma-dev-21/22 pairs
3. Results
FAIL | 255 | ucx_perftest am_lat
FAIL | 255 | ucx_perftest am_bw
PASS | 0 | ucx_perftest tag_lat
PASS | 0 | ucx_perftest tag_bw
PASS | 0 | ucx_perftest ucp_put_lat
PASS | 0 | ucx_perftest ucp_put_bw
FAIL | 143 | ucx_perftest ucp_get
These failures will be investigated and separate bug reports will be filed as necessary
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (new packages: RDMA stack), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2022:3950 |
Description of problem: all of the ucx perftests fail with segmentation fault, as shown below: FAIL | 139 | ucx_perftest am_lat FAIL | 139 | ucx_perftest put_lat FAIL | 139 | ucx_perftest add_lat FAIL | 139 | ucx_perftest fadd FAIL | 139 | ucx_perftest cswap FAIL | 139 | ucx_perftest am_bw FAIL | 139 | ucx_perftest put_bw FAIL | 139 | ucx_perftest add_mr FAIL | 139 | ucx_perftest tag_lat FAIL | 139 | ucx_perftest tag_bw FAIL | 139 | ucx_perftest ucp_put_lat FAIL | 139 | ucx_perftest ucp_put_bw FAIL | 139 | ucx_perftest ucp_get [rdma-virt-03:105091:0:105091] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil)) ==== backtrace (tid: 105091) ==== 0 /lib64/libucs.so.0(ucs_handle_error+0x2a4) [0x7fce7e3615c4] 1 /lib64/libucs.so.0(+0x2916d) [0x7fce7e36416d] 2 /lib64/libucs.so.0(+0x2934a) [0x7fce7e36434a] 3 /lib64/libpthread.so.0(+0x13a00) [0x7fce7e32da00] 4 ucx_perftest(+0x6114) [0x5573b9097114] 5 ucx_perftest(+0xc8d1) [0x5573b909d8d1] 6 ucx_perftest(+0x59ed) [0x5573b90969ed] 7 /lib64/libc.so.6(__libc_start_main+0xd5) [0x7fce7e110b75] 8 ucx_perftest(+0x603e) [0x5573b909703e] ================================= timeout: the monitored command dumped core Version-Release number of selected component (if applicable): DISTRO=RHEL-9.0.0-20210330.8 + [21-04-11 18:06:38] cat /etc/redhat-release Red Hat Enterprise Linux release 9.0 Beta (Plow) + [21-04-11 18:06:38] uname -a Linux rdma-virt-03.lab.bos.redhat.com 5.11.0-2.el9.x86_64 #1 SMP Wed Mar 10 14:55:23 EST 2021 x86_64 x86_64 x86_64 GNU/Linux + [21-04-11 18:06:38] cat /proc/cmdline BOOT_IMAGE=(hd0,msdos1)/vmlinuz-5.11.0-2.el9.x86_64 root=UUID=95261748-5608-45f0-8d17-51b97e1a6d1f ro intel_idle.max_cstate=0 processor.max_cstate=0 intel_iommu=on iommu=on console=tty0 rd_NO_PLYMOUTH resume=UUID=49bd0668-6d8a-4ad9-bfe9-f3e96bf403ce console=ttyS1,115200n81 + [21-04-11 18:06:38] rpm -q rdma-core linux-firmware rdma-core-34.0-2.el9.x86_64 linux-firmware-20210208-118.el9.noarch + [21-04-11 18:06:38] tail /sys/class/infiniband/mlx5_0/fw_ver /sys/class/infiniband/mlx5_1/fw_ver /sys/class/infiniband/mlx5_bond_0/fw_ver ==> /sys/class/infiniband/mlx5_0/fw_ver <== 12.25.1020 ==> /sys/class/infiniband/mlx5_1/fw_ver <== 12.25.1020 ==> /sys/class/infiniband/mlx5_bond_0/fw_ver <== 14.27.1016 + [21-04-11 18:06:38] lspci + [21-04-11 18:06:38] grep -i -e ethernet -e infiniband -e omni -e ConnectX 02:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 2-port Gigabit Ethernet PCIe 02:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 2-port Gigabit Ethernet PCIe 03:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 2-port Gigabit Ethernet PCIe 03:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 2-port Gigabit Ethernet PCIe 04:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4] 04:00.1 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4] 05:00.0 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] 05:00.1 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] Tested RDMA hosts: Clients: rdma-virt-03 Servers: rdma-virt-02 How reproducible: 100% Steps to Reproduce: server side of link =================== 8: mlx5_ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP group default qlen 256 link/infiniband 00:00:11:0f:fe:80:00:00:00:00:00:00:e4:1d:2d:03:00:e7:0f:f6 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff altname ibp4s0f0 inet 172.31.0.202/24 brd 172.31.0.255 scope global dynamic noprefixroute mlx5_ib0 valid_lft 2384sec preferred_lft 2384sec inet6 fe80::e61d:2d03:e7:ff6/64 scope link noprefixroute valid_lft forever preferred_lft forever 1. bring up RDMA hosts with above build 2. on the server host, issue the following command timeout --preserve-status --kill-after=5m 3m ucx_perftest -d mlx5_0:1 -t am_lat -x rc -c 1 3. on the client host, issue the following command timeout --preserve-status --kill-after=5m 3m ucx_perftest -d mlx5_0:1 -t am_lat -x rc -c 1 172.31.0.202 Actual results: both server and client hosts with the above ucx pertest commands produce the output: [rdma-virt-03:105091:0:105091] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil)) ==== backtrace (tid: 105091) ==== 0 /lib64/libucs.so.0(ucs_handle_error+0x2a4) [0x7fce7e3615c4] 1 /lib64/libucs.so.0(+0x2916d) [0x7fce7e36416d] 2 /lib64/libucs.so.0(+0x2934a) [0x7fce7e36434a] 3 /lib64/libpthread.so.0(+0x13a00) [0x7fce7e32da00] 4 ucx_perftest(+0x6114) [0x5573b9097114] 5 ucx_perftest(+0xc8d1) [0x5573b909d8d1] 6 ucx_perftest(+0x59ed) [0x5573b90969ed] 7 /lib64/libc.so.6(__libc_start_main+0xd5) [0x7fce7e110b75] 8 ucx_perftest(+0x603e) [0x5573b909703e] ================================= timeout: the monitored command dumped core Expected results: Normal performance stats Additional info: Also tested on rdma-qe-06(server) / rdma-qe-07(client) with exactly same seg. falut, as shown above