Bug 2126094
| Summary: | [RHEL9.1] UCX fails in "openmpi ucx osu_bw" test with core file when tested on MLX5 ROCE / IB devices | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 9 | Reporter: | Brian Chae <bchae> |
| Component: | ucx | Assignee: | Michal Schmidt <mschmidt> |
| Status: | CLOSED MIGRATED | QA Contact: | Afom T. Michael <tmichael> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 9.1 | CC: | rdma-dev-team, zguo |
| Target Milestone: | rc | Keywords: | MigratedToJIRA, Regression |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-09-21 14:45:28 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 2091421 | ||
With 9.3 CTC#1 RHEL-9.3.0-20230521.45 testing, the same issue was observed on MLX5 IB.
ucx/ucx/ test results on rdma-perf-02/rdma-perf-03 & Beaker job J:7886459:
5.14.0-316.el9.x86_64, rdma-core-44.0-2.el9, mlx5, ib0, ConnectX-5 & mlx5_0
Result | Status | Test
---------+--------+------------------------------------
FAIL | 135 | openmpi ucx osu_bw
+ [23-05-25 21:00:55] timeout --preserve-status --kill-after=5m 3m mpirun -hostfile /root/hfile_one_core -np 2 --allow-run-as-root --map-by node -mca btl '^vader,tcp,openib' -mca pml ucx -mca osc ucx -x UCX_NET_DEVICES=mlx5_0:1 mpitests-osu_bw
[rdma-perf-03:238169] *** Process received signal ***
[rdma-perf-03:238169] Signal: Bus error (7)
[rdma-perf-03:238169] Signal code: Non-existant physical address (2)
[rdma-perf-03:238169] Failing at address: 0x7f2e949b4000
[rdma-perf-03:238169] [ 0] /lib64/libc.so.6(+0x54df0)[0x7f2e9ea54df0]
[rdma-perf-03:238169] [ 1] /lib64/libc.so.6(+0x2b580)[0x7f2e9ea2b580]
[rdma-perf-03:238169] [ 2] /lib64/libfabric.so.1(+0x5db231)[0x7f2e9c7db231]
[rdma-perf-03:238169] [ 3] /lib64/libfabric.so.1(+0x5d13ec)[0x7f2e9c7d13ec]
[rdma-perf-03:238169] [ 4] /lib64/libfabric.so.1(+0x5d49f9)[0x7f2e9c7d49f9]
[rdma-perf-03:238169] [ 5] /lib64/libfabric.so.1(+0x5fa95b)[0x7f2e9c7fa95b]
[rdma-perf-03:238169] [ 6] /lib64/libfabric.so.1(+0x59fa11)[0x7f2e9c79fa11]
[rdma-perf-03:238169] [ 7] /usr/lib64/openmpi/lib/openmpi/mca_btl_ofi.so(+0x6cdf)[0x7f2e9d4fdcdf]
[rdma-perf-03:238169] [ 8] /usr/lib64/openmpi/lib/libopen-pal.so.40(mca_btl_base_select+0x112)[0x7f2e9e903e62]
[rdma-perf-03:238169] [ 9] /usr/lib64/openmpi/lib/openmpi/mca_bml_r2.so(mca_bml_r2_component_init+0x18)[0x7f2e9d507188]
[rdma-perf-03:238169] [10] /usr/lib64/openmpi/lib/libmpi.so.40(mca_bml_base_init+0x94)[0x7f2e9ed07c94]
[rdma-perf-03:238169] [11] /usr/lib64/openmpi/lib/libmpi.so.40(ompi_mpi_init+0x664)[0x7f2e9ed47cb4]
[rdma-perf-03:238169] [12] /usr/lib64/openmpi/lib/libmpi.so.40(MPI_Init+0x72)[0x7f2e9ece7482]
[rdma-perf-03:238169] [13] mpitests-osu_bw(+0x25d5)[0x55d47cae15d5]
[rdma-perf-03:238169] [14] /lib64/libc.so.6(+0x3feb0)[0x7f2e9ea3feb0]
[rdma-perf-03:238169] [15] /lib64/libc.so.6(__libc_start_main+0x80)[0x7f2e9ea3ff60]
[rdma-perf-03:238169] [16] mpitests-osu_bw(+0x3fa5)[0x55d47cae2fa5]
[rdma-perf-03:238169] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node rdma-perf-03 exited on signal 7 (Bus error).
--------------------------------------------------------------------------
+ [23-05-25 21:01:00] RQA_check_result -r 135 -t 'openmpi ucx osu_bw'
+ [23-05-25 21:01:00] local test_pass=0
+ [23-05-25 21:01:00] local test_skip=777
+ [23-05-25 21:01:00] test 4 -gt 0
+ [23-05-25 21:01:00] case $1 in
+ [23-05-25 21:01:00] local rc=135
+ [23-05-25 21:01:00] shift
+ [23-05-25 21:01:00] shift
+ [23-05-25 21:01:00] test 2 -gt 0
+ [23-05-25 21:01:00] case $1 in
+ [23-05-25 21:01:00] local 'msg=openmpi ucx osu_bw'
+ [23-05-25 21:01:00] shift
+ [23-05-25 21:01:00] shift
+ [23-05-25 21:01:01] test 0 -gt 0
+ [23-05-25 21:01:01] '[' -z 135 -o -z 'openmpi ucx osu_bw' ']'
+ [23-05-25 21:01:01] '[' -z /tmp/tmp.dqjGJ3nlpa/results_ucx-ucx-.txt ']'
+ [23-05-25 21:01:01] '[' -z /tmp/tmp.dqjGJ3nlpa/results_ucx-ucx-.txt ']'
+ [23-05-25 21:01:01] '[' 135 -eq 0 ']'
+ [23-05-25 21:01:01] '[' 135 -eq 777 ']'
+ [23-05-25 21:01:01] local test_result=FAIL
+ [23-05-25 21:01:01] export result=FAIL
+ [23-05-25 21:01:01] result=FAIL
+ [23-05-25 21:01:01] [[ ! -z '' ]]
+ [23-05-25 21:01:01] printf '%10s | %6s | %s\n' FAIL 135 'openmpi ucx osu_bw'
+ [23-05-25 21:01:01] set +x
---
- TEST RESULT FOR ucx
- Test: openmpi ucx osu_bw
- Result: FAIL
- Return: 135
---
Clients: rdma-perf-03
Servers: rdma-perf-02
DISTRO=RHEL-9.3.0-20230521.45
+ [23-05-25 20:43:24] cat /etc/redhat-release
Red Hat Enterprise Linux release 9.3 Beta (Plow)
+ [23-05-25 20:43:24] uname -a
Linux rdma-perf-03.rdma.lab.eng.rdu2.redhat.com 5.14.0-316.el9.x86_64 #1 SMP PREEMPT_DYNAMIC Fri May 19 13:18:40 EDT 2023 x86_64 x86_64 x86_64 GNU/Linux
+ [23-05-25 20:43:24] cat /proc/cmdline
BOOT_IMAGE=(hd0,msdos1)/vmlinuz-5.14.0-316.el9.x86_64 root=UUID=54370a14-3dd8-4131-8ec8-1c6724815ff7 ro intel_idle.max_cstate=0 processor.max_cstate=0 intel_iommu=on iommu=on console=tty0 rd_NO_PLYMOUTH intel_idle.max_cstate=0 intremap=no_x2apic_optout processor.max_cstate=0 reboot=acpi crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M resume=UUID=1b4ab1f2-c70d-4217-ac30-d241c6ade7f9 console=ttyS1,115200n81
+ [23-05-25 20:43:24] rpm -q rdma-core linux-firmware
rdma-core-44.0-2.el9.x86_64
linux-firmware-20230404-134.el9.noarch
+ [23-05-25 20:43:24] tail /sys/class/infiniband/mlx5_0/fw_ver /sys/class/infiniband/mlx5_1/fw_ver
==> /sys/class/infiniband/mlx5_0/fw_ver <==
16.33.1048
==> /sys/class/infiniband/mlx5_1/fw_ver <==
16.33.1048
+ [23-05-25 20:43:24] lspci
+ [23-05-25 20:43:24] grep -i -e ethernet -e infiniband -e omni -e ConnectX
03:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
03:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
03:00.2 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
03:00.3 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
07:00.0 Infiniband controller: Mellanox Technologies MT27800 Family [ConnectX-5]
07:00.1 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
Issue migration from Bugzilla to Jira is in process at this time. This will be the last message in Jira copied from the Bugzilla bug. This BZ has been automatically migrated to the issues.redhat.com Red Hat Issue Tracker. All future work related to this report will be managed there. Due to differences in account names between systems, some fields were not replicated. Be sure to add yourself to Jira issue's "Watchers" field to continue receiving updates and add others to the "Need Info From" field to continue requesting information. To find the migrated issue, look in the "Links" section for a direct link to the new issue location. The issue key will have an icon of 2 footprints next to it, and begin with "RHEL-" followed by an integer. You can also find this issue by visiting https://issues.redhat.com/issues/?jql= and searching the "Bugzilla Bug" field for this BZ's number, e.g. a search like: "Bugzilla Bug" = 1234567 In the event you have trouble locating or viewing this issue, you can file an issue by sending mail to rh-issues. You can also visit https://access.redhat.com/articles/7032570 for general account information. |
Description of problem: "openmpi ucx osu_bw" test fails during UCX test when tested on ALL variants of MLX5 ROCE HCA. This is a regression issue when compared with the RHEL-9.1.0-20220524.0 build for CTC#1 test cycle; also for CTC#2 (the build no longer exists) Version-Release number of selected component (if applicable): Clients: rdma-dev-22 Servers: rdma-dev-21 DISTRO=RHEL-9.1.0-20220910.0 + [22-09-11 20:02:57] cat /etc/redhat-release Red Hat Enterprise Linux release 9.1 Beta (Plow) + [22-09-11 20:02:57] uname -a Linux rdma-dev-22.rdma.lab.eng.rdu2.redhat.com 5.14.0-162.el9.x86_64 #1 SMP PREEMPT_DYNAMIC Mon Sep 5 10:44:43 EDT 2022 x86_64 x86_64 x86_64 GNU/Linux + [22-09-11 20:02:57] cat /proc/cmdline BOOT_IMAGE=(hd0,msdos1)/vmlinuz-5.14.0-162.el9.x86_64 root=UUID=376371e8-0b44-45c2-8687-191dbb3737bc ro intel_idle.max_cstate=0 processor.max_cstate=0 intel_iommu=on iommu=on console=tty0 rd_NO_PLYMOUTH crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M resume=UUID=beb6c243-17c9-4210-ba33-d2c0b4062b8a console=ttyS1,115200n81 + [22-09-11 20:02:57] rpm -q rdma-core linux-firmware rdma-core-41.0-3.el9.x86_64 linux-firmware-20220708-127.el9.noarch + [22-09-11 20:02:57] tail /sys/class/infiniband/mlx5_0/fw_ver /sys/class/infiniband/mlx5_1/fw_ver /sys/class/infiniband/mlx5_2/fw_ver ==> /sys/class/infiniband/mlx5_0/fw_ver <== 12.28.2006 ==> /sys/class/infiniband/mlx5_1/fw_ver <== 12.28.2006 ==> /sys/class/infiniband/mlx5_2/fw_ver <== 12.28.2006 + [22-09-11 20:02:57] lspci + [22-09-11 20:02:57] grep -i -e ethernet -e infiniband -e omni -e ConnectX 01:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe 01:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe 02:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe 02:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe 04:00.0 Ethernet controller: Mellanox Technologies MT27700 Family [ConnectX-4] 82:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4] 82:00.1 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4] Installed: ucx-cma-1.13.0-1.el9.x86_64 ucx-ib-1.13.0-1.el9.x86_64 ucx-rdmacm-1.13.0-1.el9.x86_64 + [22-09-11 20:15:38] timeout --preserve-status --kill-after=5m 3m ompi_info --parsable + [22-09-11 20:15:38] grep ucx mca:osc:ucx:version:"mca:2.1.0" mca:osc:ucx:version:"api:3.0.0" mca:osc:ucx:version:"component:4.1.1" mca:pml:ucx:version:"mca:2.1.0" mca:pml:ucx:version:"api:2.0.0" mca:pml:ucx:version:"component:4.1.1" How reproducible: 100% Steps to Reproduce: 1. Install RHEL-9.1.0-20220910.0 on any of rdma-dev-19/20, rdma-dev-21/22, rdma-perf-02/03, rdma-virt-02/03 for ROCE 2. Install & execute kernel-kernel-infiniband-ucx test script 3. Watch ucx result on client side Actual results: + [22-09-11 20:19:09] timeout --preserve-status --kill-after=5m 3m mpirun -hostfile /root/hfile_one_core -np 2 --allow-run-as-root --map-by node -mca btl '^vader,tcp,openib' -mca btl_openib_cpc_include rdmacm -mca btl_openib_receive_queues P,65536,256,192,128 -mca pml ucx -mca osc ucx -x UCX_NET_DEVICES=mlx5_0:1 mpitests-osu_bw [rdma-dev-22:262261] *** Process received signal *** [rdma-dev-22:262261] Signal: Bus error (7) [rdma-dev-22:262261] Signal code: Non-existant physical address (2) [rdma-dev-22:262261] Failing at address: 0x7fef511b6000 [rdma-dev-22:262261] [ 0] /lib64/libc.so.6(+0x54d90)[0x7fef5b310d90] [rdma-dev-22:262261] [ 1] /lib64/libc.so.6(+0xc290a)[0x7fef5b37e90a] [rdma-dev-22:262261] [ 2] /lib64/libfabric.so.1(+0x7836e4)[0x7fef593ef6e4] [rdma-dev-22:262261] [ 3] /lib64/libfabric.so.1(+0x787ebf)[0x7fef593f3ebf] [rdma-dev-22:262261] [ 4] /lib64/libfabric.so.1(+0x770299)[0x7fef593dc299] [rdma-dev-22:262261] [ 5] /lib64/libfabric.so.1(+0x7707ed)[0x7fef593dc7ed] [rdma-dev-22:262261] [ 6] /lib64/libfabric.so.1(+0x753e5d)[0x7fef593bfe5d] [rdma-dev-22:262261] [ 7] /lib64/libfabric.so.1(+0x747bff)[0x7fef593b3bff] [rdma-dev-22:262261] [ 8] /usr/lib64/openmpi/lib/openmpi/mca_btl_ofi.so(+0x6cdf)[0x7fef59562cdf] [rdma-dev-22:262261] [ 9] /usr/lib64/openmpi/lib/libopen-pal.so.40(mca_btl_base_select+0x112)[0x7fef5b1bae62] [rdma-dev-22:262261] [10] /usr/lib64/openmpi/lib/openmpi/mca_bml_r2.so(mca_bml_r2_component_init+0x18)[0x7fef5956c188] [rdma-dev-22:262261] [11] /usr/lib64/openmpi/lib/libmpi.so.40(mca_bml_base_init+0x94)[0x7fef5b56cc94] [rdma-dev-22:262261] [12] /usr/lib64/openmpi/lib/libmpi.so.40(ompi_mpi_init+0x664)[0x7fef5b5accb4] [rdma-dev-22:262261] [13] /usr/lib64/openmpi/lib/libmpi.so.40(MPI_Init+0x72)[0x7fef5b54c482] [rdma-dev-22:262261] [14] mpitests-osu_bw(+0x25d5)[0x559f152565d5] [rdma-dev-22:262261] [15] /lib64/libc.so.6(+0x3feb0)[0x7fef5b2fbeb0] [rdma-dev-22:262261] [16] /lib64/libc.so.6(__libc_start_main+0x80)[0x7fef5b2fbf60] [rdma-dev-22:262261] [17] mpitests-osu_bw(+0x3fa5)[0x559f15257fa5] [rdma-dev-22:262261] *** End of error message *** -------------------------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun noticed that process rank 0 with PID 0 on node rdma-dev-22 exited on signal 7 (Bus error). -------------------------------------------------------------------------- + [22-09-11 20:19:13] RQA_check_result -r 135 -t 'openmpi ucx osu_bw' Also, a core file was detected afterwards. Sun 2022-09-11 20:19:10 EDT 262261 0 0 SIGBUS none /usr/lib64/openmpi/bin/mpitests-osu_bw n/a Expected results: Result from RHEL-9.1.0-20220524.0 + [22-09-11 19:14:39] timeout --preserve-status --kill-after=5m 3m mpirun -hostfile /root/hfile_one_core -np 2 --allow-run-as-root --map-by node -mca btl '^vader,tcp,openib' -mca btl_openib_cpc_include rdmacm -mca btl_openib_receive_queues P,65536,256,192,128 -mca pml ucx -mca osc ucx -x UCX_NET_DEVICES=mlx5_1:1 mpitests-osu_bw # OSU MPI Bandwidth Test v5.8 # Size Bandwidth (MB/s) 1 4.98 2 10.21 4 20.73 8 41.60 16 83.18 32 153.85 64 187.22 128 371.50 256 703.36 512 1112.31 1024 2091.28 2048 2793.22 4096 4250.75 8192 9652.65 16384 9312.98 32768 11362.33 65536 11865.93 131072 12003.34 262144 12084.48 524288 12140.72 1048576 12187.05 2097152 12207.17 4194304 12219.12 + [22-09-11 19:14:45] RQA_check_result -r 0 -t 'openmpi ucx osu_bw' Also, there should be NO CORE generated. Additional info: