Bug 2126094

Summary: [RHEL9.1] UCX fails in "openmpi ucx osu_bw" test with core file when tested on MLX5 ROCE / IB devices
Product: Red Hat Enterprise Linux 9 Reporter: Brian Chae <bchae>
Component: ucxAssignee: Michal Schmidt <mschmidt>
Status: CLOSED MIGRATED QA Contact: Afom T. Michael <tmichael>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 9.1CC: rdma-dev-team, zguo
Target Milestone: rcKeywords: MigratedToJIRA, Regression
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-09-21 14:45:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2091421    

Description Brian Chae 2022-09-12 12:04:00 UTC
Description of problem:

"openmpi ucx osu_bw" test fails during UCX test when tested on ALL variants of MLX5 ROCE HCA.

This is a regression issue when compared with the RHEL-9.1.0-20220524.0 build for CTC#1 test cycle; also for CTC#2 (the build no longer exists)


Version-Release number of selected component (if applicable):

Clients: rdma-dev-22
Servers: rdma-dev-21

DISTRO=RHEL-9.1.0-20220910.0

+ [22-09-11 20:02:57] cat /etc/redhat-release
Red Hat Enterprise Linux release 9.1 Beta (Plow)

+ [22-09-11 20:02:57] uname -a
Linux rdma-dev-22.rdma.lab.eng.rdu2.redhat.com 5.14.0-162.el9.x86_64 #1 SMP PREEMPT_DYNAMIC Mon Sep 5 10:44:43 EDT 2022 x86_64 x86_64 x86_64 GNU/Linux

+ [22-09-11 20:02:57] cat /proc/cmdline
BOOT_IMAGE=(hd0,msdos1)/vmlinuz-5.14.0-162.el9.x86_64 root=UUID=376371e8-0b44-45c2-8687-191dbb3737bc ro intel_idle.max_cstate=0 processor.max_cstate=0 intel_iommu=on iommu=on console=tty0 rd_NO_PLYMOUTH crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M resume=UUID=beb6c243-17c9-4210-ba33-d2c0b4062b8a console=ttyS1,115200n81

+ [22-09-11 20:02:57] rpm -q rdma-core linux-firmware
rdma-core-41.0-3.el9.x86_64
linux-firmware-20220708-127.el9.noarch

+ [22-09-11 20:02:57] tail /sys/class/infiniband/mlx5_0/fw_ver /sys/class/infiniband/mlx5_1/fw_ver /sys/class/infiniband/mlx5_2/fw_ver
==> /sys/class/infiniband/mlx5_0/fw_ver <==
12.28.2006

==> /sys/class/infiniband/mlx5_1/fw_ver <==
12.28.2006

==> /sys/class/infiniband/mlx5_2/fw_ver <==
12.28.2006

+ [22-09-11 20:02:57] lspci
+ [22-09-11 20:02:57] grep -i -e ethernet -e infiniband -e omni -e ConnectX
01:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
01:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
02:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
02:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
04:00.0 Ethernet controller: Mellanox Technologies MT27700 Family [ConnectX-4]
82:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
82:00.1 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]

Installed:
  ucx-cma-1.13.0-1.el9.x86_64              ucx-ib-1.13.0-1.el9.x86_64          
  ucx-rdmacm-1.13.0-1.el9.x86_64   



+ [22-09-11 20:15:38] timeout --preserve-status --kill-after=5m 3m ompi_info --parsable
+ [22-09-11 20:15:38] grep ucx
mca:osc:ucx:version:"mca:2.1.0"
mca:osc:ucx:version:"api:3.0.0"
mca:osc:ucx:version:"component:4.1.1"
mca:pml:ucx:version:"mca:2.1.0"
mca:pml:ucx:version:"api:2.0.0"
mca:pml:ucx:version:"component:4.1.1"

       

How reproducible:
100%

Steps to Reproduce:
1. Install RHEL-9.1.0-20220910.0 on any of 
   rdma-dev-19/20, rdma-dev-21/22, rdma-perf-02/03, rdma-virt-02/03 for ROCE
2. Install & execute kernel-kernel-infiniband-ucx test script
3. Watch ucx result on client side

Actual results:
+ [22-09-11 20:19:09] timeout --preserve-status --kill-after=5m 3m mpirun -hostfile /root/hfile_one_core -np 2 --allow-run-as-root --map-by node -mca btl '^vader,tcp,openib' -mca btl_openib_cpc_include rdmacm -mca btl_openib_receive_queues P,65536,256,192,128 -mca pml ucx -mca osc ucx -x UCX_NET_DEVICES=mlx5_0:1 mpitests-osu_bw
[rdma-dev-22:262261] *** Process received signal ***
[rdma-dev-22:262261] Signal: Bus error (7)
[rdma-dev-22:262261] Signal code: Non-existant physical address (2)
[rdma-dev-22:262261] Failing at address: 0x7fef511b6000
[rdma-dev-22:262261] [ 0] /lib64/libc.so.6(+0x54d90)[0x7fef5b310d90]
[rdma-dev-22:262261] [ 1] /lib64/libc.so.6(+0xc290a)[0x7fef5b37e90a]
[rdma-dev-22:262261] [ 2] /lib64/libfabric.so.1(+0x7836e4)[0x7fef593ef6e4]
[rdma-dev-22:262261] [ 3] /lib64/libfabric.so.1(+0x787ebf)[0x7fef593f3ebf]
[rdma-dev-22:262261] [ 4] /lib64/libfabric.so.1(+0x770299)[0x7fef593dc299]
[rdma-dev-22:262261] [ 5] /lib64/libfabric.so.1(+0x7707ed)[0x7fef593dc7ed]
[rdma-dev-22:262261] [ 6] /lib64/libfabric.so.1(+0x753e5d)[0x7fef593bfe5d]
[rdma-dev-22:262261] [ 7] /lib64/libfabric.so.1(+0x747bff)[0x7fef593b3bff]
[rdma-dev-22:262261] [ 8] /usr/lib64/openmpi/lib/openmpi/mca_btl_ofi.so(+0x6cdf)[0x7fef59562cdf]
[rdma-dev-22:262261] [ 9] /usr/lib64/openmpi/lib/libopen-pal.so.40(mca_btl_base_select+0x112)[0x7fef5b1bae62]
[rdma-dev-22:262261] [10] /usr/lib64/openmpi/lib/openmpi/mca_bml_r2.so(mca_bml_r2_component_init+0x18)[0x7fef5956c188]
[rdma-dev-22:262261] [11] /usr/lib64/openmpi/lib/libmpi.so.40(mca_bml_base_init+0x94)[0x7fef5b56cc94]
[rdma-dev-22:262261] [12] /usr/lib64/openmpi/lib/libmpi.so.40(ompi_mpi_init+0x664)[0x7fef5b5accb4]
[rdma-dev-22:262261] [13] /usr/lib64/openmpi/lib/libmpi.so.40(MPI_Init+0x72)[0x7fef5b54c482]
[rdma-dev-22:262261] [14] mpitests-osu_bw(+0x25d5)[0x559f152565d5]
[rdma-dev-22:262261] [15] /lib64/libc.so.6(+0x3feb0)[0x7fef5b2fbeb0]
[rdma-dev-22:262261] [16] /lib64/libc.so.6(__libc_start_main+0x80)[0x7fef5b2fbf60]
[rdma-dev-22:262261] [17] mpitests-osu_bw(+0x3fa5)[0x559f15257fa5]
[rdma-dev-22:262261] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node rdma-dev-22 exited on signal 7 (Bus error).
--------------------------------------------------------------------------
+ [22-09-11 20:19:13] RQA_check_result -r 135 -t 'openmpi ucx osu_bw'

Also, a core file was detected afterwards.

Sun 2022-09-11 20:19:10 EDT 262261   0   0 SIGBUS none     /usr/lib64/openmpi/bin/mpitests-osu_bw    n/a



Expected results:

Result from RHEL-9.1.0-20220524.0

+ [22-09-11 19:14:39] timeout --preserve-status --kill-after=5m 3m mpirun -hostfile /root/hfile_one_core -np 2 --allow-run-as-root --map-by node -mca btl '^vader,tcp,openib' -mca btl_openib_cpc_include rdmacm -mca btl_openib_receive_queues P,65536,256,192,128 -mca pml ucx -mca osc ucx -x UCX_NET_DEVICES=mlx5_1:1 mpitests-osu_bw
# OSU MPI Bandwidth Test v5.8
# Size      Bandwidth (MB/s)
1                       4.98
2                      10.21
4                      20.73
8                      41.60
16                     83.18
32                    153.85
64                    187.22
128                   371.50
256                   703.36
512                  1112.31
1024                 2091.28
2048                 2793.22
4096                 4250.75
8192                 9652.65
16384                9312.98
32768               11362.33
65536               11865.93
131072              12003.34
262144              12084.48
524288              12140.72
1048576             12187.05
2097152             12207.17
4194304             12219.12
+ [22-09-11 19:14:45] RQA_check_result -r 0 -t 'openmpi ucx osu_bw'

Also, there should be NO CORE generated.

Additional info:

Comment 1 Brian Chae 2023-05-31 13:24:37 UTC
With 9.3 CTC#1 RHEL-9.3.0-20230521.45 testing, the same issue was observed on MLX5 IB.


ucx/ucx/ test results on rdma-perf-02/rdma-perf-03 & Beaker job J:7886459:
5.14.0-316.el9.x86_64, rdma-core-44.0-2.el9, mlx5, ib0, ConnectX-5 & mlx5_0
    Result | Status | Test
  ---------+--------+------------------------------------
      FAIL |    135 | openmpi ucx osu_bw


+ [23-05-25 21:00:55] timeout --preserve-status --kill-after=5m 3m mpirun -hostfile /root/hfile_one_core -np 2 --allow-run-as-root --map-by node -mca btl '^vader,tcp,openib' -mca pml ucx -mca osc ucx -x UCX_NET_DEVICES=mlx5_0:1 mpitests-osu_bw
[rdma-perf-03:238169] *** Process received signal ***
[rdma-perf-03:238169] Signal: Bus error (7)
[rdma-perf-03:238169] Signal code: Non-existant physical address (2)
[rdma-perf-03:238169] Failing at address: 0x7f2e949b4000
[rdma-perf-03:238169] [ 0] /lib64/libc.so.6(+0x54df0)[0x7f2e9ea54df0]
[rdma-perf-03:238169] [ 1] /lib64/libc.so.6(+0x2b580)[0x7f2e9ea2b580]
[rdma-perf-03:238169] [ 2] /lib64/libfabric.so.1(+0x5db231)[0x7f2e9c7db231]
[rdma-perf-03:238169] [ 3] /lib64/libfabric.so.1(+0x5d13ec)[0x7f2e9c7d13ec]
[rdma-perf-03:238169] [ 4] /lib64/libfabric.so.1(+0x5d49f9)[0x7f2e9c7d49f9]
[rdma-perf-03:238169] [ 5] /lib64/libfabric.so.1(+0x5fa95b)[0x7f2e9c7fa95b]
[rdma-perf-03:238169] [ 6] /lib64/libfabric.so.1(+0x59fa11)[0x7f2e9c79fa11]
[rdma-perf-03:238169] [ 7] /usr/lib64/openmpi/lib/openmpi/mca_btl_ofi.so(+0x6cdf)[0x7f2e9d4fdcdf]
[rdma-perf-03:238169] [ 8] /usr/lib64/openmpi/lib/libopen-pal.so.40(mca_btl_base_select+0x112)[0x7f2e9e903e62]
[rdma-perf-03:238169] [ 9] /usr/lib64/openmpi/lib/openmpi/mca_bml_r2.so(mca_bml_r2_component_init+0x18)[0x7f2e9d507188]
[rdma-perf-03:238169] [10] /usr/lib64/openmpi/lib/libmpi.so.40(mca_bml_base_init+0x94)[0x7f2e9ed07c94]
[rdma-perf-03:238169] [11] /usr/lib64/openmpi/lib/libmpi.so.40(ompi_mpi_init+0x664)[0x7f2e9ed47cb4]
[rdma-perf-03:238169] [12] /usr/lib64/openmpi/lib/libmpi.so.40(MPI_Init+0x72)[0x7f2e9ece7482]
[rdma-perf-03:238169] [13] mpitests-osu_bw(+0x25d5)[0x55d47cae15d5]
[rdma-perf-03:238169] [14] /lib64/libc.so.6(+0x3feb0)[0x7f2e9ea3feb0]
[rdma-perf-03:238169] [15] /lib64/libc.so.6(__libc_start_main+0x80)[0x7f2e9ea3ff60]
[rdma-perf-03:238169] [16] mpitests-osu_bw(+0x3fa5)[0x55d47cae2fa5]
[rdma-perf-03:238169] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node rdma-perf-03 exited on signal 7 (Bus error).
--------------------------------------------------------------------------
+ [23-05-25 21:01:00] RQA_check_result -r 135 -t 'openmpi ucx osu_bw'
+ [23-05-25 21:01:00] local test_pass=0
+ [23-05-25 21:01:00] local test_skip=777
+ [23-05-25 21:01:00] test 4 -gt 0
+ [23-05-25 21:01:00] case $1 in
+ [23-05-25 21:01:00] local rc=135
+ [23-05-25 21:01:00] shift
+ [23-05-25 21:01:00] shift
+ [23-05-25 21:01:00] test 2 -gt 0
+ [23-05-25 21:01:00] case $1 in
+ [23-05-25 21:01:00] local 'msg=openmpi ucx osu_bw'
+ [23-05-25 21:01:00] shift
+ [23-05-25 21:01:00] shift
+ [23-05-25 21:01:01] test 0 -gt 0
+ [23-05-25 21:01:01] '[' -z 135 -o -z 'openmpi ucx osu_bw' ']'
+ [23-05-25 21:01:01] '[' -z /tmp/tmp.dqjGJ3nlpa/results_ucx-ucx-.txt ']'
+ [23-05-25 21:01:01] '[' -z /tmp/tmp.dqjGJ3nlpa/results_ucx-ucx-.txt ']'
+ [23-05-25 21:01:01] '[' 135 -eq 0 ']'
+ [23-05-25 21:01:01] '[' 135 -eq 777 ']'
+ [23-05-25 21:01:01] local test_result=FAIL
+ [23-05-25 21:01:01] export result=FAIL
+ [23-05-25 21:01:01] result=FAIL
+ [23-05-25 21:01:01] [[ ! -z '' ]]
+ [23-05-25 21:01:01] printf '%10s | %6s | %s\n' FAIL 135 'openmpi ucx osu_bw'
+ [23-05-25 21:01:01] set +x
---
- TEST RESULT FOR ucx
-   Test:   openmpi ucx osu_bw
-   Result: FAIL
-   Return: 135
---



Clients: rdma-perf-03
Servers: rdma-perf-02

DISTRO=RHEL-9.3.0-20230521.45

+ [23-05-25 20:43:24] cat /etc/redhat-release
Red Hat Enterprise Linux release 9.3 Beta (Plow)

+ [23-05-25 20:43:24] uname -a
Linux rdma-perf-03.rdma.lab.eng.rdu2.redhat.com 5.14.0-316.el9.x86_64 #1 SMP PREEMPT_DYNAMIC Fri May 19 13:18:40 EDT 2023 x86_64 x86_64 x86_64 GNU/Linux

+ [23-05-25 20:43:24] cat /proc/cmdline
BOOT_IMAGE=(hd0,msdos1)/vmlinuz-5.14.0-316.el9.x86_64 root=UUID=54370a14-3dd8-4131-8ec8-1c6724815ff7 ro intel_idle.max_cstate=0 processor.max_cstate=0 intel_iommu=on iommu=on console=tty0 rd_NO_PLYMOUTH intel_idle.max_cstate=0 intremap=no_x2apic_optout processor.max_cstate=0 reboot=acpi crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M resume=UUID=1b4ab1f2-c70d-4217-ac30-d241c6ade7f9 console=ttyS1,115200n81

+ [23-05-25 20:43:24] rpm -q rdma-core linux-firmware
rdma-core-44.0-2.el9.x86_64
linux-firmware-20230404-134.el9.noarch

+ [23-05-25 20:43:24] tail /sys/class/infiniband/mlx5_0/fw_ver /sys/class/infiniband/mlx5_1/fw_ver
==> /sys/class/infiniband/mlx5_0/fw_ver <==
16.33.1048

==> /sys/class/infiniband/mlx5_1/fw_ver <==
16.33.1048
+ [23-05-25 20:43:24] lspci
+ [23-05-25 20:43:24] grep -i -e ethernet -e infiniband -e omni -e ConnectX
03:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
03:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
03:00.2 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
03:00.3 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
07:00.0 Infiniband controller: Mellanox Technologies MT27800 Family [ConnectX-5]
07:00.1 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]

Comment 2 RHEL Program Management 2023-09-21 14:40:24 UTC
Issue migration from Bugzilla to Jira is in process at this time. This will be the last message in Jira copied from the Bugzilla bug.

Comment 3 RHEL Program Management 2023-09-21 14:45:28 UTC
This BZ has been automatically migrated to the issues.redhat.com Red Hat Issue Tracker. All future work related to this report will be managed there.

Due to differences in account names between systems, some fields were not replicated.  Be sure to add yourself to Jira issue's "Watchers" field to continue receiving updates and add others to the "Need Info From" field to continue requesting information.

To find the migrated issue, look in the "Links" section for a direct link to the new issue location. The issue key will have an icon of 2 footprints next to it, and begin with "RHEL-" followed by an integer.  You can also find this issue by visiting https://issues.redhat.com/issues/?jql= and searching the "Bugzilla Bug" field for this BZ's number, e.g. a search like:

"Bugzilla Bug" = 1234567

In the event you have trouble locating or viewing this issue, you can file an issue by sending mail to rh-issues. You can also visit https://access.redhat.com/articles/7032570 for general account information.