Bug 1902855
| Summary: | [RHEL8.4] performance degradation with "ib_send_lat RC" test when tested on mlx5 MT27700 CX-4 ROCE device | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Brian Chae <bchae> |
| Component: | perftest | Assignee: | Honggang LI <honli> |
| Status: | CLOSED ERRATA | QA Contact: | Brian Chae <bchae> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 8.4 | CC: | dledford, knweiss, mstowe, rdma-dev-team, tmichael |
| Target Milestone: | rc | Keywords: | Triaged |
| Target Release: | 8.4 | Flags: | pm-rhel:
mirror+
|
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | perftest-4.4-8.el8 | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-05-18 14:45:12 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1903942 | ||
[root@rdma-dev-21 ~]$ lspci -nn | grep 8086:6f0 00:00.0 Host bridge [0600]: Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D DMI2 [8086:6f00] (rev 01) 00:01.0 PCI bridge [0604]: Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D PCI Express Root Port 1 [8086:6f02] (rev 01) 00:02.0 PCI bridge [0604]: Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D PCI Express Root Port 2 [8086:6f04] (rev 01) 00:03.0 PCI bridge [0604]: Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D PCI Express Root Port 3 [8086:6f08] (rev 01) 00:03.1 PCI bridge [0604]: Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D PCI Express Root Port 3 [8086:6f09] (rev 01) 80:01.0 PCI bridge [0604]: Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D PCI Express Root Port 1 [8086:6f02] (rev 01) 80:03.0 PCI bridge [0604]: Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D PCI Express Root Port 3 [8086:6f08] (rev 01) https://lore.kernel.org/patchwork/patch/820922/ The CPU of rdma-dev-21/22 is not PCIe Relaxed Ordering compliant, so please run perftest with '--disable_pcie_relaxed'. [root@rdma-dev-21 ~]$ ib_send_lat --disable_pcie_relaxed -a -c RC -d mlx5_0 -i 1 -F -R <snip> PCIe relax order: OFF <==== <snip> [root@rdma-dev-22 ~]$ ib_send_lat --disable_pcie_relaxed -a -c RC -d mlx5_0 -i 1 -F -R 172.31.45.121 The perftest was re-tested with the latest build, RHEL-8.4.0-20210205.n.0, on mlx5 MT27700 CX-4 ROCE device.
o RDMA lab hots
rdma-dev-21(server) / 22(client) host pair.
o Build info
DISTRO=RHEL-8.4.0-20210205.n.0
+ [21-02-05 06:32:51] cat /etc/redhat-release
Red Hat Enterprise Linux release 8.4 Beta (Ootpa)
+ [21-02-05 06:32:51] uname -a
Linux rdma-dev-22.lab.bos.redhat.com 4.18.0-282.el8.x86_64 #1 SMP Tue Feb 2 14:09:52 EST 2021 x86_64 x86_64 x86_64 GNU/Linux
+ [21-02-05 06:32:51] cat /proc/cmdline
BOOT_IMAGE=(hd0,msdos1)/vmlinuz-4.18.0-282.el8.x86_64 root=/dev/mapper/rhel_rdma--dev--22-root ro intel_idle.max_cstate=0 processor.max_cstate=0 intel_iommu=on iommu=on console=tty0 rd_NO_PLYMOUTH crashkernel=auto resume=/dev/mapper/rhel_rdma--dev--22-swap rd.lvm.lv=rhel_rdma-dev-22/root rd.lvm.lv=rhel_rdma-dev-22/swap console=ttyS1,115200n81
+ [21-02-05 06:32:51] rpm -q rdma-core linux-firmware
rdma-core-32.0-4.el8.x86_64
linux-firmware-20201218-102.git05789708.el8.noarch
+ [21-02-05 06:32:51] tail /sys/class/infiniband/mlx5_0/fw_ver /sys/class/infiniband/mlx5_1/fw_ver /sys/class/infiniband/mlx5_2/fw_ver
==> /sys/class/infiniband/mlx5_0/fw_ver <==
12.28.1002
==> /sys/class/infiniband/mlx5_1/fw_ver <==
12.28.1002
==> /sys/class/infiniband/mlx5_2/fw_ver <==
12.28.1002
+ [21-02-05 06:32:51] lspci
+ [21-02-05 06:32:51] grep -i -e ethernet -e infiniband -e omni -e ConnectX
01:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 2-port Gigabit Ethernet PCIe
01:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 2-port Gigabit Ethernet PCIe
02:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 2-port Gigabit Ethernet PCIe
02:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 2-port Gigabit Ethernet PCIe
04:00.0 Ethernet controller: Mellanox Technologies MT27700 Family [ConnectX-4]
82:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
82:00.1 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
o Test result
Test results for perftest on rdma-dev-22:
4.18.0-282.el8.x86_64, rdma-core-32.0-4.el8, mlx5, roce.45, & mlx5_0
Result | Status | Test
---------+--------+------------------------------------
PASS | 0 | ib_atomic_bw RC
PASS | 0 | ib_atomic_lat RC
PASS | 0 | ib_read_bw RC
PASS | 0 | ib_read_lat RC
PASS | 0 | ib_send_bw RC
PASS | 0 | ib_send_lat RC
PASS | 0 | ib_write_bw RC
PASS | 0 | ib_write_lat RC
PASS | 0 | raw_ethernet_bw RC
PASS | 0 | raw_ethernet_lat RC
Checking for failures and known issues:
no test failures
o ib_send_lat perftest result, showing the performace data
+ [21-02-05 06:34:37] timeout 3m ib_send_lat -a -c RC -d mlx5_0 -i 1 -F -R 172.31.45.121 <<<=============
---------------------------------------------------------------------------------------
Send Latency Test
Dual-port : OFF Device : mlx5_0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: Unsupported
ibv_wr* API : ON
TX depth : 1
Mtu : 4096[B]
Link type : Ethernet
GID index : 7
Max inline data : 236[B]
rdma_cm QPs : ON
Data ex. method : rdma_cm
---------------------------------------------------------------------------------------
local address: LID 0000 QPN 0x0111 PSN 0x1be678
GID: 00:00:00:00:00:00:00:00:00:00:255:255:172:31:45:122
remote address: LID 0000 QPN 0x0111 PSN 0x88219a
GID: 00:00:00:00:00:00:00:00:00:00:255:255:172:31:40:121
---------------------------------------------------------------------------------------
#bytes #iterations t_min[usec] t_max[usec] t_typical[usec] t_avg[usec] t_stdev[usec] 99% percentile[usec] 99.9% percentile[usec]
2 1000 1.18 1.97 1.22 1.22 0.02 1.31 1.97
4 1000 1.18 2.09 1.22 1.22 0.04 1.27 2.09
8 1000 1.18 2.14 1.22 1.22 0.04 1.27 2.14
16 1000 1.17 2.05 1.22 1.22 0.04 1.27 2.05
32 1000 1.18 3.22 1.22 1.22 0.07 1.28 3.22
64 1000 1.25 2.43 1.28 1.29 0.04 1.38 2.43
128 1000 1.26 2.25 1.30 1.31 0.04 1.37 2.25
256 1000 1.63 2.91 1.67 1.68 0.06 1.83 2.91
512 1000 1.70 3.00 1.75 1.76 0.06 1.95 3.00
1024 1000 1.82 3.19 1.88 1.91 0.08 2.08 3.19
2048 1000 2.05 2.38 2.11 2.12 0.04 2.30 2.38
4096 1000 2.53 3.21 2.58 2.60 0.06 2.74 3.21
8192 1000 2.89 4.05 2.95 2.97 0.07 3.17 4.05
16384 1000 3.56 5.01 3.65 3.72 0.13 4.15 5.01
32768 1000 4.91 6.11 5.02 5.09 0.16 5.59 6.11
65536 1000 8.01 9.37 8.26 8.28 0.13 8.65 9.37
131072 1000 17.81 19.05 18.17 18.19 0.13 18.51 19.05
262144 1000 28.55 29.96 29.18 29.20 0.32 29.85 29.96
524288 1000 50.20 55.86 52.25 52.61 1.03 55.51 55.86
1048576 1000 92.93 97.86 94.78 94.62 0.79 96.69 97.86
2097152 1000 178.64 184.00 180.88 181.31 1.38 183.81 184.00
4194304 1000 349.77 356.88 350.70 351.28 1.41 356.23 356.88 <<<============
8388608 1000 692.43 699.28 694.66 694.76 1.25 698.49 699.28 <<<============
---------------------------------------------------------------------------------------
Now, the above perftest of "ib_send_lat" shows the performance is in par with RHEL8.3 perftest test results.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (RDMA stack bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:1594 Question: So if I get this right --disable_pcie_relaxed is an option that (manually) works around the issue in perftest(!) on affected platforms. However, what about other Infiniband-using software with similar traffic patterns? Is every other program supposed to introduce such an option, too? I would appreciate if some could explain the situation. |
Description of problem: The perftest of "ib_send_lat RC" test on data size of 4194304 bytes and 8388608 bytes had test time increase of about 250 fold when compared with the same test on other MLX4 ROCE device, like MLX5 CX-3. This is also true when the same "ib_send_lat RC" on the same mlx5 CX-4 with RHEL-8.3. The RDMA lab host with this issue are rdma-dev-21 and rdma-dev-22. This is a performance degradation on RHEL-8.4 and regression issue from RHEL-8.3. Version-Release number of selected component (if applicable): DISTRO=RHEL-8.4.0-20201128.n.0 + [20-11-30 13:49:07] cat /etc/redhat-release Red Hat Enterprise Linux release 8.4 Beta (Ootpa) + [20-11-30 13:49:07] uname -a Linux rdma-dev-22.lab.bos.redhat.com 4.18.0-254.el8.x86_64 #1 SMP Thu Nov 26 08:47:50 EST 2020 x86_64 x86_64 x86_64 GNU/Linux + [20-11-30 13:49:07] cat /proc/cmdline BOOT_IMAGE=(hd0,msdos1)/vmlinuz-4.18.0-254.el8.x86_64 root=/dev/mapper/rhel_rdma--dev--22-root ro intel_idle.max_cstate=0 processor.max_cstate=0 intel_iommu=on iommu=on console=tty0 rd_NO_PLYMOUTH crashkernel=auto resume=/dev/mapper/rhel_rdma--dev--22-swap rd.lvm.lv=rhel_rdma-dev-22/root rd.lvm.lv=rhel_rdma-dev-22/swap console=ttyS1,115200n81 + [20-11-30 13:49:07] rpm -q rdma-core linux-firmware rdma-core-32.0-1.el8.x86_64 linux-firmware-20201022-100.gitdae4b4cd.el8.noarch + [20-11-30 13:49:07] tail /sys/class/infiniband/mlx5_0/fw_ver /sys/class/infiniband/mlx5_1/fw_ver /sys/class/infiniband/mlx5_2/fw_ver ==> /sys/class/infiniband/mlx5_0/fw_ver <== 12.28.1002 ==> /sys/class/infiniband/mlx5_1/fw_ver <== 12.28.1002 ==> /sys/class/infiniband/mlx5_2/fw_ver <== 12.28.1002 + [20-11-30 13:49:07] lspci + [20-11-30 13:49:07] grep -i -e ethernet -e infiniband -e omni -e ConnectX 01:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 2-port Gigabit Ethernet PCIe 01:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 2-port Gigabit Ethernet PCIe 02:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 2-port Gigabit Ethernet PCIe 02:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 2-port Gigabit Ethernet PCIe 04:00.0 Ethernet controller: Mellanox Technologies MT27700 Family [ConnectX-4] 82:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4] 82:00.1 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4] + [20-11-30 13:49:07] lscpu rdma-core-32.0-1.el8 How reproducible: 100% Steps to Reproduce: Device info ============ CA 'mlx5_0' CA type: MT4115 Number of ports: 1 Firmware version: 12.28.1002 Hardware version: 0 Node GUID: 0x248a07030056b834 System image GUID: 0x248a07030056b834 Port 1: State: Active Physical state: LinkUp Rate: 100 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00010000 Port GUID: 0x268a07fffe56b834 Link layer: Ethernet 6: mlx5_roce: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 1000 link/ether 24:8a:07:56:b8:34 brd ff:ff:ff:ff:ff:ff inet 172.31.40.122/24 brd 172.31.40.255 scope global dynamic noprefixroute mlx5_roce valid_lft 3427sec preferred_lft 3427sec inet6 fe80::268a:7ff:fe56:b834/64 scope link noprefixroute valid_lft forever preferred_lft forever 23: mlx5_roce.45@mlx5_roce: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000 link/ether 24:8a:07:56:b8:34 brd ff:ff:ff:ff:ff:ff inet 172.31.45.122/24 brd 172.31.45.255 scope global dynamic noprefixroute mlx5_roce.45 valid_lft 3427sec preferred_lft 3427sec inet6 fe80::268a:7ff:fe56:b834/64 scope link noprefixroute valid_lft forever preferred_lft forever 24: mlx5_roce.43@mlx5_roce: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000 link/ether 24:8a:07:56:b8:34 brd ff:ff:ff:ff:ff:ff inet 172.31.43.122/24 brd 172.31.43.255 scope global dynamic noprefixroute mlx5_roce.43 valid_lft 3427sec preferred_lft 3427sec inet6 fe80::268a:7ff:fe56:b834/64 scope link noprefixroute valid_lft forever preferred_lft forever 1. Bring up two RDMA hosts with the above software/build 2. issue the following perftest command on the server host root@rdma-dev-21 ~]$ timeout 10m ib_send_lat -a -c RC -d mlx5_0 -i 1 -F -R 3. issue the following perfest command on the client host [root@rdma-dev-22 ~]$ timeout 10m ib_send_lat -a -c RC -d mlx5_0 -i 1 -F -R 172.31.45.121 Actual results: [root@rdma-dev-22 ~]$ timeout 10m ib_send_lat -a -c RC -d mlx5_0 -i 1 -F -R 172.31.45.121 --------------------------------------------------------------------------------------- Send Latency Test Dual-port : OFF Device : mlx5_0 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON ibv_wr* API : ON TX depth : 1 Mtu : 4096[B] Link type : Ethernet GID index : 7 Max inline data : 236[B] rdma_cm QPs : ON Data ex. method : rdma_cm --------------------------------------------------------------------------------------- local address: LID 0000 QPN 0x011c PSN 0x2abeb5 GID: 00:00:00:00:00:00:00:00:00:00:255:255:172:31:40:122 remote address: LID 0000 QPN 0x011c PSN 0x2e2cd7 GID: 00:00:00:00:00:00:00:00:00:00:255:255:172:31:40:121 --------------------------------------------------------------------------------------- #bytes #iterations t_min[usec] t_max[usec] t_typical[usec] t_avg[usec] t_stdev[usec] 99% percentile[usec] 99.9% percentile[usec] 2 1000 1.18 3.09 1.23 1.23 0.04 1.31 3.09 4 1000 1.18 2.41 1.22 1.22 0.05 1.31 2.41 8 1000 1.18 2.22 1.22 1.22 0.05 1.28 2.22 16 1000 1.18 2.00 1.22 1.22 0.04 1.28 2.00 32 1000 1.18 3.23 1.22 1.23 0.06 1.31 3.23 64 1000 1.26 2.25 1.29 1.29 0.04 1.36 2.25 128 1000 1.28 3.31 1.32 1.32 0.07 1.39 3.31 256 1000 1.64 3.27 1.68 1.68 0.04 1.83 3.27 512 1000 1.71 2.88 1.75 1.77 0.06 1.93 2.88 1024 1000 1.84 2.37 1.89 1.92 0.07 2.06 2.37 2048 1000 2.09 3.00 2.13 2.15 0.05 2.34 3.00 4096 1000 2.58 3.48 2.62 2.63 0.04 2.82 3.48 8192 1000 3.00 3.29 3.04 3.06 0.05 3.25 3.29 16384 1000 3.77 4.64 3.89 3.90 0.10 4.18 4.64 32768 1000 5.33 7.43 5.42 5.48 0.13 5.85 7.43 65536 1000 8.47 10.06 8.71 8.72 0.15 9.02 10.06 131072 1000 18.81 19.86 19.09 19.11 0.14 19.51 19.86 262144 1000 31.27 32.49 31.80 31.81 0.21 32.29 32.49 524288 1000 56.04 57.71 56.63 56.65 0.27 57.50 57.71 1048576 1000 105.53 107.63 106.12 106.15 0.29 107.10 107.63 2097152 1000 204.46 206.53 205.12 205.16 0.33 206.09 206.53 4194304 1000 402.73 2150785.13 493.85 97111.29 427875.91 2147478.09 2150785.13 8388608 1000 950.45 2137753.19 1112.66 39726.68 274756.53 2100862.37 2137753.19 --------------------------------------------------------------------------------------- Normally, the above test takes less than 3min; but with the tremendous performance degradation for 4194304 and 8388608 bytes of data sizes, the test time increased to 6 minutes. Expected results: + [20-11-27 07:54:07] timeout 3m ib_send_lat -a -c RC -d mlx5_0 -i 1 -F -R 172.31.45.121 --------------------------------------------------------------------------------------- Send Latency Test Dual-port : OFF Device : mlx5_0 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON ibv_wr* API : ON TX depth : 1 Mtu : 4096[B] Link type : Ethernet GID index : 7 Max inline data : 236[B] rdma_cm QPs : ON Data ex. method : rdma_cm --------------------------------------------------------------------------------------- local address: LID 0000 QPN 0x0112 PSN 0x1e5128 GID: 00:00:00:00:00:00:00:00:00:00:255:255:172:31:45:122 remote address: LID 0000 QPN 0x0112 PSN 0x7854d GID: 00:00:00:00:00:00:00:00:00:00:255:255:172:31:45:121 --------------------------------------------------------------------------------------- #bytes #iterations t_min[usec] t_max[usec] t_typical[usec] t_avg[usec] t_stdev[usec] 99% percentile[usec] 99.9% percentile[usec] 2 1000 1.18 2.86 1.23 1.23 0.05 1.31 2.86 4 1000 1.18 2.70 1.22 1.22 0.04 1.28 2.70 8 1000 1.18 2.43 1.23 1.23 0.04 1.28 2.43 16 1000 1.19 2.16 1.22 1.22 0.04 1.28 2.16 32 1000 1.18 2.07 1.23 1.23 0.04 1.29 2.07 64 1000 1.26 2.37 1.29 1.31 0.07 1.55 2.37 128 1000 1.27 2.14 1.31 1.31 0.03 1.37 2.14 256 1000 1.63 3.32 1.68 1.68 0.04 1.82 3.32 512 1000 1.71 2.85 1.76 1.77 0.05 1.92 2.85 1024 1000 1.83 2.74 1.89 1.92 0.08 2.08 2.74 2048 1000 2.07 2.96 2.12 2.13 0.05 2.31 2.96 4096 1000 2.53 3.59 2.59 2.60 0.05 2.74 3.59 8192 1000 2.90 3.41 2.96 2.98 0.07 3.20 3.41 16384 1000 3.57 4.69 3.68 3.73 0.13 4.16 4.69 32768 1000 4.92 6.43 5.06 5.12 0.17 5.63 6.43 65536 1000 8.02 9.59 8.26 8.26 0.15 8.57 9.59 131072 1000 17.87 19.14 18.18 18.20 0.12 18.51 19.14 262144 1000 28.60 30.65 29.20 29.19 0.28 29.74 30.65 524288 1000 50.06 54.90 50.69 51.15 1.14 54.12 54.90 1048576 1000 92.81 99.26 95.02 95.22 1.70 98.74 99.26 2097152 1000 178.71 184.45 183.33 183.32 0.48 184.19 184.45 4194304 1000 349.79 355.46 351.75 351.83 1.29 355.16 355.46 8388608 1000 692.37 699.42 694.63 694.71 1.23 698.25 699.42 --------------------------------------------------------------------------------------- Additional info: