Bug 1927794

Summary: [RHEL8.4] pyverbs-tests failed with 5 errors when tested on MLX5 ROCE with bonding
Product: Red Hat Enterprise Linux 8 Reporter: Brian Chae <bchae>
Component: rdma-coreAssignee: Nobody <nobody>
Status: CLOSED WONTFIX QA Contact: Brian Chae <bchae>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 8.4CC: hwkernel-mgr, linville, rdma-dev-team
Target Milestone: rcKeywords: Triaged
Target Release: 8.4   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-08-11 07:28:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1903942    

Description Brian Chae 2021-02-11 14:35:43 UTC
Description of problem:

When run on MLX5 ROCE with "bond" and/or "team" interfaces, pyverbs-tests failed with 5 errors. This was discovered while testing pyverbs-tests bug verification for bz1907377.

Version-Release number of selected component (if applicable):

DISTRO=RHEL-8.4.0-20210210.n.0

Red Hat Enterprise Linux release 8.4 Beta (Ootpa)
Linux rdma-dev-20.lab.bos.redhat.com 4.18.0-284.el8.x86_64 #1 SMP Mon Feb 8 05:01:40 EST 2021 x86_64 x86_64 x86_64 GNU/Linux
BOOT_IMAGE=(hd0,msdos1)/vmlinuz-4.18.0-284.el8.x86_64 root=/dev/mapper/rhel_rdma--dev--20-root ro intel_idle.max_cstate=0 processor.max_cstate=0 intel_iommu=on iommu=on console=tty0 rd_NO_PLYMOUTH crashkernel=auto resume=/dev/mapper/rhel_rdma--dev--20-swap rd.lvm.lv=rhel_rdma-dev-20/root rd.lvm.lv=rhel_rdma-dev-20/swap console=ttyS1,115200n81

rdma-core-32.0-4.el8.x86_64

linux-firmware-20201218-102.git05789708.el8.noarch
==> /sys/class/infiniband/mlx5_2/fw_ver <==
12.23.1020

==> /sys/class/infiniband/mlx5_3/fw_ver <==
12.23.1020

==> /sys/class/infiniband/mlx5_bond_0/fw_ver <==
14.25.1020
01:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 2-port Gigabit Ethernet PCIe
01:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 2-port Gigabit Ethernet PCIe
02:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 2-port Gigabit Ethernet PCIe
02:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 2-port Gigabit Ethernet PCIe
04:00.0 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
04:00.1 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
82:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
82:00.1 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]

installed:
  python3-pyverbs-32.0-4.el8.x86_64

RDMA hosts
  Clients: rdma-dev-20
  Servers: rdma-dev-19

How reproducible:
100%

Steps to Reproduce:

1. With the above RHEL8.4 build, install the following packages on both server and client hosts

    python3-pyverbs-32.0-4.el8.x86_64  


32 execute the pyverbs tests

    ./run_tests.py -v --dev $HCA_ID
        
         <HCA_ID: mlx5_bond_0>

Actual results:

======================================================================
ERROR: test_xrc_traffic_cq_ex (tests.test_cqex.CqExTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/tmp.Quf1VN9J6O/rdma-core/tests/test_cqex.py", line 77, in test_xrc_traffic_cq_ex
    client, server = self.create_players('xrc')
  File "/tmp/tmp.Quf1VN9J6O/rdma-core/tests/test_cqex.py", line 62, in create_players
    client.pre_run(server.psns, server.qps_num)
  File "/tmp/tmp.Quf1VN9J6O/rdma-core/tests/base.py", line 605, in pre_run
    self.to_rts()
  File "/tmp/tmp.Quf1VN9J6O/rdma-core/tests/base.py", line 593, in to_rts
    self.rqp_lst[i].to_rts(qp_attr)
  File "qp.pyx", line 1122, in pyverbs.qp.QP.to_rts
  File "qp.pyx", line 1108, in pyverbs.qp.QP.to_rtr
pyverbs.pyverbs_error.PyverbsRDMAError: Failed to modify QP state to RTR. Errno: 101, Network is unreachable

======================================================================
ERROR: test_rc_modify_lag_port (tests.test_mlx5_lag_affinity.LagPortTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/tmp.Quf1VN9J6O/rdma-core/tests/test_mlx5_lag_affinity.py", line 82, in test_rc_modify_lag_port
    self.create_players(RCResources)
  File "/tmp/tmp.Quf1VN9J6O/rdma-core/tests/test_mlx5_lag_affinity.py", line 76, in create_players
    self.client.pre_run(self.server.psns, self.server.qps_num)
  File "/tmp/tmp.Quf1VN9J6O/rdma-core/tests/base.py", line 474, in pre_run
    self.to_rts()
  File "/tmp/tmp.Quf1VN9J6O/rdma-core/tests/base.py", line 463, in to_rts
    self.qps[i].to_rts(attr)
  File "qp.pyx", line 1122, in pyverbs.qp.QP.to_rts
  File "qp.pyx", line 1108, in pyverbs.qp.QP.to_rtr
pyverbs.pyverbs_error.PyverbsRDMAError: Failed to modify QP state to RTR. Errno: 101, Network is unreachable

======================================================================
ERROR: test_odp_sync_prefetch_rc_traffic (tests.test_odp.OdpTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/tmp.Quf1VN9J6O/rdma-core/tests/test_odp.py", line 153, in test_odp_sync_prefetch_rc_traffic
    prefetch_advice=advice)
  File "/tmp/tmp.Quf1VN9J6O/rdma-core/tests/test_odp.py", line 94, in create_players
    client.pre_run(server.psns, server.qps_num)
  File "/tmp/tmp.Quf1VN9J6O/rdma-core/tests/base.py", line 474, in pre_run
    self.to_rts()
  File "/tmp/tmp.Quf1VN9J6O/rdma-core/tests/base.py", line 463, in to_rts
    self.qps[i].to_rts(attr)
  File "qp.pyx", line 1122, in pyverbs.qp.QP.to_rts
  File "qp.pyx", line 1108, in pyverbs.qp.QP.to_rtr
pyverbs.pyverbs_error.PyverbsRDMAError: Failed to modify QP state to RTR. Errno: 101, Network is unreachable

======================================================================
ERROR: test_mem_align_srq_excq_rc_traffic (tests.test_parent_domain.ParentDomainTrafficTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/tmp.Quf1VN9J6O/rdma-core/tests/test_parent_domain.py", line 186, in test_mem_align_srq_excq_rc_traffic
    alloc_func=mem_align_allocator, free_func=free_func)
  File "/tmp/tmp.Quf1VN9J6O/rdma-core/tests/test_parent_domain.py", line 155, in create_players
    self.client.pre_run(self.server.psns, self.server.qps_num)
  File "/tmp/tmp.Quf1VN9J6O/rdma-core/tests/base.py", line 474, in pre_run
    self.to_rts()
  File "/tmp/tmp.Quf1VN9J6O/rdma-core/tests/base.py", line 463, in to_rts
    self.qps[i].to_rts(attr)
  File "qp.pyx", line 1122, in pyverbs.qp.QP.to_rts
  File "qp.pyx", line 1108, in pyverbs.qp.QP.to_rtr
pyverbs.pyverbs_error.PyverbsRDMAError: Failed to modify QP state to RTR. Errno: 101, Network is unreachable

======================================================================
ERROR: test_qp_ex_rc_send (tests.test_qpex.QpExTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/tmp.Quf1VN9J6O/rdma-core/tests/test_qpex.py", line 190, in test_qp_ex_rc_send
    client, server = self.create_players('rc_send')
  File "/tmp/tmp.Quf1VN9J6O/rdma-core/tests/test_qpex.py", line 180, in create_players
    client.pre_run(server.psns, server.qps_num)
  File "/tmp/tmp.Quf1VN9J6O/rdma-core/tests/base.py", line 474, in pre_run
    self.to_rts()
  File "/tmp/tmp.Quf1VN9J6O/rdma-core/tests/base.py", line 463, in to_rts
    self.qps[i].to_rts(attr)
  File "qp.pyx", line 1122, in pyverbs.qp.QP.to_rts
  File "qp.pyx", line 1108, in pyverbs.qp.QP.to_rtr
pyverbs.pyverbs_error.PyverbsRDMAError: Failed to modify QP state to RTR. Errno: 101, Network is unreachable

----------------------------------------------------------------------
Ran 161 tests in 7.710s

FAILED (errors=5, skipped=61)
---
- TEST RESULT FOR rdma-core
-   Test:   Run pyverbs tests
-   Result: FAIL
-   Return: 1
---


Expected results:

Normal pyverb execution with all tests passing

Additional info:

pyverbs-tests on MLX5 ROCE without bonding, no such issues found.

on RDMA-HOSTS of rdma-dev-21 / 22 pair

Ran 161 tests in 8.028s

OK (skipped=54)
---
- TEST RESULT FOR rdma-core
-   Test:   Run pyverbs tests
-   Result: PASS
-   Return: 0
---
/mnt/tests/kernel/infiniband/pyverbs-tests
---
- TEST RESULT FOR pyverbs-tests
-   Test:   Remove temp directory
-   Result: PASS
-   Return: 0
---

test_xrc_traffic_cq_ex (tests.test_cqex.CqExTestCase) ... ok


test_raw_modify_lag_port (tests.test_mlx5_lag_affinity.LagPortTestCase) ... skipped 'Set LAG affinity is not supported on this device'
test_rc_modify_lag_port (tests.test_mlx5_lag_affinity.LagPortTestCase) ... skipped 'Set LAG affinity is not supported on this device'
test_ud_modify_lag_port (tests.test_mlx5_lag_affinity.LagPortTestCase) ... skipped 'Set LAG affinity is not supported on this device'

test_odp_async_prefetch_rc_traffic (tests.test_odp.OdpTestCase) ... ok
test_odp_implicit_async_prefetch_rc_traffic (tests.test_odp.OdpTestCase) ... ok
test_odp_implicit_rc_traffic (tests.test_odp.OdpTestCase) ... ok
test_odp_implicit_sync_prefetch_rc_traffic (tests.test_odp.OdpTestCase) ... ok
test_odp_prefetch_async_no_page_fault_rc_traffic (tests.test_odp.OdpTestCase) ... skipped 'Advise MR with flags (0) and advice (2) is not supported'
test_odp_prefetch_sync_no_page_fault_rc_traffic (tests.test_odp.OdpTestCase) ... skipped 'Advise MR with flags (1) and advice (2) is not supported'
test_odp_rc_huge_traffic (tests.test_odp.OdpTestCase) ... skipped 'There are no huge pages of size 2M allocated'
test_odp_rc_huge_user_addr_traffic (tests.test_odp.OdpTestCase) ... skipped 'There are no huge pages of size 2M allocated'
test_odp_rc_srq_traffic (tests.test_odp.OdpTestCase) ... ok
test_odp_rc_traffic (tests.test_odp.OdpTestCase) ... ok
test_odp_sync_prefetch_rc_traffic (tests.test_odp.OdpTestCase) ... ok
test_odp_ud_traffic (tests.test_odp.OdpTestCase) ... ok
test_odp_xrc_traffic (tests.test_odp.OdpTestCase) ... ok


test_default_allocators_rc_traffic (tests.test_parent_domain.ParentDomainTrafficTest) ... ok
test_huge_page_traffic (tests.test_parent_domain.ParentDomainTrafficTest) ... skipped 'There are no huge pages of size 2M allocated'
test_mem_align_rc_traffic (tests.test_parent_domain.ParentDomainTrafficTest) ... ok
test_mem_align_srq_excq_rc_traffic (tests.test_parent_domain.ParentDomainTrafficTest) ... ok
test_mem_align_ud_traffic (tests.test_parent_domain.ParentDomainTrafficTest) ... ok
test_without_allocators_rc_traffic (tests.test_parent_domain.ParentDomainTrafficTest) ... ok


test_qp_ex_rc_atomic_cmp_swp (tests.test_qpex.QpExTestCase) ... ok
test_qp_ex_rc_atomic_fetch_add (tests.test_qpex.QpExTestCase) ... ok
test_qp_ex_rc_bind_mw (tests.test_qpex.QpExTestCase) ... ok
test_qp_ex_rc_rdma_read (tests.test_qpex.QpExTestCase) ... ok
test_qp_ex_rc_rdma_write (tests.test_qpex.QpExTestCase) ... ok
test_qp_ex_rc_rdma_write_imm (tests.test_qpex.QpExTestCase) ... ok
test_qp_ex_rc_send (tests.test_qpex.QpExTestCase) ... ok
test_qp_ex_rc_send_imm (tests.test_qpex.QpExTestCase) ... ok
test_qp_ex_ud_send (tests.test_qpex.QpExTestCase) ... ok
test_qp_ex_ud_send_imm (tests.test_qpex.QpExTestCase) ... ok
test_qp_ex_xrc_send (tests.test_qpex.QpExTestCase) ... ok
test_qp_ex_xrc_send_imm (tests.test_qpex.QpExTestCase) ... ok

o So, the above errors on rdma-dev-19/20 were to be unique to BONDING HCA

Comment 5 RHEL Program Management 2022-08-11 07:28:06 UTC
After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.