Bug 2212555

Summary: [RHEL-9.3] all test cases in perftest failed with "Unexpected CM event bl blka 7" error on iRDMA RoCE devices
Product: Red Hat Enterprise Linux 9 Reporter: Brian Chae <bchae>
Component: perftestAssignee: Kamal Heib <kheib>
Status: CLOSED NOTABUG QA Contact: Brian Chae <bchae>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 9.3CC: dacampbe, dledford, hwkernel-mgr, rdma-dev-team, zguo
Target Milestone: rcKeywords: Regression
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-06-26 14:23:26 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Brian Chae 2023-06-05 21:11:09 UTC
Description of problem:

All of perftest testcases failed with the following error:

Unexpected CM event bl blka 7
 Unable to perform rdma_client function
 Unable to init the socket connection

and return code of 1.


perftest test results on rdma-qe-38/rdma-qe-39 & Beaker job J:7886455:
5.14.0-316.el9.x86_64, rdma-core-44.0-2.el9, i40e, roce.45, E810-XXV & irdma0
    Result | Status | Test
  ---------+--------+------------------------------------
      FAIL |      1 | ib_read_bw RC
      FAIL |      1 | ib_read_lat RC
      FAIL |      1 | ib_send_bw RC
      FAIL |      1 | ib_send_lat RC
      FAIL |      1 | ib_write_bw RC
      FAIL |      1 | ib_write_lat RC

This is regression from RHEL-9.2.0-20230228.28.



Version-Release number of selected component (if applicable):

Clients: rdma-qe-39
Servers: rdma-qe-38

DISTRO=RHEL-9.3.0-20230521.45

+ [23-05-25 12:23:55] cat /etc/redhat-release
Red Hat Enterprise Linux release 9.3 Beta (Plow)

+ [23-05-25 12:23:55] uname -a
Linux rdma-qe-39.rdma.lab.eng.rdu2.redhat.com 5.14.0-316.el9.x86_64 #1 SMP PREEMPT_DYNAMIC Fri May 19 13:18:40 EDT 2023 x86_64 x86_64 x86_64 GNU/Linux

+ [23-05-25 12:23:55] cat /proc/cmdline
BOOT_IMAGE=(hd0,gpt2)/vmlinuz-5.14.0-316.el9.x86_64 root=UUID=d310605f-8ec9-46e6-ae94-14a823879fce ro crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M resume=UUID=8089ef9b-fb2f-45c9-861e-a5781cbf8740 console=ttyS0,115200n81

+ [23-05-25 12:23:55] rpm -q rdma-core linux-firmware
rdma-core-44.0-2.el9.x86_64
linux-firmware-20230404-134.el9.noarch

+ [23-05-25 12:23:55] tail /sys/class/infiniband/irdma0/fw_ver /sys/class/infiniband/irdma1/fw_ver
==> /sys/class/infiniband/irdma0/fw_ver <==
1.57

==> /sys/class/infiniband/irdma1/fw_ver <==
1.57

+ [23-05-25 12:23:55] lspci
+ [23-05-25 12:23:55] grep -i -e ethernet -e infiniband -e omni -e ConnectX
41:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02)
41:00.1 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02)
c1:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
c1:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe

+ [23-05-25 12:23:55] rpm -q perftest
perftest-4.5.0.20-4.el9.x86_64


How reproducible:

100%

Steps to Reproduce:
1. On the server host, issue the following perftest commands

timeout 3m ib_read_bw -a -c RC -d irdma0 -i 1 -F -R
timeout 3m ib_read_lat -a -c RC -d irdma0 -i 1 -F -R
timeout 3m ib_send_bw -a -c RC -d irdma0 -i 1 -F -R 
timeout 3m ib_send_lat -a -c RC -d irdma0 -i 1 -F -R 
timeout 3m ib_write_bw -a -c RC -d irdma0 -i 1 -F -R 
timeout 3m ib_write_lat -a -c RC -d irdma0 -i 1 -F -R 

2. On the client host, issue the following perftest commands

timeout 3m ib_read_bw -a -c RC -d irdma0 -i 1 -F -R 172.31.45.38
timeout 3m ib_read_lat -a -c RC -d irdma0 -i 1 -F -R 172.31.45.38
timeout 3m ib_send_bw -a -c RC -d irdma0 -i 1 -F -R 172.31.45.38
timeout 3m ib_send_lat -a -c RC -d irdma0 -i 1 -F -R 172.31.45.38
timeout 3m ib_write_bw -a -c RC -d irdma0 -i 1 -F -R 172.31.45.38
timeout 3m ib_write_lat -a -c RC -d irdma0 -i 1 -F -R 172.31.45.38

Actual results:

All of the above perftest commands resulted in the same errors:

+ [23-05-25 12:23:59] timeout 3m ib_read_bw -a -c RC -d irdma0 -i 1 -F -R 172.31.45.38
Unexpected CM event bl blka 7
 Unable to perform rdma_client function
 Unable to init the socket connection
+ [23-05-25 12:23:59] RQA_check_result -r 1 -t 'ib_read_bw RC'


+ [23-05-25 12:27:30] timeout 3m ib_read_lat -a -c RC -d irdma0 -i 1 -F -R 172.31.45.38
Unexpected CM event bl blka 7
 Unable to perform rdma_client function
 Unable to init the socket connection
+ [23-05-25 12:27:30] RQA_check_result -r 1 -t 'ib_read_lat RC'



+ [23-05-25 12:29:59] timeout 3m ib_send_bw -a -c RC -d irdma0 -i 1 -F -R 172.31.45.38
Unexpected CM event bl blka 7
 Unable to perform rdma_client function
 Unable to init the socket connection
+ [23-05-25 12:29:59] RQA_check_result -r 1 -t 'ib_send_bw RC'


Down to the last in the test sequence...


+ [23-05-25 12:39:30] timeout 3m ib_write_lat -a -c RC -d irdma0 -i 1 -F -R 172.31.45.38
Unexpected CM event bl blka 7
 Unable to perform rdma_client function
 Unable to init the socket connection
+ [23-05-25 12:39:30] RQA_check_result -r 1 -t 'ib_write_lat RC'



Expected results:



perftest test results on rdma-dev-30/rdma-dev-31 & Beaker job J:7586502:
5.14.0-283.el9.x86_64, rdma-core-44.0-2.el9, i40e, iw, E810-C & irdma1
    Result | Status | Test
  ---------+--------+------------------------------------
      PASS |      0 | ib_read_bw RC
      PASS |      0 | ib_read_lat RC
      PASS |      0 | ib_send_bw RC
      FAIL |    135 | ib_send_lat RC
      PASS |      0 | ib_write_bw RC
      FAIL |    135 | ib_write_lat RC



https://beaker-archive.hosts.prod.psi.bos.redhat.com/beaker-logs/2023/03/75865/7586502/13484472/157019398/results_perftest.txt


Additional info:

Comment 2 Brian Chae 2023-06-26 14:23:26 UTC
Per Afom,

Hello all,

Similar to RHEL-8.9, RHEL-9.3 perftest on iRDMA & perftest-23.04.0.0.23-1.el9 is also passing with ""limit inline data size" update to our test script. I am going to update the BZ with this and we can probably close the BZ.
perftest test results on rdma-qe-38/rdma-qe-39 & Beaker job J:7981566:
5.14.0-327.el9.x86_64, rdma-core-46.0-1.el9, i40e, iw, E810-XXV & irdma1
    Result | Status | Test
  ---------+--------+------------------------------------
      PASS |      0 | ib_read_bw RC
      PASS |      0 | ib_read_lat RC
      PASS |      0 | ib_send_bw RC
      PASS |      0 | ib_send_lat RC
      PASS |      0 | ib_write_bw RC
      PASS |      0 | ib_write_lat RC
Checking for failures and known issues:
  no test failures

perftest test results on rdma-qe-38/rdma-qe-39 & Beaker job J:7981566:
5.14.0-327.el9.x86_64, rdma-core-46.0-1.el9, i40e, roce.45, E810-XXV & irdma0
    Result | Status | Test
  ---------+--------+------------------------------------
      PASS |      0 | ib_read_bw RC
      PASS |      0 | ib_read_lat RC
      PASS |      0 | ib_send_bw RC
      PASS |      0 | ib_send_lat RC
      PASS |      0 | ib_write_bw RC
      PASS |      0 | ib_write_lat RC
Checking for failures and known issues:
  no test failures
Thanks,
Afom

So, we can close this bug...