RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1468996 - ib_write_bw failed over ConnectX-4 Lx/ROCE
Summary: ib_write_bw failed over ConnectX-4 Lx/ROCE
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: perftest
Version: 7.4
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: rc
: ---
Assignee: Jarod Wilson
QA Contact: Infiniband QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-07-10 08:23 UTC by zguo
Modified: 2021-01-17 22:12 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-10-26 13:50:53 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description zguo 2017-07-10 08:23:44 UTC
Description of problem:
[root@rdma-virt-02 ~]$ ib_write_bw -c RC -d mlx5_2

************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF		Device         : mlx5_2
 Number of qps   : 1		Transport type : IB
 Connection type : RC		Using SRQ      : OFF
 CQ Moderation   : 100
 Mtu             : 4096[B]
 Link type       : Ethernet
 GID index       : 2
 Max inline data : 0[B]
 rdma_cm QPs	 : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x02ff PSN 0xf115ef RKey 0x008458 VAddr 0x002b5eb80d2000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:172:31:43:92
 remote address: LID 0000 QPN 0x0301 PSN 0xd1ed0b RKey 0x00641b VAddr 0x002b766ae6f000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:172:31:40:93
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
ethernet_read_keys: Couldn't read remote address
 Unable to read to socket/rdam_cm
 Failed to exchange data between server and clients
[root@rdma-virt-03 ~]$ timeout 3m ib_write_bw 172.31.40.92 -c RC -d mlx5_2
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF		Device         : mlx5_2
 Number of qps   : 1		Transport type : IB
 Connection type : RC		Using SRQ      : OFF
 TX depth        : 128
 CQ Moderation   : 100
 Mtu             : 4096[B]
 Link type       : Ethernet
 GID index       : 2
 Max inline data : 0[B]
 rdma_cm QPs	 : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x0301 PSN 0xd1ed0b RKey 0x00641b VAddr 0x002b766ae6f000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:172:31:40:93
 remote address: LID 0000 QPN 0x02ff PSN 0xf115ef RKey 0x008458 VAddr 0x002b5eb80d2000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:172:31:43:92
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
 Completion with error at client
 Failed status 12: wr_id 0 syndrom 0x81
scnt=128, ccnt=0
 Failed to complete run_iter_bw function successfully

Version-Release number of selected component (if applicable):
[root@rdma-virt-02 ~]$ ethtool -i mlx5_roce
driver: mlx5_core
version: 3.0-1 (January 2015)
firmware-version: 14.18.1000
expansion-rom-version: 
bus-info: 0000:05:00.0
supports-statistics: yes
supports-test: no
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: no
[root@rdma-virt-02 ~]$ ibstat mlx5_2
CA 'mlx5_2'
	CA type: MT4117
	Number of ports: 1
	Firmware version: 14.18.1000
	Hardware version: 0
	Node GUID: 0xe41d2d0300fda72a
	System image GUID: 0xe41d2d0300fda72a
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 40
		Base lid: 0
		LMC: 0
		SM lid: 0
		Capability mask: 0x04010000
		Port GUID: 0xe61d2dfffefda72a
		Link layer: Ethernet
[root@rdma-virt-02 ~]$ lspci | grep Mell
04:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
04:00.1 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
05:00.0 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
05:00.1 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
[root@rdma-virt-02 ~]$ rpm -q perftest
perftest-3.4-1.el7.x86_64 

[root@rdma-virt-03 ~]$ uname -r
3.10.0-693.el7.x86_64
[root@rdma-virt-03 ~]$ ethtool -i mlx5_roce
driver: mlx5_core
version: 3.0-1 (January 2015)
firmware-version: 14.18.1000
expansion-rom-version: 
bus-info: 0000:05:00.0
supports-statistics: yes
supports-test: no
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: no
[root@rdma-virt-03 ~]$ ibstat mlx5_2
CA 'mlx5_2'
	CA type: MT4117
	Number of ports: 1
	Firmware version: 14.18.1000
	Hardware version: 0
	Node GUID: 0xe41d2d0300fda736
	System image GUID: 0xe41d2d0300fda736
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 40
		Base lid: 0
		LMC: 0
		SM lid: 0
		Capability mask: 0x04010000
		Port GUID: 0xe61d2dfffefda736
		Link layer: Ethernet

[root@rdma-virt-03 ~]$ lspci | grep Mell
04:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
04:00.1 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
05:00.0 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
05:00.1 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:
 Completion with error at client
 Failed status 12: wr_id 0 syndrom 0x81
scnt=128, ccnt=0
 Failed to complete run_iter_bw function successfully

Expected results:
ib_write_bw run successfully

Additional info:
Issue can be reproduced on rhel-7.3 kernel 3.10.0-514.el7.x86_64

Comment 4 Mike Stowell 2017-10-26 13:50:53 UTC
I cannot reproduce this on the same hosts, same perftest, same kernel, and same firmware.  Closing as NOTABUG.


[root@rdma-virt-02 ~]$ ib_write_bw -c RC -d mlx5_2

************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF		Device         : mlx5_2
 Number of qps   : 1		Transport type : IB
 Connection type : RC		Using SRQ      : OFF
 CQ Moderation   : 100
 Mtu             : 4096[B]
 Link type       : Ethernet
 GID index       : 2
 Max inline data : 0[B]
 rdma_cm QPs	 : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x035b PSN 0xb4be04 RKey 0x044fb2 VAddr 0x002ae0d7ed3000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:172:31:45:92
 remote address: LID 0000 QPN 0x035a PSN 0x313d6d RKey 0x04194a VAddr 0x002aed4b283000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:172:31:45:93
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
 65536      5000             1163.91            1163.91		   0.018623
---------------------------------------------------------------------------------------

[root@rdma-virt-03 ~]$ timeout 3m ib_write_bw 172.31.40.92 -c RC -d mlx5_2
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF		Device         : mlx5_2
 Number of qps   : 1		Transport type : IB
 Connection type : RC		Using SRQ      : OFF
 TX depth        : 128
 CQ Moderation   : 100
 Mtu             : 4096[B]
 Link type       : Ethernet
 GID index       : 2
 Max inline data : 0[B]
 rdma_cm QPs	 : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x035a PSN 0x313d6d RKey 0x04194a VAddr 0x002aed4b283000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:172:31:45:93
 remote address: LID 0000 QPN 0x035b PSN 0xb4be04 RKey 0x044fb2 VAddr 0x002ae0d7ed3000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:172:31:45:92
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
 65536      5000             1163.91            1163.91		   0.018623
---------------------------------------------------------------------------------------


Info:
[root@rdma-virt-02 ~]$ ethtool -i mlx5_roce
driver: mlx5_core
version: 3.0-1 (January 2015)
firmware-version: 14.18.1000
expansion-rom-version: 
bus-info: 0000:05:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes
[root@rdma-virt-02 ~]$ ibstat mlx5_2
CA 'mlx5_2'
	CA type: MT4117
	Number of ports: 1
	Firmware version: 14.18.1000
	Hardware version: 0
	Node GUID: 0xe41d2d0300fda72a
	System image GUID: 0xe41d2d0300fda72a
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 40
		Base lid: 0
		LMC: 0
		SM lid: 0
		Capability mask: 0x04010000
		Port GUID: 0xe61d2dfffefda72a
		Link layer: Ethernet
[root@rdma-virt-02 ~]$ lspci | grep Mell
04:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
04:00.1 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
05:00.0 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
05:00.1 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
[root@rdma-virt-02 ~]$ rpm -q perftest
perftest-3.4-1.el7.x86_64
[root@rdma-virt-02 ~]$ uname -r
3.10.0-693.el7.x86_64

Comment 5 Hamed 2021-01-13 18:58:22 UTC
Hello @zguo,


I'd like to ask if you could please share the solution which you came up with. I've encountered the exact same error during this test.

I'm using two ConnectX-4 Lx cards (installed on separate machines in the same LAN).

 
Any tips, advice would be much appreciated!


Best regards,
Hamed

Comment 6 Don Dutile (Red Hat) 2021-01-13 19:06:44 UTC
(In reply to Hamed from comment #5)
> Hello @zguo,
> 
> 
> I'd like to ask if you could please share the solution which you came up
> with. I've encountered the exact same error during this test.
> 
> I'm using two ConnectX-4 Lx cards (installed on separate machines in the
> same LAN).
> 
>  
> Any tips, advice would be much appreciated!
> 
> 
> Best regards,
> Hamed


zguo offline atm.
Try updating perftest to the latest release.
That's all we did to not see the error any longer, and thus, closed the bz.

Comment 7 zguo 2021-01-14 02:49:13 UTC
(In reply to Don Dutile (Red Hat) from comment #6)
> (In reply to Hamed from comment #5)
> > Hello @zguo,
> > 
> > 
> > I'd like to ask if you could please share the solution which you came up
> > with. I've encountered the exact same error during this test.
> > 
> > I'm using two ConnectX-4 Lx cards (installed on separate machines in the
> > same LAN).
> > 
> >  
> > Any tips, advice would be much appreciated!
> > 
> > 
> > Best regards,
> > Hamed
> 
> 
> zguo offline atm.
> Try updating perftest to the latest release.
> That's all we did to not see the error any longer, and thus, closed the bz.

Thanks Don.

Hi Hamed,

What I can tell is to make sure 

1) server ConnectX-4 Lx can ping client ConnectX-4 Lx successfully 
2) use the latest perftest
3) the command parameters are correct

Comment 8 Hamed 2021-01-17 22:12:57 UTC
Hi Don, zguo,

Your advice is incredibly helpful and appreciated.
Thanks for your prompt replies!


Note You need to log in before you can comment on or make changes to this bug.