Bug 1468996

Summary: ib_write_bw failed over ConnectX-4 Lx/ROCE
Product: Red Hat Enterprise Linux 7 Reporter: zguo <zguo>
Component: perftestAssignee: Jarod Wilson <jarod>
Status: CLOSED NOTABUG QA Contact: Infiniband QE <infiniband-qe>
Severity: high Docs Contact:
Priority: unspecified    
Version: 7.4CC: abeausol, bhu, ddutile, dledford, h.roudbari, kheib, mstowell, rdma-dev-team, salmy
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-10-26 13:50:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description zguo 2017-07-10 08:23:44 UTC
Description of problem:
[root@rdma-virt-02 ~]$ ib_write_bw -c RC -d mlx5_2

************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF		Device         : mlx5_2
 Number of qps   : 1		Transport type : IB
 Connection type : RC		Using SRQ      : OFF
 CQ Moderation   : 100
 Mtu             : 4096[B]
 Link type       : Ethernet
 GID index       : 2
 Max inline data : 0[B]
 rdma_cm QPs	 : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x02ff PSN 0xf115ef RKey 0x008458 VAddr 0x002b5eb80d2000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:172:31:43:92
 remote address: LID 0000 QPN 0x0301 PSN 0xd1ed0b RKey 0x00641b VAddr 0x002b766ae6f000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:172:31:40:93
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
ethernet_read_keys: Couldn't read remote address
 Unable to read to socket/rdam_cm
 Failed to exchange data between server and clients
[root@rdma-virt-03 ~]$ timeout 3m ib_write_bw 172.31.40.92 -c RC -d mlx5_2
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF		Device         : mlx5_2
 Number of qps   : 1		Transport type : IB
 Connection type : RC		Using SRQ      : OFF
 TX depth        : 128
 CQ Moderation   : 100
 Mtu             : 4096[B]
 Link type       : Ethernet
 GID index       : 2
 Max inline data : 0[B]
 rdma_cm QPs	 : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x0301 PSN 0xd1ed0b RKey 0x00641b VAddr 0x002b766ae6f000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:172:31:40:93
 remote address: LID 0000 QPN 0x02ff PSN 0xf115ef RKey 0x008458 VAddr 0x002b5eb80d2000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:172:31:43:92
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
 Completion with error at client
 Failed status 12: wr_id 0 syndrom 0x81
scnt=128, ccnt=0
 Failed to complete run_iter_bw function successfully

Version-Release number of selected component (if applicable):
[root@rdma-virt-02 ~]$ ethtool -i mlx5_roce
driver: mlx5_core
version: 3.0-1 (January 2015)
firmware-version: 14.18.1000
expansion-rom-version: 
bus-info: 0000:05:00.0
supports-statistics: yes
supports-test: no
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: no
[root@rdma-virt-02 ~]$ ibstat mlx5_2
CA 'mlx5_2'
	CA type: MT4117
	Number of ports: 1
	Firmware version: 14.18.1000
	Hardware version: 0
	Node GUID: 0xe41d2d0300fda72a
	System image GUID: 0xe41d2d0300fda72a
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 40
		Base lid: 0
		LMC: 0
		SM lid: 0
		Capability mask: 0x04010000
		Port GUID: 0xe61d2dfffefda72a
		Link layer: Ethernet
[root@rdma-virt-02 ~]$ lspci | grep Mell
04:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
04:00.1 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
05:00.0 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
05:00.1 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
[root@rdma-virt-02 ~]$ rpm -q perftest
perftest-3.4-1.el7.x86_64 

[root@rdma-virt-03 ~]$ uname -r
3.10.0-693.el7.x86_64
[root@rdma-virt-03 ~]$ ethtool -i mlx5_roce
driver: mlx5_core
version: 3.0-1 (January 2015)
firmware-version: 14.18.1000
expansion-rom-version: 
bus-info: 0000:05:00.0
supports-statistics: yes
supports-test: no
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: no
[root@rdma-virt-03 ~]$ ibstat mlx5_2
CA 'mlx5_2'
	CA type: MT4117
	Number of ports: 1
	Firmware version: 14.18.1000
	Hardware version: 0
	Node GUID: 0xe41d2d0300fda736
	System image GUID: 0xe41d2d0300fda736
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 40
		Base lid: 0
		LMC: 0
		SM lid: 0
		Capability mask: 0x04010000
		Port GUID: 0xe61d2dfffefda736
		Link layer: Ethernet

[root@rdma-virt-03 ~]$ lspci | grep Mell
04:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
04:00.1 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
05:00.0 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
05:00.1 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:
 Completion with error at client
 Failed status 12: wr_id 0 syndrom 0x81
scnt=128, ccnt=0
 Failed to complete run_iter_bw function successfully

Expected results:
ib_write_bw run successfully

Additional info:
Issue can be reproduced on rhel-7.3 kernel 3.10.0-514.el7.x86_64

Comment 4 Mike Stowell 2017-10-26 13:50:53 UTC
I cannot reproduce this on the same hosts, same perftest, same kernel, and same firmware.  Closing as NOTABUG.


[root@rdma-virt-02 ~]$ ib_write_bw -c RC -d mlx5_2

************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF		Device         : mlx5_2
 Number of qps   : 1		Transport type : IB
 Connection type : RC		Using SRQ      : OFF
 CQ Moderation   : 100
 Mtu             : 4096[B]
 Link type       : Ethernet
 GID index       : 2
 Max inline data : 0[B]
 rdma_cm QPs	 : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x035b PSN 0xb4be04 RKey 0x044fb2 VAddr 0x002ae0d7ed3000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:172:31:45:92
 remote address: LID 0000 QPN 0x035a PSN 0x313d6d RKey 0x04194a VAddr 0x002aed4b283000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:172:31:45:93
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
 65536      5000             1163.91            1163.91		   0.018623
---------------------------------------------------------------------------------------

[root@rdma-virt-03 ~]$ timeout 3m ib_write_bw 172.31.40.92 -c RC -d mlx5_2
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF		Device         : mlx5_2
 Number of qps   : 1		Transport type : IB
 Connection type : RC		Using SRQ      : OFF
 TX depth        : 128
 CQ Moderation   : 100
 Mtu             : 4096[B]
 Link type       : Ethernet
 GID index       : 2
 Max inline data : 0[B]
 rdma_cm QPs	 : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x035a PSN 0x313d6d RKey 0x04194a VAddr 0x002aed4b283000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:172:31:45:93
 remote address: LID 0000 QPN 0x035b PSN 0xb4be04 RKey 0x044fb2 VAddr 0x002ae0d7ed3000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:172:31:45:92
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
 65536      5000             1163.91            1163.91		   0.018623
---------------------------------------------------------------------------------------


Info:
[root@rdma-virt-02 ~]$ ethtool -i mlx5_roce
driver: mlx5_core
version: 3.0-1 (January 2015)
firmware-version: 14.18.1000
expansion-rom-version: 
bus-info: 0000:05:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes
[root@rdma-virt-02 ~]$ ibstat mlx5_2
CA 'mlx5_2'
	CA type: MT4117
	Number of ports: 1
	Firmware version: 14.18.1000
	Hardware version: 0
	Node GUID: 0xe41d2d0300fda72a
	System image GUID: 0xe41d2d0300fda72a
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 40
		Base lid: 0
		LMC: 0
		SM lid: 0
		Capability mask: 0x04010000
		Port GUID: 0xe61d2dfffefda72a
		Link layer: Ethernet
[root@rdma-virt-02 ~]$ lspci | grep Mell
04:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
04:00.1 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
05:00.0 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
05:00.1 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
[root@rdma-virt-02 ~]$ rpm -q perftest
perftest-3.4-1.el7.x86_64
[root@rdma-virt-02 ~]$ uname -r
3.10.0-693.el7.x86_64

Comment 5 Hamed 2021-01-13 18:58:22 UTC
Hello @zguo,


I'd like to ask if you could please share the solution which you came up with. I've encountered the exact same error during this test.

I'm using two ConnectX-4 Lx cards (installed on separate machines in the same LAN).

 
Any tips, advice would be much appreciated!


Best regards,
Hamed

Comment 6 Don Dutile (Red Hat) 2021-01-13 19:06:44 UTC
(In reply to Hamed from comment #5)
> Hello @zguo,
> 
> 
> I'd like to ask if you could please share the solution which you came up
> with. I've encountered the exact same error during this test.
> 
> I'm using two ConnectX-4 Lx cards (installed on separate machines in the
> same LAN).
> 
>  
> Any tips, advice would be much appreciated!
> 
> 
> Best regards,
> Hamed


zguo offline atm.
Try updating perftest to the latest release.
That's all we did to not see the error any longer, and thus, closed the bz.

Comment 7 zguo 2021-01-14 02:49:13 UTC
(In reply to Don Dutile (Red Hat) from comment #6)
> (In reply to Hamed from comment #5)
> > Hello @zguo,
> > 
> > 
> > I'd like to ask if you could please share the solution which you came up
> > with. I've encountered the exact same error during this test.
> > 
> > I'm using two ConnectX-4 Lx cards (installed on separate machines in the
> > same LAN).
> > 
> >  
> > Any tips, advice would be much appreciated!
> > 
> > 
> > Best regards,
> > Hamed
> 
> 
> zguo offline atm.
> Try updating perftest to the latest release.
> That's all we did to not see the error any longer, and thus, closed the bz.

Thanks Don.

Hi Hamed,

What I can tell is to make sure 

1) server ConnectX-4 Lx can ping client ConnectX-4 Lx successfully 
2) use the latest perftest
3) the command parameters are correct

Comment 8 Hamed 2021-01-17 22:12:57 UTC
Hi Don, zguo,

Your advice is incredibly helpful and appreciated.
Thanks for your prompt replies!