Bug 1798814

Summary: [rhel-7.6.z] dat_ia_close() does not release the virtual function contexts for Mellanox ROCE ports
Product: Red Hat Enterprise Linux 7 Reporter: Honggang LI <honli>
Component: daplAssignee: Honggang LI <honli>
Status: CLOSED CANTFIX QA Contact: Infiniband QE <infiniband-qe>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 7.6CC: alex.osadchyy, bchae, ddutile, dledford, honli, mschmidt, rdma-dev-team, tborcin, zguo
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1784193 Environment:
Last Closed: 2020-02-12 08:30:26 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1784193    
Bug Blocks:    

Description Honggang LI 2020-02-06 04:37:09 UTC
+++ This bug was initially created as a clone of Bug #1784193 +++

Description of problem:
Sequential execution of UDAP API Calls - Open/Close ROCE port,  breaks after 28 iteration. This indicates that the Close call does not actually release the connection. Tested and observed on IBM Z (s390x). However the connection leak does not seem to be architecture specific and must exist on x86 as well.

Similar test was performed with VERBS API calls using ibv_open_device /     ibv_close_device.  No error observed with 60 iterations.

Version-Release number of selected component (if applicable):
dapl 2.1.5-2.el7

How reproducible:
UDAPL code fails after 28 open/close iterations
 
  for( int i = 0 ; i < 60 ; i++ )
  {
     DAT_IA_HANDLE  iaHandle = DAT_HANDLE_NULL;
     DAT_EVD_HANDLE evdHandle   = DAT_HANDLE_NULL;
     cout << "open number " << i << endl ;
     status = dat_ia_open(gDevName, SVR_EVD_QLEN, &evdHandle, &iaHandle);
     if (DAT_SUCCESS != (status = dat_ia_close(iaHandle, DAT_CLOSE_GRACEFUL_FLAG) ))
     {
         printError("dat_ia_close", status);
         return 1;
     }
  }
 
 
./UdaplUtility ofa-v2-roe0
open number 0
open number 1
open number 2
open number 3
...
open number 27
open number 28
open number 29
host1:CMA:747b:a4377720: 3452 us(3452 us):  open_hca: rdma_bind ERR No such device. Is enP303p0s0.66 configured as IPoIB?
failure: dat_ia_open 0x120000

Steps to Reproduce:
1. Start the process
2. Open ROCE port via dat_ia_open() call
3. Close ROCE port via dat_ia_close() call
4. Repeat #2 for 60 times

Actual results:
UDAPL code fails after 28 open/close iterations

Expected results:
Since the connection is closed. There should be no limit in how many consecutive open/close can be executed successfully

Additional info:

Comment 2 Honggang LI 2020-02-06 04:39:28 UTC
IBM asks for fixes for rhel-7.7 and rhel-7.6. This bug opened for rhel-7.6.z .

Comment 3 Michal Schmidt 2020-02-12 08:30:26 UTC
To fix this issue in 7.6.z we must follow the z-stream process from bug 1784193.
This BZ is an improper clone. Closing.