RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1784193 - dat_ia_close() does not release the virtual function contexts for Mellanox ROCE ports
Summary: dat_ia_close() does not release the virtual function contexts for Mellanox RO...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: dapl
Version: 7.6
Hardware: s390x
OS: Linux
unspecified
urgent
Target Milestone: rc
: ---
Assignee: Honggang LI
QA Contact: Brian Chae
URL:
Whiteboard:
Depends On:
Blocks: 1798812 1798814
TreeView+ depends on / blocked
 
Reported: 2019-12-16 22:35 UTC by alex.osadchyy@ibm.com
Modified: 2020-11-11 12:09 UTC (History)
7 users (show)

Fixed In Version: dapl-2.1.5-3.el7
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1798812 1798814 (view as bug list)
Environment:
Last Closed: 2020-03-31 20:11:37 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:1192 0 None None None 2020-03-31 20:11:53 UTC

Description alex.osadchyy@ibm.com 2019-12-16 22:35:10 UTC
Description of problem:
Sequential execution of UDAP API Calls - Open/Close ROCE port,  breaks after 28 iteration. This indicates that the Close call does not actually release the connection. Tested and observed on IBM Z (s390x). However the connection leak does not seem to be architecture specific and must exist on x86 as well.

Similar test was performed with VERBS API calls using ibv_open_device /     ibv_close_device.  No error observed with 60 iterations.

Version-Release number of selected component (if applicable):
dapl 2.1.5-2.el7

How reproducible:
UDAPL code fails after 28 open/close iterations
 
  for( int i = 0 ; i < 60 ; i++ )
  {
     DAT_IA_HANDLE  iaHandle = DAT_HANDLE_NULL;
     DAT_EVD_HANDLE evdHandle   = DAT_HANDLE_NULL;
     cout << "open number " << i << endl ;
     status = dat_ia_open(gDevName, SVR_EVD_QLEN, &evdHandle, &iaHandle);
     if (DAT_SUCCESS != (status = dat_ia_close(iaHandle, DAT_CLOSE_GRACEFUL_FLAG) ))
     {
         printError("dat_ia_close", status);
         return 1;
     }
  }
 
 
./UdaplUtility ofa-v2-roe0
open number 0
open number 1
open number 2
open number 3
...
open number 27
open number 28
open number 29
host1:CMA:747b:a4377720: 3452 us(3452 us):  open_hca: rdma_bind ERR No such device. Is enP303p0s0.66 configured as IPoIB?
failure: dat_ia_open 0x120000

Steps to Reproduce:
1. Start the process
2. Open ROCE port via dat_ia_open() call
3. Close ROCE port via dat_ia_close() call
4. Repeat #2 for 60 times

Actual results:
UDAPL code fails after 28 open/close iterations

Expected results:
Since the connection is closed. There should be no limit in how many consecutive open/close can be executed successfully

Additional info:

Comment 5 Honggang LI 2019-12-18 02:50:21 UTC
Hi, Alex
 Is it a Mellanox hardware specific bug? Thanks

Comment 6 alex.osadchyy@ibm.com 2019-12-18 05:22:16 UTC
Hi Hong, 
As stated above, the same open -close sequence works fine using Verbs API.  I read it as Mellanox drivers work fine. The middle layer - UDAPL must be not handling the close calls properly as Verbs does. 
Alex

Comment 7 Honggang LI 2019-12-18 12:20:38 UTC
(In reply to alex.osadchyy from comment #0)
Hi,

>   for( int i = 0 ; i < 60 ; i++ )
>   {
>      DAT_IA_HANDLE  iaHandle = DAT_HANDLE_NULL;
>      DAT_EVD_HANDLE evdHandle   = DAT_HANDLE_NULL;
>      cout << "open number " << i << endl ;
>      status = dat_ia_open(gDevName, SVR_EVD_QLEN, &evdHandle, &iaHandle);
                            ^^^^^^^^  ^^^^^^^^^^^^^^
I'm trying to reproduce this issue. But I have no idea with these two symbols.

>      if (DAT_SUCCESS != (status = dat_ia_close(iaHandle,
> DAT_CLOSE_GRACEFUL_FLAG) ))
>      {
>          printError("dat_ia_close", status);
>          return 1;
>      }
>   }
>  
>  
> ./UdaplUtility ofa-v2-roe0

So please provide all source files of 'UdaplUtility'.

And please upload the sosreport file generated on the machine you used to reproduce this issue.

Thanks

Comment 8 alex.osadchyy@ibm.com 2019-12-24 03:24:37 UTC
The code is a simple test for the dat_ia_open. The parameters depend on the machine and Mellanox adapters. e.g. 

gDevName is the device name you specified in your dat.conf
e.g.
#define gDevName "ofa-v2-mlx4_0-1"
#define SVR_EVD_QLEN 8

Any standard UDAPL example adopted to the environment you have should work here.
e.g. ref https://www.mail-archive.com/general@lists.openfabrics.org/msg25610.html

Interface definitions:
https://docs.oracle.com/cd/E19253-01/816-5172/6mbb7btjf/index.html
https://linux.die.net/man/5/dat.conf

Comment 9 Honggang LI 2019-12-26 13:39:06 UTC
Confirmed there is file handle leak in dapl cma provider.


[root@rdma-dev-19 test]$ ibstat
CA 'mlx5_2'
	CA type: MT4115
	Number of ports: 1
	Firmware version: 12.23.1020
	Hardware version: 0
	Node GUID: 0x248a07030049d338
	System image GUID: 0x248a07030049d338
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 100
		Base lid: 13
		LMC: 0
		SM lid: 1
		Capability mask: 0x2659e848
		Port GUID: 0x248a07030049d338
		Link layer: InfiniBand
CA 'mlx5_3'
	CA type: MT4115
	Number of ports: 1
	Firmware version: 12.23.1020
	Hardware version: 0
	Node GUID: 0x248a07030049d339
	System image GUID: 0x248a07030049d338
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 40
		Base lid: 38
		LMC: 1
		SM lid: 36
		Capability mask: 0x2659e848
		Port GUID: 0x248a07030049d339
		Link layer: InfiniBand
CA 'mlx5_bond_0'
	CA type: MT4117
	Number of ports: 1
	Firmware version: 14.23.1020
	Hardware version: 0
	Node GUID: 0x7cfe900300cb743a
	System image GUID: 0x7cfe900300cb743a
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 10
		Base lid: 0
		LMC: 0
		SM lid: 0
		Capability mask: 0x00010000
		Port GUID: 0x7efe90fffecb743a
		Link layer: Ethernet


[root@rdma-dev-19 test]$ cat test.sh
#!/bin/bash
set -x

export DAT_OVERRIDE=/root/test/dat.conf

cat > ${DAT_OVERRIDE} << 'EOF'
OpenIB-cma u2.0 nonthreadsafe default libdaplcma.so.1 dapl.2.0 "mlx5_bond_roce 0" ""
ofa-v2-cma-roe-mlx5_bond_roce u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "mlx5_bond_roce 0" ""
ofa-v2-cma-roe-mlx5_ib0 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "mlx5_ib0 0" ""
EOF

cat ${DAT_OVERRIDE}

cat test.c

rm -f test.exe

if [ ! -e test.exe ]; then
	gcc -ldat2 -Wall -Werror -g -o test.exe test.c
fi

if [ "x$2" = "xdebug" ]; then
	export DAPL_DBG_DEST=0x0001
	export DAPL_DBG_TYPE=0xffffffff
	export DAPL_DBG_LEVEL=0xffff

	export DAT_DBG_TYPE_ENV=0xffff
	export DAT_DBG_TYPE=0xff
	export DAT_DBG_DEST=0x1
fi

ulimit -n 50
./test.exe ofa-v2-cma-roe-mlx5_bond_roce 8
./test.exe ofa-v2-cma-roe-mlx5_ib0 8

[root@rdma-dev-19 test]$ 


[root@rdma-dev-19 test]$ sh test.sh 
+ export DAT_OVERRIDE=/root/test/dat.conf
+ DAT_OVERRIDE=/root/test/dat.conf
+ cat
+ cat /root/test/dat.conf
OpenIB-cma u2.0 nonthreadsafe default libdaplcma.so.1 dapl.2.0 "mlx5_bond_roce 0" ""
ofa-v2-cma-roe-mlx5_bond_roce u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "mlx5_bond_roce 0" ""
ofa-v2-cma-roe-mlx5_ib0 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "mlx5_ib0 0" ""
+ cat test.c
#include <stdio.h>
#include <stdlib.h>
#include <dat2/udat.h>

int main(int argc, char **argv)
{
	DAT_IA_HANDLE  iaHandle;
	DAT_EVD_HANDLE evdHandle;
	DAT_RETURN status;
	DAT_NAME_PTR gDevName;
	DAT_COUNT SVR_EVD_QLEN;

	int i;

	if (argc != 3)
		return -1;

	gDevName = argv[1];
	SVR_EVD_QLEN = atoi(argv[2]);

	for(i = 0 ; i < 400 ; i++ ) {
		iaHandle = DAT_HANDLE_NULL;
		evdHandle   = DAT_HANDLE_NULL;
		printf("open number %d\n", i);
		status = dat_ia_open(gDevName, SVR_EVD_QLEN, &evdHandle, &iaHandle);
		if (DAT_SUCCESS != status) {
			printf("dat_ia_open status = %u\n", status);
			return 1;
		}
		if (DAT_SUCCESS != (status = dat_ia_close(iaHandle, DAT_CLOSE_GRACEFUL_FLAG) )) {
			printf("dat_ia_close status = %u\n", status);
			return 1;
		}
	}

	//for(;;);
	
	return 0;
}
+ rm -f test.exe
+ '[' '!' -e test.exe ']'
+ gcc -ldat2 -Wall -Werror -g -o test.exe test.c
+ '[' x = xdebug ']'
+ ulimit -n 50
+ ./test.exe ofa-v2-cma-roe-mlx5_bond_roce 8
open number 0
open number 1
open number 2
open number 3
open number 4
open number 5
open number 6
open number 7
open number 8
open number 9
open number 10
open number 11
open number 12
open number 13
open number 14
open number 15
open number 16
open number 17
open number 18
open number 19
open number 20
open number 21
rdma-dev-19.lab.bos.redhat.com:CMA:2112e:24ea4640: 1735 us(1735 us):  open_hca: ibv_create_comp_channel ERR Too many open files
dat_ia_open status = 262144
+ ./test.exe ofa-v2-cma-roe-mlx5_ib0 8
open number 0
open number 1
open number 2
open number 3
open number 4
open number 5
open number 6
open number 7
open number 8
open number 9
open number 10
open number 11
open number 12
open number 13
open number 14
open number 15
open number 16
open number 17
open number 18
open number 19
open number 20
open number 21
rdma-dev-19.lab.bos.redhat.com:CMA:21170:6c834640: 2294 us(2294 us):  open_hca: ibv_create_comp_channel ERR Too many open files
dat_ia_open status = 262144


[root@rdma-dev-19 test]$  ps -ef | grep test.exe
root     135815 135807 90 08:35 pts/1    00:00:24 ./test.exe ofa-v2-cma-roe-mlx5_ib0 8

[root@rdma-dev-19 fd]$ ls -l | head
total 0
lrwx------. 1 root root 64 Dec 26 08:36 0 -> /dev/pts/1
lrwx------. 1 root root 64 Dec 26 08:36 1 -> /dev/pts/1
lrwx------. 1 root root 64 Dec 26 08:36 10 -> /dev/infiniband/uverbs2
lrwx------. 1 root root 64 Dec 26 08:36 100 -> /dev/infiniband/uverbs2
lr-x------. 1 root root 64 Dec 26 08:36 101 -> anon_inode:[infinibandevent]
lrwx------. 1 root root 64 Dec 26 08:36 102 -> /dev/infiniband/uverbs2
lr-x------. 1 root root 64 Dec 26 08:36 103 -> anon_inode:[infinibandevent]
lrwx------. 1 root root 64 Dec 26 08:36 104 -> /dev/infiniband/uverbs2
lr-x------. 1 root root 64 Dec 26 08:36 105 -> anon_inode:[infinibandevent]

The 'dev/infiniband/uverbs2' and 'anon_inode:[infinibandevent]' file handlers are leaked.

Comment 10 alex.osadchyy@ibm.com 2019-12-26 19:11:32 UTC
Thank you for reproducing and confirming the issue.  From our product point of view, request the fix for RHEL 7.6 and up, not only RHEL 8+.

Comment 11 Honggang LI 2019-12-30 03:26:17 UTC
(In reply to alex.osadchyy from comment #10)
> Thank you for reproducing and confirming the issue.  From our product point
> of view, request the fix for RHEL 7.6 and up, not only RHEL 8+.

dapl is deprecated in upstream, so Redhat removed it from rhel-8 distro.

Please see thread "[Ofa_boardplus] OFA Repo and Maintainer List cleanup" for details.

https://lists.openfabrics.org/pipermail/ofa_boardplus/2018-May/thread.html

Comment 12 alex.osadchyy@ibm.com 2019-12-30 22:52:34 UTC
Understand about RHEL 8+. Will need to plan an alternative with IBM DB2 PureScale team.  For this issue, can the fix be done in RHEL 7.6 and later 7.x?

Comment 13 Honggang LI 2019-12-31 01:16:00 UTC
(In reply to alex.osadchyy from comment #12)
> For this issue, can the fix be done in RHEL 7.6 and later 7.x?

I'm debugging this issue. So, I don't have an answer at this point.

Except the file handle leak, there are MANY resource leak in dapl.
I had run covscan for dapl, if you want the details of the test result.
Please let me know, I will upload it for you.

Comment 14 Honggang LI 2020-01-02 06:11:16 UTC
(In reply to Honggang LI from comment #9)

> lrwx------. 1 root root 64 Dec 26 08:36 10 -> /dev/infiniband/uverbs2
> lrwx------. 1 root root 64 Dec 26 08:36 100 -> /dev/infiniband/uverbs2
> lr-x------. 1 root root 64 Dec 26 08:36 101 -> anon_inode:[infinibandevent]
> lrwx------. 1 root root 64 Dec 26 08:36 102 -> /dev/infiniband/uverbs2
> lr-x------. 1 root root 64 Dec 26 08:36 103 -> anon_inode:[infinibandevent]
> lrwx------. 1 root root 64 Dec 26 08:36 104 -> /dev/infiniband/uverbs2
> lr-x------. 1 root root 64 Dec 26 08:36 105 -> anon_inode:[infinibandevent]
> 
> The 'dev/infiniband/uverbs2' and 'anon_inode:[infinibandevent]' file
> handlers are leaked.

The /dev/infiniband/uverbsX files are opened in this call path. It is not mlx5 roce specific
issue, as it can be reproduced with qib/mxl5/hfi1 devices.

#0  verbs_open_device (device=0x60dc60, private_data=private_data@entry=0x0) at /usr/src/debug/rdma-core-22.4/libibverbs/device.c:320
#1  0x00002aaaab8f98e7 in __ibv_open_device_1_1 (device=<optimized out>) at /usr/src/debug/rdma-core-22.4/libibverbs/device.c:356
#2  0x00002aaaabb0a3af in ucma_open_device (guid=12686471929343054080) at /usr/src/debug/rdma-core-22.4/librdmacm/cma.c:258
#3  ucma_init_device (cma_dev=cma_dev@entry=0x605ea0) at /usr/src/debug/rdma-core-22.4/librdmacm/cma.c:276
#4  0x00002aaaabb0a659 in ucma_init_device (cma_dev=0x605ea0) at /usr/src/debug/rdma-core-22.4/librdmacm/cma.c:407
#5  ucma_get_device (id_priv=id_priv@entry=0x606c30, guid=<optimized out>) at /usr/src/debug/rdma-core-22.4/librdmacm/cma.c:402
#6  0x00002aaaabb0a7d6 in ucma_query_addr (id=id@entry=0x606c30) at /usr/src/debug/rdma-core-22.4/librdmacm/cma.c:647
#7  0x00002aaaabb0ac30 in rdma_bind_addr2 (id=0x606c30, addr=<optimized out>, addrlen=<optimized out>) at /usr/src/debug/rdma-core-22.4/librdmacm/
cma.c:817
#8  0x00002aaaabb0b063 in rdma_bind_addr (id=0x606c30, addr=addr@entry=0x606848) at /usr/src/debug/rdma-core-22.4/librdmacm/cma.c:834
#9  0x00002aaaab4c30fd in dapls_ib_open_hca (hca_name=0x60c4d0 "hfi1_opa0", hca_ptr=hca_ptr@entry=0x606800, flags=flags@entry=DAPL_OPEN_NORMAL) at
 dapl/openib_cma/device.c:314
#10 0x00002aaaab4b812e in dapl_ia_open (name=<optimized out>, async_evd_qlen=8, async_evd_handle_ptr=0x7fffffffde88, ia_handle_ptr=0x7fffffffde90)
 at dapl/common/dapl_ia_open.c:135
#11 0x00002aaaaacd13d6 in dat_ia_openv (name=0x7fffffffe2bb "ofa-v2-cma-roe-hfi1_opa0", async_event_qlen=8, async_event_handle=0x7fffffffde88, ia_
handle=0x7fffffffde90, dapl_major=2, dapl_minor=<optimized out>, thread_safety=DAT_FALSE) at dat/udat/udat.c:210
#12 0x00000000004006e8 in main (argc=3, argv=0x7fffffffdf98) at test.c:26


> Except the file handle leak, there are MANY resource leak in dapl.

I also confirmed most of resource leaks are in dapl test programs, not in the dapl libraries.

Comment 16 Honggang LI 2020-01-03 11:37:20 UTC
Hi, Alex

http://people.redhat.com/honli/dapl/

This workaround fixes the /dev/infiniband/uverbsX file leak issue for me.

Could you please test it?

Thanks

Comment 17 alex.osadchyy@ibm.com 2020-01-03 17:17:13 UTC
Hi Hong,
I will need a build for s390x(IBM LinuxONE server hardware) in order to test. You can get a free instance of RHEL Virtual Server and make the build here https://developer.ibm.com/linuxone/ 
Thanks

Comment 18 Honggang LI 2020-01-04 00:07:09 UTC
(In reply to alex.osadchyy from comment #17)
> Hi Hong,
> I will need a build for s390x(IBM LinuxONE server hardware) in order to

http://people.redhat.com/honli/dapl/s390x/

I built dapl for s390x with our internal brew system and upload it into
our http server. Please download and test it.

Comment 19 alex.osadchyy@ibm.com 2020-01-08 17:53:21 UTC
I installed the patch on my systems and confirmed the fix works.   Can you share the plans of including it in RHEL 7.6 and 7.7?

Comment 20 Honggang LI 2020-01-13 02:03:13 UTC
(In reply to alex.osadchyy from comment #19)
> I installed the patch on my systems and confirmed the fix works.   Can you
> share the plans of including it in RHEL 7.6 and 7.7?

Redhat QE confirmed the workaround for this bug had passed our internal dapl regression test suite.

Before we fix this for RHEL-7.6 and RHEL-7.7 via z-stream, this issue must be fixed for RHEL-7.8.

I had set 'blocker?' flag for RHEL-7.8 request. The management team will review this.

For rhel-7.8 blocker request and rhel-7.6/7.7-z stream request, we need business justification,

So, please provide business justification ASAP.

Thanks

Comment 22 alex.osadchyy@ibm.com 2020-01-13 19:08:52 UTC
Regarding the justification. This bug blocks the enablement of IBM Db2 pureScale v11.5 on IBM Z platform. The IBM Db2 pureScale environment might help reduce the risk and cost associated with growing a distributed database solution by providing extreme capacity and application transparency. The Db2 pureScale environment is designed for continuous availability and is capable of exceeding even the strictest industry standard.

Hong,
Could you also confirm the following 2 items:
1. When RHEL-7.8 will become GA?
2. Is current dapl patch a GA level? 
Thanks

Comment 23 Honggang LI 2020-01-14 12:09:43 UTC
(In reply to alex.osadchyy from comment #22)
> Regarding the justification. This bug blocks the enablement of IBM Db2
> pureScale v11.5 on IBM Z platform. The IBM Db2 pureScale environment might
> help reduce the risk and cost associated with growing a distributed database
> solution by providing extreme capacity and application transparency. The Db2
> pureScale environment is designed for continuous availability and is capable
> of exceeding even the strictest industry standard.
> 
> Hong,
> Could you also confirm the following 2 items:
> 1. When RHEL-7.8 will become GA?

The RHEL schedule is confidential. I can not talk it in public. Please ask IBM
partner manager. He/She may know how to tell you such information.

> 2. Is current dapl patch a GA level? 

To be honest, I'm not 100% sure. I had submit the workaround to upstream dapl maintainer,
arlin.r.davis, to discuss this issue. But I did not get reply, I think
my email was ignored because dapl is dead in upstream.

I had compared our internal dapl regression test suite, which based on dapl-utils,
with and without the workaround, no obvious regression issue was introduced by this
workaround. The test suite had been run over all RDMA hardware we have, which includes
InfiniBand/OPA/iWARP/RoCE devices.

In case you want a GA level test, you should test this with DB2 by yourself. We
don't have DB2 to test this workaround.

As you said "This bug blocks the enablement of IBM Db2 pureScale v11.5 on IBM Z platform.",
are you positive sure this issue is the ONLY issue blocks Db2 pureScale?

It seems not a good idea to write new program/feature based on a dead upstream library.

Comment 24 alex.osadchyy@ibm.com 2020-01-14 22:46:24 UTC
Thanks
Regarding GA level of the  dapl patch. Can you publish the actual modification so we can try to re-build/review the code? Perhaps if there is no response/acceptance from the maintainer. There is a way to make a pull request or create a branch against the dapl code. That way it will be accessible and whoever needs the fix will be able to reference. 

We're considering alternatives, but it's not an immediate solution.

Comment 25 Honggang LI 2020-01-15 02:53:48 UTC
(In reply to alex.osadchyy from comment #24)
> Thanks
> Regarding GA level of the  dapl patch. Can you publish the actual
> modification so we can try to re-build/review the code? 

Sure, I can publish the actual patch. The problem is send the patch to who or which mailing list?

> Perhaps if there is
> no response/acceptance from the maintainer. There is a way to make a pull
> request or create a branch against the dapl code. 

Unfortunately, the upstream git repo only can be access via web browser. That means
we can't clone it or file a pull request.

https://www.openfabrics.org/downloads/dapl/

There is a URL pointer, which pointer to dapl git repo, in the last line of the web page.

tmp]$ git clone http://git.openfabrics.org/~ardavis/dapl.git
Cloning into 'dapl'...
fatal: repository 'http://git.openfabrics.org/~ardavis/dapl.git/' not found

tmp]$ git clone https://git.openfabrics.org/~ardavis/dapl.git
Cloning into 'dapl'...
fatal: repository 'https://git.openfabrics.org/~ardavis/dapl.git/' not found


> That way it will be
> accessible and whoever needs the fix will be able to reference. 
> 
> We're considering alternatives, but it's not an immediate solution.

Comment 27 alex.osadchyy@ibm.com 2020-01-17 20:56:12 UTC
Are there any RedHat mailing lists where you can publish?  There are some active mailing lists on openfabrics.org. Previously dapl discussions were in the general section. But it's archived now. Perhaps publishing to any active list there would still work. 
https://lists.openfabrics.org/mailman/listinfo

Comment 28 Honggang LI 2020-01-21 12:09:12 UTC
(In reply to alex.osadchyy from comment #27)
> Are there any RedHat mailing lists where you can publish?  There are some
> active mailing lists on openfabrics.org. Previously dapl discussions were in
> the general section. But it's archived now. Perhaps publishing to any active
> list there would still work. 
> https://lists.openfabrics.org/mailman/listinfo

I had scanned ALL thread subjects of all open mailing list. It seems such mailing
list is not the right direction to upstream the patch.

1) As you said they are archived now, this because the upstream development work
had been migrated to github. Most mailing list are really quite.

2) The resource leak is generate problem. It is not OFED specific, so it likely be
ignored as someone tried.

3) Very few dapl related questions, no more than 5, had been asked in those mailing
list, but most of them ignored. The last one had been opened in 2014.

4) I tried to subscribe to two mailing lists. But never get reply over 24+ hours.

As the source of resource leak is in librdmacm, I will try to ask in the RDMA mailing
list <linux-rdma.org> .

Comment 29 Honggang LI 2020-01-22 15:14:47 UTC
I managed to reproduce this without dapl. That means the resource leak is a librdamcm issue.


[root@rdma-dev-00 cm2]$ sh build.sh 
+ rm -f libofa.so libofa.o
+ gcc -fPIC -g -c -o libofa.o libofa.c
+ gcc -shared -fPIC -g -Wl,-init,test_init -Wl,-fini,test_fini -lrdmacm -o libofa.so libofa.o
+ gcc -ldl -g -o test.exe test.c
+ ip addr show mlx4_ib0
+ grep -w inet
    inet 172.31.0.230/24 brd 172.31.0.255 scope global dynamic noprefixroute mlx4_ib0
+ ./test.exe 172.31.0.230
dlopen librdamcm.so done
dlopen librdamcm.so done
dlopen librdamcm.so done
dlopen librdamcm.so done
=== ls -l /proc/20221/fd
total 0
lrwx------. 1 root root 64 Jan 22 10:10 0 -> /dev/pts/1
lrwx------. 1 root root 64 Jan 22 10:10 1 -> /dev/pts/1
lrwx------. 1 root root 64 Jan 22 10:10 10 -> /dev/infiniband/uverbs0      <--- leak
lr-x------. 1 root root 64 Jan 22 10:10 11 -> 'anon_inode:[infinibandevent]'
lrwx------. 1 root root 64 Jan 22 10:10 2 -> /dev/pts/1
lrwx------. 1 root root 64 Jan 22 10:10 4 -> /dev/infiniband/uverbs0       <--  leak
lr-x------. 1 root root 64 Jan 22 10:10 5 -> 'anon_inode:[infinibandevent]'
lrwx------. 1 root root 64 Jan 22 10:10 6 -> /dev/infiniband/uverbs0       <-- leak
lr-x------. 1 root root 64 Jan 22 10:10 7 -> 'anon_inode:[infinibandevent]'
lrwx------. 1 root root 64 Jan 22 10:10 8 -> /dev/infiniband/uverbs0       <-- leak
lr-x------. 1 root root 64 Jan 22 10:10 9 -> 'anon_inode:[infinibandevent]'
[root@rdma-dev-00 cm2]$ 


[root@rdma-dev-00 cm2]$ cat build.sh 
#/bin/bash
set -x

rm -f libofa.so libofa.o
gcc -fPIC -g -c -o libofa.o libofa.c
gcc -shared -fPIC -g -Wl,-init,test_init -Wl,-fini,test_fini -lrdmacm -o libofa.so libofa.o

gcc -ldl -g -o test.exe test.c

ip addr show mlx4_ib0 | grep -w inet

#LD_DEBUG=libs ./test.exe 172.31.0.230
./test.exe 172.31.0.230


[root@rdma-dev-00 cm2]$ cat libofa.c
#include <stdio.h>
#include <stdlib.h>
#include <dlfcn.h>
#include <arpa/inet.h>
#include <rdma/rdma_cma.h>
#include <limits.h>

#include <sys/types.h>
#include <unistd.h>

static void *handle;
void test_init(void)
{

	handle = dlopen("/usr/lib64/librdmacm.so", RTLD_NOW | RTLD_GLOBAL);
	if (!handle)
		printf("dlopen /usr/lib64/librdmacm.so failed\n");
	else
		printf("dlopen librdamcm.so done\n");
}

void test_fini(void)
{
	if (handle)
		dlclose(handle);
	handle = NULL;
}

void test(char *ipoib_ip)
{
#if 1
	int ret;
	struct rdma_cm_id *id;
	struct sockaddr_in ipoib_addr;
	struct rdma_event_channel *ch;
	void *handle;

	memset(&ipoib_addr, 0, sizeof(ipoib_addr));

	ipoib_addr.sin_family = AF_INET;
	ipoib_addr.sin_port = 5555;

#if 1
	ret = inet_pton(AF_INET, ipoib_ip, (void *)&(ipoib_addr.sin_addr));

	if (ret != 1)
		printf("inet_pton failed\n");
#else	
	ipoib_addr.sin_addr.s_addr=htonl(INADDR_ANY);
#endif	

	ch = rdma_create_event_channel();
	if (ch == NULL)
		printf("rdma_create_event_channel failed\n");

	ret = rdma_create_id(ch, &id, NULL, RDMA_PS_TCP);
	if (ret != 0)
		printf("rdma_create_id failed\n");

	ret = rdma_bind_addr(id, (struct sockaddr *) &ipoib_addr);

	if (ret != 0)
		printf("rdma_bind_addr failed\n");

#if DEBUG
	printf("befora call rdma_destroy_id\n");
	getchar();
#endif

	ret = rdma_destroy_id(id);
	if (ret != 0)
		printf("rdma_destroy_id failed\n");

#if DEBUG
	printf("before call rdma_destroy_event_channel\n");
	getchar();
#endif

	rdma_destroy_event_channel(ch);
#if DEBUG
	printf("after call rdma_destroy_event_channle\n");
	getchar();
#endif

#else
	printf("xxx %s:%s\n", __FILE__, __func__);
#endif	
}


[root@rdma-dev-00 cm2]$ cat test.c
#include <stdio.h>
#include <stdlib.h>
#include <dlfcn.h>

#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>
#include <string.h>

typedef void (* DUMMY_TEST_FUNC) (char *);

int main(int argc, char **argv)
{
	DUMMY_TEST_FUNC sym;
	void *handle;
	int i;
	pid_t cpid, ppid;
	int wstatus;
	char path[128];

	if (argc != 2) {
		printf("usage: %s IPoIB_IP_ADDR\n", argv[0]);
		return 1;
	}
	
	for (i = 0; i < 4; i++) {
		handle = dlopen("./libofa.so", RTLD_NOW | RTLD_GLOBAL);
		sym = dlsym(handle, "test");
		sym(argv[1]);
		dlclose(handle);
	}

	cpid = fork();

	if (cpid == 0) { /* child */
		ppid = getppid();
		memset(path, 0, 128);
		sprintf(path, "/proc/%d/fd", ppid);
		printf("=== ls -l %s\n", path);
		execl("/usr/bin/ls", "/usr/bin/ls", "-l", path, (char *)NULL);
	} else {
		waitpid(cpid, &wstatus, 0);
	}
	return 0;
}

Comment 30 Honggang LI 2020-01-31 03:40:53 UTC
https://www.spinics.net/lists/linux-rdma/msg88399.html

librdmacm was designed to load once and only unload when process exit.
Will apply the dlclose workaround for dapl.

Comment 35 errata-xmlrpc 2020-03-31 20:11:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:1192

Comment 36 Michal Schmidt 2020-11-11 12:09:55 UTC
Clearing the "7.6.z?", "7.7.z?" request flags. The bug does not appear to meet the EUS inclusion criteria.


Note You need to log in before you can comment on or make changes to this bug.