Bug 1784193
| Summary: | dat_ia_close() does not release the virtual function contexts for Mellanox ROCE ports | |||
|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | alex.osadchyy <alex.osadchyy> | |
| Component: | dapl | Assignee: | Honggang LI <honli> | |
| Status: | CLOSED ERRATA | QA Contact: | Brian Chae <bchae> | |
| Severity: | urgent | Docs Contact: | ||
| Priority: | unspecified | |||
| Version: | 7.6 | CC: | bchae, ddutile, dledford, honli, mschmidt, rdma-dev-team, tborcin | |
| Target Milestone: | rc | |||
| Target Release: | --- | |||
| Hardware: | s390x | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | dapl-2.1.5-3.el7 | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1798812 1798814 (view as bug list) | Environment: | ||
| Last Closed: | 2020-03-31 20:11:37 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1798812, 1798814 | |||
Hi, Alex Is it a Mellanox hardware specific bug? Thanks Hi Hong, As stated above, the same open -close sequence works fine using Verbs API. I read it as Mellanox drivers work fine. The middle layer - UDAPL must be not handling the close calls properly as Verbs does. Alex (In reply to alex.osadchyy from comment #0) Hi, > for( int i = 0 ; i < 60 ; i++ ) > { > DAT_IA_HANDLE iaHandle = DAT_HANDLE_NULL; > DAT_EVD_HANDLE evdHandle = DAT_HANDLE_NULL; > cout << "open number " << i << endl ; > status = dat_ia_open(gDevName, SVR_EVD_QLEN, &evdHandle, &iaHandle); ^^^^^^^^ ^^^^^^^^^^^^^^ I'm trying to reproduce this issue. But I have no idea with these two symbols. > if (DAT_SUCCESS != (status = dat_ia_close(iaHandle, > DAT_CLOSE_GRACEFUL_FLAG) )) > { > printError("dat_ia_close", status); > return 1; > } > } > > > ./UdaplUtility ofa-v2-roe0 So please provide all source files of 'UdaplUtility'. And please upload the sosreport file generated on the machine you used to reproduce this issue. Thanks The code is a simple test for the dat_ia_open. The parameters depend on the machine and Mellanox adapters. e.g. gDevName is the device name you specified in your dat.conf e.g. #define gDevName "ofa-v2-mlx4_0-1" #define SVR_EVD_QLEN 8 Any standard UDAPL example adopted to the environment you have should work here. e.g. ref https://www.mail-archive.com/general@lists.openfabrics.org/msg25610.html Interface definitions: https://docs.oracle.com/cd/E19253-01/816-5172/6mbb7btjf/index.html https://linux.die.net/man/5/dat.conf Confirmed there is file handle leak in dapl cma provider.
[root@rdma-dev-19 test]$ ibstat
CA 'mlx5_2'
CA type: MT4115
Number of ports: 1
Firmware version: 12.23.1020
Hardware version: 0
Node GUID: 0x248a07030049d338
System image GUID: 0x248a07030049d338
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 13
LMC: 0
SM lid: 1
Capability mask: 0x2659e848
Port GUID: 0x248a07030049d338
Link layer: InfiniBand
CA 'mlx5_3'
CA type: MT4115
Number of ports: 1
Firmware version: 12.23.1020
Hardware version: 0
Node GUID: 0x248a07030049d339
System image GUID: 0x248a07030049d338
Port 1:
State: Active
Physical state: LinkUp
Rate: 40
Base lid: 38
LMC: 1
SM lid: 36
Capability mask: 0x2659e848
Port GUID: 0x248a07030049d339
Link layer: InfiniBand
CA 'mlx5_bond_0'
CA type: MT4117
Number of ports: 1
Firmware version: 14.23.1020
Hardware version: 0
Node GUID: 0x7cfe900300cb743a
System image GUID: 0x7cfe900300cb743a
Port 1:
State: Active
Physical state: LinkUp
Rate: 10
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0x7efe90fffecb743a
Link layer: Ethernet
[root@rdma-dev-19 test]$ cat test.sh
#!/bin/bash
set -x
export DAT_OVERRIDE=/root/test/dat.conf
cat > ${DAT_OVERRIDE} << 'EOF'
OpenIB-cma u2.0 nonthreadsafe default libdaplcma.so.1 dapl.2.0 "mlx5_bond_roce 0" ""
ofa-v2-cma-roe-mlx5_bond_roce u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "mlx5_bond_roce 0" ""
ofa-v2-cma-roe-mlx5_ib0 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "mlx5_ib0 0" ""
EOF
cat ${DAT_OVERRIDE}
cat test.c
rm -f test.exe
if [ ! -e test.exe ]; then
gcc -ldat2 -Wall -Werror -g -o test.exe test.c
fi
if [ "x$2" = "xdebug" ]; then
export DAPL_DBG_DEST=0x0001
export DAPL_DBG_TYPE=0xffffffff
export DAPL_DBG_LEVEL=0xffff
export DAT_DBG_TYPE_ENV=0xffff
export DAT_DBG_TYPE=0xff
export DAT_DBG_DEST=0x1
fi
ulimit -n 50
./test.exe ofa-v2-cma-roe-mlx5_bond_roce 8
./test.exe ofa-v2-cma-roe-mlx5_ib0 8
[root@rdma-dev-19 test]$
[root@rdma-dev-19 test]$ sh test.sh
+ export DAT_OVERRIDE=/root/test/dat.conf
+ DAT_OVERRIDE=/root/test/dat.conf
+ cat
+ cat /root/test/dat.conf
OpenIB-cma u2.0 nonthreadsafe default libdaplcma.so.1 dapl.2.0 "mlx5_bond_roce 0" ""
ofa-v2-cma-roe-mlx5_bond_roce u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "mlx5_bond_roce 0" ""
ofa-v2-cma-roe-mlx5_ib0 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "mlx5_ib0 0" ""
+ cat test.c
#include <stdio.h>
#include <stdlib.h>
#include <dat2/udat.h>
int main(int argc, char **argv)
{
DAT_IA_HANDLE iaHandle;
DAT_EVD_HANDLE evdHandle;
DAT_RETURN status;
DAT_NAME_PTR gDevName;
DAT_COUNT SVR_EVD_QLEN;
int i;
if (argc != 3)
return -1;
gDevName = argv[1];
SVR_EVD_QLEN = atoi(argv[2]);
for(i = 0 ; i < 400 ; i++ ) {
iaHandle = DAT_HANDLE_NULL;
evdHandle = DAT_HANDLE_NULL;
printf("open number %d\n", i);
status = dat_ia_open(gDevName, SVR_EVD_QLEN, &evdHandle, &iaHandle);
if (DAT_SUCCESS != status) {
printf("dat_ia_open status = %u\n", status);
return 1;
}
if (DAT_SUCCESS != (status = dat_ia_close(iaHandle, DAT_CLOSE_GRACEFUL_FLAG) )) {
printf("dat_ia_close status = %u\n", status);
return 1;
}
}
//for(;;);
return 0;
}
+ rm -f test.exe
+ '[' '!' -e test.exe ']'
+ gcc -ldat2 -Wall -Werror -g -o test.exe test.c
+ '[' x = xdebug ']'
+ ulimit -n 50
+ ./test.exe ofa-v2-cma-roe-mlx5_bond_roce 8
open number 0
open number 1
open number 2
open number 3
open number 4
open number 5
open number 6
open number 7
open number 8
open number 9
open number 10
open number 11
open number 12
open number 13
open number 14
open number 15
open number 16
open number 17
open number 18
open number 19
open number 20
open number 21
rdma-dev-19.lab.bos.redhat.com:CMA:2112e:24ea4640: 1735 us(1735 us): open_hca: ibv_create_comp_channel ERR Too many open files
dat_ia_open status = 262144
+ ./test.exe ofa-v2-cma-roe-mlx5_ib0 8
open number 0
open number 1
open number 2
open number 3
open number 4
open number 5
open number 6
open number 7
open number 8
open number 9
open number 10
open number 11
open number 12
open number 13
open number 14
open number 15
open number 16
open number 17
open number 18
open number 19
open number 20
open number 21
rdma-dev-19.lab.bos.redhat.com:CMA:21170:6c834640: 2294 us(2294 us): open_hca: ibv_create_comp_channel ERR Too many open files
dat_ia_open status = 262144
[root@rdma-dev-19 test]$ ps -ef | grep test.exe
root 135815 135807 90 08:35 pts/1 00:00:24 ./test.exe ofa-v2-cma-roe-mlx5_ib0 8
[root@rdma-dev-19 fd]$ ls -l | head
total 0
lrwx------. 1 root root 64 Dec 26 08:36 0 -> /dev/pts/1
lrwx------. 1 root root 64 Dec 26 08:36 1 -> /dev/pts/1
lrwx------. 1 root root 64 Dec 26 08:36 10 -> /dev/infiniband/uverbs2
lrwx------. 1 root root 64 Dec 26 08:36 100 -> /dev/infiniband/uverbs2
lr-x------. 1 root root 64 Dec 26 08:36 101 -> anon_inode:[infinibandevent]
lrwx------. 1 root root 64 Dec 26 08:36 102 -> /dev/infiniband/uverbs2
lr-x------. 1 root root 64 Dec 26 08:36 103 -> anon_inode:[infinibandevent]
lrwx------. 1 root root 64 Dec 26 08:36 104 -> /dev/infiniband/uverbs2
lr-x------. 1 root root 64 Dec 26 08:36 105 -> anon_inode:[infinibandevent]
The 'dev/infiniband/uverbs2' and 'anon_inode:[infinibandevent]' file handlers are leaked.
Thank you for reproducing and confirming the issue. From our product point of view, request the fix for RHEL 7.6 and up, not only RHEL 8+. (In reply to alex.osadchyy from comment #10) > Thank you for reproducing and confirming the issue. From our product point > of view, request the fix for RHEL 7.6 and up, not only RHEL 8+. dapl is deprecated in upstream, so Redhat removed it from rhel-8 distro. Please see thread "[Ofa_boardplus] OFA Repo and Maintainer List cleanup" for details. https://lists.openfabrics.org/pipermail/ofa_boardplus/2018-May/thread.html Understand about RHEL 8+. Will need to plan an alternative with IBM DB2 PureScale team. For this issue, can the fix be done in RHEL 7.6 and later 7.x? (In reply to alex.osadchyy from comment #12) > For this issue, can the fix be done in RHEL 7.6 and later 7.x? I'm debugging this issue. So, I don't have an answer at this point. Except the file handle leak, there are MANY resource leak in dapl. I had run covscan for dapl, if you want the details of the test result. Please let me know, I will upload it for you. (In reply to Honggang LI from comment #9) > lrwx------. 1 root root 64 Dec 26 08:36 10 -> /dev/infiniband/uverbs2 > lrwx------. 1 root root 64 Dec 26 08:36 100 -> /dev/infiniband/uverbs2 > lr-x------. 1 root root 64 Dec 26 08:36 101 -> anon_inode:[infinibandevent] > lrwx------. 1 root root 64 Dec 26 08:36 102 -> /dev/infiniband/uverbs2 > lr-x------. 1 root root 64 Dec 26 08:36 103 -> anon_inode:[infinibandevent] > lrwx------. 1 root root 64 Dec 26 08:36 104 -> /dev/infiniband/uverbs2 > lr-x------. 1 root root 64 Dec 26 08:36 105 -> anon_inode:[infinibandevent] > > The 'dev/infiniband/uverbs2' and 'anon_inode:[infinibandevent]' file > handlers are leaked. The /dev/infiniband/uverbsX files are opened in this call path. It is not mlx5 roce specific issue, as it can be reproduced with qib/mxl5/hfi1 devices. #0 verbs_open_device (device=0x60dc60, private_data=private_data@entry=0x0) at /usr/src/debug/rdma-core-22.4/libibverbs/device.c:320 #1 0x00002aaaab8f98e7 in __ibv_open_device_1_1 (device=<optimized out>) at /usr/src/debug/rdma-core-22.4/libibverbs/device.c:356 #2 0x00002aaaabb0a3af in ucma_open_device (guid=12686471929343054080) at /usr/src/debug/rdma-core-22.4/librdmacm/cma.c:258 #3 ucma_init_device (cma_dev=cma_dev@entry=0x605ea0) at /usr/src/debug/rdma-core-22.4/librdmacm/cma.c:276 #4 0x00002aaaabb0a659 in ucma_init_device (cma_dev=0x605ea0) at /usr/src/debug/rdma-core-22.4/librdmacm/cma.c:407 #5 ucma_get_device (id_priv=id_priv@entry=0x606c30, guid=<optimized out>) at /usr/src/debug/rdma-core-22.4/librdmacm/cma.c:402 #6 0x00002aaaabb0a7d6 in ucma_query_addr (id=id@entry=0x606c30) at /usr/src/debug/rdma-core-22.4/librdmacm/cma.c:647 #7 0x00002aaaabb0ac30 in rdma_bind_addr2 (id=0x606c30, addr=<optimized out>, addrlen=<optimized out>) at /usr/src/debug/rdma-core-22.4/librdmacm/ cma.c:817 #8 0x00002aaaabb0b063 in rdma_bind_addr (id=0x606c30, addr=addr@entry=0x606848) at /usr/src/debug/rdma-core-22.4/librdmacm/cma.c:834 #9 0x00002aaaab4c30fd in dapls_ib_open_hca (hca_name=0x60c4d0 "hfi1_opa0", hca_ptr=hca_ptr@entry=0x606800, flags=flags@entry=DAPL_OPEN_NORMAL) at dapl/openib_cma/device.c:314 #10 0x00002aaaab4b812e in dapl_ia_open (name=<optimized out>, async_evd_qlen=8, async_evd_handle_ptr=0x7fffffffde88, ia_handle_ptr=0x7fffffffde90) at dapl/common/dapl_ia_open.c:135 #11 0x00002aaaaacd13d6 in dat_ia_openv (name=0x7fffffffe2bb "ofa-v2-cma-roe-hfi1_opa0", async_event_qlen=8, async_event_handle=0x7fffffffde88, ia_ handle=0x7fffffffde90, dapl_major=2, dapl_minor=<optimized out>, thread_safety=DAT_FALSE) at dat/udat/udat.c:210 #12 0x00000000004006e8 in main (argc=3, argv=0x7fffffffdf98) at test.c:26 > Except the file handle leak, there are MANY resource leak in dapl. I also confirmed most of resource leaks are in dapl test programs, not in the dapl libraries. Hi, Alex http://people.redhat.com/honli/dapl/ This workaround fixes the /dev/infiniband/uverbsX file leak issue for me. Could you please test it? Thanks Hi Hong, I will need a build for s390x(IBM LinuxONE server hardware) in order to test. You can get a free instance of RHEL Virtual Server and make the build here https://developer.ibm.com/linuxone/ Thanks (In reply to alex.osadchyy from comment #17) > Hi Hong, > I will need a build for s390x(IBM LinuxONE server hardware) in order to http://people.redhat.com/honli/dapl/s390x/ I built dapl for s390x with our internal brew system and upload it into our http server. Please download and test it. I installed the patch on my systems and confirmed the fix works. Can you share the plans of including it in RHEL 7.6 and 7.7? (In reply to alex.osadchyy from comment #19) > I installed the patch on my systems and confirmed the fix works. Can you > share the plans of including it in RHEL 7.6 and 7.7? Redhat QE confirmed the workaround for this bug had passed our internal dapl regression test suite. Before we fix this for RHEL-7.6 and RHEL-7.7 via z-stream, this issue must be fixed for RHEL-7.8. I had set 'blocker?' flag for RHEL-7.8 request. The management team will review this. For rhel-7.8 blocker request and rhel-7.6/7.7-z stream request, we need business justification, So, please provide business justification ASAP. Thanks Regarding the justification. This bug blocks the enablement of IBM Db2 pureScale v11.5 on IBM Z platform. The IBM Db2 pureScale environment might help reduce the risk and cost associated with growing a distributed database solution by providing extreme capacity and application transparency. The Db2 pureScale environment is designed for continuous availability and is capable of exceeding even the strictest industry standard. Hong, Could you also confirm the following 2 items: 1. When RHEL-7.8 will become GA? 2. Is current dapl patch a GA level? Thanks (In reply to alex.osadchyy from comment #22) > Regarding the justification. This bug blocks the enablement of IBM Db2 > pureScale v11.5 on IBM Z platform. The IBM Db2 pureScale environment might > help reduce the risk and cost associated with growing a distributed database > solution by providing extreme capacity and application transparency. The Db2 > pureScale environment is designed for continuous availability and is capable > of exceeding even the strictest industry standard. > > Hong, > Could you also confirm the following 2 items: > 1. When RHEL-7.8 will become GA? The RHEL schedule is confidential. I can not talk it in public. Please ask IBM partner manager. He/She may know how to tell you such information. > 2. Is current dapl patch a GA level? To be honest, I'm not 100% sure. I had submit the workaround to upstream dapl maintainer, arlin.r.davis, to discuss this issue. But I did not get reply, I think my email was ignored because dapl is dead in upstream. I had compared our internal dapl regression test suite, which based on dapl-utils, with and without the workaround, no obvious regression issue was introduced by this workaround. The test suite had been run over all RDMA hardware we have, which includes InfiniBand/OPA/iWARP/RoCE devices. In case you want a GA level test, you should test this with DB2 by yourself. We don't have DB2 to test this workaround. As you said "This bug blocks the enablement of IBM Db2 pureScale v11.5 on IBM Z platform.", are you positive sure this issue is the ONLY issue blocks Db2 pureScale? It seems not a good idea to write new program/feature based on a dead upstream library. Thanks Regarding GA level of the dapl patch. Can you publish the actual modification so we can try to re-build/review the code? Perhaps if there is no response/acceptance from the maintainer. There is a way to make a pull request or create a branch against the dapl code. That way it will be accessible and whoever needs the fix will be able to reference. We're considering alternatives, but it's not an immediate solution. (In reply to alex.osadchyy from comment #24) > Thanks > Regarding GA level of the dapl patch. Can you publish the actual > modification so we can try to re-build/review the code? Sure, I can publish the actual patch. The problem is send the patch to who or which mailing list? > Perhaps if there is > no response/acceptance from the maintainer. There is a way to make a pull > request or create a branch against the dapl code. Unfortunately, the upstream git repo only can be access via web browser. That means we can't clone it or file a pull request. https://www.openfabrics.org/downloads/dapl/ There is a URL pointer, which pointer to dapl git repo, in the last line of the web page. tmp]$ git clone http://git.openfabrics.org/~ardavis/dapl.git Cloning into 'dapl'... fatal: repository 'http://git.openfabrics.org/~ardavis/dapl.git/' not found tmp]$ git clone https://git.openfabrics.org/~ardavis/dapl.git Cloning into 'dapl'... fatal: repository 'https://git.openfabrics.org/~ardavis/dapl.git/' not found > That way it will be > accessible and whoever needs the fix will be able to reference. > > We're considering alternatives, but it's not an immediate solution. Are there any RedHat mailing lists where you can publish? There are some active mailing lists on openfabrics.org. Previously dapl discussions were in the general section. But it's archived now. Perhaps publishing to any active list there would still work. https://lists.openfabrics.org/mailman/listinfo (In reply to alex.osadchyy from comment #27) > Are there any RedHat mailing lists where you can publish? There are some > active mailing lists on openfabrics.org. Previously dapl discussions were in > the general section. But it's archived now. Perhaps publishing to any active > list there would still work. > https://lists.openfabrics.org/mailman/listinfo I had scanned ALL thread subjects of all open mailing list. It seems such mailing list is not the right direction to upstream the patch. 1) As you said they are archived now, this because the upstream development work had been migrated to github. Most mailing list are really quite. 2) The resource leak is generate problem. It is not OFED specific, so it likely be ignored as someone tried. 3) Very few dapl related questions, no more than 5, had been asked in those mailing list, but most of them ignored. The last one had been opened in 2014. 4) I tried to subscribe to two mailing lists. But never get reply over 24+ hours. As the source of resource leak is in librdmacm, I will try to ask in the RDMA mailing list <linux-rdma.org> . I managed to reproduce this without dapl. That means the resource leak is a librdamcm issue.
[root@rdma-dev-00 cm2]$ sh build.sh
+ rm -f libofa.so libofa.o
+ gcc -fPIC -g -c -o libofa.o libofa.c
+ gcc -shared -fPIC -g -Wl,-init,test_init -Wl,-fini,test_fini -lrdmacm -o libofa.so libofa.o
+ gcc -ldl -g -o test.exe test.c
+ ip addr show mlx4_ib0
+ grep -w inet
inet 172.31.0.230/24 brd 172.31.0.255 scope global dynamic noprefixroute mlx4_ib0
+ ./test.exe 172.31.0.230
dlopen librdamcm.so done
dlopen librdamcm.so done
dlopen librdamcm.so done
dlopen librdamcm.so done
=== ls -l /proc/20221/fd
total 0
lrwx------. 1 root root 64 Jan 22 10:10 0 -> /dev/pts/1
lrwx------. 1 root root 64 Jan 22 10:10 1 -> /dev/pts/1
lrwx------. 1 root root 64 Jan 22 10:10 10 -> /dev/infiniband/uverbs0 <--- leak
lr-x------. 1 root root 64 Jan 22 10:10 11 -> 'anon_inode:[infinibandevent]'
lrwx------. 1 root root 64 Jan 22 10:10 2 -> /dev/pts/1
lrwx------. 1 root root 64 Jan 22 10:10 4 -> /dev/infiniband/uverbs0 <-- leak
lr-x------. 1 root root 64 Jan 22 10:10 5 -> 'anon_inode:[infinibandevent]'
lrwx------. 1 root root 64 Jan 22 10:10 6 -> /dev/infiniband/uverbs0 <-- leak
lr-x------. 1 root root 64 Jan 22 10:10 7 -> 'anon_inode:[infinibandevent]'
lrwx------. 1 root root 64 Jan 22 10:10 8 -> /dev/infiniband/uverbs0 <-- leak
lr-x------. 1 root root 64 Jan 22 10:10 9 -> 'anon_inode:[infinibandevent]'
[root@rdma-dev-00 cm2]$
[root@rdma-dev-00 cm2]$ cat build.sh
#/bin/bash
set -x
rm -f libofa.so libofa.o
gcc -fPIC -g -c -o libofa.o libofa.c
gcc -shared -fPIC -g -Wl,-init,test_init -Wl,-fini,test_fini -lrdmacm -o libofa.so libofa.o
gcc -ldl -g -o test.exe test.c
ip addr show mlx4_ib0 | grep -w inet
#LD_DEBUG=libs ./test.exe 172.31.0.230
./test.exe 172.31.0.230
[root@rdma-dev-00 cm2]$ cat libofa.c
#include <stdio.h>
#include <stdlib.h>
#include <dlfcn.h>
#include <arpa/inet.h>
#include <rdma/rdma_cma.h>
#include <limits.h>
#include <sys/types.h>
#include <unistd.h>
static void *handle;
void test_init(void)
{
handle = dlopen("/usr/lib64/librdmacm.so", RTLD_NOW | RTLD_GLOBAL);
if (!handle)
printf("dlopen /usr/lib64/librdmacm.so failed\n");
else
printf("dlopen librdamcm.so done\n");
}
void test_fini(void)
{
if (handle)
dlclose(handle);
handle = NULL;
}
void test(char *ipoib_ip)
{
#if 1
int ret;
struct rdma_cm_id *id;
struct sockaddr_in ipoib_addr;
struct rdma_event_channel *ch;
void *handle;
memset(&ipoib_addr, 0, sizeof(ipoib_addr));
ipoib_addr.sin_family = AF_INET;
ipoib_addr.sin_port = 5555;
#if 1
ret = inet_pton(AF_INET, ipoib_ip, (void *)&(ipoib_addr.sin_addr));
if (ret != 1)
printf("inet_pton failed\n");
#else
ipoib_addr.sin_addr.s_addr=htonl(INADDR_ANY);
#endif
ch = rdma_create_event_channel();
if (ch == NULL)
printf("rdma_create_event_channel failed\n");
ret = rdma_create_id(ch, &id, NULL, RDMA_PS_TCP);
if (ret != 0)
printf("rdma_create_id failed\n");
ret = rdma_bind_addr(id, (struct sockaddr *) &ipoib_addr);
if (ret != 0)
printf("rdma_bind_addr failed\n");
#if DEBUG
printf("befora call rdma_destroy_id\n");
getchar();
#endif
ret = rdma_destroy_id(id);
if (ret != 0)
printf("rdma_destroy_id failed\n");
#if DEBUG
printf("before call rdma_destroy_event_channel\n");
getchar();
#endif
rdma_destroy_event_channel(ch);
#if DEBUG
printf("after call rdma_destroy_event_channle\n");
getchar();
#endif
#else
printf("xxx %s:%s\n", __FILE__, __func__);
#endif
}
[root@rdma-dev-00 cm2]$ cat test.c
#include <stdio.h>
#include <stdlib.h>
#include <dlfcn.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>
#include <string.h>
typedef void (* DUMMY_TEST_FUNC) (char *);
int main(int argc, char **argv)
{
DUMMY_TEST_FUNC sym;
void *handle;
int i;
pid_t cpid, ppid;
int wstatus;
char path[128];
if (argc != 2) {
printf("usage: %s IPoIB_IP_ADDR\n", argv[0]);
return 1;
}
for (i = 0; i < 4; i++) {
handle = dlopen("./libofa.so", RTLD_NOW | RTLD_GLOBAL);
sym = dlsym(handle, "test");
sym(argv[1]);
dlclose(handle);
}
cpid = fork();
if (cpid == 0) { /* child */
ppid = getppid();
memset(path, 0, 128);
sprintf(path, "/proc/%d/fd", ppid);
printf("=== ls -l %s\n", path);
execl("/usr/bin/ls", "/usr/bin/ls", "-l", path, (char *)NULL);
} else {
waitpid(cpid, &wstatus, 0);
}
return 0;
}
https://www.spinics.net/lists/linux-rdma/msg88399.html librdmacm was designed to load once and only unload when process exit. Will apply the dlclose workaround for dapl. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:1192 Clearing the "7.6.z?", "7.7.z?" request flags. The bug does not appear to meet the EUS inclusion criteria. |
Description of problem: Sequential execution of UDAP API Calls - Open/Close ROCE port, breaks after 28 iteration. This indicates that the Close call does not actually release the connection. Tested and observed on IBM Z (s390x). However the connection leak does not seem to be architecture specific and must exist on x86 as well. Similar test was performed with VERBS API calls using ibv_open_device / ibv_close_device. No error observed with 60 iterations. Version-Release number of selected component (if applicable): dapl 2.1.5-2.el7 How reproducible: UDAPL code fails after 28 open/close iterations for( int i = 0 ; i < 60 ; i++ ) { DAT_IA_HANDLE iaHandle = DAT_HANDLE_NULL; DAT_EVD_HANDLE evdHandle = DAT_HANDLE_NULL; cout << "open number " << i << endl ; status = dat_ia_open(gDevName, SVR_EVD_QLEN, &evdHandle, &iaHandle); if (DAT_SUCCESS != (status = dat_ia_close(iaHandle, DAT_CLOSE_GRACEFUL_FLAG) )) { printError("dat_ia_close", status); return 1; } } ./UdaplUtility ofa-v2-roe0 open number 0 open number 1 open number 2 open number 3 ... open number 27 open number 28 open number 29 host1:CMA:747b:a4377720: 3452 us(3452 us): open_hca: rdma_bind ERR No such device. Is enP303p0s0.66 configured as IPoIB? failure: dat_ia_open 0x120000 Steps to Reproduce: 1. Start the process 2. Open ROCE port via dat_ia_open() call 3. Close ROCE port via dat_ia_close() call 4. Repeat #2 for 60 times Actual results: UDAPL code fails after 28 open/close iterations Expected results: Since the connection is closed. There should be no limit in how many consecutive open/close can be executed successfully Additional info: