Bug 983091

Summary: Memory leak in 389-ds-base 1.2.11.15
Product: Red Hat Enterprise Linux 6 Reporter: Rich Megginson <rmeggins>
Component: 389-ds-baseAssignee: Rich Megginson <rmeggins>
Status: CLOSED ERRATA QA Contact: Sankar Ramalingam <sramling>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 6.4CC: drewskiwooskie, hgraham, jgalipea, jrusnack, nhosoi, nkinder, rmeggins, rvdwees
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 389-ds-base-1.2.11.15-22.el6 Doc Type: Bug Fix
Doc Text:
Cause: Opening and closing many hundreds of connections to the directory server per minute for a long period of time. Consequence: The directory server memory growth will be unbounded until killed by the OOM killer. Fix: The directory server was not properly cleaning up and freeing the memory used by old connections. The fix is to clean up this memory properly. Result: The directory server memory usage should level off and not increase any more once the entry/dn/db caches are full.
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-11-21 21:10:39 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Rich Megginson 2013-07-10 13:21:48 UTC
This bug is created as a clone of upstream ticket:
https://fedorahosted.org/389/ticket/47428

We are experiencing a memory leak when 389 (compiled with openldap support) runs for a very long period of time with many searches and binds.  Running valgrind (two valgrind traces attached), I see the leak appears in liblber.  Specifically, connection_table_new calls ber_sockbuf_alloc, but each new connection also calls ber_sockbuf_add_io which appears to malloc memory.  ber_sockbuf_remove_io is never called, and when 389 shuts down it calls ber_sockbuf_free.  I hackily added a ber_sockbuf_remove_io call to connection_cleanup in connection.c and the memory leak no longer shows.  The attached diff is meant to illustrate the change, definitely not the best way to fix the problem.

3266.log - a run with about 8k searches/binds
3479.log - a run with about 16k searches/binds

Note the:
==3479== 683,920 bytes in 17,098 blocks are still reachable in loss record 3,066 of 3,069
==3479==    at 0x4C279EE: malloc (vg_replace_malloc.c:270)
==3479==    by 0x5EDBFCB: ber_memalloc_x (in /lib64/liblber-2.4.so.2.5.6)
==3479==    by 0x5EDE7C1: ber_sockbuf_add_io (in /lib64/liblber-2.4.so.2.5.6)
==3479==    by 0x418A68: slapd_daemon (daemon.c:2636)
==3479==    by 0x41F16E: main (main.c:1253)

4904.log - run of about 16k connections, no longer containing the above trace
sockbuf.patch - diff adding the remove_io call.

Comment 1 Sankar Ramalingam 2013-07-15 16:38:47 UTC
Is it possible to reproduce this issue on the QE side? If yes, could you add the steps in the bugzilla?

Comment 2 Rich Megginson 2013-07-23 18:02:35 UTC
Steps to reproduce:

1) run top to monitor directory server memory usage
2) run dbmon.sh to monitor server cache usage
3) run ldapsearch/ldclt to open/close connections as quickly as possible

Once cache usage levels off (you don't see any changes to cache usage as shown by dbmon.sh) then monitor the mem% as shown by top.  You should not see the mem% of ns-slapd increase after that.

Comment 6 Ján Rusnačko 2013-10-17 15:37:52 UTC
(In reply to Rich Megginson from comment #2)
> Steps to reproduce:
> 
> 1) run top to monitor directory server memory usage
> 2) run dbmon.sh to monitor server cache usage
> 3) run ldapsearch/ldclt to open/close connections as quickly as possible
> 
> Once cache usage levels off (you don't see any changes to cache usage as
> shown by dbmon.sh) then monitor the mem% as shown by top.  You should not
> see the mem% of ns-slapd increase after that.
Where can I find dbmon.sh ? It`s not installed with DS out of the box. Thank you.

Comment 7 Rich Megginson 2013-10-17 15:44:30 UTC
https://github.com/richm/scripts/wiki/dbmon.sh

Comment 8 Ján Rusnačko 2013-10-18 07:50:39 UTC
With 389-ds-base-1.2.11.15-11.el6.x86_64 and RHEL7 client with ldclt:

[root@localhost ~]# ldclt -h 192.168.122.135 -p 389 -D "cn=tuser1,ou=people,dc=example,dc=com" -w Secret123 -b "ou=people,dc=example,dc=com" -f "cn=tuser2" -e bindeach -e bindonly -e esearch -n 8
ldclt version 4.23
ldclt[2326]: Starting at Fri Oct 18 08:44:24 2013

ldclt[2326]: Average rate: 1598.12/thr  (1278.50/sec), total:  12785
ldclt[2326]: Average rate: 1620.62/thr  (1296.50/sec), total:  12965
ldclt[2326]: Average rate: 1619.88/thr  (1295.90/sec), total:  12959
ldclt[2326]: Average rate: 1589.38/thr  (1271.50/sec), total:  12715
ldclt[2326]: Average rate: 1632.25/thr  (1305.80/sec), total:  13058
ldclt[2326]: Average rate: 1549.62/thr  (1239.70/sec), total:  12397
ldclt[2326]: Average rate: 1562.88/thr  (1250.30/sec), total:  12503
ldclt[2326]: Average rate: 1660.12/thr  (1328.10/sec), total:  13281
^C
ldclt[2326]: Global average rate: 12832.88/thr  (1283.29/sec), total: 102663
ldclt[2326]: Global number times "no activity" reports: never
ldclt[2326]: Global no error occurs during this session.
Catch SIGINT - exit...
ldclt[2326]: Ending at Fri Oct 18 08:45:51 2013
ldclt[2326]: Exit status 0 - No problem during execution.

[jrusnack@dstet ~]$ while true; do sudo pmap -x 2527 | awk '/total/ { print $4 "K" }'; sleep 1; done
39604K
39740K
39804K
39828K
39900K
39972K
40032K
40164K
40248K

Cache haven`t changed (as there was almost nothing to cache).

After upgrade with 389-ds-base-1.2.11.15-29.el6.x86_64:

[root@localhost ~]# ldclt -h 192.168.122.135 -p 389 -D "cn=tuser1,ou=people,dc=example,dc=com" -w Secret123 -b "ou=people,dc=example,dc=com" -f "cn=tuser2" -e bindeach -e bindonly -e esearch -n 8
ldclt version 4.23
ldclt[2338]: Starting at Fri Oct 18 08:48:40 2013

ldclt[2338]: Average rate: 1554.62/thr  (1243.70/sec), total:  12437
ldclt[2338]: Average rate: 1507.75/thr  (1206.20/sec), total:  12062
ldclt[2338]: Average rate: 1551.25/thr  (1241.00/sec), total:  12410
^C
ldclt[2338]: Global average rate: 4613.62/thr  (1230.30/sec), total:  36909
ldclt[2338]: Global number times "no activity" reports: never
ldclt[2338]: Global no error occurs during this session.
Catch SIGINT - exit...
ldclt[2338]: Ending at Fri Oct 18 08:49:20 2013
ldclt[2338]: Exit status 0 - No problem during execution.

[jrusnack@dstet ~]$ while true; do sudo pmap -x 3752 | awk '/total/ { print $4 "K" }'; sleep 1; done
50500K
50500K
50500K
50500K
50500K
50500K
50500K
50500K
50500K
50500K

Verified.

Comment 9 errata-xmlrpc 2013-11-21 21:10:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1653.html