Bug 983091 - Memory leak in 389-ds-base 1.2.11.15
Memory leak in 389-ds-base 1.2.11.15
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: 389-ds-base (Show other bugs)
6.4
Unspecified Unspecified
unspecified Severity unspecified
: rc
: ---
Assigned To: Rich Megginson
Sankar Ramalingam
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-07-10 09:21 EDT by Rich Megginson
Modified: 2013-11-21 16:10 EST (History)
8 users (show)

See Also:
Fixed In Version: 389-ds-base-1.2.11.15-22.el6
Doc Type: Bug Fix
Doc Text:
Cause: Opening and closing many hundreds of connections to the directory server per minute for a long period of time. Consequence: The directory server memory growth will be unbounded until killed by the OOM killer. Fix: The directory server was not properly cleaning up and freeing the memory used by old connections. The fix is to clean up this memory properly. Result: The directory server memory usage should level off and not increase any more once the entry/dn/db caches are full.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-11-21 16:10:39 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Rich Megginson 2013-07-10 09:21:48 EDT
This bug is created as a clone of upstream ticket:
https://fedorahosted.org/389/ticket/47428

We are experiencing a memory leak when 389 (compiled with openldap support) runs for a very long period of time with many searches and binds.  Running valgrind (two valgrind traces attached), I see the leak appears in liblber.  Specifically, connection_table_new calls ber_sockbuf_alloc, but each new connection also calls ber_sockbuf_add_io which appears to malloc memory.  ber_sockbuf_remove_io is never called, and when 389 shuts down it calls ber_sockbuf_free.  I hackily added a ber_sockbuf_remove_io call to connection_cleanup in connection.c and the memory leak no longer shows.  The attached diff is meant to illustrate the change, definitely not the best way to fix the problem.

3266.log - a run with about 8k searches/binds
3479.log - a run with about 16k searches/binds

Note the:
==3479== 683,920 bytes in 17,098 blocks are still reachable in loss record 3,066 of 3,069
==3479==    at 0x4C279EE: malloc (vg_replace_malloc.c:270)
==3479==    by 0x5EDBFCB: ber_memalloc_x (in /lib64/liblber-2.4.so.2.5.6)
==3479==    by 0x5EDE7C1: ber_sockbuf_add_io (in /lib64/liblber-2.4.so.2.5.6)
==3479==    by 0x418A68: slapd_daemon (daemon.c:2636)
==3479==    by 0x41F16E: main (main.c:1253)

4904.log - run of about 16k connections, no longer containing the above trace
sockbuf.patch - diff adding the remove_io call.
Comment 1 Sankar Ramalingam 2013-07-15 12:38:47 EDT
Is it possible to reproduce this issue on the QE side? If yes, could you add the steps in the bugzilla?
Comment 2 Rich Megginson 2013-07-23 14:02:35 EDT
Steps to reproduce:

1) run top to monitor directory server memory usage
2) run dbmon.sh to monitor server cache usage
3) run ldapsearch/ldclt to open/close connections as quickly as possible

Once cache usage levels off (you don't see any changes to cache usage as shown by dbmon.sh) then monitor the mem% as shown by top.  You should not see the mem% of ns-slapd increase after that.
Comment 6 Ján Rusnačko 2013-10-17 11:37:52 EDT
(In reply to Rich Megginson from comment #2)
> Steps to reproduce:
> 
> 1) run top to monitor directory server memory usage
> 2) run dbmon.sh to monitor server cache usage
> 3) run ldapsearch/ldclt to open/close connections as quickly as possible
> 
> Once cache usage levels off (you don't see any changes to cache usage as
> shown by dbmon.sh) then monitor the mem% as shown by top.  You should not
> see the mem% of ns-slapd increase after that.
Where can I find dbmon.sh ? It`s not installed with DS out of the box. Thank you.
Comment 7 Rich Megginson 2013-10-17 11:44:30 EDT
https://github.com/richm/scripts/wiki/dbmon.sh
Comment 8 Ján Rusnačko 2013-10-18 03:50:39 EDT
With 389-ds-base-1.2.11.15-11.el6.x86_64 and RHEL7 client with ldclt:

[root@localhost ~]# ldclt -h 192.168.122.135 -p 389 -D "cn=tuser1,ou=people,dc=example,dc=com" -w Secret123 -b "ou=people,dc=example,dc=com" -f "cn=tuser2" -e bindeach -e bindonly -e esearch -n 8
ldclt version 4.23
ldclt[2326]: Starting at Fri Oct 18 08:44:24 2013

ldclt[2326]: Average rate: 1598.12/thr  (1278.50/sec), total:  12785
ldclt[2326]: Average rate: 1620.62/thr  (1296.50/sec), total:  12965
ldclt[2326]: Average rate: 1619.88/thr  (1295.90/sec), total:  12959
ldclt[2326]: Average rate: 1589.38/thr  (1271.50/sec), total:  12715
ldclt[2326]: Average rate: 1632.25/thr  (1305.80/sec), total:  13058
ldclt[2326]: Average rate: 1549.62/thr  (1239.70/sec), total:  12397
ldclt[2326]: Average rate: 1562.88/thr  (1250.30/sec), total:  12503
ldclt[2326]: Average rate: 1660.12/thr  (1328.10/sec), total:  13281
^C
ldclt[2326]: Global average rate: 12832.88/thr  (1283.29/sec), total: 102663
ldclt[2326]: Global number times "no activity" reports: never
ldclt[2326]: Global no error occurs during this session.
Catch SIGINT - exit...
ldclt[2326]: Ending at Fri Oct 18 08:45:51 2013
ldclt[2326]: Exit status 0 - No problem during execution.

[jrusnack@dstet ~]$ while true; do sudo pmap -x 2527 | awk '/total/ { print $4 "K" }'; sleep 1; done
39604K
39740K
39804K
39828K
39900K
39972K
40032K
40164K
40248K

Cache haven`t changed (as there was almost nothing to cache).

After upgrade with 389-ds-base-1.2.11.15-29.el6.x86_64:

[root@localhost ~]# ldclt -h 192.168.122.135 -p 389 -D "cn=tuser1,ou=people,dc=example,dc=com" -w Secret123 -b "ou=people,dc=example,dc=com" -f "cn=tuser2" -e bindeach -e bindonly -e esearch -n 8
ldclt version 4.23
ldclt[2338]: Starting at Fri Oct 18 08:48:40 2013

ldclt[2338]: Average rate: 1554.62/thr  (1243.70/sec), total:  12437
ldclt[2338]: Average rate: 1507.75/thr  (1206.20/sec), total:  12062
ldclt[2338]: Average rate: 1551.25/thr  (1241.00/sec), total:  12410
^C
ldclt[2338]: Global average rate: 4613.62/thr  (1230.30/sec), total:  36909
ldclt[2338]: Global number times "no activity" reports: never
ldclt[2338]: Global no error occurs during this session.
Catch SIGINT - exit...
ldclt[2338]: Ending at Fri Oct 18 08:49:20 2013
ldclt[2338]: Exit status 0 - No problem during execution.

[jrusnack@dstet ~]$ while true; do sudo pmap -x 3752 | awk '/total/ { print $4 "K" }'; sleep 1; done
50500K
50500K
50500K
50500K
50500K
50500K
50500K
50500K
50500K
50500K

Verified.
Comment 9 errata-xmlrpc 2013-11-21 16:10:39 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1653.html

Note You need to log in before you can comment on or make changes to this bug.