697701 – memory leaks in 389-ds-base-1.2.8.2-1.el5?

Bug 697701 - memory leaks in 389-ds-base-1.2.8.2-1.el5?

Summary: memory leaks in 389-ds-base-1.2.8.2-1.el5?

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	389
Classification:	Retired
Component:	Directory Server
Sub Component:
Version:	1.2.8
Hardware:	i386
OS:	Linux
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Rich Megginson
QA Contact:	Chandrasekar Kannan
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	389_1.3.0 690319
TreeView+	depends on / blocked

Reported:	2011-04-18 23:59 UTC by Jeremy Mates
Modified:	2015-01-04 23:48 UTC (History)
CC List:	6 users (show)
Fixed In Version:	389-ds-base-1.2.10.1-1.fc17
Clone Of:
Environment:
Last Closed:	2013-03-04 23:28:19 UTC
Embargoed:

Attachments	(Terms of Use)
valgrind results (70.12 KB, text/plain) 2011-04-18 23:59 UTC, Jeremy Mates	no flags	Details
valgrind results from single GSSAPI write that was replicated to other master (74.33 KB, text/plain) 2011-04-19 00:15 UTC, Jeremy Mates	no flags	Details
valgrind (briefly) from production server (197.81 KB, text/plain) 2011-05-03 00:16 UTC, Jeremy Mates	no flags	Details
View All

Description Jeremy Mates 2011-04-18 23:59:00 UTC

Created attachment 493053 [details]
valgrind results

Description of problem: valgrind output for a GSSAPI request with a MMR enabled server shows memory leaks?

Version-Release number of selected component (if applicable): 389-ds-base-1.2.8.2-1.el5

How reproducible: run server under valgrind (per instructions in bug 694195).

Steps to Reproduce:
1. Setup pair of LDAP servers, MMR between them (plain auth via plain LDAP for that). SELinux is disabled, avoiding memory leak from that:

# getenforce
Disabled

2. Launch server under valgrind.
3. Hit server with single read-only GSSAPI request.
  
Actual results: see attachment for valgrind output.

Expected results: no reported memory leaks from valgrind? (Or is the output normal?)

Comment 1 Jeremy Mates 2011-04-19 00:15:32 UTC

Created attachment 493054 [details]
valgrind results from single GSSAPI write that was replicated to other master

Comment 2 Rich Megginson 2011-04-19 02:06:25 UTC

Comment on attachment 493054 [details]
valgrind results from single GSSAPI write that was replicated to other master

This looks ok.  Not sure why but libdb gets confused about some of the contents of the buffers and thinks they are uninitialized.  We probably don't memset(something, 0, sizeof(something)).  These are harmless.
The memory leaks, while they appear daunting, are all from init or start functions, meaning they are memory the server allocates for permanent, global data one time at startup, and just expects exit() to free the memory back to the OS.  Which is fine, except valgrind doesn't know that.

Comment 3 Rich Megginson 2011-04-19 02:07:55 UTC

Comment on attachment 493053 [details]
valgrind results

These are all startup/init memory "leaks" too, nothing to worry about.

Comment 5 Jeremy Mates 2011-04-27 16:04:53 UTC

Hmm, well, even with selinux Disabled, ns-slapd (389-ds-base-1.2.8.2-1.el5) eventually runs out of memory (all 3G for a 32-bit process) and dies:

[27/Apr/2011:15:52:25 +0000] memory allocator - calloc of 1026 elems of 4 bytes failed; OS error 12 (Cannot allocate memory)
The server has probably allocated all available virtual memory. To solve
this problem, make more virtual memory available to your server, or reduce
one or more of the following server configuration settings:
  nsslapd-cachesize        (Database Settings - Maximum entries in cache)
  nsslapd-cachememsize     (Database Settings - Memory available for cache)
  nsslapd-dbcachesize      (LDBM Plug-in Settings - Maximum cache size)
  nsslapd-import-cachesize (LDBM Plug-in Settings - Import cache size).
Can't recover; calling exit(1).

I'll try another valgrind run.

Comment 6 Jeremy Mates 2011-05-03 00:16:43 UTC

Created attachment 496407 [details]
valgrind (briefly) from production server

Valgrind result from running production server briefly under valgrind. Tracking memory usage over time on the two servers points to the memory leak being proportional to the number of clients connected to the servers: that is, the rate of leak is constant, until one of the servers runs out of memory and fails, sending more clients to the other servers, whose rate of leak increases to some new constant slope.

Comment 7 Rich Megginson 2011-05-03 03:19:53 UTC

(In reply to comment #6)
> Created attachment 496407 [details]
> valgrind (briefly) from production server
> 
These are interesting.  When you do a search, and the entry is not in the entry cache, the server reads the entry into the entry cache.  The entry structure is quite complex, having lots of nested lists and sub structures.  So that's what all of the mallocs are which are triggered by op_shared_search.  Not sure if these are normal or if something else is going on.

What are your nsslapd-cachememsize settings?
This looks like a 32-bit machine.  How much RAM do you have?

Comment 8 Jeremy Mates 2011-05-03 15:34:04 UTC

Cache settings are at the as-shipped defaults.

# grep memsize /etc/dirsrv/slapd-*/dse.ldif
/etc/dirsrv/slapd-cfg1/dse.ldif:nsslapd-cachememsize: 10485760
/etc/dirsrv/slapd-cfg1/dse.ldif:nsslapd-dncachememsize: 10485760
/etc/dirsrv/slapd-cfg1/dse.ldif:nsslapd-cachememsize: 10485760
/etc/dirsrv/slapd-cfg1/dse.ldif:nsslapd-dncachememsize: 10485760
/etc/dirsrv/slapd-master1/dse.ldif:nsslapd-cachememsize: 10485760
/etc/dirsrv/slapd-master1/dse.ldif:nsslapd-dncachememsize: 10485760

# free -m
             total       used       free     shared    buffers     cached
Mem:          3032       1538       1494          0        186        439

Of which ns-slapd will inch up to ~2943 and then die.

Comment 9 Rich Megginson 2011-05-03 15:47:00 UTC

what is the size of your /var/lib/dirsrv/slapd-master1/db/userRoot/id2entry.db4 and your /var/lib/dirsrv/slapd-master1/db/userRoot/entryrdn.db4 ?

Comment 10 Jeremy Mates 2011-05-03 15:58:06 UTC

# ls -l /var/lib/dirsrv/slapd-master1/db/userRoot/id2entry.db4 /var/lib/dirsrv/slapd-master1/db/userRoot/entryrdn.db4
-rw------- 1 nobody nobody  1384448 May  3 03:43 /var/lib/dirsrv/slapd-master1/db/userRoot/entryrdn.db4
-rw------- 1 nobody nobody 13533184 May  3 03:43 /var/lib/dirsrv/slapd-master1/db/userRoot/id2entry.db4

Comment 11 Rich Megginson 2011-05-03 16:48:54 UTC

(In reply to comment #10)
> # ls -l /var/lib/dirsrv/slapd-master1/db/userRoot/id2entry.db4
> /var/lib/dirsrv/slapd-master1/db/userRoot/entryrdn.db4
> -rw------- 1 nobody nobody  1384448 May  3 03:43
> /var/lib/dirsrv/slapd-master1/db/userRoot/entryrdn.db4
> -rw------- 1 nobody nobody 13533184 May  3 03:43
> /var/lib/dirsrv/slapd-master1/db/userRoot/id2entry.db4

Could be a memory leak which occurs when entries are removed from the cache.  To see if this is the issue, and to improve performance in general, you should increase the nsslapd-cachememsize for your userRoot database to be at least 2 times the size of your userRoot/id2entry.db4 file.  Then, monitor the cache usage to make sure you eventually get to near 100 percent cache hits.  Note that cache hits starts out at 0 - when the server starts, the cache is empty - as entries are pull in due to search requests, the cache will be populated - if the cache is large enough to hold all entries, eventually all entries will be in the cache, and the cache hit ratio will approach 100%.
http://docs.redhat.com/docs/en-US/Red_Hat_Directory_Server/8.2/html-single/Administration_Guide/index.html#Monitoring_Server_and_Database_Activity-Monitoring_Server_Activity

Comment 12 Jeremy Mates 2011-05-03 17:54:00 UTC

Increased one of the servers to a 41943040 nsslapd-cachememsize, cache entry size reported as 18041151 and it now has about double the hit ratio of the non-increased server. Too soon to tell if the memory usage has changed.

Checked the backups of the previous fedora-ds 1.0.4 servers, they had the default cachememsize of 10485760, userRoot/id2entry.db4 files in the 81477632 size range, and no memory leak. (But they're shutdown so I can't observe run-time stats on them.)

Comment 13 Rich Megginson 2011-05-03 18:24:57 UTC

Could be a memory leak bug introduced in 389 1.1 or so which is why you don't see it in 1.0.4.  Note that if you are using replication, you may want to increase the cache size to 3 * id2entry to handle the additional overhead.

Comment 14 Jeremy Mates 2011-05-03 20:40:14 UTC

Memory growth has flattened off, so applied the increased size (4*default, which is slightly > 3*id2entry). A warning to the error log might be in order if this setting needs to be increased, or could the server tune itself?

Comment 15 Rich Megginson 2011-05-03 20:52:45 UTC

(In reply to comment #14)
> Memory growth has flattened off, so applied the increased size (4*default,
> which is slightly > 3*id2entry). A warning to the error log might be in order
> if this setting needs to be increased, or could the server tune itself?

A warning if cache usage is exceeded?  That's a good idea.

Auto-tuning - how would that work, given RAM constraints?

Comment 16 Noriko Hosoi 2011-06-03 20:52:52 UTC

(In reply to comment #7)
> (In reply to comment #6)
> > Created attachment 496407 [details]
> > valgrind (briefly) from production server
> > 
> These are interesting.  When you do a search, and the entry is not in the entry
> cache, the server reads the entry into the entry cache.  The entry structure is
> quite complex, having lots of nested lists and sub structures.  So that's what
> all of the mallocs are which are triggered by op_shared_search.  Not sure if
> these are normal or if something else is going on.

This is normal.  We don't clear cache when shutting down the server.  We have a code to clear the cache in dblayer_instance_close, which is not enabled by default.  To enable it, you have to rebuild DS with the macro defined.  The code is used just for the debugging purpose...

int dblayer_instance_close(backend *be)
{
    DB *pDB = NULL;
    int return_value = 0;
    DB_ENV * env = 0;
    ldbm_instance *inst = (ldbm_instance *)be->be_instance_info;

    if (NULL == inst)
        return -1;

#if defined(_USE_VALGRIND)
    /* When running a memory leak checking tool (e.g., valgrind),
       it reduces the noise by enabling this code. */
    LDAPDebug1Arg(LDAP_DEBUG_ANY, "%s: Cleaning up entry cache\n",
                                  inst->inst_name);
    cache_clear(&inst->inst_cache, CACHE_TYPE_ENTRY);
    LDAPDebug1Arg(LDAP_DEBUG_ANY, "%s: Cleaning up dn cache\n",
                                  inst->inst_name);
    cache_clear(&inst->inst_dncache, CACHE_TYPE_DN);
#endif

Comment 17 Martin Kosek 2012-01-04 13:24:11 UTC

Upstream ticket:
https://fedorahosted.org/389/ticket/51

Comment 18 Nathan Kinder 2013-03-04 23:28:19 UTC

This was fixed in 389-ds-base-1.2.10.1-1.fc17.  Closing.

Note You need to log in before you can comment on or make changes to this bug.