Created attachment 493053 [details] valgrind results Description of problem: valgrind output for a GSSAPI request with a MMR enabled server shows memory leaks? Version-Release number of selected component (if applicable): 389-ds-base-1.2.8.2-1.el5 How reproducible: run server under valgrind (per instructions in bug 694195). Steps to Reproduce: 1. Setup pair of LDAP servers, MMR between them (plain auth via plain LDAP for that). SELinux is disabled, avoiding memory leak from that: # getenforce Disabled 2. Launch server under valgrind. 3. Hit server with single read-only GSSAPI request. Actual results: see attachment for valgrind output. Expected results: no reported memory leaks from valgrind? (Or is the output normal?)
Created attachment 493054 [details] valgrind results from single GSSAPI write that was replicated to other master
Comment on attachment 493054 [details] valgrind results from single GSSAPI write that was replicated to other master This looks ok. Not sure why but libdb gets confused about some of the contents of the buffers and thinks they are uninitialized. We probably don't memset(something, 0, sizeof(something)). These are harmless. The memory leaks, while they appear daunting, are all from init or start functions, meaning they are memory the server allocates for permanent, global data one time at startup, and just expects exit() to free the memory back to the OS. Which is fine, except valgrind doesn't know that.
Comment on attachment 493053 [details] valgrind results These are all startup/init memory "leaks" too, nothing to worry about.
Hmm, well, even with selinux Disabled, ns-slapd (389-ds-base-1.2.8.2-1.el5) eventually runs out of memory (all 3G for a 32-bit process) and dies: [27/Apr/2011:15:52:25 +0000] memory allocator - calloc of 1026 elems of 4 bytes failed; OS error 12 (Cannot allocate memory) The server has probably allocated all available virtual memory. To solve this problem, make more virtual memory available to your server, or reduce one or more of the following server configuration settings: nsslapd-cachesize (Database Settings - Maximum entries in cache) nsslapd-cachememsize (Database Settings - Memory available for cache) nsslapd-dbcachesize (LDBM Plug-in Settings - Maximum cache size) nsslapd-import-cachesize (LDBM Plug-in Settings - Import cache size). Can't recover; calling exit(1). I'll try another valgrind run.
Created attachment 496407 [details] valgrind (briefly) from production server Valgrind result from running production server briefly under valgrind. Tracking memory usage over time on the two servers points to the memory leak being proportional to the number of clients connected to the servers: that is, the rate of leak is constant, until one of the servers runs out of memory and fails, sending more clients to the other servers, whose rate of leak increases to some new constant slope.
(In reply to comment #6) > Created attachment 496407 [details] > valgrind (briefly) from production server > These are interesting. When you do a search, and the entry is not in the entry cache, the server reads the entry into the entry cache. The entry structure is quite complex, having lots of nested lists and sub structures. So that's what all of the mallocs are which are triggered by op_shared_search. Not sure if these are normal or if something else is going on. What are your nsslapd-cachememsize settings? This looks like a 32-bit machine. How much RAM do you have?
Cache settings are at the as-shipped defaults. # grep memsize /etc/dirsrv/slapd-*/dse.ldif /etc/dirsrv/slapd-cfg1/dse.ldif:nsslapd-cachememsize: 10485760 /etc/dirsrv/slapd-cfg1/dse.ldif:nsslapd-dncachememsize: 10485760 /etc/dirsrv/slapd-cfg1/dse.ldif:nsslapd-cachememsize: 10485760 /etc/dirsrv/slapd-cfg1/dse.ldif:nsslapd-dncachememsize: 10485760 /etc/dirsrv/slapd-master1/dse.ldif:nsslapd-cachememsize: 10485760 /etc/dirsrv/slapd-master1/dse.ldif:nsslapd-dncachememsize: 10485760 # free -m total used free shared buffers cached Mem: 3032 1538 1494 0 186 439 Of which ns-slapd will inch up to ~2943 and then die.
what is the size of your /var/lib/dirsrv/slapd-master1/db/userRoot/id2entry.db4 and your /var/lib/dirsrv/slapd-master1/db/userRoot/entryrdn.db4 ?
# ls -l /var/lib/dirsrv/slapd-master1/db/userRoot/id2entry.db4 /var/lib/dirsrv/slapd-master1/db/userRoot/entryrdn.db4 -rw------- 1 nobody nobody 1384448 May 3 03:43 /var/lib/dirsrv/slapd-master1/db/userRoot/entryrdn.db4 -rw------- 1 nobody nobody 13533184 May 3 03:43 /var/lib/dirsrv/slapd-master1/db/userRoot/id2entry.db4
(In reply to comment #10) > # ls -l /var/lib/dirsrv/slapd-master1/db/userRoot/id2entry.db4 > /var/lib/dirsrv/slapd-master1/db/userRoot/entryrdn.db4 > -rw------- 1 nobody nobody 1384448 May 3 03:43 > /var/lib/dirsrv/slapd-master1/db/userRoot/entryrdn.db4 > -rw------- 1 nobody nobody 13533184 May 3 03:43 > /var/lib/dirsrv/slapd-master1/db/userRoot/id2entry.db4 Could be a memory leak which occurs when entries are removed from the cache. To see if this is the issue, and to improve performance in general, you should increase the nsslapd-cachememsize for your userRoot database to be at least 2 times the size of your userRoot/id2entry.db4 file. Then, monitor the cache usage to make sure you eventually get to near 100 percent cache hits. Note that cache hits starts out at 0 - when the server starts, the cache is empty - as entries are pull in due to search requests, the cache will be populated - if the cache is large enough to hold all entries, eventually all entries will be in the cache, and the cache hit ratio will approach 100%. http://docs.redhat.com/docs/en-US/Red_Hat_Directory_Server/8.2/html-single/Administration_Guide/index.html#Monitoring_Server_and_Database_Activity-Monitoring_Server_Activity
Increased one of the servers to a 41943040 nsslapd-cachememsize, cache entry size reported as 18041151 and it now has about double the hit ratio of the non-increased server. Too soon to tell if the memory usage has changed. Checked the backups of the previous fedora-ds 1.0.4 servers, they had the default cachememsize of 10485760, userRoot/id2entry.db4 files in the 81477632 size range, and no memory leak. (But they're shutdown so I can't observe run-time stats on them.)
Could be a memory leak bug introduced in 389 1.1 or so which is why you don't see it in 1.0.4. Note that if you are using replication, you may want to increase the cache size to 3 * id2entry to handle the additional overhead.
Memory growth has flattened off, so applied the increased size (4*default, which is slightly > 3*id2entry). A warning to the error log might be in order if this setting needs to be increased, or could the server tune itself?
(In reply to comment #14) > Memory growth has flattened off, so applied the increased size (4*default, > which is slightly > 3*id2entry). A warning to the error log might be in order > if this setting needs to be increased, or could the server tune itself? A warning if cache usage is exceeded? That's a good idea. Auto-tuning - how would that work, given RAM constraints?
(In reply to comment #7) > (In reply to comment #6) > > Created attachment 496407 [details] > > valgrind (briefly) from production server > > > These are interesting. When you do a search, and the entry is not in the entry > cache, the server reads the entry into the entry cache. The entry structure is > quite complex, having lots of nested lists and sub structures. So that's what > all of the mallocs are which are triggered by op_shared_search. Not sure if > these are normal or if something else is going on. This is normal. We don't clear cache when shutting down the server. We have a code to clear the cache in dblayer_instance_close, which is not enabled by default. To enable it, you have to rebuild DS with the macro defined. The code is used just for the debugging purpose... int dblayer_instance_close(backend *be) { DB *pDB = NULL; int return_value = 0; DB_ENV * env = 0; ldbm_instance *inst = (ldbm_instance *)be->be_instance_info; if (NULL == inst) return -1; #if defined(_USE_VALGRIND) /* When running a memory leak checking tool (e.g., valgrind), it reduces the noise by enabling this code. */ LDAPDebug1Arg(LDAP_DEBUG_ANY, "%s: Cleaning up entry cache\n", inst->inst_name); cache_clear(&inst->inst_cache, CACHE_TYPE_ENTRY); LDAPDebug1Arg(LDAP_DEBUG_ANY, "%s: Cleaning up dn cache\n", inst->inst_name); cache_clear(&inst->inst_dncache, CACHE_TYPE_DN); #endif
Upstream ticket: https://fedorahosted.org/389/ticket/51
This was fixed in 389-ds-base-1.2.10.1-1.fc17. Closing.