Bug 1138745

Summary: Memory leak during Reliab15 execution
Product: Red Hat Enterprise Linux 6 Reporter: Viktor Ashirov <vashirov>
Component: 389-ds-baseAssignee: Noriko Hosoi <nhosoi>
Status: CLOSED ERRATA QA Contact: Viktor Ashirov <vashirov>
Severity: high Docs Contact:
Priority: high    
Version: 6.6CC: drieden, jgalipea, jkurik, jsvarova, nhosoi, nkinder, rmeggins, salmy, tlavigne
Target Milestone: rcKeywords: Regression, TestBlocker, ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 389-ds-base-1.2.11.15-46.el6 Doc Type: Bug Fix
Doc Text:
Bug fixes for replication conflict resolution (BZ#1080185) introduced a memory leak bug, which increased the size of the Directory Server. With this update, the memory leak code has been fixed, and the size of the Directory Servers in the replication topology is now stable under the stress.
Story Points: ---
Clone Of:
: 1147092 1147479 (view as bug list) Environment:
Last Closed: 2015-07-22 06:35:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1147479    
Attachments:
Description Flags
valgrind-20140905-094633.out
none
pmap.out
none
valgrind-20140905-104243.out
none
valgrind-20140909-064929.out.bz2
none
valgrind-20140909-065958.out.bz2
none
stacktrace.1410856678.txt
none
stacktrace.1411061804.txt
none
valgrind-M1-20141006-183551.out none

Description Viktor Ashirov 2014-09-05 14:27:31 UTC
Description of problem:
Latest Reliab15 run shows that ns-slapd process takes up significant amount of memory during execution. On 32-bit systems it causes ns-slapd to crash with following error in logs:

[04/Sep/2014:03:08:00 -0400] - ber_printf failed
[04/Sep/2014:03:08:00 -0400] replication - conn=14107 op=28145 An error occurred while decoding the replication update control- Modify
[04/Sep/2014:03:08:00 -0400] memory allocator - malloc of 3597 bytes failed; OS error 12 (Cannot allocate memory)
The server has probably allocated all available virtual memory. To solve
this problem, make more virtual memory available to your server, or reduce
one or more of the following server configuration settings:
  nsslapd-cachesize        (Database Settings - Maximum entries in cache)
  nsslapd-cachememsize     (Database Settings - Memory available for cache)
  nsslapd-dbcachesize      (LDBM Plug-in Settings - Maximum cache size)
  nsslapd-import-cachesize (LDBM Plug-in Settings - Import cache size).
Can't recover; calling exit(1).

Version-Release number of selected component (if applicable):
389-ds-base-libs-1.2.11.15-42.el6.x86_64
389-ds-base-1.2.11.15-42.el6.x86_64
and
389-ds-base-libs-1.2.11.15-42.el6.i686
389-ds-base-1.2.11.15-42.el6.i686

Comment 1 Viktor Ashirov 2014-09-05 14:29:14 UTC
Created attachment 934821 [details]
valgrind-20140905-094633.out

Output of valgrind after running for 20 minutes

Comment 2 Viktor Ashirov 2014-09-05 14:33:22 UTC
Created attachment 934823 [details]
pmap.out

Output of pmap from a x86_64-system with 32GB of RAM. ns-slapd process takes up more than 11GB of memory.

Comment 3 Rich Megginson 2014-09-05 14:37:52 UTC
Comment on attachment 934821 [details]
valgrind-20140905-094633.out

All of these leaks are one time only startup related leaks.  You may have to run the server for several hours under valgrind with the reliab15 test in order for any leaks to show up.

Comment 5 Viktor Ashirov 2014-09-05 14:45:23 UTC
Rich, thank you for clarification. I restarted server under valgrind, will post results after ns-slapd crashes.

Comment 6 Noriko Hosoi 2014-09-05 17:15:10 UTC
Hi Viktor, when you run valgrind, could you set this environment variable USE_VALGRIND in your ns-slapd start up script?  (It might be already set.  If so, that's great.)
export USE_VALGRIND=1

If this variable is set, all the entries in the entry caches are freed before the shutdown.

(But of course, if you wait until the server's crash due to the memory leak, this setting does not matter...)

Comment 7 Viktor Ashirov 2014-09-06 08:03:05 UTC
Created attachment 934983 [details]
valgrind-20140905-104243.out

Hi Noriko, sorry, didn't see your comment before I started valgrind :( USE_VALGRIND wasn't set in my start up script. 
Server didn't crash this time, since execution of the test came to an end before size of a process reached 4GB.

Anyway, valgrind output is attached. Please tell me, if it's useful. If not, I will restart Reliab15 from the beginning under valgrind with USE_VALGRIND=1. 

Thank you!

Comment 8 Viktor Ashirov 2014-09-09 18:15:15 UTC
Created attachment 935852 [details]
valgrind-20140909-064929.out.bz2

Hi Noriko, 

this is the output of ns-slapd running under valgrind with USE_VALGRIND=1 on i386 system. It crashed on memory allocation. 

389-ds-base-debuginfo-1.2.11.15-42.el6.i686
389-ds-base-libs-1.2.11.15-43.el6.i686
389-ds-base-1.2.11.15-43.el6.i686

Comment 9 Rich Megginson 2014-09-09 18:25:09 UTC
out of memory:

[09/Sep/2014:12:22:13 -0400] - libdb: malloc: 8264: Cannot allocate memory
[09/Sep/2014:12:22:13 -0400] entryrdn-index - _entryrdn_put_data: Adding the child link (C3) failed: Cannot allocate memory (12)
[09/Sep/2014:12:22:13 -0400] - adding (adding tombstone entryrdn nsuniqueid=3ddad631-383911e4-a20cac43-0b85da4d,uid=emailperson_7529,l=my_l_0,ou=my_ou_0,o=my_suffix.com) failed, err=12 Cannot allocate memory
[09/Sep/2014:12:22:13 -0400] - libdb: malloc: 32000: Cannot allocate memory
[09/Sep/2014:12:22:13 -0400] - libdb: PANIC: Cannot allocate memory
[09/Sep/2014:12:22:13 -0400] - Serious Error---Failed in dblayer_txn_abort, err=-30974 (DB_RUNRECOVERY: Fatal error, run database recovery)
[09/Sep/2014:12:22:13 -0400] memory allocator - malloc of 5759 bytes failed; OS error 0 (Success)
The server has probably allocated all available virtual memory. To solve
this problem, make more virtual memory available to your server, or reduce
one or more of the following server configuration settings:
  nsslapd-cachesize        (Database Settings - Maximum entries in cache)
  nsslapd-cachememsize     (Database Settings - Memory available for cache)
  nsslapd-dbcachesize      (LDBM Plug-in Settings - Maximum cache size)
  nsslapd-import-cachesize (LDBM Plug-in Settings - Import cache size).
Can't recover; calling exit(1).

And then many valgrind errors.  I'm assuming that the OOM caused the errors.  Unfortunately, this caused no leaks to show up.  In order to catch this leak with valgrind, we are going to have to let the server run for some time, but not long enough to run into the OOM, then do a graceful shutdown of the server (e.g. service dirsrv stop), in order for valgrind to give us a leak report.

Comment 10 Viktor Ashirov 2014-09-09 19:15:31 UTC
Created attachment 935900 [details]
valgrind-20140909-065958.out.bz2

This is the valgrind output after graceful shutdown from x86_64 DS instance that was running the same amount of time as previous one.

Comment 11 Rich Megginson 2014-09-09 19:32:37 UTC
Comment on attachment 935900 [details]
valgrind-20140909-065958.out.bz2

The memory growth is occurring here:
==31331== 229,576,917 (2,908,032 direct, 226,668,885 indirect) bytes in 30,292 blocks are definitely lost in loss record 2,102 of 2,105
==31331==    at 0x4A057BB: calloc (vg_replace_malloc.c:593)
==31331==    by 0x3F42A4BB4B: slapi_ch_calloc (in /usr/lib64/dirsrv/libslapd.so.0.0.0)
==31331==    by 0xBE16622: backentry_init (backentry.c:92)
==31331==    by 0xBE28CDC: id2entry (id2entry.c:417)
==31331==    by 0xBE6FE70: uniqueid2entry (uniqueid2entry.c:86)
==31331==    by 0xBE28501: find_entry_internal (findentry.c:228)
==31331==    by 0xBE4AECC: ldbm_back_delete (ldbm_delete.c:190)
==31331==    by 0x3F42A4F732: ??? (in /usr/lib64/dirsrv/libslapd.so.0.0.0)
==31331==    by 0x3F42A4FA70: do_delete (in /usr/lib64/dirsrv/libslapd.so.0.0.0)

and here

==31331== 314,754,403 (2,284,512 direct, 312,469,891 indirect) bytes in 23,797 blocks are definitely lost in loss record 2,103 of 2,105
==31331==    at 0x4A057BB: calloc (vg_replace_malloc.c:593)
==31331==    by 0x3F42A4BB4B: slapi_ch_calloc (in /usr/lib64/dirsrv/libslapd.so.0.0.0)
==31331==    by 0xBE16622: backentry_init (backentry.c:92)
==31331==    by 0xBE40005: ldbm_back_add (ldbm_add.c:519)
==31331==    by 0x3F42A434AD: ??? (in /usr/lib64/dirsrv/libslapd.so.0.0.0)
==31331==    by 0x3F42A448D3: do_add (in /usr/lib64/dirsrv/libslapd.so.0.0.0)

This looks like entry cache - entries are created in add by backentry_init then added to the entry cache - similarly, when an entry is already in the database, but not in the cache, it is read in by id2entry and cached.

However, if USE_VALGRIND=1 is set in the environment, then the cache should be deleted upon shutdown, and there should not be any leaks reported due to backentry_init.  Was USE_VALGRIND=1 set in the environment?

Comment 12 Viktor Ashirov 2014-09-09 19:41:59 UTC
I replaced line in /usr/sbin/start-dirsrv 

cd $SERVERBIN_DIR; ./ns-slapd -D $CONFIG_DIR -i $PIDFILE -w $STARTPIDFILE "$@"

with

export USE_VALGRIND=1
cd $SERVERBIN_DIR; valgrind --num-callers=32 --tool=memcheck --leak-check=full --leak-resolution=high ./ns-slapd -D $CONFIG_DIR -i $PIDFILE -w $STARTPIDFILE "$@" -d 0 >>  /tmp/valgrind-`date +"%Y%m%d-%H%M%S"`.out 2>&1 &

Comment 13 Rich Megginson 2014-09-09 19:45:40 UTC
Then either the cache cleanup code in dblayer_instance_close() is not being called, or there is a really bad leak :-(

Comment 14 Noriko Hosoi 2014-09-09 19:57:16 UTC
Viktor,

Do you see these messages in your errors log?
[..] - userRoot: Cleaning up entry cache
[..] - userRoot: Cleaning up dn cache

Comment 15 Viktor Ashirov 2014-09-09 20:06:27 UTC
Noriko, 

yes, attached valgrind output contains it: 
[09/Sep/2014:14:54:05 -0400] - userRoot: Cleaning up entry cache
[09/Sep/2014:14:54:05 -0400] - userRoot: Cleaning up dn cache

Comment 16 Rich Megginson 2014-09-10 01:57:43 UTC
This looks like a really bad leak.  A DS dev should work on this ASAP.

Comment 19 Viktor Ashirov 2014-09-16 09:07:34 UTC
Created attachment 937913 [details]
stacktrace.1410856678.txt

rpm -qa | grep 389
389-ds-base-debuginfo-1.2.11.15-46.el6.x86_64
389-ds-base-libs-1.2.11.15-46.el6.x86_64
389-ds-base-1.2.11.15-46.el6.x86_64

9 out of 10 servers crashed after running for 20 minutes. Stacktrace is attached.

Comment 22 Viktor Ashirov 2014-09-18 18:05:08 UTC
Created attachment 939003 [details]
stacktrace.1411061804.txt

One x86_64 server crashed after 2 hours of running reliab15 with scratch build

rpm -qa | grep 389
389-ds-base-libs-1.2.11.15-99.el6.x86_64
389-ds-base-1.2.11.15-99.el6.x86_64
389-ds-base-debuginfo-1.2.11.15-99.el6.x86_64

Comment 23 Noriko Hosoi 2014-09-23 00:41:50 UTC
Steps to verify using valgrind.

1. Set up 2-way MMR
2. Restart the servers under valgrind.  Something like this.  (The command line is based upon the one in /usr/sbin/start-dirsrv file.)
  +  export USE_VALGRIND=1
  +  cd /usr/sbin; valgrind --tool=memcheck --leak-check=yes --num-callers=64 --log-file=/var/tmp/slapd.vg.$$ ./ns-slapd -D $CONFIG_DIR -i $PIDFILE -w $STARTPIDFILE "$@" -d 0
3. Add an entry, then delete it.
4. Shutdown the servers.
5. If ldbm_back_delete is not found in the leaked stacktraces in the log-file /, the bug is verified.

Comment 25 Noriko Hosoi 2014-09-26 21:50:54 UTC
Note: dist-git rhel-6.7 is not ready yet.
The patches haven't been pushed to dist-git rhel-6.7.

Comment 28 Viktor Ashirov 2014-10-06 17:43:44 UTC
Created attachment 944320 [details]
valgrind-M1-20141006-183551.out

Hi Noriko, 

on affected build I was able to find ldbm_back_delete in logs using steps in c23 only after multiple adds/deletes (in my case 100 using ldclt), not just one. 

With build 389-ds-base-1.2.11.15-47.el6 there is no mentions of ldbm_back_delete in valgrind log (attached). 

Also, Reliab15 finished with good results (no memory leak), hence I mark this bug as VERIFIED.

Comment 29 Noriko Hosoi 2014-10-06 17:54:36 UTC
(In reply to Viktor Ashirov from comment #28)
> Created attachment 944320 [details]
> valgrind-M1-20141006-183551.out
> 
> Hi Noriko, 
> 
> on affected build I was able to find ldbm_back_delete in logs using steps in
> c23 only after multiple adds/deletes (in my case 100 using ldclt), not just
> one. 
> 
> With build 389-ds-base-1.2.11.15-47.el6 there is no mentions of
> ldbm_back_delete in valgrind log (attached). 

Viktor, thank you so much for running the version with valgrind and sharing the output with us.  I also scanned your output and see no trace of leaks.  All are from the initialization or acl cache.

Comment 30 Sankar Ramalingam 2014-10-07 07:26:03 UTC
Changing the status to MODIFIED since this is RHEL6.7 bugzilla. The Z-stream clone for this bugzilla is bug #1147479. 
Viktor, please add relevant comments to the Z-stream bugzilla and mark it as Verified.

Comment 34 Noriko Hosoi 2015-04-09 17:22:19 UTC
*** Bug 1200554 has been marked as a duplicate of this bug. ***

Comment 36 errata-xmlrpc 2015-07-22 06:35:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-1326.html