Bug 1138745
Summary: | Memory leak during Reliab15 execution | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Viktor Ashirov <vashirov> | ||||||||||||||||||
Component: | 389-ds-base | Assignee: | Noriko Hosoi <nhosoi> | ||||||||||||||||||
Status: | CLOSED ERRATA | QA Contact: | Viktor Ashirov <vashirov> | ||||||||||||||||||
Severity: | high | Docs Contact: | |||||||||||||||||||
Priority: | high | ||||||||||||||||||||
Version: | 6.6 | CC: | drieden, jgalipea, jkurik, jsvarova, nhosoi, nkinder, rmeggins, salmy, tlavigne | ||||||||||||||||||
Target Milestone: | rc | Keywords: | Regression, TestBlocker, ZStream | ||||||||||||||||||
Target Release: | --- | ||||||||||||||||||||
Hardware: | Unspecified | ||||||||||||||||||||
OS: | Unspecified | ||||||||||||||||||||
Whiteboard: | |||||||||||||||||||||
Fixed In Version: | 389-ds-base-1.2.11.15-46.el6 | Doc Type: | Bug Fix | ||||||||||||||||||
Doc Text: |
Bug fixes for replication conflict resolution (BZ#1080185) introduced a memory leak bug, which increased the size of the Directory Server. With this update, the memory leak code has been fixed, and the size of the Directory Servers in the replication topology is now stable under the stress.
|
Story Points: | --- | ||||||||||||||||||
Clone Of: | |||||||||||||||||||||
: | 1147092 1147479 (view as bug list) | Environment: | |||||||||||||||||||
Last Closed: | 2015-07-22 06:35:14 UTC | Type: | Bug | ||||||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||||
Embargoed: | |||||||||||||||||||||
Bug Depends On: | |||||||||||||||||||||
Bug Blocks: | 1147479 | ||||||||||||||||||||
Attachments: |
|
Description
Viktor Ashirov
2014-09-05 14:27:31 UTC
Created attachment 934821 [details]
valgrind-20140905-094633.out
Output of valgrind after running for 20 minutes
Created attachment 934823 [details]
pmap.out
Output of pmap from a x86_64-system with 32GB of RAM. ns-slapd process takes up more than 11GB of memory.
Comment on attachment 934821 [details]
valgrind-20140905-094633.out
All of these leaks are one time only startup related leaks. You may have to run the server for several hours under valgrind with the reliab15 test in order for any leaks to show up.
Rich, thank you for clarification. I restarted server under valgrind, will post results after ns-slapd crashes. Hi Viktor, when you run valgrind, could you set this environment variable USE_VALGRIND in your ns-slapd start up script? (It might be already set. If so, that's great.) export USE_VALGRIND=1 If this variable is set, all the entries in the entry caches are freed before the shutdown. (But of course, if you wait until the server's crash due to the memory leak, this setting does not matter...) Created attachment 934983 [details]
valgrind-20140905-104243.out
Hi Noriko, sorry, didn't see your comment before I started valgrind :( USE_VALGRIND wasn't set in my start up script.
Server didn't crash this time, since execution of the test came to an end before size of a process reached 4GB.
Anyway, valgrind output is attached. Please tell me, if it's useful. If not, I will restart Reliab15 from the beginning under valgrind with USE_VALGRIND=1.
Thank you!
Created attachment 935852 [details] valgrind-20140909-064929.out.bz2 Hi Noriko, this is the output of ns-slapd running under valgrind with USE_VALGRIND=1 on i386 system. It crashed on memory allocation. 389-ds-base-debuginfo-1.2.11.15-42.el6.i686 389-ds-base-libs-1.2.11.15-43.el6.i686 389-ds-base-1.2.11.15-43.el6.i686 out of memory: [09/Sep/2014:12:22:13 -0400] - libdb: malloc: 8264: Cannot allocate memory [09/Sep/2014:12:22:13 -0400] entryrdn-index - _entryrdn_put_data: Adding the child link (C3) failed: Cannot allocate memory (12) [09/Sep/2014:12:22:13 -0400] - adding (adding tombstone entryrdn nsuniqueid=3ddad631-383911e4-a20cac43-0b85da4d,uid=emailperson_7529,l=my_l_0,ou=my_ou_0,o=my_suffix.com) failed, err=12 Cannot allocate memory [09/Sep/2014:12:22:13 -0400] - libdb: malloc: 32000: Cannot allocate memory [09/Sep/2014:12:22:13 -0400] - libdb: PANIC: Cannot allocate memory [09/Sep/2014:12:22:13 -0400] - Serious Error---Failed in dblayer_txn_abort, err=-30974 (DB_RUNRECOVERY: Fatal error, run database recovery) [09/Sep/2014:12:22:13 -0400] memory allocator - malloc of 5759 bytes failed; OS error 0 (Success) The server has probably allocated all available virtual memory. To solve this problem, make more virtual memory available to your server, or reduce one or more of the following server configuration settings: nsslapd-cachesize (Database Settings - Maximum entries in cache) nsslapd-cachememsize (Database Settings - Memory available for cache) nsslapd-dbcachesize (LDBM Plug-in Settings - Maximum cache size) nsslapd-import-cachesize (LDBM Plug-in Settings - Import cache size). Can't recover; calling exit(1). And then many valgrind errors. I'm assuming that the OOM caused the errors. Unfortunately, this caused no leaks to show up. In order to catch this leak with valgrind, we are going to have to let the server run for some time, but not long enough to run into the OOM, then do a graceful shutdown of the server (e.g. service dirsrv stop), in order for valgrind to give us a leak report. Created attachment 935900 [details] valgrind-20140909-065958.out.bz2 This is the valgrind output after graceful shutdown from x86_64 DS instance that was running the same amount of time as previous one. Comment on attachment 935900 [details] valgrind-20140909-065958.out.bz2 The memory growth is occurring here: ==31331== 229,576,917 (2,908,032 direct, 226,668,885 indirect) bytes in 30,292 blocks are definitely lost in loss record 2,102 of 2,105 ==31331== at 0x4A057BB: calloc (vg_replace_malloc.c:593) ==31331== by 0x3F42A4BB4B: slapi_ch_calloc (in /usr/lib64/dirsrv/libslapd.so.0.0.0) ==31331== by 0xBE16622: backentry_init (backentry.c:92) ==31331== by 0xBE28CDC: id2entry (id2entry.c:417) ==31331== by 0xBE6FE70: uniqueid2entry (uniqueid2entry.c:86) ==31331== by 0xBE28501: find_entry_internal (findentry.c:228) ==31331== by 0xBE4AECC: ldbm_back_delete (ldbm_delete.c:190) ==31331== by 0x3F42A4F732: ??? (in /usr/lib64/dirsrv/libslapd.so.0.0.0) ==31331== by 0x3F42A4FA70: do_delete (in /usr/lib64/dirsrv/libslapd.so.0.0.0) and here ==31331== 314,754,403 (2,284,512 direct, 312,469,891 indirect) bytes in 23,797 blocks are definitely lost in loss record 2,103 of 2,105 ==31331== at 0x4A057BB: calloc (vg_replace_malloc.c:593) ==31331== by 0x3F42A4BB4B: slapi_ch_calloc (in /usr/lib64/dirsrv/libslapd.so.0.0.0) ==31331== by 0xBE16622: backentry_init (backentry.c:92) ==31331== by 0xBE40005: ldbm_back_add (ldbm_add.c:519) ==31331== by 0x3F42A434AD: ??? (in /usr/lib64/dirsrv/libslapd.so.0.0.0) ==31331== by 0x3F42A448D3: do_add (in /usr/lib64/dirsrv/libslapd.so.0.0.0) This looks like entry cache - entries are created in add by backentry_init then added to the entry cache - similarly, when an entry is already in the database, but not in the cache, it is read in by id2entry and cached. However, if USE_VALGRIND=1 is set in the environment, then the cache should be deleted upon shutdown, and there should not be any leaks reported due to backentry_init. Was USE_VALGRIND=1 set in the environment? I replaced line in /usr/sbin/start-dirsrv cd $SERVERBIN_DIR; ./ns-slapd -D $CONFIG_DIR -i $PIDFILE -w $STARTPIDFILE "$@" with export USE_VALGRIND=1 cd $SERVERBIN_DIR; valgrind --num-callers=32 --tool=memcheck --leak-check=full --leak-resolution=high ./ns-slapd -D $CONFIG_DIR -i $PIDFILE -w $STARTPIDFILE "$@" -d 0 >> /tmp/valgrind-`date +"%Y%m%d-%H%M%S"`.out 2>&1 & Then either the cache cleanup code in dblayer_instance_close() is not being called, or there is a really bad leak :-( Viktor, Do you see these messages in your errors log? [..] - userRoot: Cleaning up entry cache [..] - userRoot: Cleaning up dn cache Noriko, yes, attached valgrind output contains it: [09/Sep/2014:14:54:05 -0400] - userRoot: Cleaning up entry cache [09/Sep/2014:14:54:05 -0400] - userRoot: Cleaning up dn cache This looks like a really bad leak. A DS dev should work on this ASAP. Created attachment 937913 [details]
stacktrace.1410856678.txt
rpm -qa | grep 389
389-ds-base-debuginfo-1.2.11.15-46.el6.x86_64
389-ds-base-libs-1.2.11.15-46.el6.x86_64
389-ds-base-1.2.11.15-46.el6.x86_64
9 out of 10 servers crashed after running for 20 minutes. Stacktrace is attached.
Created attachment 939003 [details]
stacktrace.1411061804.txt
One x86_64 server crashed after 2 hours of running reliab15 with scratch build
rpm -qa | grep 389
389-ds-base-libs-1.2.11.15-99.el6.x86_64
389-ds-base-1.2.11.15-99.el6.x86_64
389-ds-base-debuginfo-1.2.11.15-99.el6.x86_64
Steps to verify using valgrind. 1. Set up 2-way MMR 2. Restart the servers under valgrind. Something like this. (The command line is based upon the one in /usr/sbin/start-dirsrv file.) + export USE_VALGRIND=1 + cd /usr/sbin; valgrind --tool=memcheck --leak-check=yes --num-callers=64 --log-file=/var/tmp/slapd.vg.$$ ./ns-slapd -D $CONFIG_DIR -i $PIDFILE -w $STARTPIDFILE "$@" -d 0 3. Add an entry, then delete it. 4. Shutdown the servers. 5. If ldbm_back_delete is not found in the leaked stacktraces in the log-file /, the bug is verified. Note: dist-git rhel-6.7 is not ready yet. The patches haven't been pushed to dist-git rhel-6.7. Created attachment 944320 [details]
valgrind-M1-20141006-183551.out
Hi Noriko,
on affected build I was able to find ldbm_back_delete in logs using steps in c23 only after multiple adds/deletes (in my case 100 using ldclt), not just one.
With build 389-ds-base-1.2.11.15-47.el6 there is no mentions of ldbm_back_delete in valgrind log (attached).
Also, Reliab15 finished with good results (no memory leak), hence I mark this bug as VERIFIED.
(In reply to Viktor Ashirov from comment #28) > Created attachment 944320 [details] > valgrind-M1-20141006-183551.out > > Hi Noriko, > > on affected build I was able to find ldbm_back_delete in logs using steps in > c23 only after multiple adds/deletes (in my case 100 using ldclt), not just > one. > > With build 389-ds-base-1.2.11.15-47.el6 there is no mentions of > ldbm_back_delete in valgrind log (attached). Viktor, thank you so much for running the version with valgrind and sharing the output with us. I also scanned your output and see no trace of leaks. All are from the initialization or acl cache. Changing the status to MODIFIED since this is RHEL6.7 bugzilla. The Z-stream clone for this bugzilla is bug #1147479. Viktor, please add relevant comments to the Z-stream bugzilla and mark it as Verified. *** Bug 1200554 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-1326.html |