Bug 666076
Summary: | dirsrv crash (1.2.7.5) with multiple simple paged result searches | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Retired] 389 | Reporter: | Adrian Bridgett <adrian> | ||||||||
Component: | Directory Server | Assignee: | Rich Megginson <rmeggins> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | Chandrasekar Kannan <ckannan> | ||||||||
Severity: | medium | Docs Contact: | |||||||||
Priority: | high | ||||||||||
Version: | 1.2.7 | CC: | a3li, andrey.ivanov, benl, jgalipea, jlieskov, ldv, sramling, vdanen | ||||||||
Target Milestone: | --- | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | Unspecified | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | |||||||||||
: | 679495 (view as bug list) | Environment: | |||||||||
Last Closed: | 2011-02-22 17:46:18 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 639035, 670914, 679495 | ||||||||||
Attachments: |
|
The search is running into the lookthrough limit and exiting early. I suppose you could work around the problem by increasing your lookthrough limit. I was unable to reproduce by simply making a search hit the limit. The stack trace sure looks like it is doing a free of already freed memory or perhaps uninitialized memory. Looking at the code, I don't know where this could happen. How many entries should be returned from the search? That is, how many match base dc=public,dc=people,dc=ds,dc=example,dc=com scope subtree filter (&(objectClass=inetOrgPerson)(memberOf=*Enterprise*)) ? If you can reproduce it easily, would you be able to run valgrind on this server to get more detailed memory access debugging? It will slow the server somewhat. Hi Rich, thanks for looking at this. We are going to perform a nice big cleanup of LDAP in the not-too-distant future since it's not setup very well ATM. Currently there are 3443 entries returned by the query (a plain ldapsearch is showing success - shouldn't it be reporting that it hit lookthrough limit?) Oddly we also have this set in the DB (would it be useful to attach dse.ldif or a dump of cn=config?): dn: cn=config,cn=ldbm database,cn=plugins,cn=config nsslapd-lookthroughlimit: -1 Under dc=public,dc=people,dc=ds,dc=example,dc=com there are 97617 entries (post-cleanup this will drop probably to those 3443 entries and not much else). We do have scripts which sync other systems to LDAP (as well as syncing other systems from LDAP). One thing I've seen those (legacy) scripts do is update all the attributes (even if they are _already_ set to the correct value) - so there are likely to be lots of updates happening from time to time. We definitely couldn't run valgrind on this box, it's pushed quite hard as it is. I can look at running it on a separate test box and I could run half of those update scripts (the ones that update LDAP - I couldn't run the ones that update other systems from LDAP as we don't have a full test environment yet). It's currently falling over about every two days, looking at exactly when: [root@ds2 core]# ls -l total 687192 -rw------- 1 root nobody 623874048 Jan 6 02:24 core.ns-slapd.99.12161.6.1294280650 -rw------- 1 root nobody 563785728 Jan 4 02:14 core.ns-slapd.99.21142.11.1294107249 -rw------- 1 root nobody 615436288 Jan 3 15:24 core.ns-slapd.99.22195.6.1294068249 -rw------- 1 root nobody 562569216 Jan 3 19:24 core.ns-slapd.99.2607.11.1294082649 [root@ds2 core]# rm core.ns-slapd.99.2607.11.1294082649 core.ns-slapd.99.22195.6 Sadly there's nothing in the update script logs to reveal what might be the trigger. (In reply to comment #3) > Hi Rich, thanks for looking at this. We are going to perform a nice big > cleanup of LDAP in the not-too-distant future since it's not setup very well > ATM. Currently there are 3443 entries returned by the query (a plain > ldapsearch is showing success - shouldn't it be reporting that it hit > lookthrough limit?) It depends on what you mean by "plain ldapsearch". The search filter is the critical factor in determining when the lookthrough limit is hit. If by "plain" you mean "running ldapsearch without specifying a search filter" then the default search filter of "objectclass=*" is used. Are you running the search as directory manager or a regular user? lookthrough limit does not apply to directory manager. > > Oddly we also have this set in the DB (would it be useful to attach dse.ldif or > a dump of cn=config?): > dn: cn=config,cn=ldbm database,cn=plugins,cn=config > nsslapd-lookthroughlimit: -1 This means to use the default value of lookthrough limit which is 5000. > Under dc=public,dc=people,dc=ds,dc=example,dc=com there are 97617 entries > (post-cleanup this will drop probably to those 3443 entries and not much else). So because of the search filter (&(objectClass=inetOrgPerson)(memberOf=*Enterprise*)) it has to look through all (or at least more than 5000) of the 97617 entries to find a match. > > We do have scripts which sync other systems to LDAP (as well as syncing other > systems from LDAP). One thing I've seen those (legacy) scripts do is update > all the attributes (even if they are _already_ set to the correct value) - so > there are likely to be lots of updates happening from time to time. Can you paste excerpts from your directory server access log from around the time of the crash? We can try to see if there are other operations happening at the time of the search. The core dump did not show any operations other than the search request. The access log is buffered by default for performance reasons, so it may be that what we are looking for was not written to disk before the crash. You can turn off access log buffering, but that will also impact performance somewhat. http://docs.redhat.com/docs/en-US/Red_Hat_Directory_Server/8.2/html/Configuration_and_Command-Line_Tool_Reference/Core_Server_Configuration_Reference.html#cnconfig-nsslapd_accesslog_logbuffering_Log_Buffering and see also http://docs.redhat.com/docs/en-US/Red_Hat_Directory_Server/8.2/html-single/Administration_Guide/index.html#using-named-pipe > > We definitely couldn't run valgrind on this box, it's pushed quite hard as it > is. I can look at running it on a separate test box and I could run half of > those update scripts (the ones that update LDAP - I couldn't run the ones that > update other systems from LDAP as we don't have a full test environment yet). > > It's currently falling over about every two days, looking at exactly when: > [root@ds2 core]# ls -l > total 687192 > -rw------- 1 root nobody 623874048 Jan 6 02:24 > core.ns-slapd.99.12161.6.1294280650 > -rw------- 1 root nobody 563785728 Jan 4 02:14 > core.ns-slapd.99.21142.11.1294107249 > -rw------- 1 root nobody 615436288 Jan 3 15:24 > core.ns-slapd.99.22195.6.1294068249 > -rw------- 1 root nobody 562569216 Jan 3 19:24 > core.ns-slapd.99.2607.11.1294082649 > [root@ds2 core]# rm core.ns-slapd.99.2607.11.1294082649 > core.ns-slapd.99.22195.6 > > Sadly there's nothing in the update script logs to reveal what might be the > trigger. The directory server access log (sans buffering) could tell us what's going on. Do all of these cores have the same thread stack traces? If you like, you can attach the full stack traces from all of these cores to this bug. I'm was running ldapsearch without any filters, not as the directory owner but I an in the ldap administrators group. Thanks for the note about lookthroughlimit - I'm very surprised that's not biting us elsewhere. Annoyingly I even have a comment that says "-1 = unlimited" so it's as if I've read it somewhere but apparently not. I looked at two or three stack traces and they looked identical. I'll go through the access logs next time it crashes. I have a small script I used last time to figure out what was "in flight" at the time of the crash. If nothing shows up I'll turn buffering off and keep a close eye on it! Thanks (In reply to comment #5) > I'm was running ldapsearch without any filters, not as the directory owner but > I an in the ldap administrators group. Ok. So lookthrough limit will apply in that case. > Thanks for the note about lookthroughlimit - I'm very surprised that's not > biting us elsewhere. Annoyingly I even have a comment that says "-1 = > unlimited" so it's as if I've read it somewhere but apparently not. > > I looked at two or three stack traces and they looked identical. I'll go > through the access logs next time it crashes. I have a small script I used > last time to figure out what was "in flight" at the time of the crash. If > nothing shows up I'll turn buffering off and keep a close eye on it! Thanks. > > Thanks Incidentally I found where I'd got the -1=unlimited from - it's here (4.4.1.1) nsLookthroughLimit cn=config,cn=ldbm database,cn=plugins,cn=config -1 to maximum 32-bit integer in entries (where -1 is unlimited) (default=5000) http://docs.redhat.com/docs/en-US/Red_Hat_Directory_Server/8.2/html/Configuration_and_Command-Line_Tool_Reference/Database_Plug_in_Attributes.html nsslapd-lookthroughlimit seems to be only in RHDS 7.1. It's just crashed, the restart script is taking a copy of access now: $ grep err access.apb |grep -v err=0 [07/Jan/2011:14:58:55 +0000] conn=42065 op=0 RESULT err=32 tag=97 nentries=0 etime=0 [07/Jan/2011:14:58:55 +0000] conn=42066 op=0 RESULT err=32 tag=97 nentries=0 etime=0 [07/Jan/2011:14:58:56 +0000] conn=42070 op=0 RESULT err=32 tag=97 nentries=0 etime=0 [07/Jan/2011:14:58:56 +0000] conn=42071 op=0 RESULT err=32 tag=97 nentries=0 etime=0 [07/Jan/2011:14:58:56 +0000] conn=42072 op=0 RESULT err=49 tag=97 nentries=0 etime=0 in order: [07/Jan/2011:14:58:55 +0000] conn=42065 fd=152 slot=152 connection from 94.229.175.69 to 94.229.175.61 [07/Jan/2011:14:58:55 +0000] conn=42065 op=0 BIND dn="uid=letuan,dc=staff,dc=people,dc=ds,dc=example,dc=com" method=128 version=3 [07/Jan/2011:14:58:55 +0000] conn=42065 op=0 RESULT err=32 tag=97 nentries=0 etime=0 [07/Jan/2011:14:58:55 +0000] conn=42065 op=-1 fd=152 closed - B1 [07/Jan/2011:14:58:55 +0000] conn=42066 fd=152 slot=152 connection from 94.229.175.69 to 94.229.175.61 [07/Jan/2011:14:58:55 +0000] conn=42066 op=0 BIND dn="uid=letuan,dc=vendor,dc=people,dc=ds,dc=example,dc=com" method=128 version=3 [07/Jan/2011:14:58:55 +0000] conn=42066 op=0 RESULT err=32 tag=97 nentries=0 etime=0 [07/Jan/2011:14:58:55 +0000] conn=42066 op=-1 fd=152 closed - B1 [07/Jan/2011:14:58:56 +0000] conn=42070 fd=152 slot=152 connection from 94.229.175.69 to 94.229.175.61 [07/Jan/2011:14:58:56 +0000] conn=42070 op=0 BIND dn="uid=sumerjabri,dc=staff,dc=people,dc=ds,dc=example,dc=com" method=128 version=3 [07/Jan/2011:14:58:56 +0000] conn=42070 op=0 RESULT err=32 tag=97 nentries=0 etime=0 [07/Jan/2011:14:58:56 +0000] conn=42070 op=-1 fd=152 closed - B1 [07/Jan/2011:14:58:56 +0000] conn=42071 fd=161 slot=161 connection from 94.229.175.69 to 94.229.175.61 [07/Jan/2011:14:58:56 +0000] conn=42071 op=0 BIND dn="uid=sumerjabri,dc=vendor,dc=people,dc=ds,dc=example,dc=com" method=128 version=3 [07/Jan/2011:14:58:56 +0000] conn=42071 op=0 RESULT err=32 tag=97 nentries=0 etime=0 [07/Jan/2011:14:58:56 +0000] conn=42071 op=-1 fd=161 closed - B1 [07/Jan/2011:14:58:56 +0000] conn=42072 fd=167 slot=167 connection from 94.229.175.69 to 94.229.175.61 [07/Jan/2011:14:58:56 +0000] conn=42072 op=0 BIND dn="uid=sumerjabri,dc=public,dc=people,dc=ds,dc=example,dc=com" method=128 version=3 [07/Jan/2011:14:58:56 +0000] conn=42072 op=0 RESULT err=49 tag=97 nentries=0 etime=0 [07/Jan/2011:14:58:56 +0000] conn=42072 op=-1 fd=167 closed - B1 I've also run my script which prints out all in-flight connections and nothing shows up there. Looks like I'll have to turn log buffering off. short backtrace looks the same: Program terminated with signal 6, Aborted. #0 0x00000032a2630265 in raise () from /lib64/libc.so.6 (gdb) bt #0 0x00000032a2630265 in raise () from /lib64/libc.so.6 #1 0x00000032a2631d10 in abort () from /lib64/libc.so.6 #2 0x00000032a266a84b in __libc_message () from /lib64/libc.so.6 #3 0x00000032a267230f in _int_free () from /lib64/libc.so.6 #4 0x00000032a267276b in free () from /lib64/libc.so.6 #5 0x00002b708b120af6 in slapi_ch_free (ptr=0x490b7cf0) at ldap/servers/slapd/ch_malloc.c:363 #6 0x00002b708ec04a65 in idl_free (idl=0x6363c50) at ldap/servers/slapd/back-ldbm/idl_common.c:94 #7 0x00002b708ec2ce2e in delete_search_result_set (sr=0x490b7d80) at ldap/servers/slapd/back-ldbm/ldbm_search.c:1490 #8 0x00002b708ec2d0e8 in ldbm_back_next_search_entry_ext (pb=0x2aaab12f9540, use_extension=0) at ldap/servers/slapd/back-ldbm/ldbm_search.c:1236 #9 0x00002b708b158e21 in iterate (pb=0x2aaab12f9540, be=0x3bfb1c0, send_result=1, pnentries=0x490ba6a4, pagesize=500, pr_statp=0x490ba698) at ldap/servers/slapd/opshared.c:1130 #10 0x00002b708b159352 in send_results_ext (pb=0x2aaab12f9540, send_result=1, nentries=0x490ba6a4, pagesize=500, pr_stat=0x490ba698) at ldap/servers/slapd/opshared.c:1515 #11 0x00002b708b159c9d in op_shared_search (pb=0x2aaab12f9540, send_result=1) at ldap/servers/slapd/opshared.c:591 #12 0x000000000042513f in do_search (pb=0x2aaab12f9540) at ldap/servers/slapd/search.c:381 ---Type <return> to continue, or q <return> to quit--- #13 0x00000000004135ee in connection_threadmain () at ldap/servers/slapd/connection.c:597 #14 0x00000032a4a284ad in ?? () from /usr/lib64/libnspr4.so #15 0x00000032a2e0673d in start_thread () from /lib64/libpthread.so.0 #16 0x00000032a26d3f6d in clone () from /lib64/libc.so.6 Meh, I forgot to paste in the error log for that latest crash: [07/Jan/2011:14:44:10 +0000] - get_ldapmessage_controls failed: 12 (Unavailable critical extension) (op=Abandon) [07/Jan/2011:14:49:10 +0000] - get_ldapmessage_controls failed: 12 (Unavailable critical extension) (op=Abandon) [07/Jan/2011:14:59:11 +0000] - PR_Write(154) Netscape Portable Runtime error -59 61 (TCP connection reset by peer.) 389-Directory/1.2.7.5 B2010.350.198 ds2.man.alfresco.com:636 (/etc/dirsrv/slapd-ds2) [07/Jan/2011:14:59:13 +0000] - 389-Directory/1.2.7.5 B2010.350.198 starting up [07/Jan/2011:14:59:13 +0000] - Detected Disorderly Shutdown last time Directory Server was running, recovering database. http://markmail.org/message/jyi43y7q2iuo6f7m suggests that the critical extension error would not be responsible for the crash though. That PR_Write error I've not seen in earlier logs either. looks like the crash is caused by a simple paged results search. That's a good clue. The stuff from the access log isn't much help - we know the crash is occurring due to a search request, not a bind request. I think you've done enough for now, thanks. I'll try to reproduce using a simple paged search. I'll let you know if I need any more information. I am still unable to reproduce the circumstances of the crash. I am very puzzled about this from the original full stack trace: #8 0x00002b712e9640e8 in ldbm_back_next_search_entry_ext ( pb=0x2aaaad28aef0, use_extension=0) at ldap/servers/slapd/back-ldbm/ldbm_search.c:1236 be = 0x101cf270 inst = 0x10247fb0 li = 0x10141ea0 scope = 2 managedsait = 0 attr = 0x0 filter = 0x2aaaacb61280 base = 0x2aaaac485830 "dc=public,dc=people,dc=ds,dc=example,dc=com" sr = 0x2aaab0c15780 id = <value optimized out> e = 0x2aaaad0555a0 nentries = 348 curtime = <value optimized out> stoptime = 1293565146 optime = 1293561546 tlimit = 3600 llimit = -1 slimit = 997648 isroot = 0 urls = 0x0 err = 0 basesdn = {flag = 0 '\000', dn = 0x2aaaac485830 "dc=public,dc=people,dc=ds,dc=example,dc=com", ndn = 0x2aaaac485830 "dc=public,dc=people,dc=ds,dc=example,dc=com", ndn_len = 44} target_uniqueid = 0x0 rc = -1 estimate = 0 This is from ldbm_search.c:1236 in the 1.2.7.5 code: /* check lookthrough limit */ if ( llimit != -1 && sr->sr_lookthroughcount >= llimit ) { slapi_send_ldap_result( pb, LDAP_ADMINLIMIT_EXCEEDED, NULL, NULL, nentries, urls ); /* in case paged results, clean up the conn */ pagedresults_set_search_result(pb->pb_conn, NULL); slapi_pblock_set( pb, SLAPI_SEARCH_RESULT_SET, NULL ); slapi_pblock_set( pb, SLAPI_SEARCH_RESULT_SET_SIZE_ESTIMATE, &estimate ); if ( use_extension ) { slapi_pblock_set( pb, SLAPI_SEARCH_RESULT_ENTRY_EXT, NULL ); } slapi_pblock_set( pb, SLAPI_SEARCH_RESULT_ENTRY, NULL ); **** line 1236 delete_search_result_set( &sr ); rc = SLAPI_FAIL_GENERAL; goto bail; } Note that in the stack trace, llimit is -1 - since llimit == -1, the if condition should be FALSE, and this block of code should never be entered. So, one of two things: 1) this is not really 1.2.7.5 - can you confirm with /usr/sbin/ns-slapd -V 2) the stack is being corrupted Could you post a full stack trace from one of the other core dumps? What is your sizelimit? Sure: /usr/sbin/ns-slapd -v 389 Project 389-Directory/1.2.7.5 B2010.350.198 Hmm, the most recent one is different - I'll attach the full version (different backtrace), the short form is: (gdb) bt #0 0x00002aabb7ff83c7 in cache_return (cache=0x1ad5d28, ptr=0x39043b0) at ldap/servers/slapd/back-ldbm/cache.c:1027 #1 0x00002aabb8034455 in ldbm_back_next_search_entry_ext (pb=0x2aaaad365350, use_extension=0) at ldap/servers/slapd/back-ldbm/ldbm_search.c:1171 #2 0x00002aabb455fe21 in iterate (pb=0x2aaaad365350, be=0x1ad71c0, send_result=1, pnentries=0x52ebe6a4, pagesize=500, pr_statp=0x52ebe698) at ldap/servers/slapd/opshared.c:1130 #3 0x00002aabb4560352 in send_results_ext (pb=0x2aaaad365350, send_result=1, nentries=0x52ebe6a4, pagesize=500, pr_stat=0x52ebe698) at ldap/servers/slapd/opshared.c:1515 #4 0x00002aabb4560c9d in op_shared_search (pb=0x2aaaad365350, send_result=1) at ldap/servers/slapd/opshared.c:591 #5 0x000000000042513f in do_search (pb=0x2aaaad365350) at ldap/servers/slapd/search.c:381 #6 0x00000000004135ee in connection_threadmain () at ldap/servers/slapd/connection.c:597 #7 0x00000032a4a284ad in ?? () from /usr/lib64/libnspr4.so #8 0x00000032a2e0673d in start_thread () from /lib64/libpthread.so.0 #9 0x00000032a26d3f6d in clone () from /lib64/libc.so.6 The one before that is like the normal ones (I'll attach that as duplicate backtrace): (gdb) bt #0 0x00000032a2630265 in raise () from /lib64/libc.so.6 #1 0x00000032a2631d10 in abort () from /lib64/libc.so.6 #2 0x00000032a266a84b in __libc_message () from /lib64/libc.so.6 #3 0x00000032a267230f in _int_free () from /lib64/libc.so.6 #4 0x00000032a267276b in free () from /lib64/libc.so.6 #5 0x00002af8c0d48af6 in slapi_ch_free (ptr=0x4e743d80) at ldap/servers/slapd/ch_malloc.c:363 #6 0x00002af8c48550e8 in ldbm_back_next_search_entry_ext (pb=0x2aaaada486a0, use_extension=0) at ldap/servers/slapd/back-ldbm/ldbm_search.c:1236 #7 0x00002af8c0d80e21 in iterate (pb=0x2aaaada486a0, be=0x2ebb1c0, send_result=1, pnentries=0x4e7466a4, pagesize=500, pr_statp=0x4e746698) at ldap/servers/slapd/opshared.c:1130 #8 0x00002af8c0d81352 in send_results_ext (pb=0x2aaaada486a0, send_result=1, nentries=0x4e7466a4, pagesize=500, pr_stat=0x4e746698) at ldap/servers/slapd/opshared.c:1515 #9 0x00002af8c0d81c9d in op_shared_search (pb=0x2aaaada486a0, send_result=1) at ldap/servers/slapd/opshared.c:591 #10 0x000000000042513f in do_search (pb=0x2aaaada486a0) at ldap/servers/slapd/search.c:381 #11 0x00000000004135ee in connection_threadmain () at ldap/servers/slapd/connection.c:597 #12 0x00000032a4a284ad in ?? () from /usr/lib64/libnspr4.so #13 0x00000032a2e0673d in start_thread () from /lib64/libpthread.so.0 ---Type <return> to continue, or q <return> to quit--- #14 0x00000032a26d3f6d in clone () from /lib64/libc.so.6 Created attachment 472754 [details]
different backtrace
Created attachment 472756 [details]
duplicate backtrace
What client program is it that is generating these search requests? There are a huge variety, to get a helpful answer I may well need to turn off log buffering so that I can catch the one that's triggering it. We use quite a range I'm afraid: - RHEL5 boxes with LDAP authentication: -- nss_ldap-253-25.el5_5.3 (just chose one box out of many, they are all rhel5) - Java applications JDK 1.6 (Sun JNDI LDAP service provider) - apache using mod_auth*_ldap (RHEL5 httpd 2.2.3-43.el5_5.3) - bespoke perl code (Net::LDAP v0.39) - openldap client (ldapsearch) - apache directory studio It has to be some application that can * open a connection * issue async operations on the connection * issue Simple Paged Search requests That rules out nss_ldap, mod_auth_ldap, openldap client so, could be JNDI apps, Net::LDAP perl code, and apache directory studio I have been able to reproduce the crash by issuing multiple simultaneous Simple Paged Search requests on the same connection (using python-ldap). I have also found a workaround - just set nsslapd-maxthreadsperconn: 1 in cn=config in your dse.ldif - you have to shutdown the server first before editing dse.ldif. Thanks Rich, I made that change on Friday morning and it's not crashed since. I think the Net::LDAP perl is the most likely culprit TBH as it performs some (horribly inefficient - e.g. modifying attributes and setting them to their current values...) synchronisation from a bespoke system to LDAP. I'll keep an eye on it to see if it crashes but I thought you might appreciate confirmation that the workaround seems to be working here as well. Good to know the workaround works for you - thanks for the update. Please let us know if you see any weird side effects from the use of the workaround. Created attachment 474301 [details]
patch that applies to master
This patch applies to master
Created attachment 474302 [details]
patch that applies to 8.2
This patch applies to 8.2
Adding Dmitry to the bug since ALT Linux ships 389. This issue has the name CVE-2011-0019. Adding Alex to the bug since Gentoo has started to ship 389 as well. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0293.html To ssh://git.fedorahosted.org/git/389/ds.git ce0dc22..a123b54 master -> master commit a123b5466aae3eac4193451ac587d90ef1b61815 Author: Rich Megginson <rmeggins> Date: Thu Jan 13 13:44:15 2011 -0700 Fix Description: Only allow one simple paged results search per-connection at a time. The new function pagedresults_check_or_set_processing() will check the flag in the connection to see if pagedresults processing is in progress. If so, the function will return True and the search will terminat with LDAP_UNWILLING_TO_PERFORM (as per section 3 in RFC 2696, a server is allowed to return unwillingToPerform if there is a limit to the number of outstanding paged search requests from a given client). The processing flag will be reset once the search result has been sent to the client. Since the problem is multiple threads in the same connection accessing the pagedresults data, the workaround is to just set nsslapd-maxthreadsperconn: 1 in cn=config in dse.ldif. Platforms tested: RHEL6 x86_64 Flag Day: no Doc impact: no To ssh://git.fedorahosted.org/git/389/ds.git 1847a51..5d4895c 389-ds-base-1.2.8 -> 389-ds-base-1.2.8 commit 5d4895c4d6a412476f4b831e1d0654931c3934f9 Author: Rich Megginson <rmeggins> Date: Thu Jan 13 13:44:15 2011 -0700 |
Created attachment 470999 [details] full backtrace Description of problem: slapd crashed Version-Release number of selected component (if applicable): 1.2.7.5-1.el5 (x86_64) How reproducible: just once so far (but has happened on earlier versions) Steps to Reproduce: 1. run slapd 2. wait a day, a week 3. panic Additional info: Here's a short backtrace and full backtrace (attached). May well be a dupe of 656734, but I thought it might be easier to raise separately and merge if needs be rather than put it in and find that was wrong. An automated upgrade (now disabled) upgraded us from 1.2.7.1 to 1.2.7.4 which caused us major problems due to the memory leak. I had to revert to 1.2.7.1 and we did see a few crashes, I've now upgraded to 1.2.7.5 (hoping that perhaps it was some incompatibility between 1.2.7.1 and data touched by 1.2.7.4) however it's just crashed. I do wonder if we need to rebuild the indexes, I've also not run a dbverify yet. Program terminated with signal 6, Aborted. #0 0x00000032a2630265 in raise () from /lib64/libc.so.6 (gdb) bt #0 0x00000032a2630265 in raise () from /lib64/libc.so.6 #1 0x00000032a2631d10 in abort () from /lib64/libc.so.6 #2 0x00000032a266a84b in __libc_message () from /lib64/libc.so.6 #3 0x00000032a267230f in _int_free () from /lib64/libc.so.6 #4 0x00000032a267276b in free () from /lib64/libc.so.6 #5 0x00002b712ae57af6 in slapi_ch_free (ptr=0x4cdddcf0) at ldap/servers/slapd/ch_malloc.c:363 #6 0x00002b712e93ba65 in idl_free (idl=0x2aaab1b29060) at ldap/servers/slapd/back-ldbm/idl_common.c:94 #7 0x00002b712e963e2e in delete_search_result_set (sr=0x4cdddd80) at ldap/servers/slapd/back-ldbm/ldbm_search.c:1490 #8 0x00002b712e9640e8 in ldbm_back_next_search_entry_ext ( pb=0x2aaaad28aef0, use_extension=0) at ldap/servers/slapd/back-ldbm/ldbm_search.c:1236 #9 0x00002b712ae8fe21 in iterate (pb=0x2aaaad28aef0, be=0x101cf270, send_result=1, pnentries=0x4cde06a4, pagesize=500, pr_statp=0x4cde0698) at ldap/servers/slapd/opshared.c:1130 #10 0x00002b712ae90352 in send_results_ext (pb=0x2aaaad28aef0, send_result=1, nentries=0x4cde06a4, pagesize=500, pr_stat=0x4cde0698) at ldap/servers/slapd/opshared.c:1515 #11 0x00002b712ae90c9d in op_shared_search (pb=0x2aaaad28aef0, send_result=1) at ldap/servers/slapd/opshared.c:591 #12 0x000000000042513f in do_search (pb=0x2aaaad28aef0) ---Type <return> to continue, or q <return> to quit--- at ldap/servers/slapd/search.c:381 #13 0x00000000004135ee in connection_threadmain () at ldap/servers/slapd/connection.c:597 #14 0x00000032a4a284ad in ?? () from /usr/lib64/libnspr4.so #15 0x00000032a2e0673d in start_thread () from /lib64/libpthread.so.0 #16 0x00000032a26d3f6d in clone () from /lib64/libc.so.6