Bug 666076

Summary: dirsrv crash (1.2.7.5) with multiple simple paged result searches
Product: [Retired] 389 Reporter: Adrian Bridgett <adrian>
Component: Directory ServerAssignee: Rich Megginson <rmeggins>
Status: CLOSED ERRATA QA Contact: Chandrasekar Kannan <ckannan>
Severity: medium Docs Contact:
Priority: high    
Version: 1.2.7CC: a3li, andrey.ivanov, benl, jgalipea, jlieskov, ldv, sramling, vdanen
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 679495 (view as bug list) Environment:
Last Closed: 2011-02-22 17:46:18 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 639035, 670914, 679495    
Attachments:
Description Flags
full backtrace
none
different backtrace
none
duplicate backtrace none

Description Adrian Bridgett 2010-12-28 20:13:56 UTC
Created attachment 470999 [details]
full backtrace

Description of problem:
slapd crashed

Version-Release number of selected component (if applicable):
1.2.7.5-1.el5 (x86_64)

How reproducible:
just once so far (but has happened on earlier versions)

Steps to Reproduce:
1. run slapd
2. wait a day, a week
3. panic

Additional info:
Here's a short backtrace and full backtrace (attached).  May well be a dupe of 656734, but I thought it might be easier to raise separately and merge if needs be rather than put it in and find that was wrong.

An automated upgrade (now disabled) upgraded us from 1.2.7.1 to 1.2.7.4 which caused us major problems due to the memory leak.  I had to revert to 1.2.7.1 and we did see a few crashes, I've now upgraded to 1.2.7.5 (hoping that perhaps it was some incompatibility between 1.2.7.1 and data touched by 1.2.7.4) however it's just crashed.

I do wonder if we need to rebuild the indexes, I've also not run a dbverify yet.

Program terminated with signal 6, Aborted.
#0  0x00000032a2630265 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x00000032a2630265 in raise () from /lib64/libc.so.6
#1  0x00000032a2631d10 in abort () from /lib64/libc.so.6
#2  0x00000032a266a84b in __libc_message () from /lib64/libc.so.6
#3  0x00000032a267230f in _int_free () from /lib64/libc.so.6
#4  0x00000032a267276b in free () from /lib64/libc.so.6
#5  0x00002b712ae57af6 in slapi_ch_free (ptr=0x4cdddcf0)
    at ldap/servers/slapd/ch_malloc.c:363
#6  0x00002b712e93ba65 in idl_free (idl=0x2aaab1b29060)
    at ldap/servers/slapd/back-ldbm/idl_common.c:94
#7  0x00002b712e963e2e in delete_search_result_set (sr=0x4cdddd80)
    at ldap/servers/slapd/back-ldbm/ldbm_search.c:1490
#8  0x00002b712e9640e8 in ldbm_back_next_search_entry_ext (
    pb=0x2aaaad28aef0, use_extension=0)
    at ldap/servers/slapd/back-ldbm/ldbm_search.c:1236
#9  0x00002b712ae8fe21 in iterate (pb=0x2aaaad28aef0, be=0x101cf270, 
    send_result=1, pnentries=0x4cde06a4, pagesize=500, 
    pr_statp=0x4cde0698) at ldap/servers/slapd/opshared.c:1130
#10 0x00002b712ae90352 in send_results_ext (pb=0x2aaaad28aef0, 
    send_result=1, nentries=0x4cde06a4, pagesize=500, 
    pr_stat=0x4cde0698) at ldap/servers/slapd/opshared.c:1515
#11 0x00002b712ae90c9d in op_shared_search (pb=0x2aaaad28aef0, 
    send_result=1) at ldap/servers/slapd/opshared.c:591
#12 0x000000000042513f in do_search (pb=0x2aaaad28aef0)
---Type <return> to continue, or q <return> to quit---
    at ldap/servers/slapd/search.c:381
#13 0x00000000004135ee in connection_threadmain ()
    at ldap/servers/slapd/connection.c:597
#14 0x00000032a4a284ad in ?? () from /usr/lib64/libnspr4.so
#15 0x00000032a2e0673d in start_thread () from /lib64/libpthread.so.0
#16 0x00000032a26d3f6d in clone () from /lib64/libc.so.6

Comment 2 Rich Megginson 2011-01-05 23:48:59 UTC
The search is running into the lookthrough limit and exiting early.  I suppose you could work around the problem by increasing your lookthrough limit.  I was unable to reproduce by simply making a search hit the limit.  The stack trace sure looks like it is doing a free of already freed memory or perhaps uninitialized memory.  Looking at the code, I don't know where this could happen.

How many entries should be returned from the search?  That is, how many match
base dc=public,dc=people,dc=ds,dc=example,dc=com
scope subtree
filter (&(objectClass=inetOrgPerson)(memberOf=*Enterprise*))
?

If you can reproduce it easily, would you be able to run valgrind on this server to get more detailed memory access debugging?  It will slow the server somewhat.

Comment 3 Adrian Bridgett 2011-01-06 08:56:10 UTC
Hi Rich, thanks for looking at this.  We are going to perform a nice big cleanup of LDAP in the not-too-distant future since it's not setup very well ATM.  Currently there are 3443 entries returned by the query (a plain ldapsearch is showing success - shouldn't it be reporting that it hit lookthrough limit?)

Oddly we also have this set in the DB (would it be useful to attach dse.ldif or a dump of cn=config?):
dn: cn=config,cn=ldbm database,cn=plugins,cn=config
nsslapd-lookthroughlimit: -1

Under dc=public,dc=people,dc=ds,dc=example,dc=com there are 97617 entries (post-cleanup this will drop probably to those 3443 entries and not much else).

We do have scripts which sync other systems to LDAP (as well as syncing other systems from LDAP).  One thing I've seen those (legacy) scripts do is update all the attributes (even if they are _already_ set to the correct value) - so there are likely to be lots of updates happening from time to time.

We definitely couldn't run valgrind on this box, it's pushed quite hard as it is.  I can look at running it on a separate test box and I could run half of those update scripts (the ones that update LDAP - I couldn't run the ones that update other systems from LDAP as we don't have a full test environment yet).

It's currently falling over about every two days, looking at exactly when:
[root@ds2 core]# ls -l
total 687192
-rw------- 1 root nobody 623874048 Jan  6 02:24 core.ns-slapd.99.12161.6.1294280650
-rw------- 1 root nobody 563785728 Jan  4 02:14 core.ns-slapd.99.21142.11.1294107249
-rw------- 1 root nobody 615436288 Jan  3 15:24 core.ns-slapd.99.22195.6.1294068249
-rw------- 1 root nobody 562569216 Jan  3 19:24 core.ns-slapd.99.2607.11.1294082649
[root@ds2 core]# rm core.ns-slapd.99.2607.11.1294082649 core.ns-slapd.99.22195.6

Sadly there's nothing in the update script logs to reveal what might be the trigger.

Comment 4 Rich Megginson 2011-01-06 15:46:54 UTC
(In reply to comment #3)
> Hi Rich, thanks for looking at this.  We are going to perform a nice big
> cleanup of LDAP in the not-too-distant future since it's not setup very well
> ATM.  Currently there are 3443 entries returned by the query (a plain
> ldapsearch is showing success - shouldn't it be reporting that it hit
> lookthrough limit?)

It depends on what you mean by "plain ldapsearch".  The search filter is the critical factor in determining when the lookthrough limit is hit.  If by "plain" you mean "running ldapsearch without specifying a search filter" then the default search filter of "objectclass=*" is used.  Are you running the search as directory manager or a regular user?  lookthrough limit does not apply to directory manager.

> 
> Oddly we also have this set in the DB (would it be useful to attach dse.ldif or
> a dump of cn=config?):
> dn: cn=config,cn=ldbm database,cn=plugins,cn=config
> nsslapd-lookthroughlimit: -1

This means to use the default value of lookthrough limit which is 5000.
 
> Under dc=public,dc=people,dc=ds,dc=example,dc=com there are 97617 entries
> (post-cleanup this will drop probably to those 3443 entries and not much else).

So because of the search filter
(&(objectClass=inetOrgPerson)(memberOf=*Enterprise*))
it has to look through all (or at least more than 5000) of the 97617 entries to find a match.

> 
> We do have scripts which sync other systems to LDAP (as well as syncing other
> systems from LDAP).  One thing I've seen those (legacy) scripts do is update
> all the attributes (even if they are _already_ set to the correct value) - so
> there are likely to be lots of updates happening from time to time.

Can you paste excerpts from your directory server access log from around the time of the crash?  We can try to see if there are other operations happening at the time of the search.  The core dump did not show any operations other than the search request.  The access log is buffered by default for performance reasons, so it may be that what we are looking for was not written to disk before the crash.  You can turn off access log buffering, but that will also impact performance somewhat.

http://docs.redhat.com/docs/en-US/Red_Hat_Directory_Server/8.2/html/Configuration_and_Command-Line_Tool_Reference/Core_Server_Configuration_Reference.html#cnconfig-nsslapd_accesslog_logbuffering_Log_Buffering

and see also

http://docs.redhat.com/docs/en-US/Red_Hat_Directory_Server/8.2/html-single/Administration_Guide/index.html#using-named-pipe

> 
> We definitely couldn't run valgrind on this box, it's pushed quite hard as it
> is.  I can look at running it on a separate test box and I could run half of
> those update scripts (the ones that update LDAP - I couldn't run the ones that
> update other systems from LDAP as we don't have a full test environment yet).
> 
> It's currently falling over about every two days, looking at exactly when:
> [root@ds2 core]# ls -l
> total 687192
> -rw------- 1 root nobody 623874048 Jan  6 02:24
> core.ns-slapd.99.12161.6.1294280650
> -rw------- 1 root nobody 563785728 Jan  4 02:14
> core.ns-slapd.99.21142.11.1294107249
> -rw------- 1 root nobody 615436288 Jan  3 15:24
> core.ns-slapd.99.22195.6.1294068249
> -rw------- 1 root nobody 562569216 Jan  3 19:24
> core.ns-slapd.99.2607.11.1294082649
> [root@ds2 core]# rm core.ns-slapd.99.2607.11.1294082649
> core.ns-slapd.99.22195.6
> 
> Sadly there's nothing in the update script logs to reveal what might be the
> trigger.

The directory server access log (sans buffering) could tell us what's going on.

Do all of these cores have the same thread stack traces?  If you like, you can attach the full stack traces from all of these cores to this bug.

Comment 5 Adrian Bridgett 2011-01-06 17:50:31 UTC
I'm was running ldapsearch without any filters, not as the directory owner but I an in the ldap administrators group.  

Thanks for the note about lookthroughlimit - I'm very surprised that's not biting us elsewhere.  Annoyingly I even have a comment that says "-1 = unlimited" so it's as if I've read it somewhere but apparently not. 

I looked at two or three stack traces and they looked identical.  I'll go through the access logs next time it crashes.  I have a small script I used last time to figure out what was "in flight" at the time of the crash.   If nothing shows up I'll turn buffering off and keep a close eye on it!

Thanks

Comment 6 Rich Megginson 2011-01-06 22:10:28 UTC
(In reply to comment #5)
> I'm was running ldapsearch without any filters, not as the directory owner but
> I an in the ldap administrators group.  

Ok.  So lookthrough limit will apply in that case.

> Thanks for the note about lookthroughlimit - I'm very surprised that's not
> biting us elsewhere.  Annoyingly I even have a comment that says "-1 =
> unlimited" so it's as if I've read it somewhere but apparently not. 
> 
> I looked at two or three stack traces and they looked identical.  I'll go
> through the access logs next time it crashes.  I have a small script I used
> last time to figure out what was "in flight" at the time of the crash.   If
> nothing shows up I'll turn buffering off and keep a close eye on it!

Thanks.

> 
> Thanks

Comment 7 Adrian Bridgett 2011-01-07 16:27:56 UTC
Incidentally I found where I'd got the -1=unlimited from - it's here (4.4.1.1)

nsLookthroughLimit cn=config,cn=ldbm database,cn=plugins,cn=config
-1 to maximum 32-bit integer in entries (where -1 is unlimited) (default=5000)

http://docs.redhat.com/docs/en-US/Red_Hat_Directory_Server/8.2/html/Configuration_and_Command-Line_Tool_Reference/Database_Plug_in_Attributes.html

nsslapd-lookthroughlimit seems to be only in RHDS 7.1.

It's just crashed, the restart script is taking a copy of access now:
$ grep err access.apb |grep -v err=0
[07/Jan/2011:14:58:55 +0000] conn=42065 op=0 RESULT err=32 tag=97 nentries=0 etime=0
[07/Jan/2011:14:58:55 +0000] conn=42066 op=0 RESULT err=32 tag=97 nentries=0 etime=0
[07/Jan/2011:14:58:56 +0000] conn=42070 op=0 RESULT err=32 tag=97 nentries=0 etime=0
[07/Jan/2011:14:58:56 +0000] conn=42071 op=0 RESULT err=32 tag=97 nentries=0 etime=0
[07/Jan/2011:14:58:56 +0000] conn=42072 op=0 RESULT err=49 tag=97 nentries=0 etime=0


in order:
[07/Jan/2011:14:58:55 +0000] conn=42065 fd=152 slot=152 connection from 94.229.175.69 to 94.229.175.61
[07/Jan/2011:14:58:55 +0000] conn=42065 op=0 BIND dn="uid=letuan,dc=staff,dc=people,dc=ds,dc=example,dc=com" method=128 version=3
[07/Jan/2011:14:58:55 +0000] conn=42065 op=0 RESULT err=32 tag=97 nentries=0 etime=0
[07/Jan/2011:14:58:55 +0000] conn=42065 op=-1 fd=152 closed - B1

[07/Jan/2011:14:58:55 +0000] conn=42066 fd=152 slot=152 connection from 94.229.175.69 to 94.229.175.61
[07/Jan/2011:14:58:55 +0000] conn=42066 op=0 BIND dn="uid=letuan,dc=vendor,dc=people,dc=ds,dc=example,dc=com" method=128 version=3
[07/Jan/2011:14:58:55 +0000] conn=42066 op=0 RESULT err=32 tag=97 nentries=0 etime=0
[07/Jan/2011:14:58:55 +0000] conn=42066 op=-1 fd=152 closed - B1

[07/Jan/2011:14:58:56 +0000] conn=42070 fd=152 slot=152 connection from 94.229.175.69 to 94.229.175.61
[07/Jan/2011:14:58:56 +0000] conn=42070 op=0 BIND dn="uid=sumerjabri,dc=staff,dc=people,dc=ds,dc=example,dc=com" method=128 version=3
[07/Jan/2011:14:58:56 +0000] conn=42070 op=0 RESULT err=32 tag=97 nentries=0 etime=0
[07/Jan/2011:14:58:56 +0000] conn=42070 op=-1 fd=152 closed - B1

[07/Jan/2011:14:58:56 +0000] conn=42071 fd=161 slot=161 connection from 94.229.175.69 to 94.229.175.61
[07/Jan/2011:14:58:56 +0000] conn=42071 op=0 BIND dn="uid=sumerjabri,dc=vendor,dc=people,dc=ds,dc=example,dc=com" method=128 version=3
[07/Jan/2011:14:58:56 +0000] conn=42071 op=0 RESULT err=32 tag=97 nentries=0 etime=0
[07/Jan/2011:14:58:56 +0000] conn=42071 op=-1 fd=161 closed - B1

[07/Jan/2011:14:58:56 +0000] conn=42072 fd=167 slot=167 connection from 94.229.175.69 to 94.229.175.61
[07/Jan/2011:14:58:56 +0000] conn=42072 op=0 BIND dn="uid=sumerjabri,dc=public,dc=people,dc=ds,dc=example,dc=com" method=128 version=3
[07/Jan/2011:14:58:56 +0000] conn=42072 op=0 RESULT err=49 tag=97 nentries=0 etime=0
[07/Jan/2011:14:58:56 +0000] conn=42072 op=-1 fd=167 closed - B1

I've also run my script which prints out all in-flight connections and nothing shows up there.  Looks like I'll have to turn log buffering off.

short backtrace looks the same:
Program terminated with signal 6, Aborted.
#0  0x00000032a2630265 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x00000032a2630265 in raise () from /lib64/libc.so.6
#1  0x00000032a2631d10 in abort () from /lib64/libc.so.6
#2  0x00000032a266a84b in __libc_message () from /lib64/libc.so.6
#3  0x00000032a267230f in _int_free () from /lib64/libc.so.6
#4  0x00000032a267276b in free () from /lib64/libc.so.6
#5  0x00002b708b120af6 in slapi_ch_free (ptr=0x490b7cf0)
    at ldap/servers/slapd/ch_malloc.c:363
#6  0x00002b708ec04a65 in idl_free (idl=0x6363c50)
    at ldap/servers/slapd/back-ldbm/idl_common.c:94
#7  0x00002b708ec2ce2e in delete_search_result_set (sr=0x490b7d80)
    at ldap/servers/slapd/back-ldbm/ldbm_search.c:1490
#8  0x00002b708ec2d0e8 in ldbm_back_next_search_entry_ext (pb=0x2aaab12f9540, 
    use_extension=0) at ldap/servers/slapd/back-ldbm/ldbm_search.c:1236
#9  0x00002b708b158e21 in iterate (pb=0x2aaab12f9540, be=0x3bfb1c0, 
    send_result=1, pnentries=0x490ba6a4, pagesize=500, pr_statp=0x490ba698)
    at ldap/servers/slapd/opshared.c:1130
#10 0x00002b708b159352 in send_results_ext (pb=0x2aaab12f9540, send_result=1, 
    nentries=0x490ba6a4, pagesize=500, pr_stat=0x490ba698)
    at ldap/servers/slapd/opshared.c:1515
#11 0x00002b708b159c9d in op_shared_search (pb=0x2aaab12f9540, send_result=1)
    at ldap/servers/slapd/opshared.c:591
#12 0x000000000042513f in do_search (pb=0x2aaab12f9540)
    at ldap/servers/slapd/search.c:381
---Type <return> to continue, or q <return> to quit---
#13 0x00000000004135ee in connection_threadmain ()
    at ldap/servers/slapd/connection.c:597
#14 0x00000032a4a284ad in ?? () from /usr/lib64/libnspr4.so
#15 0x00000032a2e0673d in start_thread () from /lib64/libpthread.so.0
#16 0x00000032a26d3f6d in clone () from /lib64/libc.so.6

Comment 8 Adrian Bridgett 2011-01-07 16:33:18 UTC
Meh, I forgot to paste in the error log for that latest crash:

[07/Jan/2011:14:44:10 +0000] - get_ldapmessage_controls failed: 12 (Unavailable 
critical extension) (op=Abandon)
[07/Jan/2011:14:49:10 +0000] - get_ldapmessage_controls failed: 12 (Unavailable 
critical extension) (op=Abandon)
[07/Jan/2011:14:59:11 +0000] - PR_Write(154) Netscape Portable Runtime error -59
61 (TCP connection reset by peer.)
        389-Directory/1.2.7.5 B2010.350.198
        ds2.man.alfresco.com:636 (/etc/dirsrv/slapd-ds2)

[07/Jan/2011:14:59:13 +0000] - 389-Directory/1.2.7.5 B2010.350.198 starting up
[07/Jan/2011:14:59:13 +0000] - Detected Disorderly Shutdown last time Directory 
Server was running, recovering database.


http://markmail.org/message/jyi43y7q2iuo6f7m suggests that the critical extension error would not be responsible for the crash though.  That PR_Write error I've not seen in earlier logs either.

Comment 9 Rich Megginson 2011-01-07 21:48:26 UTC
looks like the crash is caused by a simple paged results search.  That's a good clue.

The stuff from the access log isn't much help - we know the crash is occurring due to a search request, not a bind request.

I think you've done enough for now, thanks.  I'll try to reproduce using a simple paged search.  I'll let you know if I need any more information.

Comment 10 Rich Megginson 2011-01-11 01:33:40 UTC
I am still unable to reproduce the circumstances of the crash.  I am very puzzled about this from the original full stack trace:

#8  0x00002b712e9640e8 in ldbm_back_next_search_entry_ext (
    pb=0x2aaaad28aef0, use_extension=0)
    at ldap/servers/slapd/back-ldbm/ldbm_search.c:1236
        be = 0x101cf270
        inst = 0x10247fb0
        li = 0x10141ea0
        scope = 2
        managedsait = 0
        attr = 0x0
        filter = 0x2aaaacb61280
        base = 0x2aaaac485830 "dc=public,dc=people,dc=ds,dc=example,dc=com"
        sr = 0x2aaab0c15780
        id = <value optimized out>
        e = 0x2aaaad0555a0
        nentries = 348
        curtime = <value optimized out>
        stoptime = 1293565146
        optime = 1293561546
        tlimit = 3600
        llimit = -1
        slimit = 997648
        isroot = 0
        urls = 0x0
        err = 0
        basesdn = {flag = 0 '\000', 
          dn = 0x2aaaac485830 "dc=public,dc=people,dc=ds,dc=example,dc=com", 
          ndn = 0x2aaaac485830 "dc=public,dc=people,dc=ds,dc=example,dc=com", ndn_len = 44}
        target_uniqueid = 0x0
        rc = -1
        estimate = 0

This is from ldbm_search.c:1236 in the 1.2.7.5 code:
        /* check lookthrough limit */
        if ( llimit != -1 && sr->sr_lookthroughcount >= llimit )
        {
            slapi_send_ldap_result( pb, LDAP_ADMINLIMIT_EXCEEDED, NULL, NULL, nentries, urls );
            /* in case paged results, clean up the conn */
            pagedresults_set_search_result(pb->pb_conn, NULL);
            slapi_pblock_set( pb, SLAPI_SEARCH_RESULT_SET, NULL );
            slapi_pblock_set( pb, SLAPI_SEARCH_RESULT_SET_SIZE_ESTIMATE, &estimate );
            if ( use_extension ) {
                slapi_pblock_set( pb, SLAPI_SEARCH_RESULT_ENTRY_EXT, NULL );
            } 
            slapi_pblock_set( pb, SLAPI_SEARCH_RESULT_ENTRY, NULL );
**** line 1236            delete_search_result_set( &sr );
            rc = SLAPI_FAIL_GENERAL;
            goto bail;
        }

Note that in the stack trace, llimit is -1 - since llimit == -1, the if condition should be FALSE, and this block of code should never be entered.  So, one of two things:
1) this is not really 1.2.7.5 - can you confirm with
 /usr/sbin/ns-slapd -V
2) the stack is being corrupted

Could you post a full stack trace from one of the other core dumps?
What is your sizelimit?

Comment 11 Adrian Bridgett 2011-01-11 08:35:01 UTC
Sure:

/usr/sbin/ns-slapd -v
389 Project
389-Directory/1.2.7.5 B2010.350.198

Hmm, the most recent one is different - I'll attach the full version (different backtrace), the short form is:

(gdb) bt
#0  0x00002aabb7ff83c7 in cache_return (cache=0x1ad5d28, ptr=0x39043b0)
    at ldap/servers/slapd/back-ldbm/cache.c:1027
#1  0x00002aabb8034455 in ldbm_back_next_search_entry_ext (pb=0x2aaaad365350, 
    use_extension=0) at ldap/servers/slapd/back-ldbm/ldbm_search.c:1171
#2  0x00002aabb455fe21 in iterate (pb=0x2aaaad365350, be=0x1ad71c0, 
    send_result=1, pnentries=0x52ebe6a4, pagesize=500, pr_statp=0x52ebe698)
    at ldap/servers/slapd/opshared.c:1130
#3  0x00002aabb4560352 in send_results_ext (pb=0x2aaaad365350, send_result=1, 
    nentries=0x52ebe6a4, pagesize=500, pr_stat=0x52ebe698)
    at ldap/servers/slapd/opshared.c:1515
#4  0x00002aabb4560c9d in op_shared_search (pb=0x2aaaad365350, send_result=1)
    at ldap/servers/slapd/opshared.c:591
#5  0x000000000042513f in do_search (pb=0x2aaaad365350)
    at ldap/servers/slapd/search.c:381
#6  0x00000000004135ee in connection_threadmain ()
    at ldap/servers/slapd/connection.c:597
#7  0x00000032a4a284ad in ?? () from /usr/lib64/libnspr4.so
#8  0x00000032a2e0673d in start_thread () from /lib64/libpthread.so.0
#9  0x00000032a26d3f6d in clone () from /lib64/libc.so.6

The one before that is like the normal ones (I'll attach that as duplicate backtrace):

(gdb) bt
#0  0x00000032a2630265 in raise () from /lib64/libc.so.6
#1  0x00000032a2631d10 in abort () from /lib64/libc.so.6
#2  0x00000032a266a84b in __libc_message () from /lib64/libc.so.6
#3  0x00000032a267230f in _int_free () from /lib64/libc.so.6
#4  0x00000032a267276b in free () from /lib64/libc.so.6
#5  0x00002af8c0d48af6 in slapi_ch_free (ptr=0x4e743d80)
    at ldap/servers/slapd/ch_malloc.c:363
#6  0x00002af8c48550e8 in ldbm_back_next_search_entry_ext (pb=0x2aaaada486a0, 
    use_extension=0) at ldap/servers/slapd/back-ldbm/ldbm_search.c:1236
#7  0x00002af8c0d80e21 in iterate (pb=0x2aaaada486a0, be=0x2ebb1c0, 
    send_result=1, pnentries=0x4e7466a4, pagesize=500, pr_statp=0x4e746698)
    at ldap/servers/slapd/opshared.c:1130
#8  0x00002af8c0d81352 in send_results_ext (pb=0x2aaaada486a0, send_result=1, 
    nentries=0x4e7466a4, pagesize=500, pr_stat=0x4e746698)
    at ldap/servers/slapd/opshared.c:1515
#9  0x00002af8c0d81c9d in op_shared_search (pb=0x2aaaada486a0, send_result=1)
    at ldap/servers/slapd/opshared.c:591
#10 0x000000000042513f in do_search (pb=0x2aaaada486a0)
    at ldap/servers/slapd/search.c:381
#11 0x00000000004135ee in connection_threadmain ()
    at ldap/servers/slapd/connection.c:597
#12 0x00000032a4a284ad in ?? () from /usr/lib64/libnspr4.so
#13 0x00000032a2e0673d in start_thread () from /lib64/libpthread.so.0
---Type <return> to continue, or q <return> to quit---
#14 0x00000032a26d3f6d in clone () from /lib64/libc.so.6

Comment 12 Adrian Bridgett 2011-01-11 08:35:54 UTC
Created attachment 472754 [details]
different backtrace

Comment 13 Adrian Bridgett 2011-01-11 08:38:08 UTC
Created attachment 472756 [details]
duplicate backtrace

Comment 14 Rich Megginson 2011-01-13 01:09:49 UTC
What client program is it that is generating these search requests?

Comment 15 Adrian Bridgett 2011-01-13 10:55:50 UTC
There are a huge variety, to get a helpful answer I may well need to turn off log buffering so that I can catch the one that's triggering it.

We use quite a range I'm afraid:
- RHEL5 boxes with LDAP authentication:
-- nss_ldap-253-25.el5_5.3  (just chose one box out of many, they are all rhel5)
- Java applications JDK 1.6 (Sun JNDI LDAP service provider)
- apache using mod_auth*_ldap (RHEL5 httpd 2.2.3-43.el5_5.3)
- bespoke perl code (Net::LDAP v0.39)
- openldap client (ldapsearch)
- apache directory studio

Comment 16 Rich Megginson 2011-01-13 18:48:51 UTC
It has to be some application that can
* open a connection
* issue async operations on the connection
* issue Simple Paged Search requests

That rules out nss_ldap, mod_auth_ldap, openldap client

so, could be JNDI apps, Net::LDAP perl code, and apache directory studio

I have been able to reproduce the crash by issuing multiple simultaneous Simple Paged Search requests on the same connection (using python-ldap).  I have also found a workaround - just set nsslapd-maxthreadsperconn: 1 in cn=config in your dse.ldif - you have to shutdown the server first before editing dse.ldif.

Comment 17 Adrian Bridgett 2011-01-17 09:35:21 UTC
Thanks Rich, 

I made that change on Friday morning and it's not crashed since. I think the Net::LDAP perl is the most likely culprit TBH as it performs some (horribly inefficient - e.g. modifying attributes and setting them to their current values...) synchronisation from a bespoke system to LDAP. 

I'll keep an eye on it to see if it crashes but I thought you might appreciate confirmation that the workaround seems to be working here as well.

Comment 18 Rich Megginson 2011-01-18 16:54:20 UTC
Good to know the workaround works for you - thanks for the update.  Please let us know if you see any weird side effects from the use of the workaround.

Comment 24 Rich Megginson 2011-01-19 15:42:42 UTC
Created attachment 474301 [details]
patch that applies to master

This patch applies to master

Comment 25 Rich Megginson 2011-01-19 15:43:21 UTC
Created attachment 474302 [details]
patch that applies to 8.2

This patch applies to 8.2

Comment 33 Vincent Danen 2011-01-19 21:42:54 UTC
Adding Dmitry to the bug since ALT Linux ships 389.

Comment 34 Vincent Danen 2011-01-19 21:44:20 UTC
This issue has the name CVE-2011-0019.

Comment 35 Vincent Danen 2011-01-20 16:25:58 UTC
Adding Alex to the bug since Gentoo has started to ship 389 as well.

Comment 37 errata-xmlrpc 2011-02-22 17:46:18 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0293.html

Comment 38 Rich Megginson 2011-02-22 18:06:06 UTC
To ssh://git.fedorahosted.org/git/389/ds.git
   ce0dc22..a123b54  master -> master
commit a123b5466aae3eac4193451ac587d90ef1b61815
Author: Rich Megginson <rmeggins>
Date:   Thu Jan 13 13:44:15 2011 -0700
    Fix Description: Only allow one simple paged results search per-connection
    at a time.  The new function pagedresults_check_or_set_processing() will
    check the flag in the connection to see if pagedresults processing is in
    progress.  If so, the function will return True and the search will terminat
    with LDAP_UNWILLING_TO_PERFORM (as per section 3 in RFC 2696, a server is
    allowed to return unwillingToPerform if there is a limit to the number of
    outstanding paged search requests from a given client).  The processing
    flag will be reset once the search result has been sent to the client.
    Since the problem is multiple threads in the same connection accessing
    the pagedresults data, the workaround is to just set
     nsslapd-maxthreadsperconn: 1
    in cn=config in dse.ldif.
    Platforms tested: RHEL6 x86_64
    Flag Day: no
    Doc impact: no
To ssh://git.fedorahosted.org/git/389/ds.git
   1847a51..5d4895c  389-ds-base-1.2.8 -> 389-ds-base-1.2.8
commit 5d4895c4d6a412476f4b831e1d0654931c3934f9
Author: Rich Megginson <rmeggins>
Date:   Thu Jan 13 13:44:15 2011 -0700