Bug 751645 - crash when simple paged fails to send entry to client
Summary: crash when simple paged fails to send entry to client
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: 389
Classification: Retired
Component: Directory Server
Version: 1.2.9
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Rich Megginson
QA Contact: Viktor Ashirov
URL:
Whiteboard:
Depends On:
Blocks: 389_1.2.10 752577 752607 797946
TreeView+ depends on / blocked
 
Reported: 2011-11-07 00:45 UTC by Daniel
Modified: 2015-12-07 16:39 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 752577 797946 (view as bug list)
Environment:
Last Closed: 2015-12-07 16:39:10 UTC


Attachments (Terms of Use)
thread apply all bt output (30.16 KB, text/plain)
2011-11-07 00:45 UTC, Daniel
no flags Details
bt full output (11.26 KB, text/plain)
2011-11-07 00:46 UTC, Daniel
no flags Details
thread9 (1.59 KB, text/plain)
2011-11-08 06:06 UTC, Daniel
no flags Details
Thread20 (13.44 KB, text/plain)
2011-11-08 06:07 UTC, Daniel
no flags Details
Thread22 (13.51 KB, text/plain)
2011-11-08 06:07 UTC, Daniel
no flags Details
Thread29 (18.50 KB, text/plain)
2011-11-08 06:07 UTC, Daniel
no flags Details
0001-Bug-751645-crash-when-simple-paged-fails-to-send-ent.patch (2.40 KB, patch)
2011-11-09 20:16 UTC, Rich Megginson
nkinder: review+
nhosoi: review+
Details | Diff

Description Daniel 2011-11-07 00:45:43 UTC
Created attachment 531949 [details]
thread apply all bt output

I am running 389-ds-base-1.2.9.9-1.el5 on RHEL 5.7

Name        : 389-ds-base                  Relocations: (not relocatable)
Version     : 1.2.9.9                           Vendor: Fedora Project
Release     : 1.el5                         Build Date: Fri 02 Sep 2011 06:41:29 AM EST
Install Date: Wed 21 Sep 2011 07:16:41 PM EST      Build Host: x86-03.phx2.fedoraproject.org
Group       : System Environment/Daemons    Source RPM: 389-ds-base-1.2.9.9-1.el5.src.rpm
Size        : 5005087                          License: GPLv2 with exceptions
Signature   : DSA/SHA1, Fri 02 Sep 2011 10:01:24 AM EST, Key ID 119cc036217521f6
Packager    : Fedora Project
URL         : http://port389.org/
Summary     : 389 Directory Server (base)
Description :
389 Directory Server is an LDAPv3 compliant server.  The base package includes
the LDAP server and command line utilities for server administration.

One of our servers has been causes repeated directory server crashes (we have powered it off for now). See stack trace from the core dump below:


#0  0x0000003935830265 in raise () from /lib64/libc.so.6
#1  0x0000003935831d10 in abort () from /lib64/libc.so.6
#2  0x000000393586a99b in __libc_message () from /lib64/libc.so.6
#3  0x000000393587245f in _int_free () from /lib64/libc.so.6
#4  0x00000039358728bb in free () from /lib64/libc.so.6
#5  0x00002b65777757b6 in slapi_ch_free (ptr=0x597e8500) at ldap/servers/slapd/ch_malloc.c:363
#6  0x00002b657ba84ec1 in ldbm_back_next_search_entry_ext (pb=0xf4a9b70, use_extension=0) at ldap/servers/slapd/back-ldbm/ldbm_search.c:1238
#7  0x00002b65777aeb98 in iterate (pb=0xf4a9b70, be=0xe823130, send_result=1, pnentries=0x597ee6a4, pagesize=1000, pr_statp=0x597ee698)
    at ldap/servers/slapd/opshared.c:1153
#8  0x00002b65777af0e2 in send_results_ext (pb=0xf4a9b70, send_result=1, nentries=0x597ee6a4, pagesize=1000, pr_stat=0x597ee698)
    at ldap/servers/slapd/opshared.c:1537
#9  0x00002b65777afab0 in op_shared_search (pb=0xf4a9b70, send_result=1) at ldap/servers/slapd/opshared.c:599
#10 0x000000000042450f in do_search (pb=0xf4a9b70) at ldap/servers/slapd/search.c:393
#11 0x000000000041352c in connection_threadmain () at ldap/servers/slapd/connection.c:611
#12 0x00000039398284ad in PR_JoinThread () from /usr/lib64/libnspr4.so
#13 0x000000393600673d in start_thread () from /lib64/libpthread.so.0

(gdb)  print *pb
$1 = {pb_backend = 0xe823130, pb_conn = 0x2aaaaaab1fa0, pb_op = 0xf48c7e0, pb_plugin = 0xe6fb1a0, pb_opreturn = 0, pb_object = 0x0, pb_destroy_fn = 0,
  pb_requestor_isroot = 0, pb_config_fname = 0x0, pb_config_lineno = 0, pb_config_argc = 0, pb_config_argv = 0x0, pb_target_entry = 0x0,
  pb_existing_dn_entry = 0x0, pb_existing_uniqueid_entry = 0x0, pb_parent_entry = 0x0, pb_newparent_entry = 0x0, pb_pre_op_entry = 0x2aaab066bfc0,
  pb_post_op_entry = 0x0, pb_seq_type = 0, pb_seq_attrname = 0x0, pb_seq_val = 0x0, pb_ldif_file = 0x0, pb_removedupvals = 0, pb_db2index_attrs = 0x0,
  pb_ldif2db_noattrindexes = 0, pb_ldif_printkey = 0, pb_instance_name = 0x0, pb_task = 0x0, pb_task_flags = 0, pb_mr_filter_match_fn = 0,
  pb_mr_filter_index_fn = 0, pb_mr_filter_reset_fn = 0, pb_mr_index_fn = 0, pb_mr_oid = 0x0, pb_mr_type = 0x0, pb_mr_value = 0x0, pb_mr_values = 0x0,
  pb_mr_keys = 0x0, pb_mr_filter_reusable = 0, pb_mr_query_operator = 0, pb_mr_usage = 0, pb_pwd_storage_scheme_user_passwd = 0x0,
  pb_pwd_storage_scheme_db_passwd = 0x0, pb_managedsait = 0, pb_internal_op_result = 0, pb_plugin_internal_search_op_entries = 0x0,
  pb_plugin_internal_search_op_referrals = 0x0, pb_plugin_identity = 0x0, pb_plugin_config_area = 0x0, pb_parent_txn = 0x0, pb_txn = 0x0,
  pb_txn_ruv_mods_fn = 0, pb_dbsize = 0, pb_ldif_files = 0x0, pb_ldif_include = 0x0, pb_ldif_exclude = 0x0, pb_ldif_dump_replica = 0,
  pb_ldif_dump_uniqueid = 0, pb_ldif_generate_uniqueid = 0, pb_ldif_namespaceid = 0x0, pb_ldif_encrypt = 0, pb_operation_notes = 2, pb_slapd_argc = 0,
  pb_slapd_argv = 0x0, pb_slapd_configdir = 0x0, pb_ctrls_arg = 0x0, pb_dse_dont_add_write = 0, pb_dse_add_merge = 0, pb_dse_dont_check_dups = 0,
  pb_dse_is_primary_file = 0, pb_schema_flags = 0, pb_result_code = 0, pb_result_text = 0x0, pb_result_matched = 0x0, pb_nentries = 0, urls = 0x0,
  pb_import_entry = 0x0, pb_import_state = 0, pb_destroy_content = 0, pb_dse_reapply_mods = 0, pb_urp_naming_collision_dn = 0x0,
  pb_urp_tombstone_uniqueid = 0x0, pb_server_running = 0, pb_backend_count = 0, pb_pwpolicy_ctrl = 0, pb_vattr_context = 0x0, pb_substrlens = 0x0,
  pb_plugin_enabled = 0, pb_search_ctrls = 0x0, pb_mr_index_sv_fn = 0}

I can send the core file(s) through but I would like them to be kept confidential. 

Attached is the output from bt full and thread apply all bt 

Regards,
Daniel

Comment 1 Daniel 2011-11-07 00:46:17 UTC
Created attachment 531950 [details]
bt full output

Comment 2 Rich Megginson 2011-11-07 15:20:25 UTC
What is the client?  Looks like the clients are doing multiple simultaneous paged results searches?  The server is 64-bit?  Can you provide the bt full output for threads 29, 22, 20, and 6?

Comment 3 Daniel 2011-11-08 06:05:57 UTC
The client is a rhel 5.7 host running nss_ldap and nscd. Both client and server are 64bit OSes.

Attached at the bt full output for the threads.

Comment 4 Daniel 2011-11-08 06:06:36 UTC
Created attachment 532200 [details]
thread9

Comment 5 Daniel 2011-11-08 06:07:00 UTC
Created attachment 532201 [details]
Thread20

Comment 6 Daniel 2011-11-08 06:07:21 UTC
Created attachment 532202 [details]
Thread22

Comment 7 Daniel 2011-11-08 06:07:45 UTC
Created attachment 532203 [details]
Thread29

Comment 8 Rich Megginson 2011-11-08 17:19:35 UTC
Thanks.  Don't see any smoking gun, although it does appear that the stack frames are corrupted, which can happen in segfaults.

I'd like to find out if the operation has been abandoned.  In gdb, can you do this:

gdb /usr/sbin/ns-slapd /path/to/core.pid
(gdb) up
...
(gdb) up
# keep going up until you are in the function ldbm_back_next_search_entry_ext
(gdb) p pb->pb_op->o_status

You should be in the thread that had the abort, which should be the default thread that gdb puts you in when you start up.

Comment 9 Rich Megginson 2011-11-08 21:56:13 UTC
I've managed to reproduce this.  It's quite difficult.  You have to run both the client and the server in the debugger.  The client does a simple paged result search that should return at least two pages with at least two entries per page.  On the server side, break at the ber_flush() in flush_ber().  Start the client, and request the first page.  On the server, just hit continue until all of the first page of results have been sent to the client.  On the client, request the next page, and break in the function that reads the PDUs for the search entries.  On the server, send at least one entry, and on the client, receive that entry, then quit/exit the client.  This causes a TCP disconnection.  Then continue the server.  You should then see the server abort and raise and core.

What is happening is that the server has two different threads accessing the connection, the thread reading for new operations on the connection, and the thread sending back the search entries/results.  If the reader thread sees the client disconnect first (in connection_read_operation()), it will call disconnect_server_nomutex() which will call pagedresults_cleanup which will free the search results stored in the connection.  The problem is that the writer thread then has a dangling reference to the already freed search results - this is
ldbm_back_next_search_entry_ext()
    back_search_result_set *sr;

The writer thread returns a -1 from ber_flush in flush_ber and sets the SLAPI_OP_STATUS_ABANDONED flag in the operation.  ldbm_back_next_search_entry_ext() sees this and also frees the search request:
        /* check for abandon */
        if ( slapi_op_abandoned( pb ) || (NULL == sr) )
...
            delete_search_result_set( &sr );

Need to figure out some way for this code to know if sr has already been freed.

Comment 10 Rich Megginson 2011-11-08 22:13:31 UTC
Easier steps to reproduce:
load slapd with a 100k entry ldif database.
using ldapsearch, repeatedly issue a search like this:

ldapsearch -xLLL -h localhost -p 1200 -D "cn=directory manager" -w password -s sub -b dc=example,dc=com -E pr=1000/noprompt "objectclass=*" > /dev/null 2>&1 & sleep 1 && kill %1

You have to have the sleep 1 after starting the background process to allow it time to become established, otherwise the kill will happen too soon.  If you do this repeatedly, you will eventually core dump the server.  You will know because ldapsearch will exit like this:
[1]+  Exit 255                ldapsearch ....

which indicates the server is not running.  The trick is to kill the client while it is reading a search entry from the server - it's not easy to get the timing right.

This makes me wonder what the client is doing in the original bug report . . .

Comment 11 Daniel 2011-11-08 22:51:15 UTC
(gdb) up
#6  0x00002b657ba84ec1 in ldbm_back_next_search_entry_ext (pb=0xf4a9b70, use_extension=0) at ldap/servers/slapd/back-ldbm/ldbm_search.c:1238
1238                delete_search_result_set( &sr );
(gdb) p pb->pb_op->o_status
$2 = 1
(gdb)

We are having problems with nscd repeatedly crashing on this client. I am not sure if it's related. I was using the PRNetAddr type in the pb to identify the client. The server kept crashing over and over (we have a restarter script) and the same client ip was the cause in each of the core files.

One thing I have noticed is most of the cores are the same but a couple have a slightly different stack trace. See below:


#0  0x0000003935830265 in raise () from /lib64/libc.so.6
#1  0x0000003935831d10 in abort () from /lib64/libc.so.6
#2  0x000000393586a99b in __libc_message () from /lib64/libc.so.6
#3  0x000000393587245f in _int_free () from /lib64/libc.so.6
#4  0x00000039358728bb in free () from /lib64/libc.so.6
#5  0x00002adb5a1de7b6 in slapi_ch_free (ptr=0x4731d470) at ldap/servers/slapd/ch_malloc.c:363
#6  0x00002adb5e4c32e5 in idl_free (idl=0x2aaaad267560) at ldap/servers/slapd/back-ldbm/idl_common.c:94
#7  0x00002adb5e4ed4ee in delete_search_result_set (sr=0x4731d500) at ldap/servers/slapd/back-ldbm/ldbm_search.c:1552
#8  0x00002adb5e4edec1 in ldbm_back_next_search_entry_ext (pb=0x2aaab0147480, use_extension=0) at ldap/servers/slapd/back-ldbm/ldbm_search.c:1238
#9  0x00002adb5a217b98 in iterate (pb=0x2aaab0147480, be=0x951a130, send_result=1, pnentries=0x473236a4, pagesize=1000, pr_statp=0x47323698)
    at ldap/servers/slapd/opshared.c:1153
#10 0x00002adb5a2180e2 in send_results_ext (pb=0x2aaab0147480, send_result=1, nentries=0x473236a4, pagesize=1000, pr_stat=0x47323698)
    at ldap/servers/slapd/opshared.c:1537
#11 0x00002adb5a218ab0 in op_shared_search (pb=0x2aaab0147480, send_result=1) at ldap/servers/slapd/opshared.c:599
#12 0x000000000042450f in do_search (pb=0x2aaab0147480) at ldap/servers/slapd/search.c:393
#13 0x000000000041352c in connection_threadmain () at ldap/servers/slapd/connection.c:611
#14 0x00000039398284ad in PR_JoinThread () from /usr/lib64/libnspr4.so
#15 0x000000393600673d in start_thread () from /lib64/libpthread.so.0
#16 0x00000039358d40cd in ?? () from /lib64/libc.so.6


It still hits ldbm_back_next_search_entry_ext but then heads into delete_search_result_set. I am assuming the same issue just appearing in a different route in the code?

Comment 12 Rich Megginson 2011-11-09 00:50:09 UTC
(In reply to comment #11)
> (gdb) up
> #6  0x00002b657ba84ec1 in ldbm_back_next_search_entry_ext (pb=0xf4a9b70,
> use_extension=0) at ldap/servers/slapd/back-ldbm/ldbm_search.c:1238
> 1238                delete_search_result_set( &sr );
> (gdb) p pb->pb_op->o_status
> $2 = 1
> (gdb)
> 
> We are having problems with nscd repeatedly crashing on this client. I am not
> sure if it's related. I was using the PRNetAddr type in the pb to identify the
> client. The server kept crashing over and over (we have a restarter script) and
> the same client ip was the cause in each of the core files.
> 
> One thing I have noticed is most of the cores are the same but a couple have a
> slightly different stack trace. See below:
> 
> 
> #0  0x0000003935830265 in raise () from /lib64/libc.so.6
> #1  0x0000003935831d10 in abort () from /lib64/libc.so.6
> #2  0x000000393586a99b in __libc_message () from /lib64/libc.so.6
> #3  0x000000393587245f in _int_free () from /lib64/libc.so.6
> #4  0x00000039358728bb in free () from /lib64/libc.so.6
> #5  0x00002adb5a1de7b6 in slapi_ch_free (ptr=0x4731d470) at
> ldap/servers/slapd/ch_malloc.c:363
> #6  0x00002adb5e4c32e5 in idl_free (idl=0x2aaaad267560) at
> ldap/servers/slapd/back-ldbm/idl_common.c:94
> #7  0x00002adb5e4ed4ee in delete_search_result_set (sr=0x4731d500) at
> ldap/servers/slapd/back-ldbm/ldbm_search.c:1552
> #8  0x00002adb5e4edec1 in ldbm_back_next_search_entry_ext (pb=0x2aaab0147480,
> use_extension=0) at ldap/servers/slapd/back-ldbm/ldbm_search.c:1238
> #9  0x00002adb5a217b98 in iterate (pb=0x2aaab0147480, be=0x951a130,
> send_result=1, pnentries=0x473236a4, pagesize=1000, pr_statp=0x47323698)
>     at ldap/servers/slapd/opshared.c:1153
> #10 0x00002adb5a2180e2 in send_results_ext (pb=0x2aaab0147480, send_result=1,
> nentries=0x473236a4, pagesize=1000, pr_stat=0x47323698)
>     at ldap/servers/slapd/opshared.c:1537
> #11 0x00002adb5a218ab0 in op_shared_search (pb=0x2aaab0147480, send_result=1)
> at ldap/servers/slapd/opshared.c:599
> #12 0x000000000042450f in do_search (pb=0x2aaab0147480) at
> ldap/servers/slapd/search.c:393
> #13 0x000000000041352c in connection_threadmain () at
> ldap/servers/slapd/connection.c:611
> #14 0x00000039398284ad in PR_JoinThread () from /usr/lib64/libnspr4.so
> #15 0x000000393600673d in start_thread () from /lib64/libpthread.so.0
> #16 0x00000039358d40cd in ?? () from /lib64/libc.so.6
> 
> 
> It still hits ldbm_back_next_search_entry_ext but then heads into
> delete_search_result_set. I am assuming the same issue just appearing in a
> different route in the code?

Yes.  I've seen this exact same stack trace too.  It just depends on how far the code gets into freeing the search result set.  It's caused by the same problem.

Comment 13 Rich Megginson 2011-11-09 17:36:55 UTC
RHDS 8.2 does not have the same problem.  In RHDS 8.2, connection.c:disconnect_server_nomutex() does not call pagedresults_cleanup() or do any sort of paged results related cleanup.  All of the cleanup is done in connection_cleanup()

Comment 14 Rich Megginson 2011-11-09 20:16:48 UTC
Created attachment 532659 [details]
0001-Bug-751645-crash-when-simple-paged-fails-to-send-ent.patch

Comment 15 Rich Megginson 2011-11-09 21:47:04 UTC
To ssh://git.fedorahosted.org/git/389/ds.git
   d32deaa..55106fe  master -> master
commit 55106fe77d2c834b0ba866d440bb8ce08c1d01ff
Author: Rich Megginson <rmeggins@redhat.com>
Date:   Wed Nov 9 13:05:39 2011 -0700
    Reviewed by: nkinder,nhosoi (Thanks!)
    Branch: master
    Fix Description: The crash happens when the server is sending back the paged
    result entry responses to the client and there is a problem with the
    connection e.g. the client closes the socket while the server is doing the
    PR_Send/PR_Write on the client socket.  If the reader thread in
    connection_read_operation() sees the close first, it will call
    disconnect_server() to disconnect the socket and cleanup the pagedresult
    structure back_search_result_set in the Connection*.  The problem with this
    is that it leaves a dangling reference to the pagedresult structures in the
    writer thread in ldbm_back_next_search_entry_ext.  When that code sees the
    error from the write, it will also attempt to free the search result, and
    will get an invalid or double free error.  The solution is to not do the
    pagedresults_cleanup in disconnect_server(), but instead allow the writer
    thread to do the cleanup safely.  The connection_cleanup() function will
    call pagedresults_cleanup() to avoid any memory leaks.  The only thing the
    disconnect_server() function needs to do is to reset the c_timelimit to
    avoid the "slapd stops responding" and "simple paged results timeout"
    problems.
    Platforms tested: RHEL6 x86_64
    Flag Day: no
    Doc impact: no

Comment 16 Rich Megginson 2012-01-10 18:12:58 UTC
marking as screened because already flagged for rhel 6.3.0

Comment 17 Rich Megginson 2012-01-10 20:18:05 UTC
Upstream ticket:
https://fedorahosted.org/389/ticket/238

Comment 18 Amita Sharma 2012-04-27 11:02:18 UTC
[root@snmaptest scripts]# ldapsearch -xLLL -h localhost -p 389 -D "cn=directory manager" -w Secret123 -s sub -b dc=example,dc=com -E pr=1000/noprompt "objectclass=*" > /dev/null 2>&1 &
[1] 16058
[root@snmaptest scripts]# kill %1
[root@snmaptest scripts]# ldapsearch -xLLL -h localhost -p 389 -D "cn=directory manager" -w Secret123 -s sub -b dc=example,dc=com -E pr=1000/noprompt "objectclass=*" > /dev/null 2>&1 &

No Crashes Found.
Hence marking as VERIFIED.


Note You need to log in before you can comment on or make changes to this bug.