Created attachment 531949 [details] thread apply all bt output I am running 389-ds-base-1.2.9.9-1.el5 on RHEL 5.7 Name : 389-ds-base Relocations: (not relocatable) Version : 1.2.9.9 Vendor: Fedora Project Release : 1.el5 Build Date: Fri 02 Sep 2011 06:41:29 AM EST Install Date: Wed 21 Sep 2011 07:16:41 PM EST Build Host: x86-03.phx2.fedoraproject.org Group : System Environment/Daemons Source RPM: 389-ds-base-1.2.9.9-1.el5.src.rpm Size : 5005087 License: GPLv2 with exceptions Signature : DSA/SHA1, Fri 02 Sep 2011 10:01:24 AM EST, Key ID 119cc036217521f6 Packager : Fedora Project URL : http://port389.org/ Summary : 389 Directory Server (base) Description : 389 Directory Server is an LDAPv3 compliant server. The base package includes the LDAP server and command line utilities for server administration. One of our servers has been causes repeated directory server crashes (we have powered it off for now). See stack trace from the core dump below: #0 0x0000003935830265 in raise () from /lib64/libc.so.6 #1 0x0000003935831d10 in abort () from /lib64/libc.so.6 #2 0x000000393586a99b in __libc_message () from /lib64/libc.so.6 #3 0x000000393587245f in _int_free () from /lib64/libc.so.6 #4 0x00000039358728bb in free () from /lib64/libc.so.6 #5 0x00002b65777757b6 in slapi_ch_free (ptr=0x597e8500) at ldap/servers/slapd/ch_malloc.c:363 #6 0x00002b657ba84ec1 in ldbm_back_next_search_entry_ext (pb=0xf4a9b70, use_extension=0) at ldap/servers/slapd/back-ldbm/ldbm_search.c:1238 #7 0x00002b65777aeb98 in iterate (pb=0xf4a9b70, be=0xe823130, send_result=1, pnentries=0x597ee6a4, pagesize=1000, pr_statp=0x597ee698) at ldap/servers/slapd/opshared.c:1153 #8 0x00002b65777af0e2 in send_results_ext (pb=0xf4a9b70, send_result=1, nentries=0x597ee6a4, pagesize=1000, pr_stat=0x597ee698) at ldap/servers/slapd/opshared.c:1537 #9 0x00002b65777afab0 in op_shared_search (pb=0xf4a9b70, send_result=1) at ldap/servers/slapd/opshared.c:599 #10 0x000000000042450f in do_search (pb=0xf4a9b70) at ldap/servers/slapd/search.c:393 #11 0x000000000041352c in connection_threadmain () at ldap/servers/slapd/connection.c:611 #12 0x00000039398284ad in PR_JoinThread () from /usr/lib64/libnspr4.so #13 0x000000393600673d in start_thread () from /lib64/libpthread.so.0 (gdb) print *pb $1 = {pb_backend = 0xe823130, pb_conn = 0x2aaaaaab1fa0, pb_op = 0xf48c7e0, pb_plugin = 0xe6fb1a0, pb_opreturn = 0, pb_object = 0x0, pb_destroy_fn = 0, pb_requestor_isroot = 0, pb_config_fname = 0x0, pb_config_lineno = 0, pb_config_argc = 0, pb_config_argv = 0x0, pb_target_entry = 0x0, pb_existing_dn_entry = 0x0, pb_existing_uniqueid_entry = 0x0, pb_parent_entry = 0x0, pb_newparent_entry = 0x0, pb_pre_op_entry = 0x2aaab066bfc0, pb_post_op_entry = 0x0, pb_seq_type = 0, pb_seq_attrname = 0x0, pb_seq_val = 0x0, pb_ldif_file = 0x0, pb_removedupvals = 0, pb_db2index_attrs = 0x0, pb_ldif2db_noattrindexes = 0, pb_ldif_printkey = 0, pb_instance_name = 0x0, pb_task = 0x0, pb_task_flags = 0, pb_mr_filter_match_fn = 0, pb_mr_filter_index_fn = 0, pb_mr_filter_reset_fn = 0, pb_mr_index_fn = 0, pb_mr_oid = 0x0, pb_mr_type = 0x0, pb_mr_value = 0x0, pb_mr_values = 0x0, pb_mr_keys = 0x0, pb_mr_filter_reusable = 0, pb_mr_query_operator = 0, pb_mr_usage = 0, pb_pwd_storage_scheme_user_passwd = 0x0, pb_pwd_storage_scheme_db_passwd = 0x0, pb_managedsait = 0, pb_internal_op_result = 0, pb_plugin_internal_search_op_entries = 0x0, pb_plugin_internal_search_op_referrals = 0x0, pb_plugin_identity = 0x0, pb_plugin_config_area = 0x0, pb_parent_txn = 0x0, pb_txn = 0x0, pb_txn_ruv_mods_fn = 0, pb_dbsize = 0, pb_ldif_files = 0x0, pb_ldif_include = 0x0, pb_ldif_exclude = 0x0, pb_ldif_dump_replica = 0, pb_ldif_dump_uniqueid = 0, pb_ldif_generate_uniqueid = 0, pb_ldif_namespaceid = 0x0, pb_ldif_encrypt = 0, pb_operation_notes = 2, pb_slapd_argc = 0, pb_slapd_argv = 0x0, pb_slapd_configdir = 0x0, pb_ctrls_arg = 0x0, pb_dse_dont_add_write = 0, pb_dse_add_merge = 0, pb_dse_dont_check_dups = 0, pb_dse_is_primary_file = 0, pb_schema_flags = 0, pb_result_code = 0, pb_result_text = 0x0, pb_result_matched = 0x0, pb_nentries = 0, urls = 0x0, pb_import_entry = 0x0, pb_import_state = 0, pb_destroy_content = 0, pb_dse_reapply_mods = 0, pb_urp_naming_collision_dn = 0x0, pb_urp_tombstone_uniqueid = 0x0, pb_server_running = 0, pb_backend_count = 0, pb_pwpolicy_ctrl = 0, pb_vattr_context = 0x0, pb_substrlens = 0x0, pb_plugin_enabled = 0, pb_search_ctrls = 0x0, pb_mr_index_sv_fn = 0} I can send the core file(s) through but I would like them to be kept confidential. Attached is the output from bt full and thread apply all bt Regards, Daniel
Created attachment 531950 [details] bt full output
What is the client? Looks like the clients are doing multiple simultaneous paged results searches? The server is 64-bit? Can you provide the bt full output for threads 29, 22, 20, and 6?
The client is a rhel 5.7 host running nss_ldap and nscd. Both client and server are 64bit OSes. Attached at the bt full output for the threads.
Created attachment 532200 [details] thread9
Created attachment 532201 [details] Thread20
Created attachment 532202 [details] Thread22
Created attachment 532203 [details] Thread29
Thanks. Don't see any smoking gun, although it does appear that the stack frames are corrupted, which can happen in segfaults. I'd like to find out if the operation has been abandoned. In gdb, can you do this: gdb /usr/sbin/ns-slapd /path/to/core.pid (gdb) up ... (gdb) up # keep going up until you are in the function ldbm_back_next_search_entry_ext (gdb) p pb->pb_op->o_status You should be in the thread that had the abort, which should be the default thread that gdb puts you in when you start up.
I've managed to reproduce this. It's quite difficult. You have to run both the client and the server in the debugger. The client does a simple paged result search that should return at least two pages with at least two entries per page. On the server side, break at the ber_flush() in flush_ber(). Start the client, and request the first page. On the server, just hit continue until all of the first page of results have been sent to the client. On the client, request the next page, and break in the function that reads the PDUs for the search entries. On the server, send at least one entry, and on the client, receive that entry, then quit/exit the client. This causes a TCP disconnection. Then continue the server. You should then see the server abort and raise and core. What is happening is that the server has two different threads accessing the connection, the thread reading for new operations on the connection, and the thread sending back the search entries/results. If the reader thread sees the client disconnect first (in connection_read_operation()), it will call disconnect_server_nomutex() which will call pagedresults_cleanup which will free the search results stored in the connection. The problem is that the writer thread then has a dangling reference to the already freed search results - this is ldbm_back_next_search_entry_ext() back_search_result_set *sr; The writer thread returns a -1 from ber_flush in flush_ber and sets the SLAPI_OP_STATUS_ABANDONED flag in the operation. ldbm_back_next_search_entry_ext() sees this and also frees the search request: /* check for abandon */ if ( slapi_op_abandoned( pb ) || (NULL == sr) ) ... delete_search_result_set( &sr ); Need to figure out some way for this code to know if sr has already been freed.
Easier steps to reproduce: load slapd with a 100k entry ldif database. using ldapsearch, repeatedly issue a search like this: ldapsearch -xLLL -h localhost -p 1200 -D "cn=directory manager" -w password -s sub -b dc=example,dc=com -E pr=1000/noprompt "objectclass=*" > /dev/null 2>&1 & sleep 1 && kill %1 You have to have the sleep 1 after starting the background process to allow it time to become established, otherwise the kill will happen too soon. If you do this repeatedly, you will eventually core dump the server. You will know because ldapsearch will exit like this: [1]+ Exit 255 ldapsearch .... which indicates the server is not running. The trick is to kill the client while it is reading a search entry from the server - it's not easy to get the timing right. This makes me wonder what the client is doing in the original bug report . . .
(gdb) up #6 0x00002b657ba84ec1 in ldbm_back_next_search_entry_ext (pb=0xf4a9b70, use_extension=0) at ldap/servers/slapd/back-ldbm/ldbm_search.c:1238 1238 delete_search_result_set( &sr ); (gdb) p pb->pb_op->o_status $2 = 1 (gdb) We are having problems with nscd repeatedly crashing on this client. I am not sure if it's related. I was using the PRNetAddr type in the pb to identify the client. The server kept crashing over and over (we have a restarter script) and the same client ip was the cause in each of the core files. One thing I have noticed is most of the cores are the same but a couple have a slightly different stack trace. See below: #0 0x0000003935830265 in raise () from /lib64/libc.so.6 #1 0x0000003935831d10 in abort () from /lib64/libc.so.6 #2 0x000000393586a99b in __libc_message () from /lib64/libc.so.6 #3 0x000000393587245f in _int_free () from /lib64/libc.so.6 #4 0x00000039358728bb in free () from /lib64/libc.so.6 #5 0x00002adb5a1de7b6 in slapi_ch_free (ptr=0x4731d470) at ldap/servers/slapd/ch_malloc.c:363 #6 0x00002adb5e4c32e5 in idl_free (idl=0x2aaaad267560) at ldap/servers/slapd/back-ldbm/idl_common.c:94 #7 0x00002adb5e4ed4ee in delete_search_result_set (sr=0x4731d500) at ldap/servers/slapd/back-ldbm/ldbm_search.c:1552 #8 0x00002adb5e4edec1 in ldbm_back_next_search_entry_ext (pb=0x2aaab0147480, use_extension=0) at ldap/servers/slapd/back-ldbm/ldbm_search.c:1238 #9 0x00002adb5a217b98 in iterate (pb=0x2aaab0147480, be=0x951a130, send_result=1, pnentries=0x473236a4, pagesize=1000, pr_statp=0x47323698) at ldap/servers/slapd/opshared.c:1153 #10 0x00002adb5a2180e2 in send_results_ext (pb=0x2aaab0147480, send_result=1, nentries=0x473236a4, pagesize=1000, pr_stat=0x47323698) at ldap/servers/slapd/opshared.c:1537 #11 0x00002adb5a218ab0 in op_shared_search (pb=0x2aaab0147480, send_result=1) at ldap/servers/slapd/opshared.c:599 #12 0x000000000042450f in do_search (pb=0x2aaab0147480) at ldap/servers/slapd/search.c:393 #13 0x000000000041352c in connection_threadmain () at ldap/servers/slapd/connection.c:611 #14 0x00000039398284ad in PR_JoinThread () from /usr/lib64/libnspr4.so #15 0x000000393600673d in start_thread () from /lib64/libpthread.so.0 #16 0x00000039358d40cd in ?? () from /lib64/libc.so.6 It still hits ldbm_back_next_search_entry_ext but then heads into delete_search_result_set. I am assuming the same issue just appearing in a different route in the code?
(In reply to comment #11) > (gdb) up > #6 0x00002b657ba84ec1 in ldbm_back_next_search_entry_ext (pb=0xf4a9b70, > use_extension=0) at ldap/servers/slapd/back-ldbm/ldbm_search.c:1238 > 1238 delete_search_result_set( &sr ); > (gdb) p pb->pb_op->o_status > $2 = 1 > (gdb) > > We are having problems with nscd repeatedly crashing on this client. I am not > sure if it's related. I was using the PRNetAddr type in the pb to identify the > client. The server kept crashing over and over (we have a restarter script) and > the same client ip was the cause in each of the core files. > > One thing I have noticed is most of the cores are the same but a couple have a > slightly different stack trace. See below: > > > #0 0x0000003935830265 in raise () from /lib64/libc.so.6 > #1 0x0000003935831d10 in abort () from /lib64/libc.so.6 > #2 0x000000393586a99b in __libc_message () from /lib64/libc.so.6 > #3 0x000000393587245f in _int_free () from /lib64/libc.so.6 > #4 0x00000039358728bb in free () from /lib64/libc.so.6 > #5 0x00002adb5a1de7b6 in slapi_ch_free (ptr=0x4731d470) at > ldap/servers/slapd/ch_malloc.c:363 > #6 0x00002adb5e4c32e5 in idl_free (idl=0x2aaaad267560) at > ldap/servers/slapd/back-ldbm/idl_common.c:94 > #7 0x00002adb5e4ed4ee in delete_search_result_set (sr=0x4731d500) at > ldap/servers/slapd/back-ldbm/ldbm_search.c:1552 > #8 0x00002adb5e4edec1 in ldbm_back_next_search_entry_ext (pb=0x2aaab0147480, > use_extension=0) at ldap/servers/slapd/back-ldbm/ldbm_search.c:1238 > #9 0x00002adb5a217b98 in iterate (pb=0x2aaab0147480, be=0x951a130, > send_result=1, pnentries=0x473236a4, pagesize=1000, pr_statp=0x47323698) > at ldap/servers/slapd/opshared.c:1153 > #10 0x00002adb5a2180e2 in send_results_ext (pb=0x2aaab0147480, send_result=1, > nentries=0x473236a4, pagesize=1000, pr_stat=0x47323698) > at ldap/servers/slapd/opshared.c:1537 > #11 0x00002adb5a218ab0 in op_shared_search (pb=0x2aaab0147480, send_result=1) > at ldap/servers/slapd/opshared.c:599 > #12 0x000000000042450f in do_search (pb=0x2aaab0147480) at > ldap/servers/slapd/search.c:393 > #13 0x000000000041352c in connection_threadmain () at > ldap/servers/slapd/connection.c:611 > #14 0x00000039398284ad in PR_JoinThread () from /usr/lib64/libnspr4.so > #15 0x000000393600673d in start_thread () from /lib64/libpthread.so.0 > #16 0x00000039358d40cd in ?? () from /lib64/libc.so.6 > > > It still hits ldbm_back_next_search_entry_ext but then heads into > delete_search_result_set. I am assuming the same issue just appearing in a > different route in the code? Yes. I've seen this exact same stack trace too. It just depends on how far the code gets into freeing the search result set. It's caused by the same problem.
RHDS 8.2 does not have the same problem. In RHDS 8.2, connection.c:disconnect_server_nomutex() does not call pagedresults_cleanup() or do any sort of paged results related cleanup. All of the cleanup is done in connection_cleanup()
Created attachment 532659 [details] 0001-Bug-751645-crash-when-simple-paged-fails-to-send-ent.patch
To ssh://git.fedorahosted.org/git/389/ds.git d32deaa..55106fe master -> master commit 55106fe77d2c834b0ba866d440bb8ce08c1d01ff Author: Rich Megginson <rmeggins> Date: Wed Nov 9 13:05:39 2011 -0700 Reviewed by: nkinder,nhosoi (Thanks!) Branch: master Fix Description: The crash happens when the server is sending back the paged result entry responses to the client and there is a problem with the connection e.g. the client closes the socket while the server is doing the PR_Send/PR_Write on the client socket. If the reader thread in connection_read_operation() sees the close first, it will call disconnect_server() to disconnect the socket and cleanup the pagedresult structure back_search_result_set in the Connection*. The problem with this is that it leaves a dangling reference to the pagedresult structures in the writer thread in ldbm_back_next_search_entry_ext. When that code sees the error from the write, it will also attempt to free the search result, and will get an invalid or double free error. The solution is to not do the pagedresults_cleanup in disconnect_server(), but instead allow the writer thread to do the cleanup safely. The connection_cleanup() function will call pagedresults_cleanup() to avoid any memory leaks. The only thing the disconnect_server() function needs to do is to reset the c_timelimit to avoid the "slapd stops responding" and "simple paged results timeout" problems. Platforms tested: RHEL6 x86_64 Flag Day: no Doc impact: no
marking as screened because already flagged for rhel 6.3.0
Upstream ticket: https://fedorahosted.org/389/ticket/238
[root@snmaptest scripts]# ldapsearch -xLLL -h localhost -p 389 -D "cn=directory manager" -w Secret123 -s sub -b dc=example,dc=com -E pr=1000/noprompt "objectclass=*" > /dev/null 2>&1 & [1] 16058 [root@snmaptest scripts]# kill %1 [root@snmaptest scripts]# ldapsearch -xLLL -h localhost -p 389 -D "cn=directory manager" -w Secret123 -s sub -b dc=example,dc=com -E pr=1000/noprompt "objectclass=*" > /dev/null 2>&1 & No Crashes Found. Hence marking as VERIFIED.