| Summary: | crash when simple paged fails to send entry to client | ||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | [Retired] 389 | Reporter: | Daniel <daniel.appleby> | ||||||||||||||||
| Component: | Directory Server | Assignee: | Rich Megginson <rmeggins> | ||||||||||||||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Viktor Ashirov <vashirov> | ||||||||||||||||
| Severity: | high | Docs Contact: | |||||||||||||||||
| Priority: | unspecified | ||||||||||||||||||
| Version: | 1.2.9 | CC: | amsharma, andrey.ivanov, jmontleo, jnansi, nhosoi, nkinder | ||||||||||||||||
| Target Milestone: | --- | ||||||||||||||||||
| Target Release: | --- | ||||||||||||||||||
| Hardware: | x86_64 | ||||||||||||||||||
| OS: | Linux | ||||||||||||||||||
| Whiteboard: | |||||||||||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||||||
| Doc Text: | Story Points: | --- | |||||||||||||||||
| Clone Of: | |||||||||||||||||||
| : | 752577 797946 (view as bug list) | Environment: | |||||||||||||||||
| Last Closed: | 2015-12-07 16:39:10 UTC | Type: | --- | ||||||||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||||||||
| Documentation: | --- | CRM: | |||||||||||||||||
| Verified Versions: | Category: | --- | |||||||||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||
| Bug Depends On: | |||||||||||||||||||
| Bug Blocks: | 743970, 752577, 752607, 797946 | ||||||||||||||||||
| Attachments: |
|
||||||||||||||||||
|
Description
Daniel
2011-11-07 00:45:43 UTC
Created attachment 531950 [details]
bt full output
What is the client? Looks like the clients are doing multiple simultaneous paged results searches? The server is 64-bit? Can you provide the bt full output for threads 29, 22, 20, and 6? The client is a rhel 5.7 host running nss_ldap and nscd. Both client and server are 64bit OSes. Attached at the bt full output for the threads. Created attachment 532200 [details]
thread9
Created attachment 532201 [details]
Thread20
Created attachment 532202 [details]
Thread22
Created attachment 532203 [details]
Thread29
Thanks. Don't see any smoking gun, although it does appear that the stack frames are corrupted, which can happen in segfaults. I'd like to find out if the operation has been abandoned. In gdb, can you do this: gdb /usr/sbin/ns-slapd /path/to/core.pid (gdb) up ... (gdb) up # keep going up until you are in the function ldbm_back_next_search_entry_ext (gdb) p pb->pb_op->o_status You should be in the thread that had the abort, which should be the default thread that gdb puts you in when you start up. I've managed to reproduce this. It's quite difficult. You have to run both the client and the server in the debugger. The client does a simple paged result search that should return at least two pages with at least two entries per page. On the server side, break at the ber_flush() in flush_ber(). Start the client, and request the first page. On the server, just hit continue until all of the first page of results have been sent to the client. On the client, request the next page, and break in the function that reads the PDUs for the search entries. On the server, send at least one entry, and on the client, receive that entry, then quit/exit the client. This causes a TCP disconnection. Then continue the server. You should then see the server abort and raise and core.
What is happening is that the server has two different threads accessing the connection, the thread reading for new operations on the connection, and the thread sending back the search entries/results. If the reader thread sees the client disconnect first (in connection_read_operation()), it will call disconnect_server_nomutex() which will call pagedresults_cleanup which will free the search results stored in the connection. The problem is that the writer thread then has a dangling reference to the already freed search results - this is
ldbm_back_next_search_entry_ext()
back_search_result_set *sr;
The writer thread returns a -1 from ber_flush in flush_ber and sets the SLAPI_OP_STATUS_ABANDONED flag in the operation. ldbm_back_next_search_entry_ext() sees this and also frees the search request:
/* check for abandon */
if ( slapi_op_abandoned( pb ) || (NULL == sr) )
...
delete_search_result_set( &sr );
Need to figure out some way for this code to know if sr has already been freed.
Easier steps to reproduce: load slapd with a 100k entry ldif database. using ldapsearch, repeatedly issue a search like this: ldapsearch -xLLL -h localhost -p 1200 -D "cn=directory manager" -w password -s sub -b dc=example,dc=com -E pr=1000/noprompt "objectclass=*" > /dev/null 2>&1 & sleep 1 && kill %1 You have to have the sleep 1 after starting the background process to allow it time to become established, otherwise the kill will happen too soon. If you do this repeatedly, you will eventually core dump the server. You will know because ldapsearch will exit like this: [1]+ Exit 255 ldapsearch .... which indicates the server is not running. The trick is to kill the client while it is reading a search entry from the server - it's not easy to get the timing right. This makes me wonder what the client is doing in the original bug report . . . (gdb) up
#6 0x00002b657ba84ec1 in ldbm_back_next_search_entry_ext (pb=0xf4a9b70, use_extension=0) at ldap/servers/slapd/back-ldbm/ldbm_search.c:1238
1238 delete_search_result_set( &sr );
(gdb) p pb->pb_op->o_status
$2 = 1
(gdb)
We are having problems with nscd repeatedly crashing on this client. I am not sure if it's related. I was using the PRNetAddr type in the pb to identify the client. The server kept crashing over and over (we have a restarter script) and the same client ip was the cause in each of the core files.
One thing I have noticed is most of the cores are the same but a couple have a slightly different stack trace. See below:
#0 0x0000003935830265 in raise () from /lib64/libc.so.6
#1 0x0000003935831d10 in abort () from /lib64/libc.so.6
#2 0x000000393586a99b in __libc_message () from /lib64/libc.so.6
#3 0x000000393587245f in _int_free () from /lib64/libc.so.6
#4 0x00000039358728bb in free () from /lib64/libc.so.6
#5 0x00002adb5a1de7b6 in slapi_ch_free (ptr=0x4731d470) at ldap/servers/slapd/ch_malloc.c:363
#6 0x00002adb5e4c32e5 in idl_free (idl=0x2aaaad267560) at ldap/servers/slapd/back-ldbm/idl_common.c:94
#7 0x00002adb5e4ed4ee in delete_search_result_set (sr=0x4731d500) at ldap/servers/slapd/back-ldbm/ldbm_search.c:1552
#8 0x00002adb5e4edec1 in ldbm_back_next_search_entry_ext (pb=0x2aaab0147480, use_extension=0) at ldap/servers/slapd/back-ldbm/ldbm_search.c:1238
#9 0x00002adb5a217b98 in iterate (pb=0x2aaab0147480, be=0x951a130, send_result=1, pnentries=0x473236a4, pagesize=1000, pr_statp=0x47323698)
at ldap/servers/slapd/opshared.c:1153
#10 0x00002adb5a2180e2 in send_results_ext (pb=0x2aaab0147480, send_result=1, nentries=0x473236a4, pagesize=1000, pr_stat=0x47323698)
at ldap/servers/slapd/opshared.c:1537
#11 0x00002adb5a218ab0 in op_shared_search (pb=0x2aaab0147480, send_result=1) at ldap/servers/slapd/opshared.c:599
#12 0x000000000042450f in do_search (pb=0x2aaab0147480) at ldap/servers/slapd/search.c:393
#13 0x000000000041352c in connection_threadmain () at ldap/servers/slapd/connection.c:611
#14 0x00000039398284ad in PR_JoinThread () from /usr/lib64/libnspr4.so
#15 0x000000393600673d in start_thread () from /lib64/libpthread.so.0
#16 0x00000039358d40cd in ?? () from /lib64/libc.so.6
It still hits ldbm_back_next_search_entry_ext but then heads into delete_search_result_set. I am assuming the same issue just appearing in a different route in the code?
(In reply to comment #11) > (gdb) up > #6 0x00002b657ba84ec1 in ldbm_back_next_search_entry_ext (pb=0xf4a9b70, > use_extension=0) at ldap/servers/slapd/back-ldbm/ldbm_search.c:1238 > 1238 delete_search_result_set( &sr ); > (gdb) p pb->pb_op->o_status > $2 = 1 > (gdb) > > We are having problems with nscd repeatedly crashing on this client. I am not > sure if it's related. I was using the PRNetAddr type in the pb to identify the > client. The server kept crashing over and over (we have a restarter script) and > the same client ip was the cause in each of the core files. > > One thing I have noticed is most of the cores are the same but a couple have a > slightly different stack trace. See below: > > > #0 0x0000003935830265 in raise () from /lib64/libc.so.6 > #1 0x0000003935831d10 in abort () from /lib64/libc.so.6 > #2 0x000000393586a99b in __libc_message () from /lib64/libc.so.6 > #3 0x000000393587245f in _int_free () from /lib64/libc.so.6 > #4 0x00000039358728bb in free () from /lib64/libc.so.6 > #5 0x00002adb5a1de7b6 in slapi_ch_free (ptr=0x4731d470) at > ldap/servers/slapd/ch_malloc.c:363 > #6 0x00002adb5e4c32e5 in idl_free (idl=0x2aaaad267560) at > ldap/servers/slapd/back-ldbm/idl_common.c:94 > #7 0x00002adb5e4ed4ee in delete_search_result_set (sr=0x4731d500) at > ldap/servers/slapd/back-ldbm/ldbm_search.c:1552 > #8 0x00002adb5e4edec1 in ldbm_back_next_search_entry_ext (pb=0x2aaab0147480, > use_extension=0) at ldap/servers/slapd/back-ldbm/ldbm_search.c:1238 > #9 0x00002adb5a217b98 in iterate (pb=0x2aaab0147480, be=0x951a130, > send_result=1, pnentries=0x473236a4, pagesize=1000, pr_statp=0x47323698) > at ldap/servers/slapd/opshared.c:1153 > #10 0x00002adb5a2180e2 in send_results_ext (pb=0x2aaab0147480, send_result=1, > nentries=0x473236a4, pagesize=1000, pr_stat=0x47323698) > at ldap/servers/slapd/opshared.c:1537 > #11 0x00002adb5a218ab0 in op_shared_search (pb=0x2aaab0147480, send_result=1) > at ldap/servers/slapd/opshared.c:599 > #12 0x000000000042450f in do_search (pb=0x2aaab0147480) at > ldap/servers/slapd/search.c:393 > #13 0x000000000041352c in connection_threadmain () at > ldap/servers/slapd/connection.c:611 > #14 0x00000039398284ad in PR_JoinThread () from /usr/lib64/libnspr4.so > #15 0x000000393600673d in start_thread () from /lib64/libpthread.so.0 > #16 0x00000039358d40cd in ?? () from /lib64/libc.so.6 > > > It still hits ldbm_back_next_search_entry_ext but then heads into > delete_search_result_set. I am assuming the same issue just appearing in a > different route in the code? Yes. I've seen this exact same stack trace too. It just depends on how far the code gets into freeing the search result set. It's caused by the same problem. RHDS 8.2 does not have the same problem. In RHDS 8.2, connection.c:disconnect_server_nomutex() does not call pagedresults_cleanup() or do any sort of paged results related cleanup. All of the cleanup is done in connection_cleanup() Created attachment 532659 [details]
0001-Bug-751645-crash-when-simple-paged-fails-to-send-ent.patch
To ssh://git.fedorahosted.org/git/389/ds.git
d32deaa..55106fe master -> master
commit 55106fe77d2c834b0ba866d440bb8ce08c1d01ff
Author: Rich Megginson <rmeggins>
Date: Wed Nov 9 13:05:39 2011 -0700
Reviewed by: nkinder,nhosoi (Thanks!)
Branch: master
Fix Description: The crash happens when the server is sending back the paged
result entry responses to the client and there is a problem with the
connection e.g. the client closes the socket while the server is doing the
PR_Send/PR_Write on the client socket. If the reader thread in
connection_read_operation() sees the close first, it will call
disconnect_server() to disconnect the socket and cleanup the pagedresult
structure back_search_result_set in the Connection*. The problem with this
is that it leaves a dangling reference to the pagedresult structures in the
writer thread in ldbm_back_next_search_entry_ext. When that code sees the
error from the write, it will also attempt to free the search result, and
will get an invalid or double free error. The solution is to not do the
pagedresults_cleanup in disconnect_server(), but instead allow the writer
thread to do the cleanup safely. The connection_cleanup() function will
call pagedresults_cleanup() to avoid any memory leaks. The only thing the
disconnect_server() function needs to do is to reset the c_timelimit to
avoid the "slapd stops responding" and "simple paged results timeout"
problems.
Platforms tested: RHEL6 x86_64
Flag Day: no
Doc impact: no
marking as screened because already flagged for rhel 6.3.0 Upstream ticket: https://fedorahosted.org/389/ticket/238 [root@snmaptest scripts]# ldapsearch -xLLL -h localhost -p 389 -D "cn=directory manager" -w Secret123 -s sub -b dc=example,dc=com -E pr=1000/noprompt "objectclass=*" > /dev/null 2>&1 & [1] 16058 [root@snmaptest scripts]# kill %1 [root@snmaptest scripts]# ldapsearch -xLLL -h localhost -p 389 -D "cn=directory manager" -w Secret123 -s sub -b dc=example,dc=com -E pr=1000/noprompt "objectclass=*" > /dev/null 2>&1 & No Crashes Found. Hence marking as VERIFIED. |