Description of problem: ns-slapd randomly segfaults in libcrypto.so: ns-slapd[2725]: segfault at 10a3000010af ip 0000003d58c95785 sp 00007ff2abf04040 error 4 in libcrypto.so.0.9.8n[3d58c00000+15b000] ns-slapd[2727]: segfault at 10a3000010af ip 0000003d58c95785 sp 00007ff2aab02040 error 4 in libcrypto.so.0.9.8n[3d58c00000+15b000] Version-Release number of selected component (if applicable): Seen in 1.2.5 and 1.2.6 How reproducible: Occurs randomly, but appears to only happen when SASL/GSSAPI multimaster replication is enabled. Actual results: Segfaults or libc detects memory corruption. Here are several backtraces: (gdb) thread 10 [Switching to thread 10 (Thread 31829)]#0 getrn (lh=0x7fe34c004d20, data=0x7fe36ac45350, rhash=<value optimized out>) at lhash.c:425 425 if (n1->hash != hash) (gdb) bt #0 getrn (lh=0x7fe34c004d20, data=0x7fe36ac45350, rhash=<value optimized out>) at lhash.c:425 #1 0x00007fe36a96d9d6 in lh_insert (lh=0x7fe34c004d20, data=0x7fe36ac45350) at lhash.c:189 #2 0x00007fe36a9700c9 in int_err_set_item (d=0x7fe36ac45350) at err.c:407 #3 0x00007fe36a96fa4c in err_load_strings (lib=0, str=0x7fe36ac45350) at err.c:676 #4 0x00007fe36a970fb1 in ERR_load_ERR_strings () at err.c:662 #5 0x00007fe36a971199 in ERR_load_crypto_strings () at err_all.c:113 #6 0x00007fe36022d857 in openssl_init (cryptoctx=0x7fe358006218) at pkinit_crypto_openssl.c:2350 #7 pkinit_init_plg_crypto (cryptoctx=0x7fe358006218) at pkinit_crypto_openssl.c:277 #8 0x00007fe360222b68 in pkinit_client_plugin_init (context=0x7fe358003830, blob=0x7fe3612b0e40) at pkinit_clnt.c:1362 #9 0x0000003ab92557eb in krb5_init_preauth_context (kcontext=0x7fe358003830) at preauth2.c:146 #10 0x0000003ab925628a in krb5_preauth_request_context_init (context=0x7fe358003830) at preauth2.c:305 #11 0x0000003ab9249ee6 in krb5_get_init_creds (context=0x7fe358003830, creds=0x7fe3612b13c0, client=0x7fe3580021c0, prompter=0, prompter_data=0x0, start_time=0, in_tkt_service=0x0, options=0x7fe358002a50, gak_fct=0x3ab924bac0 <krb5_get_as_key_keytab>, gak_data=0x7fe3580021a0, use_master=0x7fe3612b123c, as_reply=0x0) at get_in_tkt.c:1167 #12 0x0000003ab924bcce in krb5_get_init_creds_keytab (context=0x7fe358003830, creds=0x7fe3612b13c0, client=0x7fe3580021c0, arg_keytab=0x7fe3580021a0, start_time=0, in_tkt_service=0x0, options=0x0) at gic_keytab.c:109 #13 0x0000003c618602f7 in set_krb5_creds (authid=<value optimized out>, username=<value optimized out>, passwd=<value optimized out>, realm=<value optimized out>, vals=0x7fe358002bd0) at ldap/servers/slapd/ldaputil.c:1483 #14 0x0000003c61860a95 in ldap_sasl_set_interact_vals (ld=0x7fe358002270, bindid=0x7fe358003700 "uid=ldap/old-faithful.mit.edu,ou=People,dc=scripts,dc=mit,dc=edu", creds=0x7fe358000d40 "", mech=<value optimized out>, serverctrls=0x0, returnedctrls=0x7fe3612b1ab8, msgidp=0x0) at ldap/servers/slapd/ldaputil.c:928 #15 slapd_ldap_sasl_interactive_bind (ld=0x7fe358002270, bindid=0x7fe358003700 "uid=ldap/old-faithful.mit.edu,ou=People,dc=scripts,dc=mit,dc=edu", creds=0x7fe358000d40 "", mech=<value optimized out>, serverctrls=0x0, returnedctrls=0x7fe3612b1ab8, msgidp=0x0) at ldap/servers/slapd/ldaputil.c:1044 #16 0x0000003c61860d17 in slapi_ldap_bind (ld=0x7fe358002270, bindid=0x7fe358003700 "uid=ldap/old-faithful.mit.edu,ou=People,dc=scripts,dc=mit,dc=edu", creds=<value optimized out>, mech=<value optimized out>, serverctrls=0x0, returnedctrls=0x7fe3612b1ab8, timeout=0x0, msgidp=0x0) at ldap/servers/slapd/ldaputil.c:751 #17 0x00007fe3685927db in bind_and_check_pwp (conn=0x1480220) at ldap/servers/plugins/replication/repl5_connection.c:1750 #18 conn_connect (conn=0x1480220) at ldap/servers/plugins/replication/repl5_connection.c:1136 #19 0x00007fe36859aee7 in acquire_replica (prp=0x1480380, prot_oid=<value optimized out>, ruv=0x7fe3612b1d20) at ldap/servers/plugins/replication/repl5_protocol_util.c:168 #20 0x00007fe368595897 in repl5_inc_run (prp=0x1480380) at ldap/servers/plugins/replication/repl5_inc_protocol.c:828 #21 0x00007fe36859a3d5 in prot_thread_main (arg=0x14aedd0) at ldap/servers/plugins/replication/repl5_protocol.c:311 #22 0x0000003abee29593 in _pt_root (arg=0x14ad530) at ../../../mozilla/nsprpub/pr/src/pthreads/ptthread.c:228 #23 0x0000003ab6207761 in start_thread (arg=0x7fe3612b2710) at pthread_create.c:301 #24 0x0000003ab5ae151d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Another backtrace where libc detects corruption: #0 0x00000033e2433275 in *__GI_raise ( sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 #1 0x00000033e2434a55 in *__GI_abort () at abort.c:88 #2 0x00000033e246fb5b in __libc_message (do_abort=2, fmt=0x7f6dafffd820 "-7f6dc992a000 rw-p 00000000 00:00 0 \n7fffc2e61000-7fffc2e81000 rw-p 00000000 00:00 0", ' ' <repeats 26 times>, "[stack]\n7fffc2e81000-7fffc2e82000 r-xp 00000000 00:00 0", ' ' <repeats 26 times>, "[vdso]\nff"...) at ../sysdeps/unix/sysv/linux/libc_fatal.c:170 #3 0x00000033e2475506 in malloc_printerr (action=3, str=0x33e2535fd8 "double free or corruption (!prev)", ptr=<value optimized out>) at malloc.c:6239 #4 0x00000033e247c280 in _int_realloc (av=0x7f6d9c000020, oldp=0x7f6d9c529430, oldsize=6, nb=2064) at malloc.c:5295 #5 0x00000033e247c8a5 in *__GI___libc_realloc ( oldmem=0x7f6d9c529440, bytes=2048) at malloc.c:3821 #6 0x0000003d58d01e1f in CRYPTO_realloc (str=0x7f6d9c529440, num=1408, file=0x3d58d29dac "stack.c", line=150) at mem.c:355 #7 0x0000003d58c954a0 in sk_insert (st=0x7f6d9c01c110, data=0x7f6da801ee40 "\220", loc=127) at stack.c:149 #8 0x0000003d58c9f49b in EVP_PBE_alg_add (nid=144, cipher=0x3d58f5bba0, md=0x3d58f60220, keygen=0x3d58cdbca0 <PKCS12_PBE_keyivgen>) at evp_pbe.c:129 #9 0x0000003d58cdbf1d in PKCS12_PBE_add () at p12_crpt.c:68 #10 0x0000003d58c9d552 in OpenSSL_add_all_ciphers () at c_allc.c:289 #11 0x0000003d58c9cc7e in OPENSSL_add_all_algorithms_noconf () at c_all.c:83 #12 0x00007f6dbc6e864c in BN_cmp () at bn_lib.c:665 #13 0x00007f6dbc6dd9d8 in BN_cmp () at bn_lib.c:665 #14 0x0000003945065293 in ?? () from /usr/lib64/libkrb5.so.3 #15 0x0000003945065cea in ?? () from /usr/lib64/libkrb5.so.3 #16 0x000000394505c4d1 in krb5_get_init_creds () from /usr/lib64/libkrb5.so.3 #17 0x000000394505d76e in krb5_get_init_creds_keytab () from /usr/lib64/libkrb5.so.3 #18 0x00007f6dc969b317 in set_krb5_creds ( authid=<value optimized out>, username=<value optimized out>, passwd=<value optimized out>, realm=<value optimized out>, vals=0x7f6da8090730) at ldap/servers/slapd/ldaputil.c:1408 #19 0x00007f6dc969bac5 in ldap_sasl_set_interact_vals ( passwd=<value optimized out>, username=<value optimized out>, authid=<value optimized out>, mech=<value optimized out>, ld=<value optimized out>, realm=<value optimized out>) at ldap/servers/slapd/ldaputil.c:853 #20 slapd_ldap_sasl_interactive_bind ( passwd=<value optimized out>, username=<value optimized out>, authid=<value optimized out>, mech=<value optimized out>, ld=<value optimized out>, realm=<value optimized out>) at ldap/servers/slapd/ldaputil.c:969 #21 0x00007f6dc969bd4a in slapi_ldap_bind (ld=0x7f6da8021e90, bindid=0x7f6da808fd30 "uid=ldap/cats-whiskers.mit.edu,ou=People,dc=scripts,dc=mit,dc=edu", creds=<value optimized out>, mech=0x7f6dc40b803e "GSSAPI", serverctrls=0x0, returnedctrls=<value optimized out>, timeout=0x0, msgidp=0x0) at ldap/servers/slapd/ldaputil.c:751 #22 0x00007f6dc40875d5 in bind_and_check_pwp ( password=<value optimized out>, binddn=<value optimized out>, conn=<value optimized out>) at ldap/servers/plugins/replication/repl5_connection.c:1676 #23 conn_connect (password=<value optimized out>, binddn=<value optimized out>, conn=<value optimized out>) at ldap/servers/plugins/replication/repl5_connection.c:1134 #24 0x00007f6dc408fa67 in acquire_replica (prp=0xf2ec80, prot_oid=<value optimized out>, ruv=0x7f6dafffef90) at ldap/servers/plugins/replication/repl5_protocol_util.c:168 #25 0x00007f6dc408a5c7 in repl5_inc_run (prp=0xf2ec80) at ldap/servers/plugins/replication/repl5_inc_protocol.c:814 #26 0x00007f6dc408f119 in prot_thread_main ( arg=<value optimized out>) at ldap/servers/plugins/replication/repl5_protocol.c:313 #27 0x00000038a6e29773 in ?? () from /lib64/libnspr4.so #28 0x00000033e300685a in start_thread ( arg=<value optimized out>) at pthread_create.c:297 #29 0x00000033e24de22d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112 #30 0x0000000000000000 in ?? () And another backtrace: #0 _int_malloc (av=0x7fff0cb91f80, bytes=32671600) at malloc.c:4294 #1 0x000000300ac7ab52 in *__GI___libc_malloc (bytes=15) at malloc.c:3660 #2 0x0000003ce3457481 in krb5_copy_principal () from /usr/lib64/libkrb5.so.3 #3 0x0000003ce3457040 in krb5_copy_creds () from /usr/lib64/libkrb5.so.3 #4 0x0000003ce3447077 in ?? () from /usr/lib64/libkrb5.so.3 #5 0x0000003ce344bfa2 in krb5_cc_store_cred () from /usr/lib64/libkrb5.so.3 #6 0x00007f471fdbc4c1 in set_krb5_creds ( authid=<value optimized out>, username=<value optimized out>, passwd=<value optimized out>, realm=<value optimized out>, vals=0x7f46f4093020) at ldap/servers/slapd/ldaputil.c:1483 #7 0x00007f471fdbcac5 in ldap_sasl_set_interact_vals ( passwd=<value optimized out>, username=<value optimized out>, authid=<value optimized out>, mech=<value optimized out>, ld=<value optimized out>, realm=<value optimized out>) at ldap/servers/slapd/ldaputil.c:853 #8 slapd_ldap_sasl_interactive_bind ( passwd=<value optimized out>, username=<value optimized out>, authid=<value optimized out>, mech=<value optimized out>, ld=<value optimized out>, realm=<value optimized out>) at ldap/servers/slapd/ldaputil.c:969 #9 0x00007f471fdbcd4a in slapi_ldap_bind (ld=0x7f46f4092c20, bindid=0x7f46f40c8f40 "uid=ldap/bees-knees.mit.edu,ou=People,dc=scripts,dc=mit,dc=edu", creds=<value optimized out>, mech=0x7f471a7f003e "GSSAPI", serverctrls=0x0, returnedctrls=<value optimized out>, timeout=0x0, msgidp=0x0) at ldap/servers/slapd/ldaputil.c:751 #10 0x00007f471a7bf5d5 in bind_and_check_pwp ( password=<value optimized out>, binddn=<value optimized out>, conn=<value optimized out>) at ldap/servers/plugins/replication/repl5_connection.c:1676 #11 conn_connect (password=<value optimized out>, binddn=<value optimized out>, conn=<value optimized out>) at ldap/servers/plugins/replication/repl5_connection.c:1134 #12 0x00007f471a7c7a67 in acquire_replica (prp=0x20c3880, prot_oid=<value optimized out>, ruv=0x7f470d579f90) at ldap/servers/plugins/replication/repl5_protocol_util.c:168 #13 0x00007f471a7c25c7 in repl5_inc_run (prp=0x20c3880) at ldap/servers/plugins/replication/repl5_inc_protocol.c:814 #14 0x00007f471a7c7119 in prot_thread_main ( arg=<value optimized out>) at ldap/servers/plugins/replication/repl5_protocol.c:313 #15 0x0000003facc29773 in ?? () from /lib64/libnspr4.so #16 0x000000300b80685a in start_thread ( arg=<value optimized out>) at pthread_create.c:297 #17 0x000000300acde22d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112 #18 0x0000000000000000 in ?? () Here are the SHELL logs generated on normal operation: [09/Oct/2010:04:29:48 -0400] - Listening on /var/run/dirsrv/slapd-scripts.socket for LDAPI requests [09/Oct/2010:04:29:48 -0400] slapi_ldap_init_ext - Success: set up conn to [better-mousetrap.mit.edu:389] [09/Oct/2010:04:29:48 -0400] set_krb5_creds - The default credentials cache [FILE:/tmp/krb5cc_485] not found: will create a new one. [09/Oct/2010:04:29:48 -0400] slapi_ldap_init_ext - configpluginpath == NULL [09/Oct/2010:04:29:48 -0400] slapi_ldap_init_ext - Success: set up conn to [whole-enchilada.mit.edu:389] [09/Oct/2010:04:29:48 -0400] set_krb5_creds - Using principal named [ldap/old-faithful.mit.edu.EDU] [09/Oct/2010:04:29:48 -0400] slapi_ldap_init_ext - Success: set up conn to [cats-whiskers.mit.edu:389] [09/Oct/2010:04:29:48 -0400] set_krb5_creds - The default credentials cache [FILE:/tmp/krb5cc_485] not found: will create a new one. [09/Oct/2010:04:29:48 -0400] set_krb5_creds - Using principal named [ldap/old-faithful.mit.edu.EDU] [09/Oct/2010:04:29:48 -0400] set_krb5_creds - The default credentials cache [FILE:/tmp/krb5cc_485] not found: will create a new one. [09/Oct/2010:04:29:48 -0400] set_krb5_creds - Using principal named [ldap/old-faithful.mit.edu.EDU] [09/Oct/2010:04:29:48 -0400] set_krb5_creds - Using keytab named [WRFILE:/etc/dirsrv/keytab] [09/Oct/2010:04:29:48 -0400] set_krb5_creds - Using keytab named [WRFILE:/etc/dirsrv/keytab] [09/Oct/2010:04:29:48 -0400] set_krb5_creds - Using keytab named [WRFILE:/etc/dirsrv/keytab] [09/Oct/2010:04:29:48 -0400] set_krb5_creds - Generated new memory ccache [MEMORY:N0KZtwJ] [09/Oct/2010:04:29:48 -0400] show_cached_credentials - Ticket cache: MEMORY:N0KZtwJ Default principal: ldap/old-faithful.mit.edu.EDU [09/Oct/2010:04:29:48 -0400] show_one_credential - Kerberos credential: client [ldap/old-faithful.mit.edu.EDU] server [krbtgt/ATHENA.MIT.EDU.EDU] start time [Sat Oct 9 04:30:00 2010] end time [Sun Oct 10 01:45:00 2010] renew time [Sun Oct 10 04:29:49 2010] flags [0x50c00000] [09/Oct/2010:04:29:48 -0400] set_krb5_creds - Set new env for ccache: [KRB5CCNAME=MEMORY:N0KZtwJ] [09/Oct/2010:04:29:48 -0400] set_krb5_creds - Generated new memory ccache [MEMORY:fyHs1On] [09/Oct/2010:04:29:48 -0400] show_cached_credentials - Ticket cache: MEMORY:fyHs1On Default principal: ldap/old-faithful.mit.edu.EDU [09/Oct/2010:04:29:48 -0400] show_one_credential - Kerberos credential: client [ldap/old-faithful.mit.edu.EDU] server [krbtgt/ATHENA.MIT.EDU.EDU] start time [Sat Oct 9 04:30:00 2010] end time [Sun Oct 10 01:45:00 2010] renew time [Sun Oct 10 04:29:49 2010] flags [0x50c00000] [09/Oct/2010:04:29:48 -0400] set_krb5_creds - Set new env for ccache: [KRB5CCNAME=MEMORY:fyHs1On] [09/Oct/2010:04:29:48 -0400] ldap_sasl_get_val - Using value [(null)] for SASL_CB_USER [09/Oct/2010:04:29:48 -0400] set_krb5_creds - Generated new memory ccache [MEMORY:aIeSCnz] [09/Oct/2010:04:29:48 -0400] show_cached_credentials - Ticket cache: MEMORY:aIeSCnz Default principal: ldap/old-faithful.mit.edu.EDU [09/Oct/2010:04:29:48 -0400] show_one_credential - Kerberos credential: client [ldap/old-faithful.mit.edu.EDU] server [krbtgt/ATHENA.MIT.EDU.EDU] start time [Sat Oct 9 04:30:00 2010] end time [Sun Oct 10 01:45:00 2010] renew time [Sun Oct 10 04:29:49 2010] flags [0x50c00000] [09/Oct/2010:04:29:48 -0400] set_krb5_creds - Set new env for ccache: [KRB5CCNAME=MEMORY:aIeSCnz] [09/Oct/2010:04:29:48 -0400] ldap_sasl_get_val - Using value [(null)] for SASL_CB_USER [09/Oct/2010:04:29:48 -0400] ldap_sasl_get_val - Using value [(null)] for SASL_CB_USER [09/Oct/2010:04:29:48 -0400] ldap_sasl_get_val - Using value [(null)] for SASL_CB_USER [09/Oct/2010:04:29:48 -0400] ldap_sasl_get_val - Using value [(null)] for SASL_CB_USER [09/Oct/2010:04:29:48 -0400] ldap_sasl_get_val - Using value [(null)] for SASL_CB_USER We have several core dumps that we can poke at on request (unfortunately they do contain sensitive data.)
What platform is this? Have you tried to reproduce this with 389-ds-base-1.2.7.1? What version of kerberos? Are you using pkinit?
ping - needinfo?
Sorry Rich, it looks like I missed your original ping. The last time we had a dirsrv segfault was Nov 12, so we have not tried reproducing with 1.2.7.1, our version of Kerberos is 1.7.1. We do not use pkinit. Here is some interesting hack that we've used to mostly suppress the expression of this bug. We use a disk ccache instead of a memory one (which is renewed by the # slapdagent cronjob). In our dirsrv sysconfig file we've added: KRB5CCNAME=/var/run/dirsrv/krb5cc; export KRB5CCNAME /usr/kerberos/bin/kinit -k -t "$KRB5_KTNAME" ldap/"$(hostname)" chown --reference="$KRB5_KTNAME" "$KRB5CCNAME" We can suspend the hack and wait until a dirsrv crashes (and since a recent different patch was taken, this will probably not require us to do a full cluster reinitialization), but unless any Kerberos code was touched in the intervening releases it seems unlikely that it would have been resolved.
Oh, this is Fedora 13.
I'm assuming you're using the regular kerberos db and not using the ldap server as the kerberos db (because of the other bug). I'm going to try to reproduce on RHEL 6 using kerberos 1.8.2 Your "hack" looks correct in that it uses the same credentials used by the server but bypasses the kerberos calls. As long as you can reliably renew the creds, that should work fine. But you still have the problem occasionally, even with your hack? If I can't reproduce on RHEL6, I'll fallback to F-13.
Correct. I don’t think we’ve had a segfault since the hack was put in place. We’ve had dirsrvs crash, but for things like out of disk space. :-) Thanks for looking into this.
So far, I have not been able to reproduce on RHEL6 - this is what I've done: 1) setup a kerberos server (standard db, not ldap db) 2) issue a ldap service principal 3) setup multi-master replication using SASL/GSSAPI 4) add 1000 entries on each server - verify that the entries are replicated to both servers Does it take a long time to reproduce the crash? Is there some particular sequence of operations that cause a crash?
You may have more luck reproducing the error if you have a 3+, fully replicated MMR topology, as the bug may be related to racing between multiple interactions. It tends to take a long time to reproduce the crash, although when a crash does happen, it frequently causes cascading failures in other clients. We have not been able to identify a sequence of operations that causes a crash.
I think that the fact that the errors log shows certain lines around getting the ticket listed multiple times (3 to be exact) indicates that we may have multiple threads trying to access the ticket at the same time. Perhaps there is something that is being shared between these threads that needs to be protected.
The logs and the code in set_krb5_creds() seem to indicate that we keep resetting the KRB5CCNAME environment variable if we have multiple threads race into set_krb5_creds(). I suspect that this might be the problem.
Yes, I suspect it has to do with multiple threads. I suspected this might be a problem - see the comment near the top of set_krb5_creds() - but the locking used by krb5 looks sound, and I was never able to get it into a bad state with testing. I don't have a problem with putting a mutex around set_krb5_creds() but since we can't reproduce the problem, we can't verify the fix works. Edward, would you be able to test a fix for us?
Created attachment 469432 [details] 0001-Bug-642046-Segfault-when-using-SASL-GSSAPI-multimast.patch
Created attachment 469434 [details] 0001-Bug-642046-Segfault-when-using-SASL-GSSAPI-multimast.patch
To ssh://git.fedorahosted.org/git/389/ds.git 4da627a..53c948c master -> master commit 53c948cbcd7d9e94ae1bc77eb625a337b470e368 Author: Rich Megginson <rmeggins> Date: Thu Dec 16 08:28:26 2010 -0700 Reviewed by: nhosoi (Thanks!) Branch: master Fix Description: Added a mutex around all of the krb5 code. We are using static variables to cache the credentials from the keytab. Even though krb5 uses locks internally to protect the memory cache, it is possible the crash is caused by a race condition. The mutex should prevent the race condition. Also added a hack for testing to allow setting the principal - nsds5replicabinddn now must be in DN format so cannot use it for krb principal name - we really should add configuration paramters for the principal name and the keytab name. On machines with broken DNS/reverse DNS, testing Kerberos is quite hard without this. Instead of passing NULL to krb5_sname_to_principal() for the hostname, use the hostname from config_get_localhost() - this is consistent with what SASL does to initialize the SASL context. Platforms tested: RHEL6 x86_64 Flag Day: no Doc impact: no
We should have a 389-ds-base-1.2.8.a1 rpm in Fedora Testing by early January. I don't know if you are able to build/test from source - if so, please grab the patch and try it out.
Hello Rich, I completely forgot to undo the hotfix and see if 1.2.8 fixed things! We've undid the fix on one server; it is running 1.2.8.2.1.fc13. Cheers, Edward
Waiting to see if we can mark this customer verified....
Based on Comment #17 - marking customer verified.