Description of problem: When running IPA test suite test_integration/test_server_del.py, the test fails in TestServerDel::test_ignore_topology_disconnect_replica1. The journal shows that ns-slapd has crashed: ns-slapd[23272]: k5_mutex_lock: Received error 22 (Invalid argument) ns-slapd[23272]: ns-slapd: ../../../include/k5-thread.h:376: k5_mutex_lock: Assertion `r == 0' failed. ns-slapd[23272]: GSSAPI client step 1 Version-Release number of selected component (if applicable): Fedora 28: 389-ds-base.x86_64 1.4.0.16-1.fc28 How reproducible: The nightly test fails ~ 60%. Steps to Reproduce: The test is installing a 3-node topology where replica1 is connected to master and to replica2 for domain data, and replica2 is connected to master and to replica1 for CA data. 1. force removal of replica1 with --ignore-topology-disconnect 2. uninstall replica1 3. reinstall replica1 The reinstallation phase fails while re-starting 389-ds, and the journal shows the above error.
Created attachment 1487058 [details] 389-ds error log
The mutex this time doesn't look like a destroyed mutex: (gdb) p *m $3 = {__data = {__lock = 1701339999, __count = 1885301603, __owner = 757100663, __nusers = 1835491616, __kind = 1663188340, __spins = 15726, __elision = 28022, __list = {__prev = 0x6d6f642e3534302d, __next = 0x6362612e3436302d}}, __size = "_check_pwp - agmt=\"cn=vm-045.dom-064.abc", __align = 8097308729682715487} (gdb) (Note that __kind is not -1.) Something has corrupted this; the __size field is especially interesting.
- The problem is a use-after-free A replication agreement is starting to authenticate (krb). It allocates a ccache ==16436== at 0x4C2EBAB: malloc (vg_replace_malloc.c:299) ==16436== by 0x5A67E6B: new_mcc_data (cc_memory.c:414) ==16436== by 0x5A6800F: krb5_mcc_generate_new (cc_memory.c:503) ==16436== by 0x5A5E380: krb5_cc_new_unique (ccbase.c:300) ==16436== by 0x554EAFB: ldap_sasl_set_interact_vals + set_krb5_creds (in /usr/lib64/dirsrv/libslapd.so.0.1.0) ==16436== by 0x554F98A: slapd_ldap_sasl_interactive_bind (in /usr/lib64/dirsrv/libslapd.so.0.1.0) ==16436== by 0x5550B5A: slapi_ldap_bind (in /usr/lib64/dirsrv/libslapd.so.0.1.0) ==16436== by 0x17CCACAE: bind_and_check_pwp (repl5_connection.c:1921) ==16436== by 0x17CCACAE: conn_connect (repl5_connection.c:1259) ==16436== by 0x17CD54ED: acquire_replica (repl5_protocol_util.c:133) ==16436== by 0x17CCEEC9: repl5_inc_run (repl5_inc_protocol.c:885) But the authentication fails and the ccache is removed ==16436== at 0x4C2FDAC: free (vg_replace_malloc.c:530) ==16436== by 0x5A682FE: krb5_mcc_destroy (cc_memory.c:241) ==16436== by 0x5550C30: slapi_ldap_bind (in /usr/lib64/dirsrv/libslapd.so.0.1.0) ==16436== by 0x17CCACAE: bind_and_check_pwp (repl5_connection.c:1921) ==16436== by 0x17CCACAE: conn_connect (repl5_connection.c:1259) ==16436== by 0x17CD54ED: acquire_replica (repl5_protocol_util.c:133) ==16436== by 0x17CCEEC9: repl5_inc_run (repl5_inc_protocol.c:885) Later the replication agreements retries to authenticate and use the previous ccache that is now freed ==16436== by 0x7C0C690: sasl_client_step (in /usr/lib64/libsasl2.so.3.0.0) ==16436== by 0x7C0C819: sasl_client_start (in /usr/lib64/libsasl2.so.3.0.0) ==16436== by 0x69E628A: ldap_int_sasl_bind (in /usr/lib64/libldap_r-2.4.so.2.10.9) ==16436== by 0x69E9B2B: ldap_sasl_interactive_bind (in /usr/lib64/libldap_r-2.4.so.2.10.9) ==16436== by 0x69E9D49: ldap_sasl_interactive_bind_s (in /usr/lib64/libldap_r-2.4.so.2.10.9) ==16436== by 0x554F873: slapd_ldap_sasl_interactive_bind (in /usr/lib64/dirsrv/libslapd.so.0.1.0) ==16436== by 0x5550B5A: slapi_ldap_bind (in /usr/lib64/dirsrv/libslapd.so.0.1.0) ==16436== by 0x17CCACAE: bind_and_check_pwp (repl5_connection.c:1921) ==16436== by 0x17CCACAE: conn_connect (repl5_connection.c:1259) ==16436== by 0x17CD54ED: acquire_replica (repl5_protocol_util.c:133) ==16436== by 0x17CCEEC9: repl5_inc_run (repl5_inc_protocol.c:885) - The reason why authentication fails is unknown (requires connection log-level) did not find info in krb log - The problem could occur more frequently as the replica backoff time has been recently reduced to improve parallel replica installation nsds5ReplicaBackoffMax: 3 In conclusion: - I think there is a bug (DS/KRB ?) in the way krb reuses a ccache that have been cleared before - The problem can be more frequent when backoff time is low - reproducing with nsds5ReplicaBackoffMax (reproduce faster) and connection log (nsslapd-errorlog-level: 8) could give us some details what the first authentication failed Next steps - in a first phase I think the problem could be in DS (set_krb5_creds) in the way it reused an already freed cache.
- The RC of the crash is identified The problem is when several threads (RA) authenticates to the same host and sharing the same ccache. As upon authentication failure one thread will free the ccache. There is a good chance that others threads may be still using it and crash. Here is the same RA authenticating at different time with different ccache [05/Oct/2018:11:01:13.970110542 +0200] - DEBUG - slapi_ldap_init_ext - Success: set up conn to [<remote_host>:389] [05/Oct/2018:11:01:13.971703186 +0200] - DEBUG - set_krb5_creds - The default credentials cache [MEMORY:8bJWkGi] not found: will create a new one. [05/Oct/2018:11:01:13.988816394 +0200] - DEBUG - set_krb5_creds - Generated new memory ccache [MEMORY:IH3GAIi] [05/Oct/2018:11:01:13.993116272 +0200] - DEBUG - set_krb5_creds - Set new env for ccache: [KRB5CCNAME=MEMORY:IH3GAIi] [05/Oct/2018:11:01:14.036796317 +0200] - DEBUG - slapd_ldap_sasl_interactive_bind - Error: could not perform interactive bind for id [] mech [GSSAPI]: LDAP error 49 (Invalid credentials) ((null)) errno 0 (Success) [05/Oct/2018:11:01:14.038569711 +0200] - DEBUG - slapi_ldap_bind - Error: could not perform interactive bind for id [] authentication mechanism [GSSAPI]: error 49 (Invalid credentials) [05/Oct/2018:11:01:16.043821965 +0200] - DEBUG - slapi_ldap_init_ext - Success: set up conn to [<remote_host>:389] [05/Oct/2018:11:01:16.047874771 +0200] - DEBUG - set_krb5_creds - The default credentials cache [MEMORY:IH3GAIi] not found: will create a new one. [05/Oct/2018:11:01:16.075386265 +0200] - DEBUG - set_krb5_creds - Generated new memory ccache [MEMORY:PNxHnvN] [05/Oct/2018:11:01:16.105726300 +0200] - DEBUG - set_krb5_creds - Set new env for ccache: [KRB5CCNAME=MEMORY:PNxHnvN] [05/Oct/2018:11:01:16.186267995 +0200] - DEBUG - slapd_ldap_sasl_interactive_bind - Error: could not perform interactive bind for id [] mech [GSSAPI]: LDAP error 49 (Invalid credentials) ((null)) errno 0 (Success) [05/Oct/2018:11:01:16.188587202 +0200] - DEBUG - slapi_ldap_bind - Error: could not perform interactive bind for id [] authentication mechanism [GSSAPI]: error 49 (Invalid credentials) Here there are two threads authenticating at the same time and sharing the same ccache MEMORY:G3MvKTF But when the first thread frees the ccache, the second thread is has already completed auth (failure) and so the second thread is no longer using the ccache [05/Oct/2018:11:01:26.810196115 +0200] - DEBUG - slapi_ldap_init_ext - Success: set up conn to [<remote_host>:389] [05/Oct/2018:11:01:26.811771354 +0200] - DEBUG - slapi_ldap_init_ext - Success: set up conn to [<remote_host>:389] [05/Oct/2018:11:01:26.813446999 +0200] - DEBUG - set_krb5_creds - The default credentials cache [MEMORY:Jn09Bqm] not found: will create a new one. [05/Oct/2018:11:01:26.831643446 +0200] - DEBUG - set_krb5_creds - Generated new memory ccache [MEMORY:G3MvKTF] [05/Oct/2018:11:01:26.836465812 +0200] - DEBUG - set_krb5_creds - Set new env for ccache: [KRB5CCNAME=MEMORY:G3MvKTF] [05/Oct/2018:11:01:26.869662934 +0200] - DEBUG - slapd_ldap_sasl_interactive_bind - Error: could not perform interactive bind for id [] mech [GSSAPI]: LDAP error 49 (Invalid credentials) ((null)) errno 0 (Success) [05/Oct/2018:11:01:26.871480272 +0200] - DEBUG - slapi_ldap_bind - Error: could not perform interactive bind for id [] authentication mechanism [GSSAPI]: error 49 (Invalid credentials) [05/Oct/2018:11:01:26.873272904 +0200] - DEBUG - slapd_ldap_sasl_interactive_bind - Error: could not perform interactive bind for id [] mech [GSSAPI]: LDAP error 49 (Invalid credentials) ((null)) errno 0 (Success) [05/Oct/2018:11:01:26.874904479 +0200] - DEBUG - slapi_ldap_bind - Error: could not perform interactive bind for id [] authentication mechanism [GSSAPI]: error 49 (Invalid credentials) Here there are two threads authenticating at the same time and sharing the same ccache MEMORY:JGjv0l7 But the first thread frees the ccache while the second thread is still running authentication. So the second thread will crash [05/Oct/2018:11:01:44.262824714 +0200] - DEBUG - slapi_ldap_init_ext - Success: set up conn to [<remote_host>:389] [05/Oct/2018:11:01:44.264432463 +0200] - DEBUG - slapi_ldap_init_ext - Success: set up conn to [<remote_host>:389] [05/Oct/2018:11:01:44.267714979 +0200] - DEBUG - set_krb5_creds - The default credentials cache [MEMORY:ReQxFlc] not found: will create a new one. [05/Oct/2018:11:01:45.963404361 +0200] - DEBUG - set_krb5_creds - Generated new memory ccache [MEMORY:JGjv0l7] [05/Oct/2018:11:01:45.969799130 +0200] - DEBUG - set_krb5_creds - Set new env for ccache: [KRB5CCNAME=MEMORY:JGjv0l7] [05/Oct/2018:11:01:46.090698037 +0200] - DEBUG - slapd_ldap_sasl_interactive_bind - Error: could not perform interactive bind for id [] mech [GSSAPI]: LDAP error 49 (Invalid credentials) ((null)) errno 0 (Success) [05/Oct/2018:11:01:46.092159023 +0200] - DEBUG - slapi_ldap_bind - Error: could not perform interactive bind for id [] authentication mechanism [GSSAPI]: error 49 (Invalid credentials) [05/Oct/2018:11:01:47.090799006 +0200] - DEBUG - connection_reset - new connection on 72 [05/Oct/2018:11:01:47.092290346 +0200] - DEBUG - connection_reset - new connection on 82 [05/Oct/2018:11:01:47.096954723 +0200] - DEBUG - slapi_ldap_init_ext - config_get_saslpath returns NULL - The trigger of the crash is not due to auth retry but to two auth happening at the same time (this is because there are 2 replication agreement connecting to the same host) AND sharing the same ccache In conclusion: - The crash is dynamic, it requires two replication agreements starting authentication at the same time to the same host. It also requires that one authentication completes before the other - the crash is not a regression but a bug that has always existed Next steps: - Is it expected that several threads share the same ccache and could it be prevented
When the same MEMORY type ccache is used by several threads, it is looking it is the responsibility of the application to be careful to not destroy the cache under a thread (http://kerberos.996246.n3.nabble.com/krb5-ccache-of-MEMORY-type-td48334.html#a48336). At first look I do not see easy fix to synchronize those threads. An other option would be to force to recreate a new ccache, even if an existing one could be use. (like if krb5_cc_get_principal returns KRB5_FCC_NOFILE). But not sure it is valid and how it will impact perf.
thierry, can you elaborate on how the ccache handles are being used concurrently? It's never going to be okay to, for instance, call krb5_cc_destroy() on the same handle twice, but if they're separate handles pointing to the same ccache that's expected to be fine.
Robbie, both threads got the security context and ccache handle from krb5_init_context/krb5_cc_default. Their security context may be different (but I have no log that confirms this). The ccache handles have the same type/name (krb5_cc_get_type/krb5_cc_get_name) but, like for security context, I do not for sure if it is the same handle or a different one pointing to the same object. krb5_cc_destroy is called with security context and ccache handle. What I know is that the ccache handles in the two threads have the same type/name.
Having two handles open to the same ccache, even from the same context, should work (and does, in my testing). (Two calls of krb5_cc_default or krb5_cc_resolve, one of each, etc. - doesn't matter.) I'm unable to produce a problematic interleaving, nor does code inspection suggest there would be one. Closing the same ccache pointer twice does produce this result, however. Looking at the trace, we call krb5_cc_new_unique from somewhere in slapd - unfortunately there's no symbol for that, but its somewhere below slapi_ldap_bind. This is also the function which frees it, and the function which uses it again. I think this is all taking place in the same invocation, but am not an expert in this codebase. A question: what's the purpose of calling clear_krb5_ccache() on failure in slapi_ldap_bind? Based on notes elsewhere in the codebase, 49 is authentication failed, which does get returned based on the logs, but I don't see the message that would go with that ("could not perform interactive bind for"). It looks like the version of 389ds was updated today to 1.4.0.17-2 which includes reference to 49887 which looks SASL/threads related - can someone try with that? If that hasn't resolved the issue, I think we should have an openldap or SASL maintainer take a look. 389ds doesn't, from my inspection, manipulate krb5 (or GSSAPI) objects directly in this area and instead calls into openldap which calls into SASL.
krb5_cc_new_unique() is called from set_krb5_creds() in ldap/servers/slapd/ldaputil.c. It is, in turn, is called in ldap_sasl_set_interact_vals() in case SASL mech is GSSAPI. There is plenty of manipulation in set_krb5_creds() and on contrary, GSSAPI code in cyrus-sasl doesn't do anything like that.
According to the logs here is the sequence of krb5 calls/events Thread 1 Thread 2 slapd_ldap_sasl_interactive_bind { krb5_init_context krb5_cc_default krb5_cc_get_principal -> KRB5_FCC_NOFILE /* create a new ccache */ krb5_cc_close krb5_unparse_name krb5_kt_default krb5_kt_get_name krb5_get_init_creds_keytab krb5_kt_close krb5_cc_new_unique /* generate a new ccache: MEMORY:JGjv0l7 */ krb5_cc_resolve krb5_cc_initialize krb5_cc_store_cred setenv newcache KRB5CCNAME slapd_ldap_sasl_interactive_bind { krb5_init_context krb5_cc_default /* get ccache = MEMORY:JGjv0l7 */ krb5_cc_get_principal == 0 /* retrieves the principal from the ccache */ krb5_unparse_name /* retrieve principal from it */ krb5_cc_retrieve_cred /* Cred retrieved and they have not expired */ ldap_sasl_interactive_bind_s -> RC=49 } --> rc=49 On rc=49 -> clear_krb5_ccache { krb5_init_context krb5_cc_default krb5_cc_destroy /* It frees the ccache */ krb5_free_context } ldap_sasl_interactive_bind_s -> crash } Some of the symbols were not resolved with valgrind, you may find resolved stacks at https://bugzilla.redhat.com/show_bug.cgi?id=1633089#c18 The purpose of clear_krb5_ccache is to call krb5_cc_destroy https://pagure.io/389-ds-base/blob/master/f/ldap/servers/slapd/ldaputil.c#_2230. It is called when gssapi authentication fails with rc=49 I think krb5_init_context+krb5_cc_default retrieve the same ccache. It looks it happens when the two threads authenticate in the same second. Note the threads are not doing a krb5_cc_closing but a krb5_cc_destroy. I am going to look how cyrus-sasl implementation differs from set_krb5_creds
Can we simply remove clear_krb5_ccache() code as a quick fix ? Is there anything actually depending on destroying the ccache ?
I guess the other option is to wrap the whole slapi_ldap_bind() into the krb5_lock. Long term I think we should get rid of all of the krb5 code and enhance sasl GSSAPI to make use of the cfredential store, so we can pass in the keytab (and optionally principal) we want to use to initiate gssapi connections, or at least set in a CLIENT_KTNAME variable and let SASL's GSSAPI mechanism initiate on its own.
Do all the 'private' comments on this bug really need to be private? I don't, at a glance, see any RH confidential information in any of them. Private comments should be avoided on Fedora bugs unless they are absolutely necessary, as this looks bad to those who can't see them (who *can* tell that there are private comments they can't see, as they see "missing" comment numbers; they know we're talking about something behind their backs, they just don't know what it is).
Upstream ticket is 49972
Per IRC discussion, thierry will work on a rapid fix - trying removal of clear_krb5_ccache(), and possibly working on the locking. Longer term we will be working with thierry to remove the krb5 calls from 389, in favor of either pure SASL or gssapi. Finally, if it's not too much trouble, Flo or Adam, could I have valgrind traces for a crash (any of them) under krb5-1.16.1-20?
Build with the fix from https://pagure.io/389-ds-base/pull-request/49973 on top of 389-ds-base-1.4.0.18-1 is here https://copr.fedorainfracloud.org/coprs/vashirov/389ds/build/808414/ # dnf copr enable vashirov/389ds # dnf update 389-ds-base
The test test_server_del.py-TestServerDel--test_ignore_topology_disconnect_replica1 was run 50 times and produced 3 errors, none of which was related to the k5_mutex assertion error. Packages: krb5-server.x86_64 1.16.1-21.fc28 389-ds-base.x86_64 1.4.0.18-99.bz1633089.fc28 (build provided by viktor) Robbie, I will try to reproduce the crash with krb5-server 1.16.1-21.fc28 and 389-ds-base 1.4.0.18-1 (ie the one available in fedora updates-testing) with valgrind.
The named crash could only be reproduced with krb5-server 1.16.1-20.fc28 (does not happen with 1.16.1-21.fc28). Robbie, do you still need a valgrind output?
Thanks Flo. Upon further inspection, it appears that the patch doesn't take the suggested route (i.e., don't call krb5_cc_destroy), so that the issue is still present isn't very surprising; I don't think a valgrind trace will produce anything of interest.
According to Florence update (https://bugzilla.redhat.com/show_bug.cgi?id=1633089#c26) the named crash was systematic with 1.16.1-20.fc28 and not reproducible with 1.16.1-21.fc28. In both tests, DS was calling krb5_cc_destroy. It looks to me the named crash was independent to krb5_cc_destroy calls. There was 3 failures while testing Packages: krb5-server.x86_64 1.16.1-21.fc28 389-ds-base.x86_64 1.4.0.18-99.bz1633089.fc28 (build provided by viktor) Robbie, do you think those failures are consequence of krb5_cc_destroy calls ?
Yes, I think it likely. I expect that valgrind would reveal similar corruption even though everything looks okay.
(In reply to Viktor Ashirov from comment #37) > Build with the fix from https://pagure.io/389-ds-base/pull-request/49973 on > top of 389-ds-base-1.4.0.18-1 is here > https://copr.fedorainfracloud.org/coprs/vashirov/389ds/build/808414/ > > # dnf copr enable vashirov/389ds > # dnf update 389-ds-base Please find a new build on top of 389-ds-base-1.4.0.19-1 in the same place: https://copr.fedorainfracloud.org/coprs/vashirov/389ds/build/833050/
Thanks for the build, Viktor. I am relaunching the tests, will keep you posted.
Offline, Flo reported hard time to run successfully the previous build [1]. The tests are running fine in 389-ds-base-1.4.0.19-1 but not in 389-ds-base-1.4.0.19-1+krb_auth_serialization+no_call_to_krb5_cc_destroy [1]. I have no explanation for that. Viktor kindly rebuilt 389-ds-base-1.4.0.19-1+krb_auth_serialization [2]. It contains original fix that was quite good on top of 1.4.0.18. Flo do you mind to relaunch the tests with [2] [1] https://bugzilla.redhat.com/show_bug.cgi?id=1633089#c43 [2] https://copr.fedorainfracloud.org/coprs/vashirov/389ds/build/834158/ [3] https://bugzilla.redhat.com/show_bug.cgi?id=1633089#c38
Hi, thanks for the new rpms. I launched the test once with them, and the test succeeded. I am now running the tests in loop to see if the fix is really stable (with previous pkgs there were ~3 failures out of 50 runs). Will keep you posted.
Package information: # dnf list installed krb5-server freeipa-server 389-ds-base Installed Packages 389-ds-base.x86_64 1.4.0.19-1.2bz1633089.fc28 @vashirov-389ds freeipa-server.x86_64 4.7.2-1.fc28 @updates-testing krb5-server.x86_64 1.16.1-21.fc28 @updates The test test_integration/test_server_del.py::TestServerDel::test_ignore_topology_disconnect_replica1 was run 50 times without any issue (100% success).
Robbie, The tentative fix [1] is successful. This fix serializes the krb authentication. As suggested [2], a new fix [3] was tested and that fix systematically fails. The fix [3] is based on [1] and in addition prevents to clear the cc_cache in case of krb auth failure for invalid credential. The fix [3] (compare to [1]) looks like rc = slapd_ldap_sasl_interactive_bind(ld, bindid, creds, mech, serverctrls, returnedctrls, msgidp); if (LDAP_SUCCESS != rc) { slapi_log_err(..) - if (mech && !strcmp(mech, "GSSAPI") && rc == 49) { - /* only on err 49 should we clear out the credential cache */ - clear_krb5_ccache(); - } } Is it the fix you were thinking of ? [1] 1.4.0.19-1.2bz1633089.fc28 https://copr.fedorainfracloud.org/coprs/vashirov/389ds/build/834158/ [2] https://bugzilla.redhat.com/show_bug.cgi?id=1633089#c42 [3] https://copr.fedorainfracloud.org/coprs/vashirov/389ds/build/834158/
Yes, it was. Interesting that it still fails, guess I was wrong about the underlying cause. I assume the failure is the same (random corruption); otherwise logs would be interesting.
Hi, it seems I'm having same issue with FreeIPA with named-pkcs11.service from bind-9.11.4-10.P2.fc29.x86_64. After recent upgrade to krb5-*-1.16.1-22.fc29.x86_64, named-pkcs11.service fails to start with strace: [...] 27-Dec-2018 01:11:02.890 bind-dyndb-ldap version 11.1 compiled at 22:53:39 Jul 15 2018, compiler 8.1.1 20180712 (Red Hat 8.1.1-5) k5_mutex_lock: Received error 22 (Invalid argument) named-pkcs11: ../../../include/k5-thread.h:376: k5_mutex_lock: Assertion `r == 0' failed. ) = ? +++ killed by SIGABRT (core dumped) +++ Aborted (core dumped) After downgrade back to krb5-*-1.16.1-21.fc29.x86_64, but with the same bind version, named-pkcs11.service starts and works as expected. All underlying FSes are XFS. Can it be the same issue or should open new bug?
Looking into this, it seems like that update (and its partner for F28) are the problem. In that build, -22, as well as adding a patch for the security bug that's mentioned in the changelog, rharwood re-enabled the patch that was intended to address this bug, but which I had *disabled* in -21 because it actually caused FreeIPA to break completely. It seems that patch *still* causes FreeIPA to break completely. So I'm sending out a -23 with the patch disabled again. Please check it out and see if it helps. https://bodhi.fedoraproject.org/updates/FEDORA-2018-dc944aaa79
(In reply to Adam Williamson from comment #51) > Looking into this, it seems like that update (and its partner for F28) are > the problem. In that build, -22, as well as adding a patch for the security > bug that's mentioned in the changelog, rharwood re-enabled the patch that > was intended to address this bug, but which I had *disabled* in -21 because > it actually caused FreeIPA to break completely. It seems that patch *still* > causes FreeIPA to break completely. So I'm sending out a -23 with the patch > disabled again. Please check it out and see if it helps. > > https://bodhi.fedoraproject.org/updates/FEDORA-2018-dc944aaa79 Thank you, -23 seems to be working just fine.
Upstream ticket is pushed (https://pagure.io/389-ds-base/issue/49972) -> POST
Is that a different fix which shouldn't break our FreeIPA tests this time?
Adam, this is a DS fix that was successfully tested with FreeIPA (https://bugzilla.redhat.com/show_bug.cgi?id=1633089#c47). Basically the fix serialize krb authentication for outgoing connection (i.e. replication agreements) that prevent agreements to share the same cc_cache.
Hi Thierry et al., Upstream in krb5, we uncovered a bug with refcounting of MEMORY ccaches. We encountered difficulty diagnosing it because 088ba228acce4fd55bbb7c30122fe2703b8beeb8 accidentally fixed the issue as part of a backend rewrite. These changes landed in krb5-1.17 (so rawhide only for Fedora, but I can backport further if it helps) and were backported to RHEL (krb5-1.15.1-35.el7; RHEL-8 backport still pending). If testing determines that to be the issue, 389ds may not have to take the performance hit with Kerberos at all.
Thanks Robbie for the heads up. Indeed invalid ccache refcnt could be the RC of the crash of DS. Regarding the DS code changes, IMHO it has no significant performance impact. A potential performance hit should be very rare, limited (may be few more sec to authenticate) and have no consequence as connections are running replication protocol (Repl Agt) that is robust to timing issues. The part of code in DS needs to be rewritten (https://pagure.io/389-ds-base/issue/50017 and https://pagure.io/389-ds-base/issue/50018) so reverting the DS fix will not bring back the code to a better state. IMHO, we should keep DS as it is with that fix (current BZ). So if the RC of the crash was the bug in krb5, then the DS fix is useless but without noticeable impact. If the RC of the crash was not the bug in krb5, we have a fix in place that proved it prevented the crash.
Thanks, that's helpful! Please reach out if I can help with any rewrites.
This message is a reminder that Fedora 28 is nearing its end of life. On 2019-May-28 Fedora will stop maintaining and issuing updates for Fedora 28. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a Fedora 'version' of '28'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 28 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete.
Fedora 28 changed to end-of-life (EOL) status on 2019-05-28. Fedora 28 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. If you are unable to reopen this bug, please file a new report against the current release. If you experience problems, please add a comment to this bug. Thank you for reporting this bug and we are sorry it could not be fixed.