Bug 1633089 - ns-slapd crashes with k5_mutex_lock: Assertion `r == 0' failed
Summary: ns-slapd crashes with k5_mutex_lock: Assertion `r == 0' failed
Keywords:
Status: CLOSED EOL
Alias: None
Product: Fedora
Classification: Fedora
Component: 389-ds
Version: 28
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
Assignee: thierry bordaz
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks: 1636633 1671505
TreeView+ depends on / blocked
 
Reported: 2018-09-26 07:46 UTC by Florence Blanc-Renaud
Modified: 2022-04-05 08:31 UTC (History)
20 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1671505 (view as bug list)
Environment:
Last Closed: 2019-05-28 23:57:57 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github 389ds 389-ds-base issues 3031 0 None closed use-after-free in case of several parallel krb authentication 2020-10-07 17:01:24 UTC

Description Florence Blanc-Renaud 2018-09-26 07:46:26 UTC
Description of problem:

When running IPA test suite test_integration/test_server_del.py, the test fails in TestServerDel::test_ignore_topology_disconnect_replica1. The journal shows that ns-slapd has crashed:

ns-slapd[23272]: k5_mutex_lock: Received error 22 (Invalid argument)
ns-slapd[23272]: ns-slapd: ../../../include/k5-thread.h:376: k5_mutex_lock: Assertion `r == 0' failed.
ns-slapd[23272]: GSSAPI client step 1


Version-Release number of selected component (if applicable):
Fedora 28:
389-ds-base.x86_64                   1.4.0.16-1.fc28


How reproducible:
The nightly test fails ~ 60%.

Steps to Reproduce:
The test is installing a 3-node topology where replica1 is connected to master and to replica2 for domain data, and replica2 is connected to master and to replica1 for CA data.

1. force removal of replica1 with --ignore-topology-disconnect
2. uninstall replica1
3. reinstall replica1

The reinstallation phase fails while re-starting 389-ds, and the journal shows the above error.

Comment 3 Florence Blanc-Renaud 2018-09-26 07:58:31 UTC
Created attachment 1487058 [details]
389-ds error log

Comment 13 Robbie Harwood 2018-09-28 16:23:36 UTC
The mutex this time doesn't look like a destroyed mutex:

(gdb) p	*m
$3 = {__data = {__lock = 1701339999, __count = 1885301603, __owner = 757100663,	__nusers = 1835491616,
    __kind = 1663188340, __spins = 15726, __elision = 28022, __list = {__prev = 0x6d6f642e3534302d,   
      __next = 0x6362612e3436302d}}, __size = "_check_pwp - agmt=\"cn=vm-045.dom-064.abc",         
  __align = 8097308729682715487}
(gdb) 

(Note that __kind is not -1.)  Something has corrupted this; the __size field is especially interesting.

Comment 18 thierry bordaz 2018-10-02 15:14:29 UTC
    - The problem is a use-after-free
      A replication agreement is starting  to authenticate (krb). It allocates a ccache
==16436==    at 0x4C2EBAB: malloc (vg_replace_malloc.c:299)
==16436==    by 0x5A67E6B: new_mcc_data (cc_memory.c:414)
==16436==    by 0x5A6800F: krb5_mcc_generate_new (cc_memory.c:503)
==16436==    by 0x5A5E380: krb5_cc_new_unique (ccbase.c:300)
==16436==    by 0x554EAFB: ldap_sasl_set_interact_vals + set_krb5_creds (in /usr/lib64/dirsrv/libslapd.so.0.1.0)
==16436==    by 0x554F98A: slapd_ldap_sasl_interactive_bind (in /usr/lib64/dirsrv/libslapd.so.0.1.0)                                
==16436==    by 0x5550B5A: slapi_ldap_bind (in /usr/lib64/dirsrv/libslapd.so.0.1.0)
==16436==    by 0x17CCACAE: bind_and_check_pwp (repl5_connection.c:1921)
==16436==    by 0x17CCACAE: conn_connect (repl5_connection.c:1259)
==16436==    by 0x17CD54ED: acquire_replica (repl5_protocol_util.c:133)
==16436==    by 0x17CCEEC9: repl5_inc_run (repl5_inc_protocol.c:885)

      But the authentication fails and the ccache is removed
==16436==    at 0x4C2FDAC: free (vg_replace_malloc.c:530)
==16436==    by 0x5A682FE: krb5_mcc_destroy (cc_memory.c:241)
==16436==    by 0x5550C30: slapi_ldap_bind (in /usr/lib64/dirsrv/libslapd.so.0.1.0)
==16436==    by 0x17CCACAE: bind_and_check_pwp (repl5_connection.c:1921)
==16436==    by 0x17CCACAE: conn_connect (repl5_connection.c:1259)
==16436==    by 0x17CD54ED: acquire_replica (repl5_protocol_util.c:133)
==16436==    by 0x17CCEEC9: repl5_inc_run (repl5_inc_protocol.c:885)

      Later the replication agreements retries to authenticate and use the previous ccache
      that is now freed
==16436==    by 0x7C0C690: sasl_client_step (in /usr/lib64/libsasl2.so.3.0.0)
==16436==    by 0x7C0C819: sasl_client_start (in /usr/lib64/libsasl2.so.3.0.0)
==16436==    by 0x69E628A: ldap_int_sasl_bind (in /usr/lib64/libldap_r-2.4.so.2.10.9)
==16436==    by 0x69E9B2B: ldap_sasl_interactive_bind (in /usr/lib64/libldap_r-2.4.so.2.10.9)
==16436==    by 0x69E9D49: ldap_sasl_interactive_bind_s (in /usr/lib64/libldap_r-2.4.so.2.10.9)
==16436==    by 0x554F873: slapd_ldap_sasl_interactive_bind (in /usr/lib64/dirsrv/libslapd.so.0.1.0)
==16436==    by 0x5550B5A: slapi_ldap_bind (in /usr/lib64/dirsrv/libslapd.so.0.1.0)
==16436==    by 0x17CCACAE: bind_and_check_pwp (repl5_connection.c:1921)
==16436==    by 0x17CCACAE: conn_connect (repl5_connection.c:1259)
==16436==    by 0x17CD54ED: acquire_replica (repl5_protocol_util.c:133)
==16436==    by 0x17CCEEC9: repl5_inc_run (repl5_inc_protocol.c:885)

    

    - The reason why authentication fails is unknown (requires connection log-level)
      did not find info in krb log

    - The problem could occur more frequently as the replica backoff time has been recently reduced
      to improve parallel replica installation
        nsds5ReplicaBackoffMax: 3

In conclusion:
    - I think there is a bug (DS/KRB ?) in the way krb reuses a ccache that have been cleared before
    - The problem can be more frequent when backoff time is low
    - reproducing with nsds5ReplicaBackoffMax (reproduce faster) and connection log (nsslapd-errorlog-level: 8)
      could give us some details what the first authentication failed

Next steps
    - in a first phase I think the problem could be in DS (set_krb5_creds) in the way it reused an already
      freed cache.

Comment 20 thierry bordaz 2018-10-05 15:18:23 UTC
    - The RC of the crash is identified
      The problem is when several threads (RA) authenticates to the same host
      and sharing the same ccache.
      As upon authentication failure one thread will free the ccache. There is
      a good chance that others threads may be still using it and crash.

      Here is the same RA authenticating at different time with different ccache
[05/Oct/2018:11:01:13.970110542 +0200] - DEBUG - slapi_ldap_init_ext - Success: set up conn to [<remote_host>:389]
[05/Oct/2018:11:01:13.971703186 +0200] - DEBUG - set_krb5_creds - The default credentials cache [MEMORY:8bJWkGi] not found: will create a new one.
[05/Oct/2018:11:01:13.988816394 +0200] - DEBUG - set_krb5_creds - Generated new memory ccache [MEMORY:IH3GAIi]
[05/Oct/2018:11:01:13.993116272 +0200] - DEBUG - set_krb5_creds - Set new env for ccache: [KRB5CCNAME=MEMORY:IH3GAIi]
[05/Oct/2018:11:01:14.036796317 +0200] - DEBUG - slapd_ldap_sasl_interactive_bind - Error: could not perform interactive bind for id [] mech [GSSAPI]: LDAP error 49 (Invalid credentials) ((null)) errno 0 (Success)
[05/Oct/2018:11:01:14.038569711 +0200] - DEBUG - slapi_ldap_bind - Error: could not perform interactive bind for id [] authentication mechanism [GSSAPI]: error 49 (Invalid credentials)

[05/Oct/2018:11:01:16.043821965 +0200] - DEBUG - slapi_ldap_init_ext - Success: set up conn to [<remote_host>:389]
[05/Oct/2018:11:01:16.047874771 +0200] - DEBUG - set_krb5_creds - The default credentials cache [MEMORY:IH3GAIi] not found: will create a new one.
[05/Oct/2018:11:01:16.075386265 +0200] - DEBUG - set_krb5_creds - Generated new memory ccache [MEMORY:PNxHnvN]
[05/Oct/2018:11:01:16.105726300 +0200] - DEBUG - set_krb5_creds - Set new env for ccache: [KRB5CCNAME=MEMORY:PNxHnvN]
[05/Oct/2018:11:01:16.186267995 +0200] - DEBUG - slapd_ldap_sasl_interactive_bind - Error: could not perform interactive bind for id [] mech [GSSAPI]: LDAP error 49 (Invalid credentials) ((null)) errno 0 (Success)
[05/Oct/2018:11:01:16.188587202 +0200] - DEBUG - slapi_ldap_bind - Error: could not perform interactive bind for id [] authentication mechanism [GSSAPI]: error 49 (Invalid credentials)

     Here there are two threads authenticating at the same time and sharing the same ccache MEMORY:G3MvKTF
     But when the first thread frees the ccache, the second thread is has already completed auth (failure)
     and so the second thread is no longer using the ccache
[05/Oct/2018:11:01:26.810196115 +0200] - DEBUG - slapi_ldap_init_ext - Success: set up conn to [<remote_host>:389]
[05/Oct/2018:11:01:26.811771354 +0200] - DEBUG - slapi_ldap_init_ext - Success: set up conn to [<remote_host>:389]
[05/Oct/2018:11:01:26.813446999 +0200] - DEBUG - set_krb5_creds - The default credentials cache [MEMORY:Jn09Bqm] not found: will create a new one.
[05/Oct/2018:11:01:26.831643446 +0200] - DEBUG - set_krb5_creds - Generated new memory ccache [MEMORY:G3MvKTF]
[05/Oct/2018:11:01:26.836465812 +0200] - DEBUG - set_krb5_creds - Set new env for ccache: [KRB5CCNAME=MEMORY:G3MvKTF]
[05/Oct/2018:11:01:26.869662934 +0200] - DEBUG - slapd_ldap_sasl_interactive_bind - Error: could not perform interactive bind for id [] mech [GSSAPI]: LDAP error 49 (Invalid credentials) ((null)) errno 0 (Success)
[05/Oct/2018:11:01:26.871480272 +0200] - DEBUG - slapi_ldap_bind - Error: could not perform interactive bind for id [] authentication mechanism [GSSAPI]: error 49 (Invalid credentials)
[05/Oct/2018:11:01:26.873272904 +0200] - DEBUG - slapd_ldap_sasl_interactive_bind - Error: could not perform interactive bind for id [] mech [GSSAPI]: LDAP error 49 (Invalid credentials) ((null)) errno 0 (Success)
[05/Oct/2018:11:01:26.874904479 +0200] - DEBUG - slapi_ldap_bind - Error: could not perform interactive bind for id [] authentication mechanism [GSSAPI]: error 49 (Invalid credentials)

     Here there are two threads authenticating at the same time and sharing the same ccache MEMORY:JGjv0l7
     But the first thread frees the ccache while the second thread is still running authentication.
     So the second thread will crash
[05/Oct/2018:11:01:44.262824714 +0200] - DEBUG - slapi_ldap_init_ext - Success: set up conn to [<remote_host>:389]
[05/Oct/2018:11:01:44.264432463 +0200] - DEBUG - slapi_ldap_init_ext - Success: set up conn to [<remote_host>:389]
[05/Oct/2018:11:01:44.267714979 +0200] - DEBUG - set_krb5_creds - The default credentials cache [MEMORY:ReQxFlc] not found: will create a new one.
[05/Oct/2018:11:01:45.963404361 +0200] - DEBUG - set_krb5_creds - Generated new memory ccache [MEMORY:JGjv0l7]
[05/Oct/2018:11:01:45.969799130 +0200] - DEBUG - set_krb5_creds - Set new env for ccache: [KRB5CCNAME=MEMORY:JGjv0l7]
[05/Oct/2018:11:01:46.090698037 +0200] - DEBUG - slapd_ldap_sasl_interactive_bind - Error: could not perform interactive bind for id [] mech [GSSAPI]: LDAP error 49 (Invalid credentials) ((null)) errno 0 (Success)
[05/Oct/2018:11:01:46.092159023 +0200] - DEBUG - slapi_ldap_bind - Error: could not perform interactive bind for id [] authentication mechanism [GSSAPI]: error 49 (Invalid credentials)
[05/Oct/2018:11:01:47.090799006 +0200] - DEBUG - connection_reset - new connection on 72
[05/Oct/2018:11:01:47.092290346 +0200] - DEBUG - connection_reset - new connection on 82
[05/Oct/2018:11:01:47.096954723 +0200] - DEBUG - slapi_ldap_init_ext - config_get_saslpath returns NULL

    - The trigger of the crash is not due to auth retry
      but to two auth happening at the same time (this is because there are 2 replication
      agreement connecting to the same host) AND sharing the same ccache

In conclusion: 
    - The crash is dynamic, it requires two replication agreements starting
      authentication at the same time to the same host. It also requires that
      one authentication completes before the other
    - the crash is not a regression but a bug that has always existed

Next steps:
    - Is it expected that several threads share the same ccache and could it
      be prevented

Comment 22 thierry bordaz 2018-10-08 10:27:22 UTC
When the same MEMORY type ccache is used by several threads, it is looking it is the responsibility of the application to be careful to not destroy the cache under a thread (http://kerberos.996246.n3.nabble.com/krb5-ccache-of-MEMORY-type-td48334.html#a48336).

At first look I do not see easy fix to synchronize those threads.

An other option would be to force to recreate a new ccache, even if an existing one could be use. (like if krb5_cc_get_principal returns KRB5_FCC_NOFILE). But not sure it is valid and how it will impact perf.

Comment 27 Robbie Harwood 2018-10-09 15:30:28 UTC
thierry, can you elaborate on how the ccache handles are being used concurrently?  It's never going to be okay to, for instance, call krb5_cc_destroy() on the same handle twice, but if they're separate handles pointing to the same ccache that's expected to be fine.

Comment 28 thierry bordaz 2018-10-09 17:03:29 UTC
Robbie, both threads got the security context and ccache handle from krb5_init_context/krb5_cc_default.

Their security context may be different (but I have no log that confirms this).
The ccache handles have the same type/name (krb5_cc_get_type/krb5_cc_get_name) but, like for security context, I do not for sure if it is the same handle or a different one pointing to the same object.

krb5_cc_destroy is called with security context and ccache handle. What I know is that the ccache handles in the two threads have the same type/name.

Comment 29 Robbie Harwood 2018-10-09 20:52:02 UTC
Having two handles open to the same ccache, even from the same context, should work (and does, in my testing).  (Two calls of krb5_cc_default or krb5_cc_resolve, one of each, etc. - doesn't matter.)  I'm unable to produce a problematic interleaving, nor does code inspection suggest there would be one.
Closing the same ccache pointer twice does produce this result, however.

Looking at the trace, we call krb5_cc_new_unique from somewhere in slapd - unfortunately there's no symbol for that, but its somewhere below slapi_ldap_bind.  This is also the function which frees it, and the function which uses it again.  I think this is all taking place in the same invocation, but am not an expert in this codebase.

A question: what's the purpose of calling clear_krb5_ccache() on failure in slapi_ldap_bind?  Based on notes elsewhere in the codebase, 49 is authentication failed, which does get returned based on the logs, but I don't see the message that would go with that ("could not perform interactive bind for").

It looks like the version of 389ds was updated today to 1.4.0.17-2 which includes reference to 49887 which looks SASL/threads related - can someone try with that?

If that hasn't resolved the issue, I think we should have an openldap or SASL maintainer take a look.  389ds doesn't, from my inspection, manipulate krb5 (or GSSAPI) objects directly in this area and instead calls into openldap which calls into SASL.

Comment 30 Alexander Bokovoy 2018-10-10 03:31:56 UTC
krb5_cc_new_unique() is called from set_krb5_creds() in ldap/servers/slapd/ldaputil.c. It is, in turn, is called in ldap_sasl_set_interact_vals() in case SASL mech is GSSAPI.

There is plenty of manipulation in set_krb5_creds() and on contrary, GSSAPI code in cyrus-sasl doesn't do anything like that.

Comment 31 thierry bordaz 2018-10-10 13:06:51 UTC
According to the logs here is the sequence of krb5 calls/events

        Thread 1                                                Thread 2

        slapd_ldap_sasl_interactive_bind {
                krb5_init_context
                krb5_cc_default
                krb5_cc_get_principal -> KRB5_FCC_NOFILE
                    /* create a new ccache */
                krb5_cc_close
                krb5_unparse_name

                krb5_kt_default
                krb5_kt_get_name
                krb5_get_init_creds_keytab
                krb5_kt_close

                krb5_cc_new_unique
                    /* generate a new ccache: MEMORY:JGjv0l7 */
                krb5_cc_resolve
                krb5_cc_initialize
                krb5_cc_store_cred
                setenv newcache KRB5CCNAME
                                                        slapd_ldap_sasl_interactive_bind {
                                                                krb5_init_context
                                                                krb5_cc_default
                                                                    /* get ccache = MEMORY:JGjv0l7 */
                                                                krb5_cc_get_principal == 0
                                                                    /* retrieves the principal from the ccache */
                                                                krb5_unparse_name
                                                                    /* retrieve principal from it */
                                                                krb5_cc_retrieve_cred  
                                                                    /* Cred retrieved and they have not expired */

                ldap_sasl_interactive_bind_s -> RC=49
        } --> rc=49
        On rc=49 -> clear_krb5_ccache  {
            krb5_init_context
            krb5_cc_default
            krb5_cc_destroy /* It frees the ccache */
            krb5_free_context
        }
                                                                ldap_sasl_interactive_bind_s ->  crash
                                                        }


Some of the symbols were not resolved with valgrind, you may find resolved stacks at https://bugzilla.redhat.com/show_bug.cgi?id=1633089#c18

The purpose of clear_krb5_ccache is to call krb5_cc_destroy https://pagure.io/389-ds-base/blob/master/f/ldap/servers/slapd/ldaputil.c#_2230. It is called when gssapi authentication fails with rc=49


I think krb5_init_context+krb5_cc_default retrieve the same ccache. It looks it happens when the two threads authenticate in the same second. Note the threads are not doing a krb5_cc_closing but a krb5_cc_destroy.

I am going to look how cyrus-sasl implementation differs from set_krb5_creds

Comment 32 Simo Sorce 2018-10-10 18:42:09 UTC
Can we simply remove clear_krb5_ccache() code as a quick fix ?
Is there anything actually depending on destroying the ccache ?

Comment 33 Simo Sorce 2018-10-10 18:46:47 UTC
I guess the other option is to wrap the whole slapi_ldap_bind() into the krb5_lock.

Long term I think we should get rid of all of the krb5 code and enhance sasl GSSAPI to make use of the cfredential store, so we can pass in the keytab (and optionally principal) we want to use to initiate gssapi connections, or at least set in a CLIENT_KTNAME variable and let SASL's GSSAPI mechanism initiate on its own.

Comment 34 Adam Williamson 2018-10-10 19:07:35 UTC
Do all the 'private' comments on this bug really need to be private? I don't, at a glance, see any RH confidential information in any of them. Private comments should be avoided on Fedora bugs unless they are absolutely necessary, as this looks bad to those who can't see them (who *can* tell that there are private comments they can't see, as they see "missing" comment numbers; they know we're talking about something behind their backs, they just don't know what it is).

Comment 35 thierry bordaz 2018-10-11 14:15:09 UTC
Upstream ticket is 49972

Comment 36 Robbie Harwood 2018-10-11 14:19:04 UTC
Per IRC discussion, thierry will work on a rapid fix - trying removal of clear_krb5_ccache(), and possibly working on the locking.

Longer term we will be working with thierry to remove the krb5 calls from 389, in favor of either pure SASL or gssapi.

Finally, if it's not too much trouble, Flo or Adam, could I have valgrind traces for a crash (any of them) under krb5-1.16.1-20?

Comment 37 Viktor Ashirov 2018-10-11 19:30:49 UTC
Build with the fix from https://pagure.io/389-ds-base/pull-request/49973 on top of 389-ds-base-1.4.0.18-1 is here https://copr.fedorainfracloud.org/coprs/vashirov/389ds/build/808414/

# dnf copr enable vashirov/389ds
# dnf update 389-ds-base

Comment 38 Florence Blanc-Renaud 2018-10-16 08:40:08 UTC
The test test_server_del.py-TestServerDel--test_ignore_topology_disconnect_replica1 was run 50 times and produced 3 errors, none of which was related to the k5_mutex assertion error.

Packages:
krb5-server.x86_64     1.16.1-21.fc28
389-ds-base.x86_64     1.4.0.18-99.bz1633089.fc28 (build provided by viktor)


Robbie, I will try to reproduce the crash with krb5-server 1.16.1-21.fc28 and 389-ds-base 1.4.0.18-1 (ie the one available in fedora updates-testing) with valgrind.

Comment 39 Florence Blanc-Renaud 2018-10-16 11:38:59 UTC
The named crash could only be reproduced with krb5-server 1.16.1-20.fc28 (does not happen with 1.16.1-21.fc28). Robbie, do you still need a valgrind output?

Comment 40 Robbie Harwood 2018-10-16 18:56:36 UTC
Thanks Flo.  Upon further inspection, it appears that the patch doesn't take the suggested route (i.e., don't call krb5_cc_destroy), so that the issue is still present isn't very surprising; I don't think a valgrind trace will produce anything of interest.

Comment 41 thierry bordaz 2018-10-17 12:16:20 UTC
According to Florence update (https://bugzilla.redhat.com/show_bug.cgi?id=1633089#c26) the named crash was systematic with 1.16.1-20.fc28 and not reproducible with 1.16.1-21.fc28. In both tests, DS was calling krb5_cc_destroy. It looks to me the named crash was independent to krb5_cc_destroy calls.

There was 3 failures while testing
Packages:
krb5-server.x86_64     1.16.1-21.fc28
389-ds-base.x86_64     1.4.0.18-99.bz1633089.fc28 (build provided by viktor)

Robbie, do you think those failures are consequence of krb5_cc_destroy calls ?

Comment 42 Robbie Harwood 2018-10-19 17:00:48 UTC
Yes, I think it likely.  I expect that valgrind would reveal similar corruption even though everything looks okay.

Comment 43 Viktor Ashirov 2018-12-04 09:55:38 UTC
(In reply to Viktor Ashirov from comment #37)
> Build with the fix from https://pagure.io/389-ds-base/pull-request/49973 on
> top of 389-ds-base-1.4.0.18-1 is here
> https://copr.fedorainfracloud.org/coprs/vashirov/389ds/build/808414/
> 
> # dnf copr enable vashirov/389ds
> # dnf update 389-ds-base

Please find a new build on top of 389-ds-base-1.4.0.19-1 in the same place:

https://copr.fedorainfracloud.org/coprs/vashirov/389ds/build/833050/

Comment 44 Florence Blanc-Renaud 2018-12-04 09:57:37 UTC
Thanks for the build, Viktor. I am relaunching the tests, will keep you posted.

Comment 45 thierry bordaz 2018-12-06 07:41:05 UTC
Offline, Flo reported hard time to run successfully the previous build [1]. The tests are running fine in 389-ds-base-1.4.0.19-1 but not in 389-ds-base-1.4.0.19-1+krb_auth_serialization+no_call_to_krb5_cc_destroy [1].

I have no explanation for that. Viktor kindly rebuilt 389-ds-base-1.4.0.19-1+krb_auth_serialization [2]. It contains original fix that was quite good on top of 1.4.0.18.

Flo do you mind to relaunch the tests with [2] 


[1] https://bugzilla.redhat.com/show_bug.cgi?id=1633089#c43
[2] https://copr.fedorainfracloud.org/coprs/vashirov/389ds/build/834158/
[3] https://bugzilla.redhat.com/show_bug.cgi?id=1633089#c38

Comment 46 Florence Blanc-Renaud 2018-12-06 10:50:37 UTC
Hi,
thanks for the new rpms. I launched the test once with them, and the test succeeded. I am now running the tests in loop to see if the fix is really stable (with previous pkgs there were ~3 failures out of 50 runs). Will keep you posted.

Comment 47 Florence Blanc-Renaud 2018-12-10 08:22:25 UTC
Package information:
# dnf list installed krb5-server freeipa-server 389-ds-base
Installed Packages
389-ds-base.x86_64           1.4.0.19-1.2bz1633089.fc28         @vashirov-389ds 
freeipa-server.x86_64        4.7.2-1.fc28                       @updates-testing
krb5-server.x86_64           1.16.1-21.fc28                     @updates        

The test test_integration/test_server_del.py::TestServerDel::test_ignore_topology_disconnect_replica1 was run 50 times without any issue (100% success).

Comment 48 thierry bordaz 2018-12-10 13:28:06 UTC
Robbie,

The tentative fix [1] is successful. This fix serializes the krb authentication.
As suggested [2], a new fix [3] was tested and that fix systematically fails. The fix [3] is based on [1] and in addition prevents to clear the cc_cache in case of krb auth failure for invalid credential.
The fix [3] (compare to [1]) looks like

        rc = slapd_ldap_sasl_interactive_bind(ld, bindid, creds, mech,
                                              serverctrls, returnedctrls, msgidp);
        if (LDAP_SUCCESS != rc) {
            slapi_log_err(..)
-           if (mech && !strcmp(mech, "GSSAPI") && rc == 49) {
-               /* only on err 49 should we clear out the credential cache */
-               clear_krb5_ccache();
-           }
        }

Is it the fix you were thinking of ?

[1] 1.4.0.19-1.2bz1633089.fc28 https://copr.fedorainfracloud.org/coprs/vashirov/389ds/build/834158/
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1633089#c42
[3] https://copr.fedorainfracloud.org/coprs/vashirov/389ds/build/834158/

Comment 49 Robbie Harwood 2018-12-11 17:46:53 UTC
Yes, it was.  Interesting that it still fails, guess I was wrong about the underlying cause.  I assume the failure is the same (random corruption); otherwise logs would be interesting.

Comment 50 Martin Stefany 2018-12-27 00:21:35 UTC
Hi,

it seems I'm having same issue with FreeIPA with named-pkcs11.service from bind-9.11.4-10.P2.fc29.x86_64. After recent upgrade to krb5-*-1.16.1-22.fc29.x86_64, named-pkcs11.service fails to start with strace:

[...]
27-Dec-2018 01:11:02.890 bind-dyndb-ldap version 11.1 compiled at 22:53:39 Jul 15 2018, compiler 8.1.1 20180712 (Red Hat 8.1.1-5)
k5_mutex_lock: Received error 22 (Invalid argument)
named-pkcs11: ../../../include/k5-thread.h:376: k5_mutex_lock: Assertion `r == 0' failed.
)                    = ?
+++ killed by SIGABRT (core dumped) +++
Aborted (core dumped)


After downgrade back to krb5-*-1.16.1-21.fc29.x86_64, but with the same bind version, named-pkcs11.service starts and works as expected.
All underlying FSes are XFS.

Can it be the same issue or should open new bug?

Comment 51 Adam Williamson 2018-12-28 18:10:47 UTC
Looking into this, it seems like that update (and its partner for F28) are the problem. In that build, -22, as well as adding a patch for the security bug that's mentioned in the changelog, rharwood re-enabled the patch that was intended to address this bug, but which I had *disabled* in -21 because it actually caused FreeIPA to break completely. It seems that patch *still* causes FreeIPA to break completely. So I'm sending out a -23 with the patch disabled again. Please check it out and see if it helps.

https://bodhi.fedoraproject.org/updates/FEDORA-2018-dc944aaa79

Comment 52 Martin Stefany 2018-12-29 21:08:29 UTC
(In reply to Adam Williamson from comment #51)
> Looking into this, it seems like that update (and its partner for F28) are
> the problem. In that build, -22, as well as adding a patch for the security
> bug that's mentioned in the changelog, rharwood re-enabled the patch that
> was intended to address this bug, but which I had *disabled* in -21 because
> it actually caused FreeIPA to break completely. It seems that patch *still*
> causes FreeIPA to break completely. So I'm sending out a -23 with the patch
> disabled again. Please check it out and see if it helps.
> 
> https://bodhi.fedoraproject.org/updates/FEDORA-2018-dc944aaa79

Thank you, -23 seems to be working just fine.

Comment 53 thierry bordaz 2019-01-15 17:10:43 UTC
Upstream ticket is pushed (https://pagure.io/389-ds-base/issue/49972) -> POST

Comment 54 Adam Williamson 2019-01-15 17:21:10 UTC
Is that a different fix which shouldn't break our FreeIPA tests this time?

Comment 55 thierry bordaz 2019-01-15 17:31:56 UTC
Adam, this is a DS fix that was successfully tested with FreeIPA (https://bugzilla.redhat.com/show_bug.cgi?id=1633089#c47).
Basically the fix serialize krb authentication for outgoing connection (i.e. replication agreements) that prevent agreements to share the same cc_cache.

Comment 56 Robbie Harwood 2019-02-11 20:39:43 UTC
Hi Thierry et al.,

Upstream in krb5, we uncovered a bug with refcounting of MEMORY ccaches.  We encountered difficulty diagnosing it because 088ba228acce4fd55bbb7c30122fe2703b8beeb8 accidentally fixed the issue as part of a backend rewrite.

These changes landed in krb5-1.17 (so rawhide only for Fedora, but I can backport further if it helps) and were backported to RHEL (krb5-1.15.1-35.el7; RHEL-8 backport still pending).

If testing determines that to be the issue, 389ds may not have to take the performance hit with Kerberos at all.

Comment 57 thierry bordaz 2019-02-14 09:35:21 UTC
Thanks Robbie for the heads up. Indeed invalid ccache refcnt could be the RC of the crash of DS.

Regarding the DS code changes, IMHO it has no significant performance impact. A potential performance hit should be very rare, limited (may be few more sec to authenticate) and have no consequence as connections are running replication protocol (Repl Agt) that is robust to timing issues.

The part of code in DS needs to be rewritten (https://pagure.io/389-ds-base/issue/50017 and https://pagure.io/389-ds-base/issue/50018) so reverting the DS fix will not bring back the code to a better state.
IMHO, we should keep DS as it is with that fix (current BZ). So if the RC of the crash was the bug in krb5, then the DS fix is useless but without noticeable impact. If the RC of the crash was not the bug in krb5, we have a fix in place that proved it prevented the crash.

Comment 58 Robbie Harwood 2019-02-14 18:23:22 UTC
Thanks, that's helpful!  Please reach out if I can help with any rewrites.

Comment 65 Ben Cotton 2019-05-02 19:37:39 UTC
This message is a reminder that Fedora 28 is nearing its end of life.
On 2019-May-28 Fedora will stop maintaining and issuing updates for
Fedora 28. It is Fedora's policy to close all bug reports from releases
that are no longer maintained. At that time this bug will be closed as
EOL if it remains open with a Fedora 'version' of '28'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 28 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 66 Ben Cotton 2019-05-28 23:57:57 UTC
Fedora 28 changed to end-of-life (EOL) status on 2019-05-28. Fedora 28 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.


Note You need to log in before you can comment on or make changes to this bug.