Bug 642046
Summary: | Segfault when using SASL/GSSAPI multimaster replication, possible krb5_creds doublefree | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [Retired] 389 | Reporter: | Edward Z. Yang <ezyang> | ||||||
Component: | Security - SASL | Assignee: | Rich Megginson <rmeggins> | ||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Viktor Ashirov <vashirov> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 1.2.6 | CC: | andrey.ivanov, dpal, jgalipea, nkinder | ||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | All | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2015-12-07 16:46:24 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 639035, 656390 | ||||||||
Attachments: |
|
Description
Edward Z. Yang
2010-10-11 20:53:38 UTC
What platform is this? Have you tried to reproduce this with 389-ds-base-1.2.7.1? What version of kerberos? Are you using pkinit? ping - needinfo? Sorry Rich, it looks like I missed your original ping. The last time we had a dirsrv segfault was Nov 12, so we have not tried reproducing with 1.2.7.1, our version of Kerberos is 1.7.1. We do not use pkinit. Here is some interesting hack that we've used to mostly suppress the expression of this bug. We use a disk ccache instead of a memory one (which is renewed by the # slapdagent cronjob). In our dirsrv sysconfig file we've added: KRB5CCNAME=/var/run/dirsrv/krb5cc; export KRB5CCNAME /usr/kerberos/bin/kinit -k -t "$KRB5_KTNAME" ldap/"$(hostname)" chown --reference="$KRB5_KTNAME" "$KRB5CCNAME" We can suspend the hack and wait until a dirsrv crashes (and since a recent different patch was taken, this will probably not require us to do a full cluster reinitialization), but unless any Kerberos code was touched in the intervening releases it seems unlikely that it would have been resolved. Oh, this is Fedora 13. I'm assuming you're using the regular kerberos db and not using the ldap server as the kerberos db (because of the other bug). I'm going to try to reproduce on RHEL 6 using kerberos 1.8.2 Your "hack" looks correct in that it uses the same credentials used by the server but bypasses the kerberos calls. As long as you can reliably renew the creds, that should work fine. But you still have the problem occasionally, even with your hack? If I can't reproduce on RHEL6, I'll fallback to F-13. Correct. I don’t think we’ve had a segfault since the hack was put in place. We’ve had dirsrvs crash, but for things like out of disk space. :-) Thanks for looking into this. So far, I have not been able to reproduce on RHEL6 - this is what I've done: 1) setup a kerberos server (standard db, not ldap db) 2) issue a ldap service principal 3) setup multi-master replication using SASL/GSSAPI 4) add 1000 entries on each server - verify that the entries are replicated to both servers Does it take a long time to reproduce the crash? Is there some particular sequence of operations that cause a crash? You may have more luck reproducing the error if you have a 3+, fully replicated MMR topology, as the bug may be related to racing between multiple interactions. It tends to take a long time to reproduce the crash, although when a crash does happen, it frequently causes cascading failures in other clients. We have not been able to identify a sequence of operations that causes a crash. I think that the fact that the errors log shows certain lines around getting the ticket listed multiple times (3 to be exact) indicates that we may have multiple threads trying to access the ticket at the same time. Perhaps there is something that is being shared between these threads that needs to be protected. The logs and the code in set_krb5_creds() seem to indicate that we keep resetting the KRB5CCNAME environment variable if we have multiple threads race into set_krb5_creds(). I suspect that this might be the problem. Yes, I suspect it has to do with multiple threads. I suspected this might be a problem - see the comment near the top of set_krb5_creds() - but the locking used by krb5 looks sound, and I was never able to get it into a bad state with testing. I don't have a problem with putting a mutex around set_krb5_creds() but since we can't reproduce the problem, we can't verify the fix works. Edward, would you be able to test a fix for us? Created attachment 469432 [details]
0001-Bug-642046-Segfault-when-using-SASL-GSSAPI-multimast.patch
Created attachment 469434 [details]
0001-Bug-642046-Segfault-when-using-SASL-GSSAPI-multimast.patch
To ssh://git.fedorahosted.org/git/389/ds.git 4da627a..53c948c master -> master commit 53c948cbcd7d9e94ae1bc77eb625a337b470e368 Author: Rich Megginson <rmeggins> Date: Thu Dec 16 08:28:26 2010 -0700 Reviewed by: nhosoi (Thanks!) Branch: master Fix Description: Added a mutex around all of the krb5 code. We are using static variables to cache the credentials from the keytab. Even though krb5 uses locks internally to protect the memory cache, it is possible the crash is caused by a race condition. The mutex should prevent the race condition. Also added a hack for testing to allow setting the principal - nsds5replicabinddn now must be in DN format so cannot use it for krb principal name - we really should add configuration paramters for the principal name and the keytab name. On machines with broken DNS/reverse DNS, testing Kerberos is quite hard without this. Instead of passing NULL to krb5_sname_to_principal() for the hostname, use the hostname from config_get_localhost() - this is consistent with what SASL does to initialize the SASL context. Platforms tested: RHEL6 x86_64 Flag Day: no Doc impact: no We should have a 389-ds-base-1.2.8.a1 rpm in Fedora Testing by early January. I don't know if you are able to build/test from source - if so, please grab the patch and try it out. Hello Rich, I completely forgot to undo the hotfix and see if 1.2.8 fixed things! We've undid the fix on one server; it is running 1.2.8.2.1.fc13. Cheers, Edward Waiting to see if we can mark this customer verified.... Based on Comment #17 - marking customer verified. |