Description of problem:
sssd ID searches break with openldap-2.4.28-1.fc17.
Steps to Reproduce:
1. Set up sssd on a Rawhide box
2. Run getent passwd <userid>, eg. "getent passwd kdreyer"
(user is not found)
My LDAP server is CentOS 5, openldap-servers-2.3.43-12.el5_7.10
(regular openldap-clients commands in 2.4.28-1.fc17, for example ldapwhoami, or ldapsearch, work fine against this server. It is only sssd that breaks here.)
In the sssd_KTDREYER.log:
(Tue Jan 3 12:12:11 2012) [sssd[be[KTDREYER]]] [simple_bind_send] (0x0100): Executing simple bind as: (null)
(Tue Jan 3 12:12:11 2012) [sssd[be[KTDREYER]]] [simple_bind_send] (0x2000): ldap simple bind sent, msgid = 2
(Tue Jan 3 12:12:11 2012) [sssd[be[KTDREYER]]] [sdap_process_result] (0x2000): Trace: sh[0xb8eee4c0], connected, ops[0xb8f76e10], ldap[0xb8ee1ca0]
(Tue Jan 3 12:12:11 2012) [sssd[be[KTDREYER]]] [sdap_process_result] (0x0100): ldap_result gave -1, something bad happend!
(Tue Jan 3 12:12:11 2012) [sssd[be[KTDREYER]]] [sdap_handle_release] (0x2000): Trace: sh[0xb8eee4c0], connected, ops[0xb8f76e10], ldap[0xb8ee1ca0], destructor_lock, release_memory
(Tue Jan 3 12:12:11 2012) [sssd[be[KTDREYER]]] [remove_connection_callback] (0x4000): Successfully removed connection callback.
(Tue Jan 3 12:12:11 2012) [sssd[be[KTDREYER]]] [fo_set_port_status] (0x0100): Marking port 636 of server 'salt.ktdreyer.com' as 'not working'
My sssd config is working fine on EL5, EL6, and F15. Thanks to sgallagh in #sssd, I found out that when I dowgrade openldap to 2.4.26-5.fc17, sssd works again.
There is a regression in openldap-libs somewhere between 2.4.26-5.fc17 and 2.4.28-1.fc17. We don't get any useful information from openldap, we only receive -1 back from ldap_result().
I managed to reproduce this with openldap-2.4.28-2. However, after playing around with gdb trying to find out some more info about what's going on, it started to work and I can't reproduce this anymore. I really have no idea what happened. I didn't change any setting or anything in ldap database.
Here is roughly what I did:
1. run sssd -d 10 -i
2. find sssd_be pid and run gdb
3. attach gdb to sssd_be process
4. set breakpoint at sdap_async.c:1019
5. run getent passwd firstname.lastname@example.org
6. make few steps in gdb, then continue
Any ideas about what might have gone wrong? (or more precisely, gone 'right' in this particular case)
I know that attaching gdb to a process makes the process pause, so maybe the execution order changed somehow. What really escapes me, though, is why I can't reproduce this after I restart the sssd and slapd, or even reboot the machine.
Ah ok, I removed the cache file and it's reproducible again. But still not sure why it started to work.
(In reply to comment #3)
> Ah ok, I removed the cache file and it's reproducible again. But still not sure
> why it started to work.
SSSD caches the entries internally into an on-disk cache (as you figured out). I believe that the second request came straight out of the cache without even contacting SSSD.
If you set:
entry_cache_timeout = 1
into the [domain/redhat.com] section of your sssd.conf, then all cache entries will be only valid for 1 second, so even the second request should go all the way into the LDAP code.
Actually, one more gotcha..the cache expiration time is stored in the cache itself, so you'll probably want to either remove the cache after you changed entry_cache_timeout parameter or use the sss_cache tool (available in 1.7 an later) to mark the cached entries as expired.
Created attachment 559011 [details]
SSSD log after returning expected result.
I'm pretty much hopeless on this one. I attached a log where I got the expected result from openldap; that means after running 'getent passwd email@example.com', sssd contacted ldap and retrieved the info. What I don't understand is why there is that 'ldap_result gave -1' at the end of the log. I removed the cache files before running the command.
Anyone from sssd team willing to give a hint on what might have gone wrong?
It looks like some kind of a weird race condition to me.
This is how I test it:
1. start slapd (no nonstandard configuration + minimal tree to get the result)
2. (re)start slapd -d 10 -i
3. getent passwd firstname.lastname@example.org
4. rm -rf /var/lib/sss/db/*
5. goto 2.
At some point, step 3 prints the expected result.
(In reply to comment #7)
> I'm pretty much hopeless on this one. I attached a log where I got the expected
> result from openldap; that means after running 'getent passwd
> email@example.com', sssd contacted ldap and retrieved the info. What I don't
> understand is why there is that 'ldap_result gave -1' at the end of the log. I
> removed the cache files before running the command.
> Anyone from sssd team willing to give a hint on what might have gone wrong?
> It looks like some kind of a weird race condition to me.
> This is how I test it:
> 1. start slapd (no nonstandard configuration + minimal tree to get the result)
> 2. (re)start slapd -d 10 -i
> 3. getent passwd firstname.lastname@example.org
> 4. rm -rf /var/lib/sss/db/*
> 5. goto 2.
> At some point, step 3 prints the expected result.
I can prepare a scratch build that would print openldap debug info to stderr. Would that be helpful?
I think it might be. Thank you.
I'm not sure what exact version are you testing with so I'm going to attach a patch that applies on the upstream 1.7 branch - that should be pretty close to what you have.
In order to use the debugging, you need to export the SSSD_DEBUG_LDAP_SEARCH variable with the desired LDAP debug level.
would set LDAP_DEBUG_ANY
Then run SSSD on the foreground, the debug messages would simply appear on stderr.
Created attachment 559069 [details]
A non-upstream patch to enable libldap debugging
We're getting reports about this breakage all over the place. Please look into this as soon as possible.
Stephen, can you reproduce the bug? Please, can you retest with openldap from the following repository?
The latest package there (openldap-2.4.29-0.3.tc4) is possibly the future 2.4.29, which will be released soon. It is already tagged in upstream git. If it is a regression, then it might be already fixed.
I have tried rebuilding SSSD against:
$ rpm -qa \*openldap\*
and I am still seeing the bug.
Unsetting needinfo, the requested information was provided in comment #14. Please let us know if you need any additional information to produce a fix.
I have a little more information about this.
SSSD's event loop listens on the file descriptor we get from
ber_sockbuf_ctrl(sb, LBER_SB_OPT_GET_FD, &ber_fd);
We call a callback to process results from LDAP when that file descriptor becomes marked as readable by the kernel (assumption: there's data on it to be read).
However, it appears that this file descriptor is now sometimes being set "readable" without data that can be processed by ldap_result().
This is the source of the regression. Downgrading to F16's openldap works properly.
In the case above, it seems like the file descriptor is being marked readable before the response has actually arrived. Additionally, see BZ #790414 for another manifestation of this. In that particular case, we're getting it marked as readable AFTER the response has already been fully processed.
btw I bisected the version where the regression appeared and openldap-2.4.26-6.fc17 is NOT affected, while openldap-2.4.28-1.fc17 IS affected.
Thank you for the additional info. I'm already looking into this.
How do you test that the descriptor is readable?
SSSD relies on tevent (http://tevent.samba.org) for its mainloop. We register file descriptors with it and a callback to invoke on read or write readiness. Under the hood, tevent uses epoll().
Things started working again when I reverted upstream commit 06ec9f1db2bb9dc304a4adcd0d0506203cf9e6b6
I'm not actually familiar enough with openldap sources to say whether bailing out if "serviced" is not set in any of the branches is the right thing to do or whether there is another bug, but maybe this can help you triage the bug further..
Proposing as an F17 Alpha blocker so we can kick around the possible consequences of this in the Friday meeting: it's a judgement call. It's at least possible that any kind of usage of LDAP with F17 could be broken due to this bug; at minimum, we know it can break login via SSSD. Thus it's a conditional breakage of the criterion "Following on from the previous criterion, after firstboot is completed and on subsequent boots, a system installed according to any of the above criteria (or the appropriate Beta or Final criteria, when applying this criterion to those releases) must boot to a working graphical environment without unintended user intervention. This includes correctly accessing any encrypted partitions when the correct passphrase is supplied" - breaks it in the case of using SSSD.
Note that the bug can affect configurations where only the client is on F17 - you don't have to have the server running F17 to run into this bug. So that's a factor.
Fedora Bugzappers volunteer triage team
I have created an upstream report:
-1 to alpha +1 to nth +1 to beta
Discussed at the blocker review meeting of 2012-02-17. Agreed that this is not a blocker, we don't consider SSSD auth important/popular enough to block an Alpha release, but it is accepted as NTH so long as the fix is isolated to openldap as we'd expect. If the fix requires touching more general components, we may revisit the NTH status.
NTH means that if a fix for this is made available ahead of an RC spin, we will likely choose to pull that fix into the spin, through the freeze - but we will not delay the release if no fix shows up.
Fedora Bugzappers volunteer triage team
We have identified the cause and we are looking for a possible solution. If we don't find any before the RC deadline, I will revert the upstream patch identified by Jakub in (comment #20). So SSSD in F17 will work.
'RC deadline' is basically today: go/no-go is wednesday, so we need to compose rc3 today really in order to have time to test it.
Fedora Bugzappers volunteer triage team
Note this is not a release blocker
So instead of stressing everybody out ( Releng/QA/You guys) just introduce the proper fix via update before beta...
*** Bug 795502 has been marked as a duplicate of this bug. ***
contains a patch that partially resolves the issue, but I think we also have a bug in sssd where we erroneously assumed that calling ldap_result() after all results had already been returned was acceptable.
It was not, but ldap_result() used to return 0 anyways.
I think we should open a new bug for the opneldap libraries and return this one to SSSD, so we can handle the other half of the fix in SSSD itself.
Sorry disregard comment #29, I misinterpreted one reply from HYC.
The patch in http://www.openldap.org/its/index.cgi/Incoming?id=7167 should fully fix the SSSD issue w/o any code change on the SSSD side.
If this checks out we can update 7167 and consider the whole incident fixed. The patch in 7167#9 has also already been pushed to the openldap git master tree.
Scratch build with upstream patch applied:
It works in my test configuration. Can someone verify?
Works for me, too. Great!
openldap-2.4.29-3.fc17 has been submitted as an update for Fedora 17.
* should fix your issue,
* was pushed to the Fedora 17 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing openldap-2.4.29-3.fc17'
as soon as you are able to.
Please go to the following url:
then log in and leave karma (feedback).
I confirmed that openldap-2.4.29-3.fc18 fixes the original bug. Thanks all for working on this!
*** Bug 790414 has been marked as a duplicate of this bug. ***
openldap-2.4.29-3.fc17 has been pushed to the Fedora 17 stable repository. If problems still persist, please make note of it in this bug report.