Bug 771484

Summary: ldap_result does not succeed for sssd
Product: [Fedora] Fedora Reporter: Ken Dreyer <ktdreyer>
Component: openldapAssignee: Jan Vcelak <jvcelak>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: urgent Docs Contact:
Priority: high    
Version: rawhideCC: awilliam, jhrozek, johannbg, jsynacek, justin, jvcelak, orion, rmeggins, sgallagh, ssorce, tsmetana
Target Milestone: ---Keywords: Regression
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: AcceptedNTH RejectedBlocker
Fixed In Version: openldap-2.4.29-3.fc17 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-02-28 10:39:15 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 752651, 790414    
Attachments:
Description Flags
SSSD log after returning expected result.
none
A non-upstream patch to enable libldap debugging none

Description Ken Dreyer 2012-01-03 21:06:45 UTC
Description of problem:
sssd ID searches break with openldap-2.4.28-1.fc17.


How reproducible:
Always

Steps to Reproduce:
1. Set up sssd on a Rawhide box
2. Run getent passwd <userid>, eg. "getent passwd kdreyer"
  
Actual results:
(user is not found)

Expected results:
kdreyer:*:500:100:Ken Dreyer:/home/kdreyer:/bin/bash

Additional info:

My LDAP server is CentOS 5, openldap-servers-2.3.43-12.el5_7.10

(regular openldap-clients commands in 2.4.28-1.fc17, for example ldapwhoami, or ldapsearch, work fine against this server. It is only sssd that breaks here.)


In the sssd_KTDREYER.log:

(Tue Jan  3 12:12:11 2012) [sssd[be[KTDREYER]]] [simple_bind_send] (0x0100): Executing simple bind as: (null)
(Tue Jan  3 12:12:11 2012) [sssd[be[KTDREYER]]] [simple_bind_send] (0x2000): ldap simple bind sent, msgid = 2
(Tue Jan  3 12:12:11 2012) [sssd[be[KTDREYER]]] [sdap_process_result] (0x2000): Trace: sh[0xb8eee4c0], connected[1], ops[0xb8f76e10], ldap[0xb8ee1ca0]
(Tue Jan  3 12:12:11 2012) [sssd[be[KTDREYER]]] [sdap_process_result] (0x0100): ldap_result gave -1, something bad happend!
(Tue Jan  3 12:12:11 2012) [sssd[be[KTDREYER]]] [sdap_handle_release] (0x2000): Trace: sh[0xb8eee4c0], connected[1], ops[0xb8f76e10], ldap[0xb8ee1ca0], destructor_lock[0], release_memory[0]
(Tue Jan  3 12:12:11 2012) [sssd[be[KTDREYER]]] [remove_connection_callback] (0x4000): Successfully removed connection callback.
(Tue Jan  3 12:12:11 2012) [sssd[be[KTDREYER]]] [fo_set_port_status] (0x0100): Marking port 636 of server 'salt.ktdreyer.com' as 'not working'


My sssd config is working fine on EL5, EL6, and F15. Thanks to sgallagh in #sssd, I found out that when I dowgrade openldap to 2.4.26-5.fc17, sssd works again.
http://koji.fedoraproject.org/koji/buildinfo?buildID=267220

Comment 1 Stephen Gallagher 2012-01-03 21:13:49 UTC
There is a regression in openldap-libs somewhere between 2.4.26-5.fc17 and 2.4.28-1.fc17. We don't get any useful information from openldap, we only receive -1 back from ldap_result().

Comment 2 Jan Synacek 2012-01-30 11:47:25 UTC
I managed to reproduce this with openldap-2.4.28-2. However, after playing around with gdb trying to find out some more info about what's going on, it started to work and I can't reproduce this anymore. I really have no idea what happened. I didn't change any setting or anything in ldap database.

Here is roughly what I did:
1. run sssd -d 10 -i
2. find sssd_be pid and run gdb
3. attach gdb to sssd_be process
4. set breakpoint at sdap_async.c:1019
5. run getent passwd jsynacek
6. make few steps in gdb, then continue

Any ideas about what might have gone wrong? (or more precisely, gone 'right' in this particular case)

I know that attaching gdb to a process makes the process pause, so maybe the execution order changed somehow. What really escapes me, though, is why I can't reproduce this after I restart the sssd and slapd, or even reboot the machine.

Comment 3 Jan Synacek 2012-01-30 11:53:32 UTC
Ah ok, I removed the cache file and it's reproducible again. But still not sure why it started to work.

Comment 4 Jakub Hrozek 2012-01-30 11:58:49 UTC
(In reply to comment #3)
> Ah ok, I removed the cache file and it's reproducible again. But still not sure
> why it started to work.

SSSD caches the entries internally into an on-disk cache (as you figured out). I believe that the second request came straight out of the cache without even contacting SSSD.

If you set:
entry_cache_timeout = 1
into the [domain/redhat.com] section of your sssd.conf, then all cache entries will be only valid for 1 second, so even the second request should go all the way into the LDAP code.

Comment 5 Jakub Hrozek 2012-01-30 12:04:28 UTC
Actually, one more gotcha..the cache expiration time is stored in the cache itself, so you'll probably want to either remove the cache after you changed entry_cache_timeout parameter or use the sss_cache tool (available in 1.7 an later) to mark the cached entries as expired.

Comment 6 Jan Synacek 2012-02-02 08:45:04 UTC
Created attachment 559011 [details]
SSSD log after returning expected result.

Comment 7 Jan Synacek 2012-02-02 09:01:05 UTC
I'm pretty much hopeless on this one. I attached a log where I got the expected result from openldap; that means after running 'getent passwd jsynacek', sssd contacted ldap and retrieved the info. What I don't understand is why there is that 'ldap_result gave -1' at the end of the log. I removed the cache files before running the command.

Anyone from sssd team willing to give a hint on what might have gone wrong?

It looks like some kind of a weird race condition to me.
This is how I test it:
1. start slapd (no nonstandard configuration + minimal tree to get the result)
2. (re)start slapd -d 10 -i
3. getent passwd jsynacek
4. rm -rf /var/lib/sss/db/*
5. goto 2.

At some point, step 3 prints the expected result.

Comment 8 Jakub Hrozek 2012-02-02 09:09:56 UTC
(In reply to comment #7)
> I'm pretty much hopeless on this one. I attached a log where I got the expected
> result from openldap; that means after running 'getent passwd
> jsynacek', sssd contacted ldap and retrieved the info. What I don't
> understand is why there is that 'ldap_result gave -1' at the end of the log. I
> removed the cache files before running the command.
> 
> Anyone from sssd team willing to give a hint on what might have gone wrong?
> 
> It looks like some kind of a weird race condition to me.
> This is how I test it:
> 1. start slapd (no nonstandard configuration + minimal tree to get the result)
> 2. (re)start slapd -d 10 -i
> 3. getent passwd jsynacek
> 4. rm -rf /var/lib/sss/db/*
> 5. goto 2.
> 
> At some point, step 3 prints the expected result.

I can prepare a scratch build that would print openldap debug info to stderr. Would that be helpful?

Comment 9 Jan Synacek 2012-02-02 10:16:49 UTC
I think it might be. Thank you.

Comment 10 Jakub Hrozek 2012-02-02 15:07:22 UTC
I'm not sure what exact version are you testing with so I'm going to attach a patch that applies on the upstream 1.7 branch - that should be pretty close to what you have.

In order to use the debugging, you need to export the SSSD_DEBUG_LDAP_SEARCH variable with the desired LDAP debug level.

For example:
export SSSD_DEBUG_LDAP_SEARCH="0xffff"
would set LDAP_DEBUG_ANY

Then run SSSD on the foreground, the debug messages would simply appear on stderr.

Comment 11 Jakub Hrozek 2012-02-02 15:08:12 UTC
Created attachment 559069 [details]
A non-upstream patch to enable libldap debugging

Comment 12 Stephen Gallagher 2012-02-09 18:27:32 UTC
We're getting reports about this breakage all over the place. Please look into this as soon as possible.

Comment 13 Jan Vcelak 2012-02-09 19:54:16 UTC
Stephen, can you reproduce the bug? Please, can you retest with openldap from the following repository?

http://repos.fedorapeople.org/repos/jvcelak/openldap/

The latest package there (openldap-2.4.29-0.3.tc4) is possibly the future 2.4.29, which will be released soon. It is already tagged in upstream git. If it is a regression, then it might be already fixed.

Comment 14 Jakub Hrozek 2012-02-10 10:22:36 UTC
I have tried rebuilding SSSD against:

$ rpm -qa \*openldap\*
openldap-devel-2.4.29-0.3.tc4.fc17.x86_64
openldap-2.4.29-0.3.tc4.fc17.x86_64

and I am still seeing the bug.

Comment 15 Jakub Hrozek 2012-02-14 15:05:25 UTC
Unsetting needinfo, the requested information was provided in comment #14. Please let us know if you need any additional information to produce a fix.

Comment 16 Stephen Gallagher 2012-02-14 16:51:02 UTC
I have a little more information about this.

SSSD's event loop listens on the file descriptor we get from
ber_sockbuf_ctrl(sb, LBER_SB_OPT_GET_FD, &ber_fd);

We call a callback to process results from LDAP when that file descriptor becomes marked as readable by the kernel (assumption: there's data on it to be read).

However, it appears that this file descriptor is now sometimes being set "readable" without data that can be processed by ldap_result().

This is the source of the regression. Downgrading to F16's openldap works properly.

In the case above, it seems like the file descriptor is being marked readable before the response has actually arrived. Additionally, see BZ #790414 for another manifestation of this. In that particular case, we're getting it marked as readable AFTER the response has already been fully processed.

Comment 17 Jakub Hrozek 2012-02-14 16:59:46 UTC
btw I bisected the version where the regression appeared and openldap-2.4.26-6.fc17 is NOT affected, while openldap-2.4.28-1.fc17 IS affected.

Comment 18 Jan Vcelak 2012-02-14 17:15:00 UTC
Thank you for the additional info. I'm already looking into this.

How do you test that the descriptor is readable?

Comment 19 Stephen Gallagher 2012-02-14 18:07:59 UTC
SSSD relies on tevent (http://tevent.samba.org) for its mainloop. We register file descriptors with it and a callback to invoke on read or write readiness. Under the hood, tevent uses epoll().

Comment 20 Jakub Hrozek 2012-02-14 18:09:57 UTC
Things started working again when I reverted upstream commit 06ec9f1db2bb9dc304a4adcd0d0506203cf9e6b6

I'm not actually familiar enough with openldap sources to say whether bailing out if "serviced" is not set in any of the branches is the right thing to do or whether there is another bug, but maybe this can help you triage the bug further..

Comment 21 Adam Williamson 2012-02-14 20:31:57 UTC
Proposing as an F17 Alpha blocker so we can kick around the possible consequences of this in the Friday meeting: it's a judgement call. It's at least possible that any kind of usage of LDAP with F17 could be broken due to this bug; at minimum, we know it can break login via SSSD. Thus it's a conditional breakage of the criterion "Following on from the previous criterion, after firstboot is completed and on subsequent boots, a system installed according to any of the above criteria (or the appropriate Beta or Final criteria, when applying this criterion to those releases) must boot to a working graphical environment without unintended user intervention. This includes correctly accessing any encrypted partitions when the correct passphrase is supplied" - breaks it in the case of using SSSD.

Note that the bug can affect configurations where only the client is on F17 - you don't have to have the server running F17 to run into this bug. So that's a factor.



-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 22 Jan Vcelak 2012-02-15 15:41:12 UTC
I have created an upstream report:
http://www.openldap.org/its/index.cgi?findid=7167

Comment 23 Jóhann B. Guðmundsson 2012-02-17 09:02:35 UTC
-1 to alpha +1 to nth +1 to beta

Comment 24 Adam Williamson 2012-02-17 17:19:32 UTC
Discussed at the blocker review meeting of 2012-02-17. Agreed that this is not a blocker, we don't consider SSSD auth important/popular enough to block an Alpha release, but it is accepted as NTH so long as the fix is isolated to openldap as we'd expect. If the fix requires touching more general components, we may revisit the NTH status.

NTH means that if a fix for this is made available ahead of an RC spin, we will likely choose to pull that fix into the spin, through the freeze - but we will not delay the release if no fix shows up.



-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 25 Jan Vcelak 2012-02-20 17:32:37 UTC
We have identified the cause and we are looking for a possible solution. If we don't find any before the RC deadline, I will revert the upstream patch identified by Jakub in (comment #20). So SSSD in F17 will work.

Comment 26 Adam Williamson 2012-02-20 17:41:12 UTC
'RC deadline' is basically today: go/no-go is wednesday, so we need to compose rc3 today really in order to have time to test it.



-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 27 Jóhann B. Guðmundsson 2012-02-20 18:00:38 UTC
Note this is not a release blocker 

So instead of stressing everybody out ( Releng/QA/You guys) just introduce the proper fix via update before beta...

Comment 28 Simo Sorce 2012-02-20 18:14:44 UTC
*** Bug 795502 has been marked as a duplicate of this bug. ***

Comment 29 Simo Sorce 2012-02-20 21:13:20 UTC
http://www.openldap.org/its/index.cgi/Incoming?id=7167
contains a patch that partially resolves the issue, but I think we also have a bug in sssd where we erroneously assumed that calling ldap_result() after all results had already been returned was acceptable.
It was not, but ldap_result() used to return 0 anyways.

I think we should open a new bug for the opneldap libraries and return this one to SSSD, so we can handle the other half of the fix in SSSD itself.

Comment 30 Simo Sorce 2012-02-20 23:24:34 UTC
Sorry disregard comment #29, I misinterpreted one reply from HYC.

The patch in http://www.openldap.org/its/index.cgi/Incoming?id=7167 should fully fix the SSSD issue w/o any code change on the SSSD side.

If this checks out we can update 7167 and consider the whole incident fixed. The patch in 7167#9 has also already been pushed to the openldap git master tree.

Comment 31 Jan Vcelak 2012-02-21 11:02:36 UTC
Scratch build with upstream patch applied:
http://koji.fedoraproject.org/koji/taskinfo?taskID=3806953

Please, test!

Comment 32 Jan Vcelak 2012-02-21 11:10:56 UTC
It works in my test configuration. Can someone verify?

Comment 33 Jakub Hrozek 2012-02-21 11:51:44 UTC
Works for me, too. Great!

Comment 34 Jan Vcelak 2012-02-21 15:18:18 UTC
Resolved in:
openldap-2.4.29-3.fc17
openldap-2.4.29-3.fc18

Comment 35 Fedora Update System 2012-02-21 15:20:02 UTC
openldap-2.4.29-3.fc17 has been submitted as an update for Fedora 17.
https://admin.fedoraproject.org/updates/openldap-2.4.29-3.fc17

Comment 36 Fedora Update System 2012-02-21 17:44:18 UTC
Package openldap-2.4.29-3.fc17:
* should fix your issue,
* was pushed to the Fedora 17 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing openldap-2.4.29-3.fc17'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2012-2113/openldap-2.4.29-3.fc17
then log in and leave karma (feedback).

Comment 37 Ken Dreyer 2012-02-21 20:59:02 UTC
I confirmed that openldap-2.4.29-3.fc18 fixes the original bug. Thanks all for working on this!

Comment 38 Stephen Gallagher 2012-02-23 01:29:16 UTC
*** Bug 790414 has been marked as a duplicate of this bug. ***

Comment 39 Fedora Update System 2012-02-28 10:39:15 UTC
openldap-2.4.29-3.fc17 has been pushed to the Fedora 17 stable repository.  If problems still persist, please make note of it in this bug report.