488597 – nscd discards entries even with unlimited reload count

Bug 488597 - nscd discards entries even with unlimited reload count

Summary: nscd discards entries even with unlimited reload count

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	nss_ldap
Sub Component:
Version:	5.3
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Nalin Dahyabhai
QA Contact:	BaseOS QE Security Team
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2009-03-04 21:23 UTC by Daniel Qarras
Modified:	2018-11-14 20:25 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2011-02-01 22:05:09 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
/etc/nscd.conf (1.72 KB, text/plain) 2009-03-10 21:28 UTC, Daniel Qarras	no flags	Details
/var/log/nscd.log (7.82 KB, text/plain) 2009-03-10 21:29 UTC, Daniel Qarras	no flags	Details
nscd test steps (1.58 KB, text/plain) 2009-03-10 21:29 UTC, Daniel Qarras	no flags	Details
slapd.conf (2.19 KB, text/plain) 2009-09-09 21:03 UTC, Daniel Qarras	no flags	Details
DB_CONFIG (168 bytes, text/plain) 2009-09-09 21:04 UTC, Daniel Qarras	no flags	Details
ldap.conf (1.29 KB, text/plain) 2009-09-09 21:04 UTC, Daniel Qarras	no flags	Details
nsswitch.conf (301 bytes, text/plain) 2009-09-09 21:39 UTC, Daniel Qarras	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Sourceware	10181	0	P2	NEW	nscd discards entries even with unlimited reload count	2020-09-20 16:50:38 UTC

Description Daniel Qarras 2009-03-04 21:23:15 UTC

Description of problem:
With the configuration below nscd seems to remove user info after a period of time (noticed both after 5 minutes and after 15 hours). I did something like this to test (using LDAP as info source):

service ldap start  [succeeds]
service nscd start  [succeeds]
id testuser         [succeeds]
ls -l ~testuser     [succeeds]
service ldap stop   [succeeds]
id testuser         [succeeds]
ls -l ~testuser     [succeeds]
<wait, retry, wait, ...>
id testuser         [fails]
ls -l ~testuser     [fails]

It seems that the time after which entries get removed is more or less random making it hard to identify any clear occurances causing this.

This is important to allow laptops to work when not connected to the home network.



	server-user		nscd
	debug-level		0
	reload-count		unlimited
	paranoia		no

	enable-cache		passwd		yes
	positive-time-to-live	passwd		600
	negative-time-to-live	passwd		20
	suggested-size		passwd		211
	check-files		passwd		yes
	persistent		passwd		yes
	shared			passwd		yes
	max-db-size		passwd		33554432
	auto-propagate		passwd		yes

	enable-cache		group		yes
	positive-time-to-live	group		600
	negative-time-to-live	group		60
	suggested-size		group		211
	check-files		group		yes
	persistent		group		yes
	shared			group		yes
	max-db-size		group		33554432
	auto-propagate		group		yes

Comment 1 Daniel Qarras 2009-03-10 21:27:58 UTC

I was able to reproduce this easily on Fedora 10 with all updates as of 2009-03-10 installed, including:

root@localhost:~# rpm -q nscd
nscd-2.9-3.i386

I will attach from Fedora 10 test /etc/nscd.conf, /var/log/nscd.log, and steps done in a terminal.

Comment 2 Daniel Qarras 2009-03-10 21:28:50 UTC

Created attachment 334716 [details]
/etc/nscd.conf

Comment 3 Daniel Qarras 2009-03-10 21:29:11 UTC

Created attachment 334717 [details]
/var/log/nscd.log

Comment 4 Daniel Qarras 2009-03-10 21:29:35 UTC

Created attachment 334718 [details]
nscd test steps

Comment 5 Daniel Qarras 2009-05-18 16:03:36 UTC

Any news on this nscd issue? Would it be preferable if I open this against upstream glibc since this seems to happen also with F11?

Thanks.

Comment 6 Daniel Qarras 2009-05-21 09:15:11 UTC

I now upstreamed this at http://sources.redhat.com/bugzilla/show_bug.cgi?id=10181 .

Comment 7 Andreas Schwab 2009-09-09 11:49:54 UTC

Please provide a sample ldap configuration.

Comment 8 Andreas Schwab 2009-09-09 16:22:11 UTC

Alternatively, is there a way to reproduce that without ldap?

Comment 9 Daniel Qarras 2009-09-09 21:02:55 UTC

Thanks for looking into this.

I've now retested this on latest Fedora 11 + updates using the following packages:

nscd-2.10.1-5.i586
openldap-servers-2.4.15-5.fc11.i586

The issue still remains after the same procedure described in comment #4. I will attach /etc/openldap/slapd.conf, /var/lib/ldap/intra/DB_CONFIG, and /etc/ldap.conf as requested.

This should be trivial to reproduce with ldap. What would be your suggestion to try this without ldap?

Thanks!

Comment 10 Daniel Qarras 2009-09-09 21:03:54 UTC

Created attachment 360345 [details]
slapd.conf

Comment 11 Daniel Qarras 2009-09-09 21:04:11 UTC

Created attachment 360346 [details]
DB_CONFIG

Comment 12 Daniel Qarras 2009-09-09 21:04:29 UTC

Created attachment 360347 [details]
ldap.conf

Comment 13 Daniel Qarras 2009-09-09 21:06:39 UTC

Err, nss_ldap was version:

nss_ldap-264-2.fc11.i586

No nss-ldapd was installed while testing.

Comment 14 Daniel Qarras 2009-09-09 21:39:30 UTC

Created attachment 360356 [details]
nsswitch.conf

Comment 15 Andreas Schwab 2009-10-02 13:30:33 UTC

I don't know how to setup an ldap server.  Can this be reproduced without one?

Comment 16 Daniel Qarras 2009-10-02 15:27:39 UTC

> I don't know how to setup an ldap server.  Can this be reproduced without one?

I did now additional testing and yes, in fact this can be produced with plain files, too. Below is the only related configuration snippet in /etc/nsswitch.conf:

passwd:     files
shadow:     files
group:      files

And as /etc/nscd.conf was https://bugzilla.redhat.com/attachment.cgi?id=334716 .

Below are the steps to reproduce:

wget "https://bugzilla.redhat.com/attachment.cgi?id=334716" -O /etc/nscd.conf
/etc/init.d/nscd restart
echo testuser:x:10000:100::/tmp:/sbin/nologin >> /etc/passwd
grep testuser /etc/passwd
touch /tmp/testfile
chown testuser:users /tmp/testfile
ls -l /tmp/testfile ; id testuser ;
sed -i 's/^testuser/#testuser/' /etc/passwd
grep testuser /etc/passwd
ls -l /tmp/testfile ; id testuser ;
<retry, wait, retry, ...>

For some reason every now and then this fails instantly, other times it succeeds for some time and then starts to fail. In case of instant failure, I've done something like this which usually guarantees that failure will not happen instantly:

sed -i 's/^#testuser/testuser/' /etc/passwd
grep testuser /etc/passwd
ls -l /tmp/testfile ; id testuser ;
sleep 60
ls -l /tmp/testfile ; id testuser ;
sed -i 's/^testuser/#testuser/' /etc/passwd
grep testuser /etc/passwd
ls -l /tmp/testfile ; id testuser ;
<retry, wait, retry, ...>

Thanks for looking into this!

Comment 17 Jeff Bastian 2009-10-22 18:19:28 UTC

If I'm reading the code correctly, this is expected behavior.

Looking at nscd/cache.c in the prune_cache() function, there are three conditions, any of which can make an entry in the cache removable:
  1. reload_count is not unlimited and the entry has been reloaded
     too many times
  2. the look-up failed
  3. you ran 'nscd -i' to invalidate the cache

So, by running 
   sed -i 's/^testuser/#testuser/' /etc/passwd
you satisfied condition #2 -- the look-up failed -- and it removed testuser.

This could happen using LDAP if, for example, the LDAP server was temporarily too slow to respond to a look-up request.

Here's the code:
                 /* At this point there are two choices: we reload the
                     value or we discard it.  Do not change NRELOADS if
                     we never not reload the record.  */
                  if ((reload_count != UINT_MAX
                       && __builtin_expect (dh->nreloads >= reload_count, 0))
                      /* We always remove negative entries.  */
                      || dh->notfound
                      /* Discard everything if the user explicitly
                         requests it.  */
                      || now == LONG_MAX)
                    {
                      /* Remove the value.  */
                      dh->usable = false;

                      /* We definitely have some garbage entries now.  */
                      any = true;
                    }

Comment 18 Daniel Qarras 2009-10-23 16:43:16 UTC

Thanks for looking into this.

If your analysis in comment 17 is correct it means in practise that it is not possible to cache any user information in nscd when offline or not connected to data source (e.g., LDAP server). This sounds inconvenient as, e.g., LDAP server hang might make the system unusable for a user if user information is coming from LDAP.

I am also wondering what is actually the purpose or use case for this reload-count if it is not intented to work when data source like LDAP server is not available?

Comment 19 Daniel Qarras 2009-10-26 20:05:31 UTC

Based on my interpretation of the discussion at

http://sourceware.org/bugzilla/show_bug.cgi?id=2132

it would seem that the purpose of reload-count is what I've explained above meaning that we have here a bug which should be fixed.

Thanks.

Comment 20 Jeff Bastian 2009-10-26 22:03:23 UTC

This is just an idea that would require a lot of testing, but you could try removing or modifying condition #2.  For example, removing it:

--- cache.c     2009-09-14 12:21:30.060998034 -0500
+++ cache.c.new 2009-10-26 16:51:09.590911376 -0500
@@ -369,8 +369,6 @@
                     we never not reload the record.  */
                  if ((reload_count != UINT_MAX
                       && __builtin_expect (dh->nreloads >= reload_count, 0))
-                     /* We always remove negative entries.  */
-                     || dh->notfound
                      /* Discard everything if the user explicitly
                         requests it.  */
                      || now == LONG_MAX)


Or, you could modify it slightly so that it only removes entries if reload-count is not set to unlimited:

--- cache.c     2009-09-14 12:21:30.060998034 -0500
+++ cache.c.new 2009-10-26 16:52:59.502899615 -0500
@@ -369,8 +369,10 @@
                     we never not reload the record.  */
                  if ((reload_count != UINT_MAX
                       && __builtin_expect (dh->nreloads >= reload_count, 0))
-                     /* We always remove negative entries.  */
-                     || dh->notfound
+                     /* We remove negative entries if reload-count
+                       * is not unlimited.  */
+                     || (reload_count != UINT_MAX
+                          && dh->notfound)
                      /* Discard everything if the user explicitly
                         requests it.  */
                      || now == LONG_MAX)


However, as I mentioned above, this would require a lot of careful testing since there may be unwanted side-effects.

Instead of modifying nscd, there's a Fedora project that looks intriguing: System Security Services Daemon.
  http://fedoraproject.org/wiki/Features/SSSD
  https://fedorahosted.org/sssd/

One of the primary features is the ability to cache nss db info for disconnected users like laptops.  SSSD is available in Fedora 11 now for testing.

Comment 21 Brian J. Murrell 2009-10-27 02:14:05 UTC

(In reply to comment #17)
> If I'm reading the code correctly, this is expected behavior.

I don't think you are probably reading that bit of code wrongly, but I think there is likely more at play here than just that.

> Looking at nscd/cache.c in the prune_cache() function, there are three
> conditions, any of which can make an entry in the cache removable:
>   1. reload_count is not unlimited and the entry has been reloaded
>      too many times
>   2. the look-up failed

This does not necessarily seem to be so cut and dry.  As Daniel (and myself and others) has observed, the amount of time that needs to pass for nscd to have pruned the entry (i.e. after the authoritative source has become unavailable) "seems" (more on that in a bit) quite unpredictable.  I say seems because it is probably quite predictable once we uncover the mystery that is at the root of it.

Given that the amount of time that needs to pass for the entry to be pruned can vary greatly, a lookup failing cannot be the entire cause.  If it were, then the amount of time needed to pass for the entry to be pruned would be quite predictable.

>    sed -i 's/^testuser/#testuser/' /etc/passwd
> you satisfied condition #2 -- the look-up failed -- and it removed testuser.

Yes, but the amount of time that has to pass after that varies wildly.

> Here's the code:
>                  /* At this point there are two choices: we reload the
>                      value or we discard it.  Do not change NRELOADS if
>                      we never not reload the record.  */
>                   if ((reload_count != UINT_MAX
>                        && __builtin_expect (dh->nreloads >= reload_count, 0))
>                       /* We always remove negative entries.  */
>                       || dh->notfound
>                       /* Discard everything if the user explicitly
>                          requests it.  */
>                       || now == LONG_MAX)
>                     {
>                       /* Remove the value.  */
>                       dh->usable = false;
> 
>                       /* We definitely have some garbage entries now.  */
>                       any = true;
>                     }  

This bit of code is interesting.  Indeed, if we know that nscd -i was not used and we know that the reload-count is unlimited, then, the dh->notfound condition must be true.  But what contributes to dh->notfound being true?  A failed lookup, only, always?  Something else must be influencing that.

I had a discussion about this with another person interested in this same functionality (disconnected caching) and he pointed out that he believed that the cache was acting in what I would describe as a flywheel -- in that an entry would remain in the cache, despite the authoritative source being unavailable, as long as the entry was being used frequently enough.

This seems like a viable explanation.  I wonder if the test case in comment #16 can be modified to (dis-)prove this.  Or a code analysis.

I'm off to look at the code a bit more.

Comment 22 Brian J. Murrell 2009-10-27 02:38:36 UTC

Ahhh.  I suspect dh->notfound it a positive "no such record" vs. the alternate situation, which is "lookup failed".

I also don't think you are presenting the entire context of the pruning which is:

		  /* At this point there are two choices: we reload the
		     value or we discard it.  Do not change NRELOADS if
		     we never not reload the record.  */
		  if ((reload_count != UINT_MAX
		       && __builtin_expect (dh->nreloads >= reload_count, 0))
		      /* We always remove negative entries.  */
		      || dh->notfound
		      /* Discard everything if the user explicitly
			 requests it.  */
		      || now == LONG_MAX)
		    {
		      /* Remove the value.  */
		      dh->usable = false;

		      /* We definitely have some garbage entries now.  */
		      any = true;
		    }
		  else
		    {
		      /* Reload the value.  We do this only for the
			 initially used key, not the additionally
			 added derived value.  */
		      assert (runp->type < LASTREQ
			      && readdfcts[runp->type] != NULL);

		      readdfcts[runp->type] (table, runp, dh);

		      /* If the entry has been replaced, we might need
			 cleanup.  */
		      any |= !dh->usable;
		    }

The first part of the if is the simple cases of when pruning should definitely happen.  If none of those are true however, the else path is taken which is the more likely path in the case of reload-count = unlimited.

In this situation, "readdfcts[runp->type] (table, runp, dh)" is used to attempt to reload the entry, where readdfcts can be any of:

static void (*const readdfcts[LASTREQ]) (struct database_dyn *,
					 struct hashentry *,
					 struct datahead *) =
{
  [GETPWBYNAME] = readdpwbyname,
  [GETPWBYUID] = readdpwbyuid,
  [GETGRBYNAME] = readdgrbyname,
  [GETGRBYGID] = readdgrbygid,
  [GETHOSTBYNAME] = readdhstbyname,
  [GETHOSTBYNAMEv6] = readdhstbynamev6,
  [GETHOSTBYADDR] = readdhstbyaddr,
  [GETHOSTBYADDRv6] = readdhstbyaddrv6,
  [GETAI] = readdhstai,
  [INITGROUPS] = readdinitgroups,
  [GETSERVBYNAME] = readdservbyname,
  [GETSERVBYPORT] = readdservbyport
};

So readdpwbyname() for example, which calls addpwbyX(), which uses __getpw{nam|uid}_r() to look up the entry and then calls cache_addpw() to add what was found.

Looking into cache_addpw() more closely, we can see that if the *pwd sent to it was NULL and he != NULL and errval == EAGAIN:

	  /* If we have an old record available but cannot find one
	     now because the service is not available we keep the old
	     record and make sure it does not get removed.  */
	  if (reload_count != UINT_MAX && dh->nreloads == reload_count)
	    /* Do not reset the value if we never not reload the record.  */
	    dh->nreloads = reload_count - 1;

	  written = total = 0;

This appears to be what is supposed to keep existing records around if a lookup fails and reload-count is unlimited (or at least very large -- i.e. one could set reload-count to a multiple of the timeout to set an upper timelimit on the cache).

It's the else of the above which I have not quite figured out yet.  That being !(he != NULL && errval == EAGAIN).  I've not quite figure enough out about the data structures (specifically he) being used to know what condition that really is.

I was going to start to sprinkle some debug around and see what seems to be happening when the records are being expired.

Comment 23 Brian J. Murrell 2009-10-27 13:22:03 UTC

Hrm.  I don't think the negative cache expiry is working properly either.  Using the technique on comment #16 to make a passwd entry valid and then invalid and then valid again, I am finding that after going through a series of valid, invalid, valid steps, many minutes after the entry is made valid again, nscd is still reporting it invalid:

$ grep foobar /etc/passwd
foobar:*:9999:9999::/dev/null:/bin/true
$ id foobar
id: foobar: No such user
$ id foobar
id: foobar: No such user
$ id foobar
id: foobar: No such user
$ id foobar
id: foobar: No such user
$ id foobar
id: foobar: No such user

The above "id" commands were done over a period of many minutes.  AFAIU, with a setting of:

	negative-time-to-live	passwd		20

nscd should be going back to the authoritative source every 20s to check for an update.

Comment 24 Howard Wilkinson 2009-10-27 14:00:28 UTC

I think I have found a bug in nss_ldap that would explain this problem. Please see the comment I have added to http://sources.redhat.com/bugzilla/show_bug.cgi?id=2132

Comment 25 Brian J. Murrell 2009-10-27 14:42:59 UTC

Now that I think about it, adding and removing entries from /etc/passwd (i.e. in an effort to replicate this without LDAP) is not likely going to work, because a missing /etc/passwd entry is a "no such record", which is quite different than "the lookup could not be completed" which is what we are targeting here with an unreachable LDAP server.

Comment 27 Jeff Bastian 2009-10-27 18:01:34 UTC

I've been thinking about this some more and discussing with some peers, and even if nss_ldap and nscd can be made to work as expected with unlimited reload count, something still "feels" wrong about this: nscd was designed to speed up name service lookups, not to act as an offline cache.

For example, what if data in the offline cache becomes stale?  Somebody will have to run 'nscd -i passwd' to force it refresh the cache from the server.

SSSD, as mentioned in comment 20, was designed with this as one of its goals: an offline cache for disconnected laptops.  If nss_ldap has a bug, then it should be fixed, but efforts would probably be better spent on SSSD.
  http://fedoraproject.org/wiki/Features/SSSD
  https://fedorahosted.org/sssd/

Comment 28 Brian J. Murrell 2009-10-27 18:16:46 UTC

(In reply to comment #27)
> I've been thinking about this some more and discussing with some peers, and
> even if nss_ldap and nscd can be made to work as expected with unlimited reload
> count, something still "feels" wrong about this: nscd was designed to speed up
> name service lookups, not to act as an offline cache.

But it does cache, persistently already, which lends itself to being leveraged into an offline cache nicely, I think.

> For example, what if data in the offline cache becomes stale?

That's a problem with any offline cache, not just nscd.  Any offline cache, including SSSD (as I cannot see how it would know otherwise) has the potential to deliver stale (in reference to the authoritative source) data while it's offline.

> Somebody will
> have to run 'nscd -i passwd' to force it refresh the cache from the server.

Not at all.  Once the authoritative source is available again, assuming the timeout has expired (which it should as it shouldn't be any larger than some reasonable-for-caching time period like, say, 10 minutes) for the record you want, the cache should refresh from the source.

IOW, once the cache is reconnected to the source after a period of being disconnected that is longer than the (i.e. 10 minutes) timeout, effectively, all records in the cache will get refreshed, or expired as the source returns "no such record (anymore)".
 
> SSSD, as mentioned in comment 20, was designed with this as one of its goals:
> an offline cache for disconnected laptops.

Yeah.  But it's new (i.e. likely still buggy and immature) and a whole honking new piece of software with a whole honking new (and not terribly straighforward as I skimmed it) configuration syntax that I have to learn.

>  If nss_ldap has a bug, then it
> should be fixed, but efforts would probably be better spent on SSSD.
>   http://fedoraproject.org/wiki/Features/SSSD
>   https://fedorahosted.org/sssd/  

In the context of RedHat perhaps.  But we are here discussion NSCD on RedHat.  Personally, my implementation target is wider than just RedHat.

Perhaps we should all take this elsewhere, and stop bothering RedHat with it seeing as they have a different solution in mind.

Comment 29 Dmitri Pal 2009-10-27 18:37:49 UTC

> 
> Yeah.  But it's new (i.e. likely still buggy and immature) and a whole honking
> new piece of software with a whole honking new (and not terribly straighforward
> as I skimmed it) configuration syntax that I have to learn.
> 

Configuration syntax has been significantly simplified for SSSD in the recent month. Yes it is still young but already imported into distributions other than Fedora. 

As for buggy... would you be interested in trying it and making it more stable by experimenting with it and providing your feedback?

We just released 0.7.1 today...

Comment 30 Daniel Qarras 2009-10-27 20:53:21 UTC

I am well aware of SSSD's existence (I've even suggested others to use it in bug 182464 and bug 186527) but this issue was about nscd's misbehaving configuration option and to be honest I see nscd preferable over SSSD for quite some time to come: it has *much* wider adoption, it's time tested code, it's much more efficient at least currently, and - well - not fixing bugs is just not a bad practice. In addition, nscd is nicely compact providing just some very basic functionality which is enough in many cases - SSSD code is constantly changing as anyone can see from their git repo so putting it into production in coming months would be just irresponsible.

As stated by the author of the nscd himself in the upstream report nscd should be able to cache entries with this reload-count option. This is not just for laptop use, just think of large workstation and cluster environments where there's an LDAP server outage, consequences can be (and *have* been) rather unfortunate.

Having said that, I applaud to Red Hat for taking the lead in this area to start developing the next generation solution for these issues but IMHO it is very clear that SSSD's time is not quite yet and nscd deserves to be fixed in this regard.

OTOH, if no-one is going to fix this then please remove the option altogether to make sure users are not going to waste their (or yours) time in the future with known-to-be-broken configurations. Or at least document this limitation very clearly.

Thanks.

Comment 31 Jeff Bastian 2009-10-27 21:09:07 UTC

Please don't misunderstand me: I'm not saying nscd and/or nss_ldap should not be fixed, I was just wondering if nscd is really the best tool for an offline cache.  It may happen to work that way, but it wasn't originally designed with that purpose in mind, so there may be shortcomings using it that way.

SSSD, on the other hand, is being designed with that purpose in mind, although it's not quite mature yet.

So, as a short term solution, it's worth looking at getting the 'reload-count unlimited' feature working, but for long term goals the effort should really be on SSSD.

Comment 32 Daniel Qarras 2009-10-27 22:00:56 UTC

> So, as a short term solution, it's worth looking at getting the 'reload-count
> unlimited' feature working, but for long term goals the effort should really be
> on SSSD.

Thanks. And if not already said, I am able and willing to test any patches and/or configurations you might cook up. There are few suggested patches above and also speculation about configuration options, which one would be the best to be tested first?

Comment 33 Dmitri Pal 2009-10-27 22:47:41 UTC

Ok, for the ball to continue rolling we are reassigning it to the right component.

Comment 36 Andreas Schwab 2009-10-28 15:43:28 UTC

See comment #24, the bug needs to be fixed in nss_ldap.  There is nothing that nscd can do if the nss module returns misleading information.

Comment 37 Dmitri Pal 2009-10-28 16:22:00 UTC

Comment #24 points to the exact use case that SSSD has been created to solve.
It does both the credential and identity caching. Plus supports a lot of other valuable features. I am conceptually not against giving a green light and fixing this issue in nss_ldap. The only concern that I have is that SSSD will be available and much more mature by the time this fix becomes available so is it really worth spending time and fixing? Sounds like a duplication of effort. I realize all the concerns related to SSSD being immature but fixing nss_ldap will also require testing and thus time.
I want to help to solve the problem but not to do it twice. We have several organizations working with us closely on SSSD in real work deployments solving the exact use case described. May be it is worth giving it a try?
If it does not fly I would commit resources to fix it in 5.6.

Comment 38 Daniel Qarras 2009-11-01 12:13:21 UTC

FWIW, I've tested this using nss-pam-ldapd 0.7.1 (and nss-ldapd 0.6.11 from RPM) instead of nss_ldap and nscd still discards the entries. Perhaps nss-pam-ldapd is behaving the same way as nss_ldap is suspected to be.

Comment 39 Howard Chu 2009-11-02 01:59:04 UTC

I think since nss-pam-ldapd is derived from nss_ldap the behaviors will likely be the same. But a quick look through this code shows that it is returning NSS_STATUS_UNAVAIL or NSS_STATUS_TRYAGAIN for this type of transient failure, so nscd should not be interpreting that the same as NSS_STATUS_NOTFOUND.

The OpenLDAP nssov solution is also a viable alternative, but it depends on the same NSS stub that nss-pam-ldapd uses. Of course in that case you can just use pcache, which *is* designed for offline/disconnected support, and forget about nscd.

Comment 40 Daniel Qarras 2009-11-03 13:01:03 UTC

Hmm, I think running an LDAP server on a laptop would be rather strange, in that case I'm sure SSSD would be much more preferable.

Anyway, nscd provides otherwise such robustness and performance that it is desired to make it also cope with randon LDAP server hickups. It now seems that either one would need to fix both nss_ldap and nss-ldapd or just nscd - and in fact nscd currently doesn't seem to operate in most logical fashion: From

http://sources.redhat.com/bugzilla/show_bug.cgi?id=2132

"Is it really correct for an NSS module to return NSS_STATUS_TRYAGAIN+EAGAIN when the LDAP server is unavailable or would NSS_STATUS_UNAVAIL+EAGAIN be better (or perhaps something else)?

Wouln't it be better if nscd only enters data in the cache on NSS_STATUS_SUCCESS and NSS_STATUS_NOTFOUND? (that should also solve this problem and probably be more consistent)"

Comment 41 Howard Chu 2009-11-03 21:51:53 UTC

(In reply to comment #40)
> Hmm, I think running an LDAP server on a laptop would be rather strange, in
> that case I'm sure SSSD would be much more preferable.

I believe your perceptions are out of date. I have OpenLDAP slapd running on my G1 phone with a local address book replicated from my main server. The process footprint is under 4MB, with TLS and appropriate security features enabled. One of the many advantages of this approach is that the service is remotely administrable via LDAP. And of course, the code is mature, proven to be highly efficient, well tested and stable.

> Anyway, nscd provides otherwise such robustness and performance that it is
> desired to make it also cope with randon LDAP server hickups. It now seems that
> either one would need to fix both nss_ldap and nss-ldapd or just nscd - and in
> fact nscd currently doesn't seem to operate in most logical fashion: From

"nscd" and "robustness" are two words that have never been closely associated...

Comment 42 Daniel Qarras 2009-11-18 18:12:38 UTC

I've now tested patches by Howard W available at

http://bugzilla.padl.com/show_bug.cgi?id=412

and with all of them applied nss_ldap/nscd on F12 finally seem to work as expected when reload-count is unlimited or, e.g., 5. I'm afraid that the patches are way too intrusive to be applied as-is to nss_ldap.rpm but hopefully they give a good hints how this issue could be finally fixed.

Thanks.

Comment 43 Daniel Qarras 2010-01-14 10:05:53 UTC

It's been few months since last comments, would anyone dare to make guesses what's the plan with this? I'm still willing to test anything that might be suggested.

Thanks.

Comment 44 Howard Wilkinson 2010-01-14 10:16:16 UTC

Daniel,

I have been bashing on the nss_ldap code for other problems including making the code a lot more resilient in a disconnected environment. The latest set of patches at the bugzilla for padl are in my view production ready. I am due to speak to Luke today to arrange upstreaming these, so I am hopeful that 266 will include all of my changes. If this does not then support the behaviour here I will work on any additional fixes required.

If we get the code mainstreamed are you in a position to influence the deployment into Fedora i.e. would it be possible to get this into Fedora 13 or 14 do you think?

Howard.

Comment 45 Daniel Qarras 2010-06-02 20:07:08 UTC

Howard et al,

based on https://bugzilla.redhat.com/show_bug.cgi?id=553032#c7 it would seem that there is no point continue working with nss_ldap anymore. Therefore I think this bug could be closed and we just need to live with this if using EL5.

But because this is also an issue with nss-pam-ldapd I've opened bug 599192 against Fedora Rawhide for that.

Thanks.

Comment 49 RHEL Program Management 2011-02-01 22:05:09 UTC

Development Management has reviewed and declined this request.  You may appeal
this decision by reopening this request.

Note You need to log in before you can comment on or make changes to this bug.