Description of problem: With the configuration below nscd seems to remove user info after a period of time (noticed both after 5 minutes and after 15 hours). I did something like this to test (using LDAP as info source): service ldap start [succeeds] service nscd start [succeeds] id testuser [succeeds] ls -l ~testuser [succeeds] service ldap stop [succeeds] id testuser [succeeds] ls -l ~testuser [succeeds] <wait, retry, wait, ...> id testuser [fails] ls -l ~testuser [fails] It seems that the time after which entries get removed is more or less random making it hard to identify any clear occurances causing this. This is important to allow laptops to work when not connected to the home network. server-user nscd debug-level 0 reload-count unlimited paranoia no enable-cache passwd yes positive-time-to-live passwd 600 negative-time-to-live passwd 20 suggested-size passwd 211 check-files passwd yes persistent passwd yes shared passwd yes max-db-size passwd 33554432 auto-propagate passwd yes enable-cache group yes positive-time-to-live group 600 negative-time-to-live group 60 suggested-size group 211 check-files group yes persistent group yes shared group yes max-db-size group 33554432 auto-propagate group yes
I was able to reproduce this easily on Fedora 10 with all updates as of 2009-03-10 installed, including: root@localhost:~# rpm -q nscd nscd-2.9-3.i386 I will attach from Fedora 10 test /etc/nscd.conf, /var/log/nscd.log, and steps done in a terminal.
Created attachment 334716 [details] /etc/nscd.conf
Created attachment 334717 [details] /var/log/nscd.log
Created attachment 334718 [details] nscd test steps
Any news on this nscd issue? Would it be preferable if I open this against upstream glibc since this seems to happen also with F11? Thanks.
I now upstreamed this at http://sources.redhat.com/bugzilla/show_bug.cgi?id=10181 .
Please provide a sample ldap configuration.
Alternatively, is there a way to reproduce that without ldap?
Thanks for looking into this. I've now retested this on latest Fedora 11 + updates using the following packages: nscd-2.10.1-5.i586 openldap-servers-2.4.15-5.fc11.i586 The issue still remains after the same procedure described in comment #4. I will attach /etc/openldap/slapd.conf, /var/lib/ldap/intra/DB_CONFIG, and /etc/ldap.conf as requested. This should be trivial to reproduce with ldap. What would be your suggestion to try this without ldap? Thanks!
Created attachment 360345 [details] slapd.conf
Created attachment 360346 [details] DB_CONFIG
Created attachment 360347 [details] ldap.conf
Err, nss_ldap was version: nss_ldap-264-2.fc11.i586 No nss-ldapd was installed while testing.
Created attachment 360356 [details] nsswitch.conf
I don't know how to setup an ldap server. Can this be reproduced without one?
> I don't know how to setup an ldap server. Can this be reproduced without one? I did now additional testing and yes, in fact this can be produced with plain files, too. Below is the only related configuration snippet in /etc/nsswitch.conf: passwd: files shadow: files group: files And as /etc/nscd.conf was https://bugzilla.redhat.com/attachment.cgi?id=334716 . Below are the steps to reproduce: wget "https://bugzilla.redhat.com/attachment.cgi?id=334716" -O /etc/nscd.conf /etc/init.d/nscd restart echo testuser:x:10000:100::/tmp:/sbin/nologin >> /etc/passwd grep testuser /etc/passwd touch /tmp/testfile chown testuser:users /tmp/testfile ls -l /tmp/testfile ; id testuser ; sed -i 's/^testuser/#testuser/' /etc/passwd grep testuser /etc/passwd ls -l /tmp/testfile ; id testuser ; <retry, wait, retry, ...> For some reason every now and then this fails instantly, other times it succeeds for some time and then starts to fail. In case of instant failure, I've done something like this which usually guarantees that failure will not happen instantly: sed -i 's/^#testuser/testuser/' /etc/passwd grep testuser /etc/passwd ls -l /tmp/testfile ; id testuser ; sleep 60 ls -l /tmp/testfile ; id testuser ; sed -i 's/^testuser/#testuser/' /etc/passwd grep testuser /etc/passwd ls -l /tmp/testfile ; id testuser ; <retry, wait, retry, ...> Thanks for looking into this!
If I'm reading the code correctly, this is expected behavior. Looking at nscd/cache.c in the prune_cache() function, there are three conditions, any of which can make an entry in the cache removable: 1. reload_count is not unlimited and the entry has been reloaded too many times 2. the look-up failed 3. you ran 'nscd -i' to invalidate the cache So, by running sed -i 's/^testuser/#testuser/' /etc/passwd you satisfied condition #2 -- the look-up failed -- and it removed testuser. This could happen using LDAP if, for example, the LDAP server was temporarily too slow to respond to a look-up request. Here's the code: /* At this point there are two choices: we reload the value or we discard it. Do not change NRELOADS if we never not reload the record. */ if ((reload_count != UINT_MAX && __builtin_expect (dh->nreloads >= reload_count, 0)) /* We always remove negative entries. */ || dh->notfound /* Discard everything if the user explicitly requests it. */ || now == LONG_MAX) { /* Remove the value. */ dh->usable = false; /* We definitely have some garbage entries now. */ any = true; }
Thanks for looking into this. If your analysis in comment 17 is correct it means in practise that it is not possible to cache any user information in nscd when offline or not connected to data source (e.g., LDAP server). This sounds inconvenient as, e.g., LDAP server hang might make the system unusable for a user if user information is coming from LDAP. I am also wondering what is actually the purpose or use case for this reload-count if it is not intented to work when data source like LDAP server is not available?
Based on my interpretation of the discussion at http://sourceware.org/bugzilla/show_bug.cgi?id=2132 it would seem that the purpose of reload-count is what I've explained above meaning that we have here a bug which should be fixed. Thanks.
This is just an idea that would require a lot of testing, but you could try removing or modifying condition #2. For example, removing it: --- cache.c 2009-09-14 12:21:30.060998034 -0500 +++ cache.c.new 2009-10-26 16:51:09.590911376 -0500 @@ -369,8 +369,6 @@ we never not reload the record. */ if ((reload_count != UINT_MAX && __builtin_expect (dh->nreloads >= reload_count, 0)) - /* We always remove negative entries. */ - || dh->notfound /* Discard everything if the user explicitly requests it. */ || now == LONG_MAX) Or, you could modify it slightly so that it only removes entries if reload-count is not set to unlimited: --- cache.c 2009-09-14 12:21:30.060998034 -0500 +++ cache.c.new 2009-10-26 16:52:59.502899615 -0500 @@ -369,8 +369,10 @@ we never not reload the record. */ if ((reload_count != UINT_MAX && __builtin_expect (dh->nreloads >= reload_count, 0)) - /* We always remove negative entries. */ - || dh->notfound + /* We remove negative entries if reload-count + * is not unlimited. */ + || (reload_count != UINT_MAX + && dh->notfound) /* Discard everything if the user explicitly requests it. */ || now == LONG_MAX) However, as I mentioned above, this would require a lot of careful testing since there may be unwanted side-effects. Instead of modifying nscd, there's a Fedora project that looks intriguing: System Security Services Daemon. http://fedoraproject.org/wiki/Features/SSSD https://fedorahosted.org/sssd/ One of the primary features is the ability to cache nss db info for disconnected users like laptops. SSSD is available in Fedora 11 now for testing.
(In reply to comment #17) > If I'm reading the code correctly, this is expected behavior. I don't think you are probably reading that bit of code wrongly, but I think there is likely more at play here than just that. > Looking at nscd/cache.c in the prune_cache() function, there are three > conditions, any of which can make an entry in the cache removable: > 1. reload_count is not unlimited and the entry has been reloaded > too many times > 2. the look-up failed This does not necessarily seem to be so cut and dry. As Daniel (and myself and others) has observed, the amount of time that needs to pass for nscd to have pruned the entry (i.e. after the authoritative source has become unavailable) "seems" (more on that in a bit) quite unpredictable. I say seems because it is probably quite predictable once we uncover the mystery that is at the root of it. Given that the amount of time that needs to pass for the entry to be pruned can vary greatly, a lookup failing cannot be the entire cause. If it were, then the amount of time needed to pass for the entry to be pruned would be quite predictable. > sed -i 's/^testuser/#testuser/' /etc/passwd > you satisfied condition #2 -- the look-up failed -- and it removed testuser. Yes, but the amount of time that has to pass after that varies wildly. > Here's the code: > /* At this point there are two choices: we reload the > value or we discard it. Do not change NRELOADS if > we never not reload the record. */ > if ((reload_count != UINT_MAX > && __builtin_expect (dh->nreloads >= reload_count, 0)) > /* We always remove negative entries. */ > || dh->notfound > /* Discard everything if the user explicitly > requests it. */ > || now == LONG_MAX) > { > /* Remove the value. */ > dh->usable = false; > > /* We definitely have some garbage entries now. */ > any = true; > } This bit of code is interesting. Indeed, if we know that nscd -i was not used and we know that the reload-count is unlimited, then, the dh->notfound condition must be true. But what contributes to dh->notfound being true? A failed lookup, only, always? Something else must be influencing that. I had a discussion about this with another person interested in this same functionality (disconnected caching) and he pointed out that he believed that the cache was acting in what I would describe as a flywheel -- in that an entry would remain in the cache, despite the authoritative source being unavailable, as long as the entry was being used frequently enough. This seems like a viable explanation. I wonder if the test case in comment #16 can be modified to (dis-)prove this. Or a code analysis. I'm off to look at the code a bit more.
Ahhh. I suspect dh->notfound it a positive "no such record" vs. the alternate situation, which is "lookup failed". I also don't think you are presenting the entire context of the pruning which is: /* At this point there are two choices: we reload the value or we discard it. Do not change NRELOADS if we never not reload the record. */ if ((reload_count != UINT_MAX && __builtin_expect (dh->nreloads >= reload_count, 0)) /* We always remove negative entries. */ || dh->notfound /* Discard everything if the user explicitly requests it. */ || now == LONG_MAX) { /* Remove the value. */ dh->usable = false; /* We definitely have some garbage entries now. */ any = true; } else { /* Reload the value. We do this only for the initially used key, not the additionally added derived value. */ assert (runp->type < LASTREQ && readdfcts[runp->type] != NULL); readdfcts[runp->type] (table, runp, dh); /* If the entry has been replaced, we might need cleanup. */ any |= !dh->usable; } The first part of the if is the simple cases of when pruning should definitely happen. If none of those are true however, the else path is taken which is the more likely path in the case of reload-count = unlimited. In this situation, "readdfcts[runp->type] (table, runp, dh)" is used to attempt to reload the entry, where readdfcts can be any of: static void (*const readdfcts[LASTREQ]) (struct database_dyn *, struct hashentry *, struct datahead *) = { [GETPWBYNAME] = readdpwbyname, [GETPWBYUID] = readdpwbyuid, [GETGRBYNAME] = readdgrbyname, [GETGRBYGID] = readdgrbygid, [GETHOSTBYNAME] = readdhstbyname, [GETHOSTBYNAMEv6] = readdhstbynamev6, [GETHOSTBYADDR] = readdhstbyaddr, [GETHOSTBYADDRv6] = readdhstbyaddrv6, [GETAI] = readdhstai, [INITGROUPS] = readdinitgroups, [GETSERVBYNAME] = readdservbyname, [GETSERVBYPORT] = readdservbyport }; So readdpwbyname() for example, which calls addpwbyX(), which uses __getpw{nam|uid}_r() to look up the entry and then calls cache_addpw() to add what was found. Looking into cache_addpw() more closely, we can see that if the *pwd sent to it was NULL and he != NULL and errval == EAGAIN: /* If we have an old record available but cannot find one now because the service is not available we keep the old record and make sure it does not get removed. */ if (reload_count != UINT_MAX && dh->nreloads == reload_count) /* Do not reset the value if we never not reload the record. */ dh->nreloads = reload_count - 1; written = total = 0; This appears to be what is supposed to keep existing records around if a lookup fails and reload-count is unlimited (or at least very large -- i.e. one could set reload-count to a multiple of the timeout to set an upper timelimit on the cache). It's the else of the above which I have not quite figured out yet. That being !(he != NULL && errval == EAGAIN). I've not quite figure enough out about the data structures (specifically he) being used to know what condition that really is. I was going to start to sprinkle some debug around and see what seems to be happening when the records are being expired.
Hrm. I don't think the negative cache expiry is working properly either. Using the technique on comment #16 to make a passwd entry valid and then invalid and then valid again, I am finding that after going through a series of valid, invalid, valid steps, many minutes after the entry is made valid again, nscd is still reporting it invalid: $ grep foobar /etc/passwd foobar:*:9999:9999::/dev/null:/bin/true $ id foobar id: foobar: No such user $ id foobar id: foobar: No such user $ id foobar id: foobar: No such user $ id foobar id: foobar: No such user $ id foobar id: foobar: No such user The above "id" commands were done over a period of many minutes. AFAIU, with a setting of: negative-time-to-live passwd 20 nscd should be going back to the authoritative source every 20s to check for an update.
I think I have found a bug in nss_ldap that would explain this problem. Please see the comment I have added to http://sources.redhat.com/bugzilla/show_bug.cgi?id=2132
Now that I think about it, adding and removing entries from /etc/passwd (i.e. in an effort to replicate this without LDAP) is not likely going to work, because a missing /etc/passwd entry is a "no such record", which is quite different than "the lookup could not be completed" which is what we are targeting here with an unreachable LDAP server.
I've been thinking about this some more and discussing with some peers, and even if nss_ldap and nscd can be made to work as expected with unlimited reload count, something still "feels" wrong about this: nscd was designed to speed up name service lookups, not to act as an offline cache. For example, what if data in the offline cache becomes stale? Somebody will have to run 'nscd -i passwd' to force it refresh the cache from the server. SSSD, as mentioned in comment 20, was designed with this as one of its goals: an offline cache for disconnected laptops. If nss_ldap has a bug, then it should be fixed, but efforts would probably be better spent on SSSD. http://fedoraproject.org/wiki/Features/SSSD https://fedorahosted.org/sssd/
(In reply to comment #27) > I've been thinking about this some more and discussing with some peers, and > even if nss_ldap and nscd can be made to work as expected with unlimited reload > count, something still "feels" wrong about this: nscd was designed to speed up > name service lookups, not to act as an offline cache. But it does cache, persistently already, which lends itself to being leveraged into an offline cache nicely, I think. > For example, what if data in the offline cache becomes stale? That's a problem with any offline cache, not just nscd. Any offline cache, including SSSD (as I cannot see how it would know otherwise) has the potential to deliver stale (in reference to the authoritative source) data while it's offline. > Somebody will > have to run 'nscd -i passwd' to force it refresh the cache from the server. Not at all. Once the authoritative source is available again, assuming the timeout has expired (which it should as it shouldn't be any larger than some reasonable-for-caching time period like, say, 10 minutes) for the record you want, the cache should refresh from the source. IOW, once the cache is reconnected to the source after a period of being disconnected that is longer than the (i.e. 10 minutes) timeout, effectively, all records in the cache will get refreshed, or expired as the source returns "no such record (anymore)". > SSSD, as mentioned in comment 20, was designed with this as one of its goals: > an offline cache for disconnected laptops. Yeah. But it's new (i.e. likely still buggy and immature) and a whole honking new piece of software with a whole honking new (and not terribly straighforward as I skimmed it) configuration syntax that I have to learn. > If nss_ldap has a bug, then it > should be fixed, but efforts would probably be better spent on SSSD. > http://fedoraproject.org/wiki/Features/SSSD > https://fedorahosted.org/sssd/ In the context of RedHat perhaps. But we are here discussion NSCD on RedHat. Personally, my implementation target is wider than just RedHat. Perhaps we should all take this elsewhere, and stop bothering RedHat with it seeing as they have a different solution in mind.
> > Yeah. But it's new (i.e. likely still buggy and immature) and a whole honking > new piece of software with a whole honking new (and not terribly straighforward > as I skimmed it) configuration syntax that I have to learn. > Configuration syntax has been significantly simplified for SSSD in the recent month. Yes it is still young but already imported into distributions other than Fedora. As for buggy... would you be interested in trying it and making it more stable by experimenting with it and providing your feedback? We just released 0.7.1 today...
I am well aware of SSSD's existence (I've even suggested others to use it in bug 182464 and bug 186527) but this issue was about nscd's misbehaving configuration option and to be honest I see nscd preferable over SSSD for quite some time to come: it has *much* wider adoption, it's time tested code, it's much more efficient at least currently, and - well - not fixing bugs is just not a bad practice. In addition, nscd is nicely compact providing just some very basic functionality which is enough in many cases - SSSD code is constantly changing as anyone can see from their git repo so putting it into production in coming months would be just irresponsible. As stated by the author of the nscd himself in the upstream report nscd should be able to cache entries with this reload-count option. This is not just for laptop use, just think of large workstation and cluster environments where there's an LDAP server outage, consequences can be (and *have* been) rather unfortunate. Having said that, I applaud to Red Hat for taking the lead in this area to start developing the next generation solution for these issues but IMHO it is very clear that SSSD's time is not quite yet and nscd deserves to be fixed in this regard. OTOH, if no-one is going to fix this then please remove the option altogether to make sure users are not going to waste their (or yours) time in the future with known-to-be-broken configurations. Or at least document this limitation very clearly. Thanks.
Please don't misunderstand me: I'm not saying nscd and/or nss_ldap should not be fixed, I was just wondering if nscd is really the best tool for an offline cache. It may happen to work that way, but it wasn't originally designed with that purpose in mind, so there may be shortcomings using it that way. SSSD, on the other hand, is being designed with that purpose in mind, although it's not quite mature yet. So, as a short term solution, it's worth looking at getting the 'reload-count unlimited' feature working, but for long term goals the effort should really be on SSSD.
> So, as a short term solution, it's worth looking at getting the 'reload-count > unlimited' feature working, but for long term goals the effort should really be > on SSSD. Thanks. And if not already said, I am able and willing to test any patches and/or configurations you might cook up. There are few suggested patches above and also speculation about configuration options, which one would be the best to be tested first?
Ok, for the ball to continue rolling we are reassigning it to the right component.
See comment #24, the bug needs to be fixed in nss_ldap. There is nothing that nscd can do if the nss module returns misleading information.
Comment #24 points to the exact use case that SSSD has been created to solve. It does both the credential and identity caching. Plus supports a lot of other valuable features. I am conceptually not against giving a green light and fixing this issue in nss_ldap. The only concern that I have is that SSSD will be available and much more mature by the time this fix becomes available so is it really worth spending time and fixing? Sounds like a duplication of effort. I realize all the concerns related to SSSD being immature but fixing nss_ldap will also require testing and thus time. I want to help to solve the problem but not to do it twice. We have several organizations working with us closely on SSSD in real work deployments solving the exact use case described. May be it is worth giving it a try? If it does not fly I would commit resources to fix it in 5.6.
FWIW, I've tested this using nss-pam-ldapd 0.7.1 (and nss-ldapd 0.6.11 from RPM) instead of nss_ldap and nscd still discards the entries. Perhaps nss-pam-ldapd is behaving the same way as nss_ldap is suspected to be.
I think since nss-pam-ldapd is derived from nss_ldap the behaviors will likely be the same. But a quick look through this code shows that it is returning NSS_STATUS_UNAVAIL or NSS_STATUS_TRYAGAIN for this type of transient failure, so nscd should not be interpreting that the same as NSS_STATUS_NOTFOUND. The OpenLDAP nssov solution is also a viable alternative, but it depends on the same NSS stub that nss-pam-ldapd uses. Of course in that case you can just use pcache, which *is* designed for offline/disconnected support, and forget about nscd.
Hmm, I think running an LDAP server on a laptop would be rather strange, in that case I'm sure SSSD would be much more preferable. Anyway, nscd provides otherwise such robustness and performance that it is desired to make it also cope with randon LDAP server hickups. It now seems that either one would need to fix both nss_ldap and nss-ldapd or just nscd - and in fact nscd currently doesn't seem to operate in most logical fashion: From http://sources.redhat.com/bugzilla/show_bug.cgi?id=2132 "Is it really correct for an NSS module to return NSS_STATUS_TRYAGAIN+EAGAIN when the LDAP server is unavailable or would NSS_STATUS_UNAVAIL+EAGAIN be better (or perhaps something else)? Wouln't it be better if nscd only enters data in the cache on NSS_STATUS_SUCCESS and NSS_STATUS_NOTFOUND? (that should also solve this problem and probably be more consistent)"
(In reply to comment #40) > Hmm, I think running an LDAP server on a laptop would be rather strange, in > that case I'm sure SSSD would be much more preferable. I believe your perceptions are out of date. I have OpenLDAP slapd running on my G1 phone with a local address book replicated from my main server. The process footprint is under 4MB, with TLS and appropriate security features enabled. One of the many advantages of this approach is that the service is remotely administrable via LDAP. And of course, the code is mature, proven to be highly efficient, well tested and stable. > Anyway, nscd provides otherwise such robustness and performance that it is > desired to make it also cope with randon LDAP server hickups. It now seems that > either one would need to fix both nss_ldap and nss-ldapd or just nscd - and in > fact nscd currently doesn't seem to operate in most logical fashion: From "nscd" and "robustness" are two words that have never been closely associated...
I've now tested patches by Howard W available at http://bugzilla.padl.com/show_bug.cgi?id=412 and with all of them applied nss_ldap/nscd on F12 finally seem to work as expected when reload-count is unlimited or, e.g., 5. I'm afraid that the patches are way too intrusive to be applied as-is to nss_ldap.rpm but hopefully they give a good hints how this issue could be finally fixed. Thanks.
It's been few months since last comments, would anyone dare to make guesses what's the plan with this? I'm still willing to test anything that might be suggested. Thanks.
Daniel, I have been bashing on the nss_ldap code for other problems including making the code a lot more resilient in a disconnected environment. The latest set of patches at the bugzilla for padl are in my view production ready. I am due to speak to Luke today to arrange upstreaming these, so I am hopeful that 266 will include all of my changes. If this does not then support the behaviour here I will work on any additional fixes required. If we get the code mainstreamed are you in a position to influence the deployment into Fedora i.e. would it be possible to get this into Fedora 13 or 14 do you think? Howard.
Howard et al, based on https://bugzilla.redhat.com/show_bug.cgi?id=553032#c7 it would seem that there is no point continue working with nss_ldap anymore. Therefore I think this bug could be closed and we just need to live with this if using EL5. But because this is also an issue with nss-pam-ldapd I've opened bug 599192 against Fedora Rawhide for that. Thanks.
Development Management has reviewed and declined this request. You may appeal this decision by reopening this request.