Red Hat Bugzilla – Bug 154314
nss_ldap randomly replying with wrong user's data
Last modified: 2007-11-30 17:11:03 EST
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.7.6) Gecko/20050225 Firefox/1.0.1
Description of problem:
Second time already when I hear nss_ldap is replying with wrong results and causing peoples' mails to be shown to wrong people:
Something should really be done about this. At the very least I'm adding a check to make sure getpwnam() returns the same user name that is being requested, and if not put out some huge warnings about something being broken..
Version-Release number of selected component (if applicable):
Steps to Reproduce:
Install Dovecot with nss_pam enabled. Try logging in enough times with different users and see how at some point it shows you wrong user's mailbox.
Thanks for the notification Timo.
In order to reproduce this, do you actually have to be using nss_ldap for user
Any idea how to valgrind this code path?
On the system where this happens, what does the PAM configuration look like?
Hi Josh Burley, we need information about your PAM configuration in order to
help diagnose and fix this problem. Could you please help us?
Sure... there's nothing fancy in there. It's simply the default /etc/ldap.conf
that was installed, and I updated it with my LDAP tree's structure, and the
Manager to bind with.
What information are you looking for?
I'm pretty sure it's nss_ldap specific problem. Never heard of it happening with anything else. And the
code in Dovecot that uses it is really simple as well. Just calling getpwnam() multiple times in a process
with no threads. At some point it then returns some previously authenticated user's data.
Perhaps LDAP query fails and it's not checking the error value properly, and just returns the struct
uninitialized (and filled with previous lookup's data)?
No idea how to debug this, I'm not using Fedora or nss_ldap myself..
Oh, and does PAM affect this somehow? Dovecot is doing PAM lookups in separate short-living
processes, so PAM has never been called before nss_ldap lookups within same process.
I was trying to investigate it but I has been unable to reproduce it. The PAM
calls shouldn't really matter as they are called in forked child process. I've
made a testing binary which called getpwnam randomly on various user accounts
stored in LDAP and it always returned consistent results.
Perhaps more information would be needed - the LDAP server vendor and version,
what kind of network link was between the LDAP server and the Dovecot server.
Exact versions of nss_ldap, glibc, openldap and dovecot packages...
Maybe it can be only reproduced with Dovecot. I'm not good at setting up a
working LDAP server (or nss_ldap), but I can give an easy way to test it:
- Create 100 or so test accounts to LDAP server named user-0 user-1 etc.
- Give each one a home directory which ends with the user's name
- Patch Dovecot's POP3 server with attached patch
- Run attached pop3test binary. It should print an error whenever home
directory's ending doesn't match the user which tried to log in.
Created attachment 114416 [details]
Created attachment 114417 [details]
patch to Dovecot's pop3 code
That indeed allows reproducing the bug (on a rawhide-ish system),
but only sometimes;
most of the time the tester very soon stops, waiting for dovecot's
response to PASS; I don't know what dovecot is waiting for
( (getent passwd) replies instantly, so it probably isn't the LDAP
server), or why it sometimes works fine.
I have also attempted to reproduce the "same" situation by creating 100
processes, each calling getpwnam(). This found no inconsistencies at all.
Can you check if the patch in bug #154315 makes Dovecot notice the problem? I'd
just like to verify that this really is nss_ldap problem and not a Dovecot
problem which for some reason has only been noticed with nss_ldap.
I don't know why the tester would stop.. It creates multiple connections anyway,
they all stop? It should slow down if the given password is invalid, but it
shouldn't completely stop even then.
Created attachment 114944 [details]
bzip2'ed syslog (*.*) during the test
It is still not as deterministic as I'd like, but adding the debugging
patch makes it much more likely to happen.
This seems to point outside of dovecot; maybe someone can decipher the logs
faster than me.
This issue has started to get stale.
Nalin can you take a look at some point in the near future.
I heard a couple of days ago that this happens after nscd crashes. Or maybe
loses LDAP connection for whatever reason. Looking at the log above there are
also some LDAP connection errors before the Dovecot error.
nscd sounds like a not completely unlikely reason... Is the problem
reproducible with nscd turned off? Anyhow, i will apply the
check from bug 154315 in a bit, so at least dovecot should cope with it
Fedora Core 3 is now maintained by the Fedora Legacy project for security
updates only. If this problem is a security issue, please reopen and
reassign to the Fedora Legacy product. If it is not a security issue and
hasn't been resolved in the current FC5 updates or in the FC6 test
release, reopen and change the version to match.
Moving forward to devel, until/if we figure out why this is/was happening, I'm
not going to rule out that it's still a problem.
I've build packages that could probably resolve the stopping.
Could anyone try it with them?
(these have #222110 solved, so it would probably fail much earlier than the stop
could occur, so it may be necessary to tweak the spec file a little bit)
Ok, I tried it now, the fix I posted has nothing to do with that, but:
It gets stuck quite soon and then, after about a minute or two, the problem
happens (according to the log message) and then it seems to work ok for some time.
These are the versions used:
(I had to modify the pop3 tester to compile with RHEL5 dovecot)
And, after trying the second time, it seemed to work for quite a long time but
then, after issuing "service nscd stop", it happened.
Forgot to mention that dovecot&nss_ldap&ldap server were running on RHEL4, the
tester on RHEL5. I could test this on rawhide as well, if needed.
Oh yeah, got it.
The problem relies in the fact that if an application is linked against
pthread and uses a nss_ldap call and then forks, both processes share the ldap
connection, having no locking mechanism, and bad things happen. This is the
case with dovecot -- dovecot-auth is linked with pthread and uses pam in
forked processes. A race condition causes dovecot-auth to receive a reply that
should have gone to pam.
The direct cause of this is that an assumption that __pthread_once is nonnull
(ldap-nss.c:1048) implies __pthread_atfork being nonnull (ldap-nss.c:504,
bits/libc-lock.h:290) is plain wrong. These two variables have no connection
to each other and each of them becomes non-null at the time linker resolves
them, which happens upon them being called. And it happens upon them being
called in the object that checks for them -- that means calling pthread_atfork
in dovecot has no effect on __pthread_atfork value in nss_ldap and vice versa.
For some not so obvious reason __pthread_once is nonnull at the enter of main
function, __pthread_atfork is null. This means that nss_ldap assumes we have
pthreads working, calls the __libc_atfork (ldap-nss.c:504), which is a noop in
this case, and then has no idea about the forks and such.
The easiest solution would be to help nss_ldap's configure find pthreads
(telling it to -lpthread), which would make nss_ldap use pthreads directly and
avoid such crazy things -- and using those libc internal functions is bad
anyway, but I'm not sure whether we should do it.
Also, we could fix it to chech for both __pthread_once and __pthread_atfork
but it would not find them and use the pid-comparing method, which is probably
I hope this information helps :)
Oh, so the bug was nothing like I imagined :)
Is it possible to close the LDAP connection somehow in the forked process? Does
endpwent() do anything?
If I didn't overlook anything, endpwent does not close the connection. I found
no way to do it. Seems like this will have to be fixed in nss_ldap, no
workaround this time :)
If anyone wants to play with this, this is a simple reproducer:
I dug into that a little further and wrote two patches.
The first one is a plain fix of the wrong assumption, leaving the atfork code
unused on current glibc:
Then I found out the cause of __pthread_atfork being NULL. It's not a dynamic
symbol any more, it's being built into the binaries, because it's just a small
function calling __register_atfork. So the second patch tries to follow this:
It's an ugly abuse of glibc internals, but the original code was as well.
There's probably not much benefit though. openldap uses shutdown(2) before
closing the connection so the parent's ldap connection becomes screwed and must
be reopened. This is out of my sight though.
pthread_atfork is defined in libpthread_nonshared.a, so that the handlers are
automatically deregistered when a shared library that contains them is dlclosed,
otherwise if you call pthread_atfork in some shared library, then dlclose it and
then call fork, it would be calling handlers in an already unmapped library,
which can either mean crashes or running some unrelated code (if some other
library has been mmapped in that place instead).
__register_atfork is provided by libc.so (i.e. always available), I think for
nss_ldap the easiest fix would be at configure time check for
libpthread_nonshared.a library and if it exists:
1) don't define pthread_atfork as weak, call it directly whenever pthread_once
2) link libnss_ldap.so.2 with -lpthread_nonshared
Jakub, thanks for your hints. I tried it and it works. I don't check for the
pthread_once at runtime though, it doesn't seem to be neccessary -- I just call
the pthread_atfork using pthread_once/__libc_once.
I also built some packages:
The upstream has accepted my two patches:
Nalin, please update to nss_ldap-256 once it's released.
I see nss_ldap-257-3.fc7.i386.rpm in koji. Is this going to make it into F7 testing?
Our F7 dovecot is getting bitten by bug #208528 which, according to a post by
Tomas Janousek on that bug, could be caused by this bug.
257-3.fc7 is supposed to be an update for F7, though I may have to hassle Luke
to help sort out my difficulties using the update tool.