154314 – nss_ldap randomly replying with wrong user's data

Bug 154314 - nss_ldap randomly replying with wrong user's data

Summary: nss_ldap randomly replying with wrong user's data

Keywords:
Status:	CLOSED RAWHIDE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	nss_ldap
Sub Component:
Version:	rawhide
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Nalin Dahyabhai
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	222667 243715
TreeView+	depends on / blocked

Reported:	2005-04-09 18:05 UTC by Timo Sirainen
Modified:	2007-11-30 22:11 UTC (History)
CC List:	14 users (show)
Fixed In Version:	nss_ldap-256-1
Clone Of:
Environment:
Last Closed:	2007-08-21 10:29:52 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
pop3 tester (4.96 KB, text/plain) 2005-05-16 10:48 UTC, Timo Sirainen	no flags	Details
patch to Dovecot's pop3 code (587 bytes, patch) 2005-05-16 10:49 UTC, Timo Sirainen	no flags	Details \| Diff
*bzip2'ed syslog (.) during the test* (954.02 KB, application/octet-stream) 2005-05-28 21:14 UTC, Miloslav Trmač	no flags	Details
View All

Description Timo Sirainen 2005-04-09 18:05:16 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.7.6) Gecko/20050225 Firefox/1.0.1

Description of problem:
Second time already when I hear nss_ldap is replying with wrong results and causing peoples' mails to be shown to wrong people:

http://www.dovecot.org/list/dovecot/2005-March/006345.html
http://www.dovecot.org/list/dovecot/2005-April/006859.html

Something should really be done about this. At the very least I'm adding a check to make sure getpwnam() returns the same user name that is being requested, and if not put out some huge warnings about something being broken..

Version-Release number of selected component (if applicable):


How reproducible:
Sometimes

Steps to Reproduce:
Install Dovecot with nss_pam enabled. Try logging in enough times with different users and see how at some point it shows you wrong user's mailbox.


Additional info:

Comment 3 Warren Togami 2005-04-17 10:56:38 UTC

Thanks for the notification Timo.

In order to reproduce this, do you actually have to be using nss_ldap for user
authentication?

Any idea how to valgrind this code path?

Comment 4 Nalin Dahyabhai 2005-04-18 18:11:03 UTC

On the system where this happens, what does the PAM configuration look like?

Comment 5 Warren Togami 2005-04-19 06:17:37 UTC

Hi Josh Burley, we need information about your PAM configuration in order to
help diagnose and fix this problem.  Could you please help us?

Comment 6 Josh Burley 2005-04-19 17:15:16 UTC

Sure... there's nothing fancy in there. It's simply the default /etc/ldap.conf
that was installed, and I updated it with my LDAP tree's structure, and the
Manager to bind with.

What information are you looking for?

Comment 7 Timo Sirainen 2005-04-21 19:38:13 UTC

I'm pretty sure it's nss_ldap specific problem. Never heard of it happening with anything else. And the 
code in Dovecot that uses it is really simple as well. Just calling getpwnam() multiple times in a process 
with no threads. At some point it then returns some previously authenticated user's data.

Perhaps LDAP query fails and it's not checking the error value properly, and just returns the struct 
uninitialized (and filled with previous lookup's data)?

No idea how to debug this, I'm not using Fedora or nss_ldap myself..

Comment 8 Timo Sirainen 2005-04-21 19:44:17 UTC

Oh, and does PAM affect this somehow? Dovecot is doing PAM lookups in separate short-living 
processes, so PAM has never been called before nss_ldap lookups within same process.

Comment 9 Tomas Mraz 2005-04-25 18:04:35 UTC

I was trying to investigate it but I has been unable to reproduce it. The PAM
calls shouldn't really matter as they are called in forked child process. I've
made a testing binary which called getpwnam randomly on various user accounts
stored in LDAP and it always returned consistent results.

Perhaps more information would be needed - the LDAP server vendor and version,
what kind of network link was between the LDAP server and the Dovecot server.
Exact versions of nss_ldap, glibc, openldap and dovecot packages...

Comment 10 Warren Togami 2005-05-16 08:35:42 UTC

ping Josh

Comment 11 Timo Sirainen 2005-05-16 10:47:42 UTC

Maybe it can be only reproduced with Dovecot. I'm not good at setting up a
working LDAP server (or nss_ldap), but I can give an easy way to test it:

 - Create 100 or so test accounts to LDAP server named user-0 user-1 etc.
 - Give each one a home directory which ends with the user's name
 - Patch Dovecot's POP3 server with attached patch
 - Run attached pop3test binary. It should print an error whenever home
directory's ending doesn't match the user which tried to log in.

Comment 12 Timo Sirainen 2005-05-16 10:48:55 UTC

Created attachment 114416 [details]
pop3 tester

Comment 13 Timo Sirainen 2005-05-16 10:49:55 UTC

Created attachment 114417 [details]
patch to Dovecot's pop3 code

Comment 14 Miloslav Trmač 2005-05-21 14:34:38 UTC

That indeed allows reproducing the bug (on a rawhide-ish system),
but only sometimes;
most of the time the tester very soon stops, waiting for dovecot's
response to PASS; I don't know what dovecot is waiting for
( (getent passwd) replies instantly, so it probably isn't the LDAP
server), or why it sometimes works fine.

I have also attempted to reproduce the "same" situation by creating 100
processes, each calling getpwnam(). This found no inconsistencies at all.

Comment 15 Timo Sirainen 2005-05-21 15:02:41 UTC

Can you check if the patch in bug #154315 makes Dovecot notice the problem? I'd
just like to verify that this really is nss_ldap problem and not a Dovecot
problem which for some reason has only been noticed with nss_ldap.

I don't know why the tester would stop.. It creates multiple connections anyway,
they all stop? It should slow down if the given password is invalid, but it
shouldn't completely stop even then.

Comment 16 Miloslav Trmač 2005-05-28 21:14:37 UTC

Created attachment 114944 [details]
bzip2'ed syslog (*.*) during the test

It is still not as deterministic as I'd like, but adding the debugging
patch makes it much more likely to happen.

This seems to point outside of dovecot; maybe someone can decipher the logs
faster than me.

Comment 17 Josh Bressers 2005-11-02 21:33:30 UTC

This issue has started to get stale.

Nalin can you take a look at some point in the near future.

Comment 18 Timo Sirainen 2005-11-03 10:02:39 UTC

I heard a couple of days ago that this happens after nscd crashes. Or maybe
loses LDAP connection for whatever reason. Looking at the log above there are
also some LDAP connection errors before the Dovecot error.

Comment 19 Petr Rockai 2006-01-18 08:39:23 UTC

nscd sounds like a not completely unlikely reason... Is the problem 
reproducible with nscd turned off? Anyhow, i will apply the   
check from bug 154315 in a bit, so at least dovecot should cope with it 
somewhat.

Comment 20 Matthew Miller 2006-07-10 21:42:13 UTC

Fedora Core 3 is now maintained by the Fedora Legacy project for security
updates only. If this problem is a security issue, please reopen and
reassign to the Fedora Legacy product. If it is not a security issue and
hasn't been resolved in the current FC5 updates or in the FC6 test
release, reopen and change the version to match.

Thank you!

Comment 21 Nalin Dahyabhai 2007-01-10 18:36:48 UTC

Moving forward to devel, until/if we figure out why this is/was happening, I'm
not going to rule out that it's still a problem.

Comment 22 Tomas Janousek 2007-01-11 17:36:00 UTC

I've build packages that could probably resolve the stopping.
http://people.redhat.com/tjanouse/dovecot/154314-test1/

Could anyone try it with them?

(these have #222110 solved, so it would probably fail much earlier than the stop
could occur, so it may be necessary to tweak the spec file a little bit)

Comment 23 Tomas Janousek 2007-01-16 10:56:10 UTC

Ok, I tried it now, the fix I posted has nothing to do with that, but:

It gets stuck quite soon and then, after about a minute or two, the problem
happens (according to the log message) and then it seems to work ok for some time.

These are the versions used:
dovecot-0.99.11-8.1.EL4
nss_ldap-226-17
openldap-2.2.13-6.4E

(I had to modify the pop3 tester to compile with RHEL5 dovecot)

Comment 24 Tomas Janousek 2007-01-16 11:00:40 UTC

And, after trying the second time, it seemed to work for quite a long time but
then, after issuing "service nscd stop", it happened.

Comment 25 Tomas Janousek 2007-01-16 17:24:06 UTC

Forgot to mention that dovecot&nss_ldap&ldap server were running on RHEL4, the
tester on RHEL5. I could test this on rawhide as well, if needed.

Comment 26 Tomas Janousek 2007-02-06 16:08:29 UTC

Oh yeah, got it.

The problem relies in the fact that if an application is linked against
pthread and uses a nss_ldap call and then forks, both processes share the ldap
connection, having no locking mechanism, and bad things happen. This is the
case with dovecot -- dovecot-auth is linked with pthread and uses pam in
forked processes. A race condition causes dovecot-auth to receive a reply that
should have gone to pam.

The direct cause of this is that an assumption that __pthread_once is nonnull
(ldap-nss.c:1048) implies __pthread_atfork being nonnull (ldap-nss.c:504,
bits/libc-lock.h:290) is plain wrong. These two variables have no connection
to each other and each of them becomes non-null at the time linker resolves
them, which happens upon them being called. And it happens upon them being
called in the object that checks for them -- that means calling pthread_atfork
in dovecot has no effect on __pthread_atfork value in nss_ldap and vice versa.

For some not so obvious reason __pthread_once is nonnull at the enter of main
function, __pthread_atfork is null. This means that nss_ldap assumes we have
pthreads working, calls the __libc_atfork (ldap-nss.c:504), which is a noop in
this case, and then has no idea about the forks and such.

The easiest solution would be to help nss_ldap's configure find pthreads
(telling it to -lpthread), which would make nss_ldap use pthreads directly and
avoid such crazy things -- and using those libc internal functions is bad
anyway, but I'm not sure whether we should do it.

Also, we could fix it to chech for both __pthread_once and __pthread_atfork 
but it would not find them and use the pid-comparing method, which is probably 
slower.

I hope this information helps :)

Comment 27 Timo Sirainen 2007-02-06 16:31:27 UTC

Oh, so the bug was nothing like I imagined :)

Is it possible to close the LDAP connection somehow in the forked process? Does
endpwent() do anything?

Comment 28 Tomas Janousek 2007-02-06 18:11:03 UTC

If I didn't overlook anything, endpwent does not close the connection. I found
no way to do it. Seems like this will have to be fixed in nss_ldap, no
workaround this time :)

If anyone wants to play with this, this is a simple reproducer:
http://people.redhat.com/tjanouse/dovecot/154314/ldaptest.c

Comment 29 Tomas Janousek 2007-02-07 15:01:50 UTC

I dug into that a little further and wrote two patches.

The first one is a plain fix of the wrong assumption, leaving the atfork code
unused on current glibc:
http://people.redhat.com/tjanouse/dovecot/154314/nss_ldap-226-pthread.patch

Then I found out the cause of __pthread_atfork being NULL. It's not a dynamic
symbol any more, it's being built into the binaries, because it's just a small
function calling __register_atfork. So the second patch tries to follow this:
http://people.redhat.com/tjanouse/dovecot/154314/nss_ldap-226-pthread2.patch
It's an ugly abuse of glibc internals, but the original code was as well.

There's probably not much benefit though. openldap uses shutdown(2) before
closing the connection so the parent's ldap connection becomes screwed and must
be reopened. This is out of my sight though.

Comment 30 Jakub Jelinek 2007-03-26 08:30:57 UTC

pthread_atfork is defined in libpthread_nonshared.a, so that the handlers are
automatically deregistered when a shared library that contains them is dlclosed,
otherwise if you call pthread_atfork in some shared library, then dlclose it and
then call fork, it would be calling handlers in an already unmapped library,
which can either mean crashes or running some unrelated code (if some other
library has been mmapped in that place instead).

__register_atfork is provided by libc.so (i.e. always available), I think for
nss_ldap the easiest fix would be at configure time check for
libpthread_nonshared.a library and if it exists:
1) don't define pthread_atfork as weak, call it directly whenever pthread_once
!= NULL
2) link libnss_ldap.so.2 with -lpthread_nonshared

Comment 31 Tomas Janousek 2007-04-11 14:15:24 UTC

Jakub, thanks for your hints. I tried it and it works. I don't check for the
pthread_once at runtime though, it doesn't seem to be neccessary -- I just call
the pthread_atfork using pthread_once/__libc_once.

Patches here:
http://people.redhat.com/tjanouse/dovecot/154314/fc7.spec.diff
http://people.redhat.com/tjanouse/dovecot/154314/el4.spec.diff

I also built some packages:
http://people.redhat.com/tjanouse/dovecot/154314/packages/

Comment 32 Tomas Janousek 2007-05-10 13:16:22 UTC

The upstream has accepted my two patches:
 http://people.redhat.com/tjanouse/dovecot/154314/sent_upstream/

Nalin, please update to nss_ldap-256 once it's released.

Comment 33 Norman Gaywood 2007-09-24 12:49:26 UTC

I see nss_ldap-257-3.fc7.i386.rpm in koji. Is this going to make it into F7 testing?

Our F7 dovecot is getting bitten by bug #208528 which, according to a post by
Tomas Janousek on that bug, could be caused by this bug.

Comment 34 Nalin Dahyabhai 2007-09-24 16:11:27 UTC

257-3.fc7 is supposed to be an update for F7, though I may have to hassle Luke
to help sort out my difficulties using the update tool.

Note You need to log in before you can comment on or make changes to this bug.