Hide Forgot
From Bugzilla Helper: User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.7.6) Gecko/20050225 Firefox/1.0.1 Description of problem: Second time already when I hear nss_ldap is replying with wrong results and causing peoples' mails to be shown to wrong people: http://www.dovecot.org/list/dovecot/2005-March/006345.html http://www.dovecot.org/list/dovecot/2005-April/006859.html Something should really be done about this. At the very least I'm adding a check to make sure getpwnam() returns the same user name that is being requested, and if not put out some huge warnings about something being broken.. Version-Release number of selected component (if applicable): How reproducible: Sometimes Steps to Reproduce: Install Dovecot with nss_pam enabled. Try logging in enough times with different users and see how at some point it shows you wrong user's mailbox. Additional info:
Thanks for the notification Timo. In order to reproduce this, do you actually have to be using nss_ldap for user authentication? Any idea how to valgrind this code path?
On the system where this happens, what does the PAM configuration look like?
Hi Josh Burley, we need information about your PAM configuration in order to help diagnose and fix this problem. Could you please help us?
Sure... there's nothing fancy in there. It's simply the default /etc/ldap.conf that was installed, and I updated it with my LDAP tree's structure, and the Manager to bind with. What information are you looking for?
I'm pretty sure it's nss_ldap specific problem. Never heard of it happening with anything else. And the code in Dovecot that uses it is really simple as well. Just calling getpwnam() multiple times in a process with no threads. At some point it then returns some previously authenticated user's data. Perhaps LDAP query fails and it's not checking the error value properly, and just returns the struct uninitialized (and filled with previous lookup's data)? No idea how to debug this, I'm not using Fedora or nss_ldap myself..
Oh, and does PAM affect this somehow? Dovecot is doing PAM lookups in separate short-living processes, so PAM has never been called before nss_ldap lookups within same process.
I was trying to investigate it but I has been unable to reproduce it. The PAM calls shouldn't really matter as they are called in forked child process. I've made a testing binary which called getpwnam randomly on various user accounts stored in LDAP and it always returned consistent results. Perhaps more information would be needed - the LDAP server vendor and version, what kind of network link was between the LDAP server and the Dovecot server. Exact versions of nss_ldap, glibc, openldap and dovecot packages...
ping Josh
Maybe it can be only reproduced with Dovecot. I'm not good at setting up a working LDAP server (or nss_ldap), but I can give an easy way to test it: - Create 100 or so test accounts to LDAP server named user-0 user-1 etc. - Give each one a home directory which ends with the user's name - Patch Dovecot's POP3 server with attached patch - Run attached pop3test binary. It should print an error whenever home directory's ending doesn't match the user which tried to log in.
Created attachment 114416 [details] pop3 tester
Created attachment 114417 [details] patch to Dovecot's pop3 code
That indeed allows reproducing the bug (on a rawhide-ish system), but only sometimes; most of the time the tester very soon stops, waiting for dovecot's response to PASS; I don't know what dovecot is waiting for ( (getent passwd) replies instantly, so it probably isn't the LDAP server), or why it sometimes works fine. I have also attempted to reproduce the "same" situation by creating 100 processes, each calling getpwnam(). This found no inconsistencies at all.
Can you check if the patch in bug #154315 makes Dovecot notice the problem? I'd just like to verify that this really is nss_ldap problem and not a Dovecot problem which for some reason has only been noticed with nss_ldap. I don't know why the tester would stop.. It creates multiple connections anyway, they all stop? It should slow down if the given password is invalid, but it shouldn't completely stop even then.
Created attachment 114944 [details] bzip2'ed syslog (*.*) during the test It is still not as deterministic as I'd like, but adding the debugging patch makes it much more likely to happen. This seems to point outside of dovecot; maybe someone can decipher the logs faster than me.
This issue has started to get stale. Nalin can you take a look at some point in the near future.
I heard a couple of days ago that this happens after nscd crashes. Or maybe loses LDAP connection for whatever reason. Looking at the log above there are also some LDAP connection errors before the Dovecot error.
nscd sounds like a not completely unlikely reason... Is the problem reproducible with nscd turned off? Anyhow, i will apply the check from bug 154315 in a bit, so at least dovecot should cope with it somewhat.
Fedora Core 3 is now maintained by the Fedora Legacy project for security updates only. If this problem is a security issue, please reopen and reassign to the Fedora Legacy product. If it is not a security issue and hasn't been resolved in the current FC5 updates or in the FC6 test release, reopen and change the version to match. Thank you!
Moving forward to devel, until/if we figure out why this is/was happening, I'm not going to rule out that it's still a problem.
I've build packages that could probably resolve the stopping. http://people.redhat.com/tjanouse/dovecot/154314-test1/ Could anyone try it with them? (these have #222110 solved, so it would probably fail much earlier than the stop could occur, so it may be necessary to tweak the spec file a little bit)
Ok, I tried it now, the fix I posted has nothing to do with that, but: It gets stuck quite soon and then, after about a minute or two, the problem happens (according to the log message) and then it seems to work ok for some time. These are the versions used: dovecot-0.99.11-8.1.EL4 nss_ldap-226-17 openldap-2.2.13-6.4E (I had to modify the pop3 tester to compile with RHEL5 dovecot)
And, after trying the second time, it seemed to work for quite a long time but then, after issuing "service nscd stop", it happened.
Forgot to mention that dovecot&nss_ldap&ldap server were running on RHEL4, the tester on RHEL5. I could test this on rawhide as well, if needed.
Oh yeah, got it. The problem relies in the fact that if an application is linked against pthread and uses a nss_ldap call and then forks, both processes share the ldap connection, having no locking mechanism, and bad things happen. This is the case with dovecot -- dovecot-auth is linked with pthread and uses pam in forked processes. A race condition causes dovecot-auth to receive a reply that should have gone to pam. The direct cause of this is that an assumption that __pthread_once is nonnull (ldap-nss.c:1048) implies __pthread_atfork being nonnull (ldap-nss.c:504, bits/libc-lock.h:290) is plain wrong. These two variables have no connection to each other and each of them becomes non-null at the time linker resolves them, which happens upon them being called. And it happens upon them being called in the object that checks for them -- that means calling pthread_atfork in dovecot has no effect on __pthread_atfork value in nss_ldap and vice versa. For some not so obvious reason __pthread_once is nonnull at the enter of main function, __pthread_atfork is null. This means that nss_ldap assumes we have pthreads working, calls the __libc_atfork (ldap-nss.c:504), which is a noop in this case, and then has no idea about the forks and such. The easiest solution would be to help nss_ldap's configure find pthreads (telling it to -lpthread), which would make nss_ldap use pthreads directly and avoid such crazy things -- and using those libc internal functions is bad anyway, but I'm not sure whether we should do it. Also, we could fix it to chech for both __pthread_once and __pthread_atfork but it would not find them and use the pid-comparing method, which is probably slower. I hope this information helps :)
Oh, so the bug was nothing like I imagined :) Is it possible to close the LDAP connection somehow in the forked process? Does endpwent() do anything?
If I didn't overlook anything, endpwent does not close the connection. I found no way to do it. Seems like this will have to be fixed in nss_ldap, no workaround this time :) If anyone wants to play with this, this is a simple reproducer: http://people.redhat.com/tjanouse/dovecot/154314/ldaptest.c
I dug into that a little further and wrote two patches. The first one is a plain fix of the wrong assumption, leaving the atfork code unused on current glibc: http://people.redhat.com/tjanouse/dovecot/154314/nss_ldap-226-pthread.patch Then I found out the cause of __pthread_atfork being NULL. It's not a dynamic symbol any more, it's being built into the binaries, because it's just a small function calling __register_atfork. So the second patch tries to follow this: http://people.redhat.com/tjanouse/dovecot/154314/nss_ldap-226-pthread2.patch It's an ugly abuse of glibc internals, but the original code was as well. There's probably not much benefit though. openldap uses shutdown(2) before closing the connection so the parent's ldap connection becomes screwed and must be reopened. This is out of my sight though.
pthread_atfork is defined in libpthread_nonshared.a, so that the handlers are automatically deregistered when a shared library that contains them is dlclosed, otherwise if you call pthread_atfork in some shared library, then dlclose it and then call fork, it would be calling handlers in an already unmapped library, which can either mean crashes or running some unrelated code (if some other library has been mmapped in that place instead). __register_atfork is provided by libc.so (i.e. always available), I think for nss_ldap the easiest fix would be at configure time check for libpthread_nonshared.a library and if it exists: 1) don't define pthread_atfork as weak, call it directly whenever pthread_once != NULL 2) link libnss_ldap.so.2 with -lpthread_nonshared
Jakub, thanks for your hints. I tried it and it works. I don't check for the pthread_once at runtime though, it doesn't seem to be neccessary -- I just call the pthread_atfork using pthread_once/__libc_once. Patches here: http://people.redhat.com/tjanouse/dovecot/154314/fc7.spec.diff http://people.redhat.com/tjanouse/dovecot/154314/el4.spec.diff I also built some packages: http://people.redhat.com/tjanouse/dovecot/154314/packages/
The upstream has accepted my two patches: http://people.redhat.com/tjanouse/dovecot/154314/sent_upstream/ Nalin, please update to nss_ldap-256 once it's released.
I see nss_ldap-257-3.fc7.i386.rpm in koji. Is this going to make it into F7 testing? Our F7 dovecot is getting bitten by bug #208528 which, according to a post by Tomas Janousek on that bug, could be caused by this bug.
257-3.fc7 is supposed to be an update for F7, though I may have to hassle Luke to help sort out my difficulties using the update tool.