From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; cs-CZ; rv:1.8.0.9) Gecko/20061219 Fedora/1.5.0.9-1.fc6 Firefox/1.5.0.9 pango-text Description of problem: Dovecot stopped working with these messages: dovecot-auth: PAM unable to dlopen(/lib/security/$ISA/pam_env.so) dovecot-auth: PAM [error: /lib/security/../../lib/security/pam_env.so: cannot open shared object file: Too many open files] dovecot-auth: PAM adding faulty module: /lib/security/$ISA/pam_env.so Restart solved the problem; but it seems to be related to recent (2 days ago, not now) upgrade of pam. Should be either fixed in dovecot or make pam restart necessary services, am I not totally wrong. Other services (sshd, crond, procmail, spamc) worked well all the time. Version-Release number of selected component (if applicable): dovecot-1.0-1.rc15.fc6 pam-0.99.6.2-3.9.fc6 How reproducible: Always Steps to Reproduce: not sure if upgrading pam causes the problem, but it happend after Actual Results: Expected Results: Additional info:
I was unable to reproduce this. Could you check whether the number of open file descriptors increases (ls /proc/`pidof dovecot-auth`/fd | wc -l)? Does it happen after you restarted it or it happened just once after the pam upgrade? Also, you wrote it's reproducible always, could you be more specific about the steps to reproduce? Your config files of dovecot/pam, if you altered the default settings, for example.
Updated information: problem reappeared after a few hours, seems to be problem of dovecot or network, not pam. Until now it wasn't observed, but 2 weeks ago we had upgraded fc5->fc6 with the version change 1.0-0.beta8.2.fc5 to 1.0-1.rc15.fc6. Yes, the number of opened fd continually (with some fluctuation) increases, with impact to fs.file-nr too. Almost all of the stalled descriptors are sockets, fuser says they are held by dovecot-auth. Reproducibility is not the right term here, it just _is_ in my situation and dovecot runs only on one of my servers. Config is almost default but ssl and was not changed since previous version.
Each imap/pop3-login process keeps a connection open to dovecot-auth in rc15. These are the only sockets in dovecot-auth+PAM configuration that I can think of. Do you have a lot of imap/pop3-login processes running? One problem in rc15 was that after logging in with SSL, each process left the connection open to dovecot- auth although they didn't use it anymore. So if you have a lot of concurrent SSL sessions, I guess this could happen. It's already fixed in rc16 (http://dovecot.org/list/dovecot-cvs/2007-January/007326.html). I'm not aware of any actual leaks though.
Today's update from dovecot-1.0-1.rc15.fc6.i386 to dovecot-1.0-1.1.rc15.fc6.i386 did not help, rc16 is still not available in updates or updates-testing. But it doesn't look like a problem with SSL, today there was 20000 logins and only 120 of them were secured by tls. Dovecot continues to stop after number of open files reach 1024, what is several times a day.
Do you have login_process_per_connection=yes (default)? You could try if setting it to "no" helps (although then you probably want to increase login_processes_count as well). I guess the number of imap-login and pop3-login processes doesn't increase to 1000? Wonder if there's any way to check what process owns the other side of the socket? That'd be really useful in figuring out problems like this. :) I'm anyway guessing that dovecot-auth for some reason doesn't notice that the process owning the other side of the socket died. If it doesn't leak one fd per each connection, then it means it happens only sometimes..
Created attachment 145062 [details] config file diff from dist to used Hanging *-login processes was problem of some 0.99 dovecot and didn't happen any more after upgrading to 1.0-0.beta2.7 which was in FC5. (This was detectable much easily.) I copied config file from the last package and made only necessary changes plus those you suggested. But the situation seems to be the same, number of sockets raises.
You're sure the leaked fds are sockets, and not pipes? PAM is handled by forking a new process and talking to it via pipe (so PAM can't cause the leaks). The only sockets should be for login processes, but if it leaks with login_process_per_connection=no there shouldn't ever be any more created sockets than the number of login processes.. Maybe you could strace the dovecot-auth for a while, then grep connect/accept/pipe/socketpair/close calls and see what causes the leaks?
Except 0-9 each fd is for a socket. I'm gouing to give strace a chance.
I've just packaged 1.0-rc17 for rawhide (dovecot-1.0-2.rc17.fc7), so you could probably download the SRPM and give it a try, but it probably isn't fixed there anyway. I'd like to help, but I was unable to reproduce it even with fc5 config and SSL connections. Timo: I'm glad you're helping with this, thx.
Most of non-zero values look like this: waitpid(-1, 0xbff0e26c, WNOHANG) = -1 ECHILD (No child processes) But there are another ones, ending with ENOTCONN, EBADF, EINPROGRESS. My current tip leads to ldap: connect(102, {sa_family=AF_INET, sin_port=htons(389), sin_addr=inet_addr("147.251.96.1")}, 16) = -1 EINPROGRESS (Operation now in progress) write(102, "\25\3\1\0 \10T[r\306\270Nx\200L\355 ]\357\243e\271\244"..., 37) = -1 EPIPE (Broken pipe) getpeername(102, 0xbff0c4c4, [128]) = -1 ENOTCONN (Transport endpoint is not connected) shutdown(102, 2 /* send and receive */) = -1 ENOTCONN (Transport endpoint is not connected) close(102) = -1 EBADF (Bad file descriptor) I'll try that rawhide package and inspect ldap (which seems to be fully working and accepting all requests).
Oh. Since you were only talking about PAM, I completely forgot the userdb. So you're using nss_ldap? That could easily be the problem. It's caused security problems already, so I wouldn't suggest using it anyway (http://wiki.dovecot.org/AuthDatabase/Passwd).
During the night there were smaller amount of connections and the number of open sockets were near the value of 150 for a long time. Since working hours started, the number raised up to 628 and this value lasts for a few hours! No change up or down, I don't understand anything, dovecot should have been hung at least 3 hours ago. There were many pop3(s) and imap(s) connections. Strace shows just those ECHILDs and no other error any more. I was talking about pam because I thought it was pam's business to do authentication/authorization, why any application does it itself? If it bypasses pam settings, things may go bad. Our users exist mostly in ldap, I didn't know about https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=154314 until now, but if I had to select between dovecot or ldap, I'd select job change. Now I have the last dovecot-1.0-1.rc15 rebuilt, should I try it or better be happy with current state?
You wouldn't have to drop LDAP and you wouldn't even need to drop PAM. You could just not use nss_ldap. Meaning that instead of using "userdb passwd", you could set up "userdb ldap" with correct settings to look up the user information from LDAP after PAM has done the authentication. The reason why a lot of software uses PAM is because it's easier to do that way than reimplement all the functionality of PAM modules. I don't like PAM all that much. I think anyway that this bug is invalid for Dovecot, and the real problem is nss_ldap that's causing the leaks. Perhaps I should some day make NSS lookups also be done in separate processes so they wouldn't break dovecot-auth..
OK. I'll accept (of course) this issue to be closed/invalid/somebody else's problem. Thanks to anybody who tried to help me. My current situation is with rawhide's dovecot-1.0-2.rc17 rebuilt for fc6, works fine, although I didn't succeeded to set "userdp ldap" yet so that dovecot was gratified - this is certainly my problem, even that documentation is quite poor in this special point; I'll try primary documentation (those files with .c extension). I don't like bugs which appear out-of the blue, like this case was, when nothing changed and suddenly application started to behave bad. Even less I like situations when nothing changes and the bug disappears like yesterday and today. There is more work to be done.
Could you check if it's ok with the current fc6 package (dovecot-1.0-2.rc28.fc6)? Should I really close this as if it never existed?
The bug hasn't appeared since 2007-01-10. I had always used updated fc6 package with the exception of the last one (problems with people not in ldap, what is another game). I don't find it being a bug that time.
Ok, closing.