Bug 221572 - dovecot stops working with "Too many open files"
Summary: dovecot stops working with "Too many open files"
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Fedora
Classification: Fedora
Component: dovecot
Version: 6
Hardware: i386
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Tomas Janousek
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2007-01-05 11:04 UTC by Pytela, Zdenek
Modified: 2014-01-21 22:56 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2007-04-06 14:30:37 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
config file diff from dist to used (1.61 KB, text/x-patch)
2007-01-08 15:40 UTC, Pytela, Zdenek
no flags Details

Description Pytela, Zdenek 2007-01-05 11:04:33 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; cs-CZ; rv:1.8.0.9) Gecko/20061219 Fedora/1.5.0.9-1.fc6 Firefox/1.5.0.9 pango-text

Description of problem:
Dovecot stopped working with these messages:
dovecot-auth: PAM unable to dlopen(/lib/security/$ISA/pam_env.so)
dovecot-auth: PAM [error: /lib/security/../../lib/security/pam_env.so: cannot open shared object file: Too many open files]
dovecot-auth: PAM adding faulty module: /lib/security/$ISA/pam_env.so
Restart solved the problem; but it seems to be related to recent (2 days ago, not now) upgrade of pam. Should be either fixed in dovecot or make pam restart necessary services, am I not totally wrong. Other services (sshd, crond, procmail, spamc) worked well all the time.

Version-Release number of selected component (if applicable):
dovecot-1.0-1.rc15.fc6 pam-0.99.6.2-3.9.fc6

How reproducible:
Always


Steps to Reproduce:
not sure if upgrading pam causes the problem, but it happend after

Actual Results:


Expected Results:


Additional info:

Comment 1 Tomas Janousek 2007-01-05 12:07:47 UTC
I was unable to reproduce this.

Could you check whether the number of open file descriptors increases (ls
/proc/`pidof dovecot-auth`/fd | wc -l)? Does it happen after you restarted it or
it happened just once after the pam upgrade?

Also, you wrote it's reproducible always, could you be more specific about the
steps to reproduce? Your config files of dovecot/pam, if you altered the default
settings, for example.

Comment 2 Pytela, Zdenek 2007-01-05 16:46:34 UTC
Updated information: problem reappeared after a few hours, seems to be problem
of dovecot or network, not pam. Until now it wasn't observed, but 2 weeks ago we
had upgraded fc5->fc6 with the version change 1.0-0.beta8.2.fc5 to 1.0-1.rc15.fc6.

Yes, the number of opened fd continually (with some fluctuation) increases, with
impact to fs.file-nr too. Almost all of the stalled descriptors are sockets,
fuser says they are held by dovecot-auth.

Reproducibility is not the right term here, it just _is_ in my situation and
dovecot runs only on one of my servers. Config is almost default but ssl and was
not changed since previous version.

Comment 3 Timo Sirainen 2007-01-05 19:01:44 UTC
Each imap/pop3-login process keeps a connection open to dovecot-auth in rc15. These are the only 
sockets in dovecot-auth+PAM configuration that I can think of. Do you have a lot of imap/pop3-login 
processes running?

One problem in rc15 was that after logging in with SSL, each process left the connection open to dovecot-
auth although they didn't use it anymore. So if you have a lot of concurrent SSL sessions, I guess this 
could happen. It's already fixed in rc16 (http://dovecot.org/list/dovecot-cvs/2007-January/007326.html). 
I'm not aware of any actual leaks though.


Comment 4 Pytela, Zdenek 2007-01-08 14:10:24 UTC
Today's update from dovecot-1.0-1.rc15.fc6.i386 to dovecot-1.0-1.1.rc15.fc6.i386
did not help, rc16 is still not available in updates or updates-testing.
But it doesn't look like a problem with SSL, today there was 20000 logins and
only 120 of them were secured by tls. Dovecot continues to stop after number of
open files reach 1024, what is several times a day.


Comment 5 Timo Sirainen 2007-01-08 15:07:18 UTC
Do you have login_process_per_connection=yes (default)? You could try if setting it to "no" helps (although 
then you probably want to increase login_processes_count as well). I guess the number of imap-login and 
pop3-login processes doesn't increase to 1000?

Wonder if there's any way to check what process owns the other side of the socket? That'd be really useful 
in figuring out problems like this. :)

I'm anyway guessing that dovecot-auth for some reason doesn't notice that the process owning the other 
side of the socket died. If it doesn't leak one fd per each connection, then it means it happens only 
sometimes..

Comment 6 Pytela, Zdenek 2007-01-08 15:40:32 UTC
Created attachment 145062 [details]
config file diff from dist to used

Hanging *-login processes was problem of some 0.99 dovecot and didn't happen
any more after upgrading to 1.0-0.beta2.7 which was in FC5. (This was
detectable much easily.)
I copied config file from the last package and made only necessary changes plus
those you suggested. But the situation seems to be the same, number of sockets
raises.

Comment 7 Timo Sirainen 2007-01-08 15:53:13 UTC
You're sure the leaked fds are sockets, and not pipes? PAM is handled by forking a new process and 
talking to it via pipe (so PAM can't cause the leaks). The only sockets should be for login processes, but if 
it leaks with login_process_per_connection=no there shouldn't ever be any more created sockets than the 
number of login processes..

Maybe you could strace the dovecot-auth for a while, then grep connect/accept/pipe/socketpair/close 
calls and see what causes the leaks?


Comment 8 Pytela, Zdenek 2007-01-08 15:56:45 UTC
Except 0-9 each fd is for a socket. I'm gouing to give strace a chance.

Comment 9 Tomas Janousek 2007-01-08 16:29:39 UTC
I've just packaged 1.0-rc17 for rawhide (dovecot-1.0-2.rc17.fc7), so you could
probably download the SRPM and give it a try, but it probably isn't fixed there
anyway. I'd like to help, but I was unable to reproduce it even with fc5 config
and SSL connections.

Timo: I'm glad you're helping with this, thx.

Comment 10 Pytela, Zdenek 2007-01-08 16:41:39 UTC
Most of non-zero values look like this:
waitpid(-1, 0xbff0e26c, WNOHANG)        = -1 ECHILD (No child processes)

But there are another ones, ending with ENOTCONN, EBADF, EINPROGRESS. My current
tip leads to ldap:

connect(102, {sa_family=AF_INET, sin_port=htons(389), 
sin_addr=inet_addr("147.251.96.1")}, 16) = -1 EINPROGRESS (Operation now in
progress)
write(102, "\25\3\1\0 \10T[r\306\270Nx\200L\355 ]\357\243e\271\244"..., 37) = -1
EPIPE (Broken pipe)
getpeername(102, 0xbff0c4c4, [128])     = -1 ENOTCONN (Transport endpoint is not
connected)
shutdown(102, 2 /* send and receive */) = -1 ENOTCONN (Transport endpoint is not
connected)
close(102)                              = -1 EBADF (Bad file descriptor)

I'll try that rawhide package and inspect ldap (which seems to be fully working
and accepting all requests).


Comment 11 Timo Sirainen 2007-01-08 16:54:36 UTC
Oh. Since you were only talking about PAM, I completely forgot the userdb. So you're using nss_ldap? That 
could easily be the problem. It's caused security problems already, so I wouldn't suggest using it anyway 
(http://wiki.dovecot.org/AuthDatabase/Passwd).


Comment 12 Pytela, Zdenek 2007-01-09 11:41:57 UTC
During the night there were smaller amount of connections and the number of open
sockets were near the value of 150 for a long time. Since working hours started,
the number raised up to 628 and this value lasts for a few hours! No change up
or down, I don't understand anything, dovecot should have been hung at least 3
hours ago. There were many pop3(s) and imap(s) connections. Strace shows just
those ECHILDs and no other error any more.

I was talking about pam because I thought it was pam's business to do
authentication/authorization, why any application does it itself? If it bypasses
pam settings, things may go bad. Our users exist mostly in ldap, I didn't know
about https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=154314
until now, but if I had to select between dovecot or ldap, I'd select job change.
Now I have the last dovecot-1.0-1.rc15 rebuilt, should I try it or better be
happy with current state?

Comment 13 Timo Sirainen 2007-01-09 11:56:58 UTC
You wouldn't have to drop LDAP and you wouldn't even need to drop PAM. You could just not use 
nss_ldap. Meaning that instead of using "userdb passwd", you could set up "userdb ldap" with correct 
settings to look up the user information from LDAP after PAM has done the authentication.

The reason why a lot of software uses PAM is because it's easier to do that way than reimplement all the 
functionality of PAM modules. I don't like PAM all that much.

I think anyway that this bug is invalid for Dovecot, and the real problem is nss_ldap that's causing the 
leaks. Perhaps I should some day make NSS lookups also be done in separate processes so they wouldn't 
break dovecot-auth..

Comment 14 Pytela, Zdenek 2007-01-10 09:30:16 UTC
OK. I'll accept (of course) this issue to be closed/invalid/somebody else's
problem. Thanks to anybody who tried to help me.

My current situation is with rawhide's dovecot-1.0-2.rc17 rebuilt for fc6, works
fine, although I didn't succeeded to set "userdp ldap" yet so that dovecot was
gratified - this is certainly my problem, even that documentation is quite poor
in this special point; I'll try primary documentation (those files with .c
extension).

I don't like bugs which appear out-of the blue, like this case was, when nothing
changed and suddenly application started to behave bad. Even less I like
situations when nothing changes and the bug disappears like yesterday and today.
There is more work to be done.

Comment 15 Tomas Janousek 2007-04-06 13:40:47 UTC
Could you check if it's ok with the current fc6 package
(dovecot-1.0-2.rc28.fc6)? Should I really close this as if it never existed?

Comment 16 Pytela, Zdenek 2007-04-06 14:17:44 UTC
The bug hasn't appeared since 2007-01-10. I had always used updated fc6 package
with the exception of the last one (problems with people not in ldap, what is
another game). I don't find it being a bug that time.

Comment 17 Tomas Janousek 2007-04-06 14:30:37 UTC
Ok, closing.


Note You need to log in before you can comment on or make changes to this bug.