Bug 432706 - [RHEL4] nscd leaks unix sockets to /var/run/nscd/socket
[RHEL4] nscd leaks unix sockets to /var/run/nscd/socket
Status: CLOSED NOTABUG
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: glibc (Show other bugs)
4.5
All Linux
low Severity medium
: rc
: ---
Assigned To: Andreas Schwab
Brian Brock
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2008-02-13 17:22 EST by Rafael Ferreira
Modified: 2016-11-24 10:38 EST (History)
9 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2010-06-07 01:45:53 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Rafael Ferreira 2008-02-13 17:22:17 EST
Description of problem:


Version-Release number of selected component (if applicable):

[root@onlp205afm03 /]# cat /etc/redhat-release
Red Hat Enterprise Linux AS release 4 (Nahant Update 5)

[root@onlp205afm03 /]# uname -a
Linux onlp205afm03.ols.phoenix.edu 2.6.9-34.0.2.ELsmp #1 SMP Fri Jun 30 10:33:58
EDT 2006 i686 athlon i386 GNU/Linux


[root@onlp205afm03 /]# rpm -q glibc nscd
glibc-2.3.4-2.36
nscd-2.3.4-2.36

[root@onlp205afm03 /]# netstat -a | grep /var/run/nscd/socket | wc -l
1013

We're also using ldap with nscd for authentication against MS AD. 

How reproducible:
Easy... it happened on 10 nodes before we know what was going on

Steps to Reproduce:
1. install nscd-2.3.4-2.36
2. Let it run for a while, it will rack up a bunch of /var/run/nscd/socket unix
sockets
3. eventually apps that use nscd will start to sporadically get a SIGPIPE
  
Actual results:

here's an example:

running TOP
gettimeofday({1202939020, 392832}, {420, 0}) = 0 
stat64("/proc/self/task", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0 
open("/proc", O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY) = 3 
fstat64(3, {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0 
fcntl64(3, F_SETFD, FD_CLOEXEC) = 0 
getdents64(3, /* 36 entries */, 1024) = 1016 
getdents64(3, /* 39 entries */, 1024) = 1024 
stat64("/proc/1", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0 
open("/proc/1/stat", O_RDONLY) = 4 
read(4, "1 (init) S 0 0 0 0 -1 4194560 12"..., 1023) = 198 
close(4) = 0 
open("/proc/1/statm", O_RDONLY) = 4 
read(4, "455 128 109 6 0 97 0\n", 1023) = 21 
close(4) = 0 
socket(PF_FILE, SOCK_STREAM, 0) = 4 
fcntl64(4, F_GETFL) = 0x2 (flags O_RDWR) 
fcntl64(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0 
connect(4, {sa_family=AF_FILE, path="/var/run/nscd/socket"}, 110) = 0 
poll([{fd=4, events=POLLOUT|POLLERR|POLLHUP, revents=POLLOUT|POLLHUP}], 1, 5000)
= 1 
send(4, "\2\0\0\0\v\0\0\0\7\0\0\0passwd\0\0", 20, MSG_NOSIGNAL) = -1 EPIPE
(Broken pipe) 
close(4) = 0 
socket(PF_FILE, SOCK_STREAM, 0) = 4 
fcntl64(4, F_GETFL) = 0x2 (flags O_RDWR) 
fcntl64(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0 
connect(4, {sa_family=AF_FILE, path="/var/run/nscd/socket"}, 110) = 0 
poll([{fd=4, events=POLLOUT|POLLERR|POLLHUP, revents=POLLOUT|POLLHUP}], 1, 5000)
= 1 
writev(4, [{"\2\0\0\0\1\0\0\0\2\0\0\0", 12}, {"0\0", 2}], 2) = -1 EPIPE (Broken
pipe) 
--- SIGPIPE (Broken pipe) @ 0 (0) --- 
ioctl(0, SNDCTL_TMR_CONTINUE or TCSETSF, {B38400 opost isig icanon echo ...}) = 0 
write(1, "\33[52;1H\33[?12l\33[?25h\n", 20 
) = 20 
exit_group(0) = ? 
Process 28266 detached 

Expected results:


Additional info:
Comment 1 Arenas Belon, Carlo Marcelo 2008-03-11 15:14:07 EDT
similar problem observed in Red Hat Enterprise Linux 5, using ldap to an
openldap server.

with the caveat that in this case, nscd was eating 100 of 1 CPUs running in a
busy loop trying to bind to the UNIX socket for /var/run/nscd/socket as shown by :

time(NULL)                              = 1205259067
accept(10, 0, NULL)                     = -1 EMFILE (Too many open files)
epoll_wait(11, {{EPOLLRDNORM, {u32=10, u64=10}}}, 100, 29988) = 1
time(NULL)                              = 1205259067
accept(10, 0, NULL)                     = -1 EMFILE (Too many open files)
epoll_wait(11, {{EPOLLRDNORM, {u32=10, u64=10}}}, 100, 29988) = 1
time(NULL)                              = 1205259067
accept(10, 0, NULL)                     = -1 EMFILE (Too many open files)
epoll_wait(11, {{EPOLLRDNORM, {u32=10, u64=10}}}, 100, 29988) = 1
time(NULL)                              = 1205259067
accept(10, 0, NULL)                     = -1 EMFILE (Too many open files)

after it leaked all its 1024 file handles with socket connections as shown by :

nscd    10501 nscd    5r   REG        3,2   217016   1038341 /var/db/nscd/passwd
nscd    10501 nscd    6u   REG        3,2   217016   1038342 /var/db/nscd/group
nscd    10501 nscd    7r   REG        3,2   217016   1038342 /var/db/nscd/group
nscd    10501 nscd    8u   REG        3,2   217016   1038340 /var/db/nscd/hosts
nscd    10501 nscd    9r   REG        3,2   217016   1038340 /var/db/nscd/hosts
nscd    10501 nscd   10u  unix 0xe7921280           12076246 /var/run/nscd/socket
nscd    10501 nscd   11r  0000       0,10        0  12076248 eventpoll
nscd    10501 nscd   12u  sock        0,5           12089279 can't identify protocol
nscd    10501 nscd   13u  unix 0xebade480           12077692 socket
nscd    10501 nscd   14u  sock        0,5           12098818 can't identify protocol
nscd    10501 nscd   15u  sock        0,5           12108033 can't identify protocol
nscd    10501 nscd   16u  sock        0,5           12136264 can't identify protocol
nscd    10501 nscd   17u  sock        0,5           12156091 can't identify protocol
nscd    10501 nscd   18u  sock        0,5           12189201 can't identify protocol
..
nscd    10501 nscd 1022u  sock        0,5           35734554 can't identify protocol
nscd    10501 nscd 1023u  unix 0xd6582300          118834655 /var/run/nscd/socket
Comment 2 Ulrich Drepper 2008-08-02 23:58:17 EDT
This doesn't look like a libc problem.  In the original report it seems like nscd is in trouble.  Programs don't get a response.  Comment #1 shows one possible way this can happen.

There are no known reports of nscd not closing descriptors.  And the fact that LDAP is mentioned makes this all the less likely.

We have no other report like this and would need more information.  And this time preferably without the nss_ldap module.

The next RHEL5 update will likely contain some nscd updates based on the current upstream code.  This code has no know issues.
Comment 3 Atro Tossavainen 2008-10-10 03:27:59 EDT
I can confirm this problem. When the problem situation is present, killing nscd makes it go away. Symptoms include not being able to start any new programs because they are SIGPIPE'd and not even being able to log in on the console. There is nothing in the syslog and nothing in dmesg either.  I am using nss_ldap, of course - it's rather hard to get user authorization information from LDAP without doing so.
Comment 4 Atro Tossavainen 2008-10-10 03:28:24 EDT
I should also say that I've had this occur on both x86 and x86_64.
Comment 6 Chris Ward 2009-04-01 04:17:54 EDT
Support, Customers, 

I have uploaded test packages that should fix this issue below. These packages
- if the issue reported can be confirmed as resolved - will be included in the
upcoming 4.8 release.

http://people.redhat.com/cward/4.8/nss_ldap/

The latest 4.8 Beta can be downloaded from RHN @ 
https://rhn.redhat.com/network/software/download_isos_full.pxt

Please test and provide us with feedback ASAP.

Note You need to log in before you can comment on or make changes to this bug.