Bug 432706

Summary: [RHEL4] nscd leaks unix sockets to /var/run/nscd/socket
Product: Red Hat Enterprise Linux 4 Reporter: Rafael Ferreira <rafael.ferreira>
Component: glibcAssignee: Andreas Schwab <schwab>
Status: CLOSED NOTABUG QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: low    
Version: 4.5CC: cward, drepper, fweimer, jakub, jbastian, jwest, linux_support, redhat-bugzilla, tao
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-06-07 05:45:53 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Rafael Ferreira 2008-02-13 22:22:17 UTC
Description of problem:


Version-Release number of selected component (if applicable):

[root@onlp205afm03 /]# cat /etc/redhat-release
Red Hat Enterprise Linux AS release 4 (Nahant Update 5)

[root@onlp205afm03 /]# uname -a
Linux onlp205afm03.ols.phoenix.edu 2.6.9-34.0.2.ELsmp #1 SMP Fri Jun 30 10:33:58
EDT 2006 i686 athlon i386 GNU/Linux


[root@onlp205afm03 /]# rpm -q glibc nscd
glibc-2.3.4-2.36
nscd-2.3.4-2.36

[root@onlp205afm03 /]# netstat -a | grep /var/run/nscd/socket | wc -l
1013

We're also using ldap with nscd for authentication against MS AD. 

How reproducible:
Easy... it happened on 10 nodes before we know what was going on

Steps to Reproduce:
1. install nscd-2.3.4-2.36
2. Let it run for a while, it will rack up a bunch of /var/run/nscd/socket unix
sockets
3. eventually apps that use nscd will start to sporadically get a SIGPIPE
  
Actual results:

here's an example:

running TOP
gettimeofday({1202939020, 392832}, {420, 0}) = 0 
stat64("/proc/self/task", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0 
open("/proc", O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY) = 3 
fstat64(3, {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0 
fcntl64(3, F_SETFD, FD_CLOEXEC) = 0 
getdents64(3, /* 36 entries */, 1024) = 1016 
getdents64(3, /* 39 entries */, 1024) = 1024 
stat64("/proc/1", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0 
open("/proc/1/stat", O_RDONLY) = 4 
read(4, "1 (init) S 0 0 0 0 -1 4194560 12"..., 1023) = 198 
close(4) = 0 
open("/proc/1/statm", O_RDONLY) = 4 
read(4, "455 128 109 6 0 97 0\n", 1023) = 21 
close(4) = 0 
socket(PF_FILE, SOCK_STREAM, 0) = 4 
fcntl64(4, F_GETFL) = 0x2 (flags O_RDWR) 
fcntl64(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0 
connect(4, {sa_family=AF_FILE, path="/var/run/nscd/socket"}, 110) = 0 
poll([{fd=4, events=POLLOUT|POLLERR|POLLHUP, revents=POLLOUT|POLLHUP}], 1, 5000)
= 1 
send(4, "\2\0\0\0\v\0\0\0\7\0\0\0passwd\0\0", 20, MSG_NOSIGNAL) = -1 EPIPE
(Broken pipe) 
close(4) = 0 
socket(PF_FILE, SOCK_STREAM, 0) = 4 
fcntl64(4, F_GETFL) = 0x2 (flags O_RDWR) 
fcntl64(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0 
connect(4, {sa_family=AF_FILE, path="/var/run/nscd/socket"}, 110) = 0 
poll([{fd=4, events=POLLOUT|POLLERR|POLLHUP, revents=POLLOUT|POLLHUP}], 1, 5000)
= 1 
writev(4, [{"\2\0\0\0\1\0\0\0\2\0\0\0", 12}, {"0\0", 2}], 2) = -1 EPIPE (Broken
pipe) 
--- SIGPIPE (Broken pipe) @ 0 (0) --- 
ioctl(0, SNDCTL_TMR_CONTINUE or TCSETSF, {B38400 opost isig icanon echo ...}) = 0 
write(1, "\33[52;1H\33[?12l\33[?25h\n", 20 
) = 20 
exit_group(0) = ? 
Process 28266 detached 

Expected results:


Additional info:

Comment 1 Arenas Belon, Carlo Marcelo 2008-03-11 19:14:07 UTC
similar problem observed in Red Hat Enterprise Linux 5, using ldap to an
openldap server.

with the caveat that in this case, nscd was eating 100 of 1 CPUs running in a
busy loop trying to bind to the UNIX socket for /var/run/nscd/socket as shown by :

time(NULL)                              = 1205259067
accept(10, 0, NULL)                     = -1 EMFILE (Too many open files)
epoll_wait(11, {{EPOLLRDNORM, {u32=10, u64=10}}}, 100, 29988) = 1
time(NULL)                              = 1205259067
accept(10, 0, NULL)                     = -1 EMFILE (Too many open files)
epoll_wait(11, {{EPOLLRDNORM, {u32=10, u64=10}}}, 100, 29988) = 1
time(NULL)                              = 1205259067
accept(10, 0, NULL)                     = -1 EMFILE (Too many open files)
epoll_wait(11, {{EPOLLRDNORM, {u32=10, u64=10}}}, 100, 29988) = 1
time(NULL)                              = 1205259067
accept(10, 0, NULL)                     = -1 EMFILE (Too many open files)

after it leaked all its 1024 file handles with socket connections as shown by :

nscd    10501 nscd    5r   REG        3,2   217016   1038341 /var/db/nscd/passwd
nscd    10501 nscd    6u   REG        3,2   217016   1038342 /var/db/nscd/group
nscd    10501 nscd    7r   REG        3,2   217016   1038342 /var/db/nscd/group
nscd    10501 nscd    8u   REG        3,2   217016   1038340 /var/db/nscd/hosts
nscd    10501 nscd    9r   REG        3,2   217016   1038340 /var/db/nscd/hosts
nscd    10501 nscd   10u  unix 0xe7921280           12076246 /var/run/nscd/socket
nscd    10501 nscd   11r  0000       0,10        0  12076248 eventpoll
nscd    10501 nscd   12u  sock        0,5           12089279 can't identify protocol
nscd    10501 nscd   13u  unix 0xebade480           12077692 socket
nscd    10501 nscd   14u  sock        0,5           12098818 can't identify protocol
nscd    10501 nscd   15u  sock        0,5           12108033 can't identify protocol
nscd    10501 nscd   16u  sock        0,5           12136264 can't identify protocol
nscd    10501 nscd   17u  sock        0,5           12156091 can't identify protocol
nscd    10501 nscd   18u  sock        0,5           12189201 can't identify protocol
..
nscd    10501 nscd 1022u  sock        0,5           35734554 can't identify protocol
nscd    10501 nscd 1023u  unix 0xd6582300          118834655 /var/run/nscd/socket


Comment 2 Ulrich Drepper 2008-08-03 03:58:17 UTC
This doesn't look like a libc problem.  In the original report it seems like nscd is in trouble.  Programs don't get a response.  Comment #1 shows one possible way this can happen.

There are no known reports of nscd not closing descriptors.  And the fact that LDAP is mentioned makes this all the less likely.

We have no other report like this and would need more information.  And this time preferably without the nss_ldap module.

The next RHEL5 update will likely contain some nscd updates based on the current upstream code.  This code has no know issues.

Comment 3 Atro Tossavainen 2008-10-10 07:27:59 UTC
I can confirm this problem. When the problem situation is present, killing nscd makes it go away. Symptoms include not being able to start any new programs because they are SIGPIPE'd and not even being able to log in on the console. There is nothing in the syslog and nothing in dmesg either.  I am using nss_ldap, of course - it's rather hard to get user authorization information from LDAP without doing so.

Comment 4 Atro Tossavainen 2008-10-10 07:28:24 UTC
I should also say that I've had this occur on both x86 and x86_64.

Comment 6 Chris Ward 2009-04-01 08:17:54 UTC
Support, Customers, 

I have uploaded test packages that should fix this issue below. These packages
- if the issue reported can be confirmed as resolved - will be included in the
upcoming 4.8 release.

http://people.redhat.com/cward/4.8/nss_ldap/

The latest 4.8 Beta can be downloaded from RHN @ 
https://rhn.redhat.com/network/software/download_isos_full.pxt

Please test and provide us with feedback ASAP.