After an unspecified amount of time nscd will deadlock, crippling the
system. Each thread seems to hang shortly after recieving a request
for something that can't be found in the cache (according to the
nscd logs). Once all threads are hung anything needing to access nscd
blocks indefinitely. Logins are impossible, but if a root terminal is
already open nscd can be killed, restoring system functionality. nscd
threads are able to process other cache misses, but for some reason
they will eventually recieve one that causes the thread to hang.
I'm submitting this as a high-priority, high-severity bug because it
relates to a core system component (glibc) and can cause a system to be
unusable. While nscd is an optional service, disabling it isn't really
a viable solution; the performance degradation is quite noticable. Our
backend is an LDAP server - with around a hundred clients banging on the
server, removing nscd creates some serious performance problems.
assigned to jakub
I am seeing the same problems on our deployed RedHat 7.2 servers, again with
LDAP as a backend. All the related packages (nss_ldap, pam_ldap, glibc) are all
either the default RedHat 7.2 install with most of the machines at the latest
released RedHat 7.2 updated package.
Is any progress being made here?
We are running Novell eDirectory on Red Hat 7.3 server. Without using nscd the
server will jam totaly. The problem is that nscd is extreme unstable and it has
to restarted on crontab about every minute.
Here is snipper what I see with "ps fax" command. Not a pretty sight:
3475 ? S 0:11 /usr/sbin/nscd
3484 ? Z 0:00 \_ [nscd <defunct>]
3684 ? S 0:09 /usr/sbin/nscd
3687 ? Z 0:00 \_ [nscd <defunct>]
3816 ? S 0:08 /usr/sbin/nscd
3819 ? Z 0:00 \_ [nscd <defunct>]
3954 ? S 0:07 /usr/sbin/nscd
3961 ? Z 0:00 \_ [nscd <defunct>]
4147 ? S 0:07 /usr/sbin/nscd
4151 ? Z 0:00 \_ [nscd <defunct>]
We also run nscd with an LDAP backend, we are fortunate in that the nscd daemon
die abnormally frequently but not deadlock. The nscd daemon dies leaving
behind /var/run/nscd.pid and /var/run/.nscd_socket - these need to be removed
before nscd can be restarted again.
I've tried to increase the number of nscd threads and enabling debug logging
but I am still not sure if these resolve the problem.
This problem happens on both a RH7.3 box and RH7.1 box with the current
nscd/glibc errata RPMs.
We've had this problem happen on
Red Hat 7.3
Red Hat 8.0
Red Hat ES 2.1
Red Hat ES 3
We kept our RH 7.3 and 8 systems up to date with patches, and Red Hat
Network is keeping our ES 2.1 and ES 3 systems completely up to date,
and we're still seeing the problem, on multiple different systems.
This problem is listed as "ASSIGNED", but that was more than a year
ago. What's the holdup? Would nscd debug logs help?
The holdup is that the coponent is wrong. Somewhat set this up for
some reason but none of the people responsible for the package even
knew it existed. The bug should have been filed against glibc since
this is the package nscd is part of.
There is a problem in nscd which is fixed in the current glibc at
least. Use FC3t2 or later when it comes available. Part part of the
blame is to be laid on the nss_ldap module which far too often
misbehaves. I won't anayze it since I at some point want to eat again.
If you have problems with lockups in FC3 let me know by reopening.
But we certainly won't touch any code in RHL9 or earlier, FC1, or FC2.