Bug 17519

Summary:	nscd deadlocks, halting system activity
Product:	[Retired] Red Hat Linux	Reporter:	shuey
Component:	glibc	Assignee:	Jakub Jelinek <jakub>
Status:	CLOSED RAWHIDE	QA Contact:
Severity:	high	Docs Contact:
Priority:	high
Version:	6.2	CC:	drepper, fweimer, shuey
Target Milestone:	---
Target Release:	---
Hardware:	i386
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2004-10-04 06:50:41 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description shuey 2000-09-14 22:17:31 UTC

After an unspecified amount of time nscd will deadlock, crippling the
system.  Each thread seems to hang shortly after recieving a request
for something that can't be found in the cache (according to the
nscd logs).  Once all threads are hung anything needing to access nscd
blocks indefinitely.  Logins are impossible, but if a root terminal is
already open nscd can be killed, restoring system functionality.  nscd
threads are able to process other cache misses, but for some reason
they will eventually recieve one that causes the thread to hang.

I'm submitting this as a high-priority, high-severity bug because it
relates to a core system component (glibc) and can cause a system to be
unusable.  While nscd is an optional service, disabling it isn't really
a viable solution; the performance degradation is quite noticable.  Our
backend is an LDAP server - with around a hundred clients banging on the
server, removing nscd creates some serious performance problems.

Comment 1 Cristian Gafton 2000-10-17 23:34:53 UTC

assigned to jakub

Comment 2 Ben Klang 2002-05-16 14:05:26 UTC

I am seeing the same problems on our deployed RedHat 7.2 servers, again with 
LDAP as a backend. All the related packages (nss_ldap, pam_ldap, glibc) are all 
either the default RedHat 7.2 install with most of the machines at the latest 
released RedHat 7.2 updated package.

Is any progress being made here?

THanks

Comment 3 Petri T. Koistinen 2002-08-30 20:06:47 UTC

We are running Novell eDirectory on Red Hat 7.3 server. Without using nscd the
server will jam totaly. The problem is that nscd is extreme unstable and it has
to restarted on crontab about every minute.

Here is snipper what I see with "ps fax" command. Not a pretty sight:

 3475 ?        S      0:11 /usr/sbin/nscd
 3484 ?        Z      0:00  \_ [nscd <defunct>]
 3684 ?        S      0:09 /usr/sbin/nscd
 3687 ?        Z      0:00  \_ [nscd <defunct>]
 3816 ?        S      0:08 /usr/sbin/nscd
 3819 ?        Z      0:00  \_ [nscd <defunct>]
 3954 ?        S      0:07 /usr/sbin/nscd
 3961 ?        Z      0:00  \_ [nscd <defunct>]
 4147 ?        S      0:07 /usr/sbin/nscd
 4151 ?        Z      0:00  \_ [nscd <defunct>]

Comment 4 David Vu 2002-12-02 06:42:46 UTC

We also run nscd with an LDAP backend, we are fortunate in that the nscd daemon 
die abnormally frequently but not deadlock.  The nscd daemon dies leaving 
behind /var/run/nscd.pid and /var/run/.nscd_socket - these need to be removed 
before nscd can be restarted again.

I've tried to increase the number of nscd threads and enabling debug logging 
but I am still not sure if these resolve the problem.

This problem happens on both a RH7.3 box and RH7.1 box with the current 
nscd/glibc errata RPMs.

Comment 5 Tim Mooney 2004-01-09 23:51:07 UTC

We've had this problem happen on

  Red Hat 7.3
  Red Hat 8.0
  Red Hat ES 2.1
  Red Hat ES 3

We kept our RH 7.3 and 8 systems up to date with patches, and Red Hat
Network is keeping our ES 2.1 and ES 3 systems completely up to date,
and we're still seeing the problem, on multiple different systems.

This problem is listed as "ASSIGNED", but that was more than a year
ago.  What's the holdup?  Would nscd debug logs help?

Comment 6 Ulrich Drepper 2004-10-04 06:50:41 UTC

The holdup is that the coponent is wrong.  Somewhat set this up for
some reason but none of the people responsible for the package even
knew it existed.  The bug should have been filed against glibc since
this is the package nscd is part of.

There is a problem in nscd which is fixed in the current glibc at
least.  Use FC3t2 or later when it comes available.  Part part of the
blame is to be laid on the nss_ldap module which far too often
misbehaves.  I won't anayze it since I at some point want to eat again.

If you have problems with lockups in FC3 let me know by reopening. 
But we certainly won't touch any code in RHL9 or earlier, FC1, or FC2.