151486 – nscd takes 100% cpu, locks up top and ps -ef

Bug 151486 - nscd takes 100% cpu, locks up top and ps -ef

Summary: nscd takes 100% cpu, locks up top and ps -ef

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	glibc
Sub Component:
Version:	3
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Jakub Jelinek
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2005-03-18 14:49 UTC by Thomas J. Baker
Modified:	2007-11-30 22:11 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2005-07-01 20:34:16 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Thomas J. Baker 2005-03-18 14:49:10 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.6) Gecko/20050302 Firefox/1.0.1 Fedora/1.0.1-1.3.2

Description of problem:
I don't know what's triggered this but I am unable to fix it. Upon boot of my laptop this morning, nscd started taking 100% cpu. It was hard to track down at first as top just hung without displaying anything and ps -ef would hang right after listing klogd. I finally went through /proc to find the culprit. I've tried removing and reinstalling nscd, prelinking, ldconfiging, etc. but nothing has helped. I use NetworkManager  which I believe relies on nscd.

Version-Release number of selected component (if applicable):
nscd-2.3.4-2.fc3.i386.rpm

How reproducible:
Always

Steps to Reproduce:
1. boot machine
2. 
3.
  

Actual Results:  nscd is near 100% CPU utilization

Expected Results:  nscd is near 0% CPU

Additional info:

Comment 1 Peter Bieringer 2005-03-23 11:00:05 UTC

I found the same on RHEL4 using nscd-2.3.4-2

After "service nscd start" after around 15 seconds, CPU utilization goes up to 100% 

But funny, nscd do not show CPU load.

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 9018 nscd      17   0 24676 1552 1176 S  0.0  0.2   0:00.00 nscd

"vmstat 1" told me (first line around start of nscd)

 0  0      0 560956  18848  66244    0    0     0   120 1036    80  0  0 87 13
 0  0      0 560956  18848  66244    0    0     0     0 1005    14  0  0 100  0
 0  0      0 560956  18848  66244    0    0     0     0 1005    19  0  0 100  0
 0  0      0 560956  18848  66244    0    0     0     0 1005    13  0  0 100  0
 0  0      0 560956  18848  66244    0    0     0     0 1006    19  0  0 100  0
 0  0      0 560956  18848  66244    0    0     0     0 1005   241  0  0 100  0
 0  0      0 560972  18848  66244    0    0     0     0 1005    13  0  0 100  0
 0  0      0 560972  18848  66244    0    0     0     0 1005    11  0  0 100  0
 0  0      0 560972  18848  66244    0    0     0     0 1006    17  0  0 100  0
 0  0      0 560972  18848  66244    0    0     0     0 1005    17  0  0 100  0
 0  0      0 560980  18848  66244    0    0     0     0 1005    15  0  0 100  0
 0  0      0 560980  18848  66244    0    0     0     0 1005    11  0  0 100  0
 0  0      0 560980  18848  66244    0    0     0     0 1006    13  0  0 100  0
 0  0      0 560980  18848  66244    0    0     0     0 1007    13  0  0 100  0
 1  0      0 560980  18848  66244    0    0     0     0 1005    20 38  0 62  0
 1  0      0 560980  18848  66244    0    0     0     0 1006    15 100  0  0  0
 1  0      0 560980  18848  66244    0    0     0     0 1006    13 100  0  0  0
 1  0      0 560980  18848  66244    0    0     0     0 1005    11 100  0  0  0
 1  0      0 560980  18848  66244    0    0     0     0 1006    13 100  0  0  0
 1  0      0 560980  18848  66244    0    0     0     0 1005    13 100  0  0  0

# ps axu |grep nscd
nscd      5938  0.0  0.1 19272 1556 ?        Ssl  11:45   0:00 /usr/sbin/nscd

Strange also that strace has nothing to do:

# strace -f -p 6747
Process 6755 attached with 7 threads - interrupt to quit
[pid  6747] time( <unfinished ...>
[pid  6751] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
[pid  6752] futex(0x9832e0, FUTEX_WAIT, 2, NULL <unfinished ...>
[pid  6753] futex(0x9832e0, FUTEX_WAIT, 2, NULL <unfinished ...>
[pid  6754] futex(0x9832e0, FUTEX_WAIT, 2, NULL <unfinished ...>
[pid  6755] futex(0x9832e0, FUTEX_WAIT, 2, NULL <unfinished ...>
[pid  6747] <... time resumed> NULL)    = 1111574946
[pid  6751] <... clock_gettime resumed> {274, 272530304}) = 0
[pid  6747] epoll_wait(10,  <unfinished ...>
[pid  6751] futex(0x9832e0, FUTEX_WAKE, 1 <unfinished ...>
[pid  6752] <... futex resumed> )       = 0
[pid  6751] <... futex resumed> )       = 1
[pid  6752] futex(0x9832e0, FUTEX_WAKE, 1 <unfinished ...>
[pid  6753] <... futex resumed> )       = 0
[pid  6751] futex(0x9832e4, FUTEX_WAIT, 9, {3, 150981000} <unfinished ...>
[pid  6752] <... futex resumed> )       = 1
[pid  6753] futex(0x9832e0, FUTEX_WAKE, 1 <unfinished ...>
[pid  6754] <... futex resumed> )       = 0
[pid  6752] futex(0x9832e4, FUTEX_WAIT, 9, NULL <unfinished ...>
[pid  6753] <... futex resumed> )       = 1
[pid  6754] futex(0x9832e0, FUTEX_WAKE, 1 <unfinished ...>
[pid  6755] <... futex resumed> )       = 0
[pid  6753] futex(0x9832e4, FUTEX_WAIT, 9, NULL <unfinished ...>
[pid  6754] <... futex resumed> )       = 1
[pid  6755] futex(0x9832e0, FUTEX_WAKE, 1 <unfinished ...>
[pid  6754] futex(0x9832e4, FUTEX_WAIT, 9, NULL <unfinished ...>
[pid  6755] <... futex resumed> )       = 0
[pid  6755] futex(0x9832e4, FUTEX_WAIT, 9, NULL <unfinished ...>
[pid  6751] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid  6751] futex(0x983264, FUTEX_WAKE, 1) = 0
[pid  6751] stat64("/etc/group", {st_mode=S_IFREG|0644, st_size=663, ...}) = 0
[pid  6751] clock_gettime(CLOCK_MONOTONIC, {277, 428656304}) = 0
[pid  6751] clock_gettime(CLOCK_MONOTONIC, {277, 428881304}) = 0
[pid  6751] futex(0x9832e4, FUTEX_WAIT, 11, {14, 999775000}) = -1 ETIMEDOUT
(Connection timed out)
[pid  6751] futex(0x983264, FUTEX_WAKE, 1) = 0
[pid  6751] stat64("/etc/group", {st_mode=S_IFREG|0644, st_size=663, ...}) = 0
[pid  6751] clock_gettime(CLOCK_MONOTONIC, {292, 431396304}) = 0
[pid  6751] clock_gettime(CLOCK_MONOTONIC, {292, 431619304}) = 0
[pid  6751] futex(0x9832e4, FUTEX_WAIT, 13, {14, 999777000}

Logging also show nothing strange

I tested following possible reasons without any improvements:
- disable local running named and not using by editing /etc/resolv.conf
- shutting down local running ldap server
- removing "ldap" from nsswitch.conf
- disable caching of "hosts"

Note that I'm running a very similar box with RHEL3 (does most of the same
work), config is mostly equal, here "nscd" has no problems.

Comment 2 Peter Bieringer 2005-03-23 11:01:35 UTC

BTW: why does "nscd" not exists in the components list of RHEL4?

Comment 3 Matt Kennel 2005-03-24 19:57:43 UTC

I have seen this phenomenon myself.   On my setup, it locked up name resolution
upon trying to start 'xfs' during bootup.  It was quite frustrating before
finding the seemingly unrelated 'nscd' was to blame.

Once I saw that I disabled nscd during bootup and the rest ran normally.

I then managed to more reliably induce the bug.
In one window, I started a 'top' process running.

In another window I did 'service nscd start', and nscd started normally and
nothing bad happened immediately.  I then used the nameserver by doing a ping to
a hostname, the host resolved properly and ping worked fine.  While the 'ping'
program was active, there was no anomalous nscd behavior either.

At the moment that I hit control-c on the ping process, then 'nscd' shot up to
100% of CPU, and further name resolution timed out or was very slow, and that
I'm sure is causing all the other weird faults.

Only by killing the nscd process could I get things to return back to normal.

I did find a resolution, however:  with nscd off I erased the files in
'/var/db/nscd/*', and subsequently, with nscd startup, everything seems to work
normally.

I am wondering whether there is some weird bug in a version skew in the db
routines used by nscd for its cache, version skew between file format and
library(?), and this gives problems upon upgrades (where presumably
/var/db/nscd/ is kept), but not on fresh installations.

A suggested fix might be to delete /var/db/nscd/* on nscd upgrade---but possibly
the problem is in the db library upgrade.

Comment 4 Matt Kennel 2005-03-24 20:05:10 UTC

One more comment:  if any of you are having this problem, try to save
the /var/db/nscd/* files which were apparently causing the problem.

That way you will be able to send them to the proper nscd developer, along
with a set which works, to help find the bug.

I didn't do this, unfortunately.

Comment 5 Peter Bieringer 2005-03-25 10:13:56 UTC

Oops, I deleted the files also before reading the last posting. Now nscd is
working fine again. Note that I didn't upgrade nscd on this system, it was still
the version installed during first installation.
I've tried to reproduce this, but don't get similar result.

Comment 6 Matt Kennel 2005-03-25 16:50:50 UTC

Maybe then the problem is an upgrade in some library which nscd depends on, but
not the nscd package itself? 

Or maybe not---perhaps there is some strange circumstance, e.g. series of DNS
queries and replies---different for everybody---which gets the database into
this broken state.

Comment 7 Ulrich Drepper 2005-06-30 20:07:45 UTC

I never seen the problem anywhere and there is no usable information in this bug
so far.  Either provide information or I'll close the bug soon.

Comment 8 Thomas J. Baker 2005-06-30 20:17:54 UTC

As the original reporter, I upgraded to FC4T1 soon after the problem started
happening for me and it went away.

Comment 9 Ulrich Drepper 2005-07-01 20:34:16 UTC

If it works in FC4 there is no reason to keep this bug open.

Note You need to log in before you can comment on or make changes to this bug.