From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.6) Gecko/20050302 Firefox/1.0.1 Fedora/1.0.1-1.3.2 Description of problem: I don't know what's triggered this but I am unable to fix it. Upon boot of my laptop this morning, nscd started taking 100% cpu. It was hard to track down at first as top just hung without displaying anything and ps -ef would hang right after listing klogd. I finally went through /proc to find the culprit. I've tried removing and reinstalling nscd, prelinking, ldconfiging, etc. but nothing has helped. I use NetworkManager which I believe relies on nscd. Version-Release number of selected component (if applicable): nscd-2.3.4-2.fc3.i386.rpm How reproducible: Always Steps to Reproduce: 1. boot machine 2. 3. Actual Results: nscd is near 100% CPU utilization Expected Results: nscd is near 0% CPU Additional info:
I found the same on RHEL4 using nscd-2.3.4-2 After "service nscd start" after around 15 seconds, CPU utilization goes up to 100% But funny, nscd do not show CPU load. PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 9018 nscd 17 0 24676 1552 1176 S 0.0 0.2 0:00.00 nscd "vmstat 1" told me (first line around start of nscd) 0 0 0 560956 18848 66244 0 0 0 120 1036 80 0 0 87 13 0 0 0 560956 18848 66244 0 0 0 0 1005 14 0 0 100 0 0 0 0 560956 18848 66244 0 0 0 0 1005 19 0 0 100 0 0 0 0 560956 18848 66244 0 0 0 0 1005 13 0 0 100 0 0 0 0 560956 18848 66244 0 0 0 0 1006 19 0 0 100 0 0 0 0 560956 18848 66244 0 0 0 0 1005 241 0 0 100 0 0 0 0 560972 18848 66244 0 0 0 0 1005 13 0 0 100 0 0 0 0 560972 18848 66244 0 0 0 0 1005 11 0 0 100 0 0 0 0 560972 18848 66244 0 0 0 0 1006 17 0 0 100 0 0 0 0 560972 18848 66244 0 0 0 0 1005 17 0 0 100 0 0 0 0 560980 18848 66244 0 0 0 0 1005 15 0 0 100 0 0 0 0 560980 18848 66244 0 0 0 0 1005 11 0 0 100 0 0 0 0 560980 18848 66244 0 0 0 0 1006 13 0 0 100 0 0 0 0 560980 18848 66244 0 0 0 0 1007 13 0 0 100 0 1 0 0 560980 18848 66244 0 0 0 0 1005 20 38 0 62 0 1 0 0 560980 18848 66244 0 0 0 0 1006 15 100 0 0 0 1 0 0 560980 18848 66244 0 0 0 0 1006 13 100 0 0 0 1 0 0 560980 18848 66244 0 0 0 0 1005 11 100 0 0 0 1 0 0 560980 18848 66244 0 0 0 0 1006 13 100 0 0 0 1 0 0 560980 18848 66244 0 0 0 0 1005 13 100 0 0 0 # ps axu |grep nscd nscd 5938 0.0 0.1 19272 1556 ? Ssl 11:45 0:00 /usr/sbin/nscd Strange also that strace has nothing to do: # strace -f -p 6747 Process 6755 attached with 7 threads - interrupt to quit [pid 6747] time( <unfinished ...> [pid 6751] clock_gettime(CLOCK_MONOTONIC, <unfinished ...> [pid 6752] futex(0x9832e0, FUTEX_WAIT, 2, NULL <unfinished ...> [pid 6753] futex(0x9832e0, FUTEX_WAIT, 2, NULL <unfinished ...> [pid 6754] futex(0x9832e0, FUTEX_WAIT, 2, NULL <unfinished ...> [pid 6755] futex(0x9832e0, FUTEX_WAIT, 2, NULL <unfinished ...> [pid 6747] <... time resumed> NULL) = 1111574946 [pid 6751] <... clock_gettime resumed> {274, 272530304}) = 0 [pid 6747] epoll_wait(10, <unfinished ...> [pid 6751] futex(0x9832e0, FUTEX_WAKE, 1 <unfinished ...> [pid 6752] <... futex resumed> ) = 0 [pid 6751] <... futex resumed> ) = 1 [pid 6752] futex(0x9832e0, FUTEX_WAKE, 1 <unfinished ...> [pid 6753] <... futex resumed> ) = 0 [pid 6751] futex(0x9832e4, FUTEX_WAIT, 9, {3, 150981000} <unfinished ...> [pid 6752] <... futex resumed> ) = 1 [pid 6753] futex(0x9832e0, FUTEX_WAKE, 1 <unfinished ...> [pid 6754] <... futex resumed> ) = 0 [pid 6752] futex(0x9832e4, FUTEX_WAIT, 9, NULL <unfinished ...> [pid 6753] <... futex resumed> ) = 1 [pid 6754] futex(0x9832e0, FUTEX_WAKE, 1 <unfinished ...> [pid 6755] <... futex resumed> ) = 0 [pid 6753] futex(0x9832e4, FUTEX_WAIT, 9, NULL <unfinished ...> [pid 6754] <... futex resumed> ) = 1 [pid 6755] futex(0x9832e0, FUTEX_WAKE, 1 <unfinished ...> [pid 6754] futex(0x9832e4, FUTEX_WAIT, 9, NULL <unfinished ...> [pid 6755] <... futex resumed> ) = 0 [pid 6755] futex(0x9832e4, FUTEX_WAIT, 9, NULL <unfinished ...> [pid 6751] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) [pid 6751] futex(0x983264, FUTEX_WAKE, 1) = 0 [pid 6751] stat64("/etc/group", {st_mode=S_IFREG|0644, st_size=663, ...}) = 0 [pid 6751] clock_gettime(CLOCK_MONOTONIC, {277, 428656304}) = 0 [pid 6751] clock_gettime(CLOCK_MONOTONIC, {277, 428881304}) = 0 [pid 6751] futex(0x9832e4, FUTEX_WAIT, 11, {14, 999775000}) = -1 ETIMEDOUT (Connection timed out) [pid 6751] futex(0x983264, FUTEX_WAKE, 1) = 0 [pid 6751] stat64("/etc/group", {st_mode=S_IFREG|0644, st_size=663, ...}) = 0 [pid 6751] clock_gettime(CLOCK_MONOTONIC, {292, 431396304}) = 0 [pid 6751] clock_gettime(CLOCK_MONOTONIC, {292, 431619304}) = 0 [pid 6751] futex(0x9832e4, FUTEX_WAIT, 13, {14, 999777000} Logging also show nothing strange I tested following possible reasons without any improvements: - disable local running named and not using by editing /etc/resolv.conf - shutting down local running ldap server - removing "ldap" from nsswitch.conf - disable caching of "hosts" Note that I'm running a very similar box with RHEL3 (does most of the same work), config is mostly equal, here "nscd" has no problems.
BTW: why does "nscd" not exists in the components list of RHEL4?
I have seen this phenomenon myself. On my setup, it locked up name resolution upon trying to start 'xfs' during bootup. It was quite frustrating before finding the seemingly unrelated 'nscd' was to blame. Once I saw that I disabled nscd during bootup and the rest ran normally. I then managed to more reliably induce the bug. In one window, I started a 'top' process running. In another window I did 'service nscd start', and nscd started normally and nothing bad happened immediately. I then used the nameserver by doing a ping to a hostname, the host resolved properly and ping worked fine. While the 'ping' program was active, there was no anomalous nscd behavior either. At the moment that I hit control-c on the ping process, then 'nscd' shot up to 100% of CPU, and further name resolution timed out or was very slow, and that I'm sure is causing all the other weird faults. Only by killing the nscd process could I get things to return back to normal. I did find a resolution, however: with nscd off I erased the files in '/var/db/nscd/*', and subsequently, with nscd startup, everything seems to work normally. I am wondering whether there is some weird bug in a version skew in the db routines used by nscd for its cache, version skew between file format and library(?), and this gives problems upon upgrades (where presumably /var/db/nscd/ is kept), but not on fresh installations. A suggested fix might be to delete /var/db/nscd/* on nscd upgrade---but possibly the problem is in the db library upgrade.
One more comment: if any of you are having this problem, try to save the /var/db/nscd/* files which were apparently causing the problem. That way you will be able to send them to the proper nscd developer, along with a set which works, to help find the bug. I didn't do this, unfortunately.
Oops, I deleted the files also before reading the last posting. Now nscd is working fine again. Note that I didn't upgrade nscd on this system, it was still the version installed during first installation. I've tried to reproduce this, but don't get similar result.
Maybe then the problem is an upgrade in some library which nscd depends on, but not the nscd package itself? Or maybe not---perhaps there is some strange circumstance, e.g. series of DNS queries and replies---different for everybody---which gets the database into this broken state.
I never seen the problem anywhere and there is no usable information in this bug so far. Either provide information or I'll close the bug soon.
As the original reporter, I upgraded to FC4T1 soon after the problem started happening for me and it went away.
If it works in FC4 there is no reason to keep this bug open.