Description of problem: We have nscd started on a few systems running Fedora 9. These systems use openldap to access out passwd/group database off of a Windows Active Directory (ldap) server. Upon the first boot, after installing fedora 9, I noticed that one of the init scripts I made was hanging (this worked fine on fedora 7, so I was puzzled). The line it was hanging on, was: "useradd -u 44 -d /var/flexnet -s /bin/sh flexnet" (it was stuck here for at least a few hours). After running strace, I figured out that it had forked and executed: /usr/sbin/nscd nscd -i group (strace shows that it is stuck reading bytes from file descriptor 3) Also, I noticed that the running nscd daemon in memory was taking up 100% of the idle cpu time: [root@glinda ~]# strace -p 2018 Process 2018 attached - interrupt to quit time(NULL) = 1211933052 epoll_wait(12, {}, 100, 29988) = 0 time(NULL) = 1211933142 epoll_wait(12, After killing the client nscd, the daemon remains taking up 100% of the cpu. Subsequent attempts (after nscd is spinning) get stuck reading file descriptor 3 as well, but from a full trace, it appears that this is connected to the socket: /var/run/nscd/socket (presumably waiting for the daemon process to write a response). This script is late in the init process (S95), and by this time username resolutions are already working, after gaining a shell via ssh, I can use "getent passwd" to get records out of ldap. How reproducible: sometimes Additional info: nscd-2.8-3.i386 nss_ldap-259-3.fc9.i386 openldap-2.4.8-3.fc9.i386 shadow-utils-4.1.1-2.fc9.i386 2.6.25.3-18.fc9.i686 our ldap repository has a few large unix groups, with ~900 entries each.
Created attachment 306865 [details] subsequent nscd trace
This could be related to bug #444618. - do you have "nss_page_results yes" in your /etc/ldap.conf? - when downgrading to F8 (openldap-2.3.x + appropriate nss_ldap version), does the the bug disappear? - when you disable nscd, does "id <username>" hang, especially when the user is in AD? And please attach stack trace of the nscd running in endless loop (attach gdb to it and run "bt full"). It would be very helpful if I would have access to the AD you use, with my debugger, my openldap testing builds etc, but I assume it's not possible (I'm just trying :) Thanks in advance.
>do you have "nss_page_results yes" in your /etc/ldap.conf? no >when downgrading to F8 (openldap-2.3.x + appropriate nss_ldap version), does the the bug disappear? I'll try this if you want me to. Our fedora 7 boxes run just fine. >when you disable nscd, does "id <username>" hang, especially when the user is in AD? nope, its a bit slow (~.3 sec) but thats normal. >bt full the backtrace is a bit useless, I need to mirror the debuginfos. I will do that today, and will have it available tomorrow: (gdb) bt full #0 0x0012e416 in __kernel_vsyscall () No symbol table info available. #1 0x00283a46 in epoll_wait () from /lib/libc.so.6 No symbol table info available. #2 0xb7f9196b in main_loop_epoll () from /usr/sbin/nscd No symbol table info available. #3 0xb7f91f98 in start_threads () from /usr/sbin/nscd No symbol table info available. #4 0xb7f8f0bd in main () from /usr/sbin/nscd No symbol table info available. >It would be very helpful if I would have access to the AD you use, with my >debugger, my openldap testing builds etc, but I assume it's not possible (I'm >just trying :) Sure, actually. I will send you the login details in a private email.
The stack trace does not look like it's a bug in openldap, it looks like problem in nscd itself. Reassigning to nscd owner and adding nss_ldap owner to cc:, just to have a look if something rings a bell. I also tried to reproduce the bug on your system, but 'useradd jsafrane_test" always succeeded (tried ~10 times, not in initscript). Is something else necessary to reproduce the bug? Jakub, Nalin, I can provide login details to the reporter's buggy machine, if you are interested.
it happens 100% of the time on all the systems I have tried during the first boot. I can't reproduce it after this point. I am going to make an attempt to copy the nscd cache state right before the useradd command, to see if I can reproduce this error state. I will also set up a host to do a full strace on nscd and useradd while the command gets executed.
Created attachment 307126 [details] nscd and useradd strace
on the system that I copied the nscd cache from- I can't reproduce this by copying the /var/db/nscd/* and setting the clock back on a host system. It also seems that restart nscd before running useradd also prevents it from hanging (i took a copy of the cache state before and after stopping nscd). On the system that I did the strace on- Another hang, like all the rest. I have attacked the strace output from useradd and the server nscd. (strace -f)
Should work nicely in current version.