448643 – NSCD hangs when invalidating group table cache

Bug 448643 - NSCD hangs when invalidating group table cache

Summary: NSCD hangs when invalidating group table cache

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	glibc
Sub Component:
Version:	9
Hardware:	All
OS:	Linux
Priority:	low
Severity:	high
Target Milestone:	---
Assignee:	Jakub Jelinek
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2008-05-28 00:32 UTC by Josh Lange
Modified:	2008-08-03 03:22 UTC (History)
CC List:	3 users (show)
Fixed In Version:	2.8-8
Clone Of:
Environment:
Last Closed:	2008-08-03 03:22:55 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
subsequent nscd trace (7.23 KB, application/octet-stream) 2008-05-28 00:32 UTC, Josh Lange	no flags	Details
nscd and useradd strace (277.19 KB, application/x-bzip) 2008-05-29 19:25 UTC, Josh Lange	no flags	Details
View All

Description Josh Lange 2008-05-28 00:32:12 UTC

Description of problem:
We have nscd started on a few systems running Fedora 9. These systems use
openldap to access out passwd/group database off of a Windows Active Directory
(ldap) server.

Upon the first boot, after installing fedora 9, I noticed that one of the init
scripts I made was hanging (this worked fine on fedora 7, so I was puzzled). The
line it was hanging on, was: "useradd -u 44 -d /var/flexnet -s /bin/sh flexnet"
(it was stuck here for at least a few hours).

After running strace, I figured out that it had forked and executed:
/usr/sbin/nscd nscd -i group
(strace shows that it is stuck reading bytes from file descriptor 3)

Also, I noticed that the running nscd daemon in memory was taking up 100% of the
idle cpu time:
[root@glinda ~]# strace -p 2018
Process 2018 attached - interrupt to quit
time(NULL)                              = 1211933052
epoll_wait(12, {}, 100, 29988)          = 0
time(NULL)                              = 1211933142
epoll_wait(12, 


After killing the client nscd, the daemon remains taking up 100% of the cpu.

Subsequent attempts (after nscd is spinning) get stuck reading file descriptor 3
as well, but from a full trace, it appears that this is connected to the socket:
/var/run/nscd/socket (presumably waiting for the daemon process to write a
response).

This script is late in the init process (S95), and by this time username
resolutions are already working, after gaining a shell via ssh, I can use
"getent passwd" to get records out of ldap.


How reproducible: sometimes


Additional info:
nscd-2.8-3.i386
nss_ldap-259-3.fc9.i386
openldap-2.4.8-3.fc9.i386
shadow-utils-4.1.1-2.fc9.i386
2.6.25.3-18.fc9.i686
our ldap repository has a few large unix groups, with ~900 entries each.

Comment 1 Josh Lange 2008-05-28 00:32:13 UTC

Created attachment 306865 [details]
subsequent nscd trace

Comment 2 Jan Safranek 2008-05-28 13:26:00 UTC

This could be related to bug #444618.

- do you have "nss_page_results yes" in your /etc/ldap.conf?
- when downgrading to F8 (openldap-2.3.x + appropriate nss_ldap version), does
the the bug disappear?
- when you disable nscd, does "id <username>" hang, especially when the user is
in AD?

And please attach stack trace of the nscd running in endless loop (attach gdb to
it and run "bt full").

It would be very helpful if I would have access to the AD you use, with my
debugger, my openldap testing builds etc, but I assume it's not possible (I'm
just trying :)

Thanks in advance.

Comment 3 Josh Lange 2008-05-29 00:15:08 UTC

>do you have "nss_page_results yes" in your /etc/ldap.conf?
no

>when downgrading to F8 (openldap-2.3.x + appropriate nss_ldap version), does
the the bug disappear?
I'll try this if you want me to. Our fedora 7 boxes run just fine.

>when you disable nscd, does "id <username>" hang, especially when the user is
in AD?
nope, its a bit slow (~.3 sec) but thats normal.

>bt full
the backtrace is a bit useless, I need to mirror the debuginfos. I will do that
today, and will have it available tomorrow:

(gdb) bt full
#0  0x0012e416 in __kernel_vsyscall ()
No symbol table info available.
#1  0x00283a46 in epoll_wait () from /lib/libc.so.6
No symbol table info available.
#2  0xb7f9196b in main_loop_epoll () from /usr/sbin/nscd
No symbol table info available.
#3  0xb7f91f98 in start_threads () from /usr/sbin/nscd
No symbol table info available.
#4  0xb7f8f0bd in main () from /usr/sbin/nscd
No symbol table info available.



>It would be very helpful if I would have access to the AD you use, with my
>debugger, my openldap testing builds etc, but I assume it's not possible (I'm
>just trying :)

Sure, actually. I will send you the login details in a private email.

Comment 4 Jan Safranek 2008-05-29 11:13:25 UTC

The stack trace does not look like it's a bug in openldap, it looks like problem
in nscd itself. Reassigning to nscd owner and adding nss_ldap owner to cc:, just
to have a look if something rings a bell.

I also tried to reproduce the bug on your system, but 'useradd jsafrane_test"
always succeeded (tried ~10 times, not in initscript). Is something else
necessary to reproduce the bug?

Jakub, Nalin, I can provide login details to the reporter's buggy machine, if
you are interested.

Comment 5 Josh Lange 2008-05-29 17:07:04 UTC

it happens 100% of the time on all the systems I have tried during the first boot.

I can't reproduce it after this point.

I am going to make an attempt to copy the nscd cache state right before the
useradd command, to see if I can reproduce this error state.

I will also set up a host to do a full strace on nscd and useradd while the
command gets executed.

Comment 6 Josh Lange 2008-05-29 19:25:15 UTC

Created attachment 307126 [details]
nscd and useradd strace

Comment 7 Josh Lange 2008-05-29 19:25:48 UTC

on the system that I copied the nscd cache from- 
I can't reproduce this by copying the /var/db/nscd/* and setting the clock back
on a host system. It also seems that restart nscd before running useradd also
prevents it from hanging (i took a copy of the cache state before and after
stopping nscd).

On the system that I did the strace on-
Another hang, like all the rest. I have attacked the strace output from useradd
and the server nscd. (strace -f)

Comment 8 Ulrich Drepper 2008-08-03 03:22:55 UTC

Should work nicely in current version.

Note You need to log in before you can comment on or make changes to this bug.