DescriptionOlivier Fourdan
2009-02-02 17:54:22 UTC
Description of problem:
nscd is segfaulting when the cache is used intensively.
Version-Release number of selected component (if applicable):
nscd-2.5-34
How reproducible:
100% reproducible
Steps to Reproduce:
run a few "perl -e 'while (1) { getpwuid(int(rand(100000))); }'" in parallel
Actual results:
nscd segfaults after a few minutes with the following backtrace:
(gdb) bt
#0 gc (db=0x2ad0a7e2f120) at mem.c:342
#1 0x00002ad0a7c22b3f in prune_cache (table=0x2ad0a7e2f120, now=1228862068, fd=-1) at cache.c:486
#2 0x00002ad0a7c1c2f3 in nscd_run (p=0x42060a60) at connections.c:1489
#3 0x00002ad0a825c367 in start_thread (arg=<value optimized out>) at pthread_create.c:297
#4 0x00002ad0a8d8ef7d in clone () from /lib64/libc.so.6
336 do
337 {
338 assert ((*next_data)->key >= (*next_data)->packet);
339 assert ((*next_data)->key + (*next_data)->len
340 <= (*next_data)->packet + dh->allocsize);
341
342 ==> (*next_data)->packet -= disp;
343 (*next_data)->key -= disp;
344 ++next_data;
345 }
346 while (next_data < &he_data[db->head->nentries]
347 && (*next_data)->packet == off_alloc);
Expected results:
nscd does not crash
Additional info:
Initially, the problem was reported with nss_ldap but it can be reproduced without nss_ldap and with a default nscd configuration.
This could be a dup of other nscd crash report, namely rhbz#464918 and rhbz#443713 or even rhbz#241073 however the reproducer does not seem identical.
I thought this might have been upstream bugs #5381 and #5382
http://sourceware.org/bugzilla/show_bug.cgi?id=5381http://sourceware.org/bugzilla/show_bug.cgi?id=5382
But the same problem is still reproducible with nscd from glibc-2.9-3 in Fedora 10 that contains fixes for the two bugs above, but with slightly different backtrace:
#0 memcpy () at ../sysdeps/i386/i686/memcpy.S:75
#1 0xaf218008 in ?? ()
#2 0x00d6af94 in gc (db=0xd7b040) at ../string/bits/string3.h:52
#3 0x00d69df7 in prune_cache (table=0xd7b040, now=1233244915, fd=-1) at cache.c:521
#4 0x00d5e72c in nscd_run_prune (p=0x0) at connections.c:1528
#5 0x0045651f in start_thread (arg=0xafdddb90) at pthread_create.c:297
#6 0x0057804e in clone () at ../sysdeps/unix/sysv/linux/i386/clone.S:130
I tried backporting the code from current cvs in sourceware.org for nscd and the crash still occurs, so I suspect that this bug might still be uptream in CVS code.
The backtrace is not necessarily identical, for example I've seen this on EL5.3 too:
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x40518940 (LWP 5051)]
0x00002ba0001ca57b in gc (db=0x2ba0003d6120) at mem.c:90
90 mark[elem++] |= 0xff << (start % BITS);
(gdb) bt
#0 0x00002ba0001ca57b in gc (db=0x2ba0003d6120) at mem.c:90
#1 0x00002ba0001c9b3f in prune_cache (table=0x2ba0003d6120, now=1233246853, fd=-1) at cache.c:486
#2 0x00002ba0001c32f3 in nscd_run (p=0x40517a60) at connections.c:1489
#3 0x00002ba000803367 in start_thread (arg=<value optimized out>) at pthread_create.c:297
#4 0x00002ba00133a0ad in clone () from /lib64/libc.so.6
Reducing the size of the sample from 100000 to 100 seems to make the error less likely to occur.
I have been able to run "perl -e 'while (1) { getpwuid(int(rand(100))); }'" in parallel for hours without crash (though it does not prove the error would not occur at some point)
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.
http://rhn.redhat.com/errata/RHBA-2009-1415.html