Bug 483636 - nscd segfaults under heavy load
Summary: nscd segfaults under heavy load
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: glibc
Version: 5.3
Hardware: All
OS: Linux
high
high
Target Milestone: rc
: ---
Assignee: Jakub Jelinek
QA Contact: BaseOS QE
URL:
Whiteboard:
: 443713 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2009-02-02 17:54 UTC by Olivier Fourdan
Modified: 2018-10-20 02:11 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-09-02 11:44:22 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2009:1415 0 normal SHIPPED_LIVE glibc bug fix and enhancement update 2009-09-01 14:25:37 UTC

Description Olivier Fourdan 2009-02-02 17:54:22 UTC
Description of problem:

nscd is segfaulting when the cache is used intensively.

Version-Release number of selected component (if applicable):

nscd-2.5-34

How reproducible:

100% reproducible

Steps to Reproduce:

run a few "perl -e 'while (1) { getpwuid(int(rand(100000))); }'" in parallel
  
Actual results:

nscd segfaults after a few minutes with the following backtrace:

	(gdb) bt
	#0  gc (db=0x2ad0a7e2f120) at mem.c:342
	#1  0x00002ad0a7c22b3f in prune_cache (table=0x2ad0a7e2f120, now=1228862068, fd=-1) at cache.c:486
	#2  0x00002ad0a7c1c2f3 in nscd_run (p=0x42060a60) at connections.c:1489
	#3  0x00002ad0a825c367 in start_thread (arg=<value optimized out>) at pthread_create.c:297
	#4  0x00002ad0a8d8ef7d in clone () from /lib64/libc.so.6 

	336               do
	337                 {
	338                   assert ((*next_data)->key >= (*next_data)->packet);
	339                   assert ((*next_data)->key + (*next_data)->len
	340                           <= (*next_data)->packet + dh->allocsize);
	341
	342 ==>               (*next_data)->packet -= disp;
	343                   (*next_data)->key -= disp;
	344                   ++next_data;
	345                 }
	346               while (next_data < &he_data[db->head->nentries]
	347                      && (*next_data)->packet == off_alloc);

Expected results:

nscd does not crash

Additional info:

Initially, the problem was reported with nss_ldap but it can be reproduced without nss_ldap and with a default nscd configuration.

This could be a dup of other nscd crash report, namely rhbz#464918 and rhbz#443713 or even rhbz#241073 however the reproducer does not seem identical.

I thought this might have been upstream bugs #5381 and #5382

    http://sourceware.org/bugzilla/show_bug.cgi?id=5381
    http://sourceware.org/bugzilla/show_bug.cgi?id=5382

But the same problem is still reproducible with nscd from glibc-2.9-3 in Fedora 10 that contains fixes for the two bugs above, but with slightly different backtrace:

	#0  memcpy () at ../sysdeps/i386/i686/memcpy.S:75
	#1  0xaf218008 in ?? ()
	#2  0x00d6af94 in gc (db=0xd7b040) at ../string/bits/string3.h:52
	#3  0x00d69df7 in prune_cache (table=0xd7b040, now=1233244915, fd=-1) at cache.c:521
	#4  0x00d5e72c in nscd_run_prune (p=0x0) at connections.c:1528
	#5  0x0045651f in start_thread (arg=0xafdddb90) at pthread_create.c:297
	#6  0x0057804e in clone () at ../sysdeps/unix/sysv/linux/i386/clone.S:130

I tried backporting the code from current cvs in sourceware.org for nscd and the crash still occurs, so I suspect that this bug might still be uptream in CVS code.

The backtrace is not necessarily identical, for example I've seen this on EL5.3 too:

	Program received signal SIGSEGV, Segmentation fault.
	[Switching to Thread 0x40518940 (LWP 5051)]
	0x00002ba0001ca57b in gc (db=0x2ba0003d6120) at mem.c:90
	90            mark[elem++] |= 0xff << (start % BITS);
	(gdb) bt
	#0  0x00002ba0001ca57b in gc (db=0x2ba0003d6120) at mem.c:90
	#1  0x00002ba0001c9b3f in prune_cache (table=0x2ba0003d6120, now=1233246853, fd=-1) at cache.c:486
	#2  0x00002ba0001c32f3 in nscd_run (p=0x40517a60) at connections.c:1489
	#3  0x00002ba000803367 in start_thread (arg=<value optimized out>) at pthread_create.c:297
	#4  0x00002ba00133a0ad in clone () from /lib64/libc.so.6 

Reducing the size of the sample from 100000 to 100 seems to make the error less likely to occur.

I have been able to run "perl -e 'while (1) { getpwuid(int(rand(100))); }'" in parallel for hours without crash (though it does not prove the error would not occur at some point)

Comment 54 errata-xmlrpc 2009-09-02 11:44:22 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-1415.html

Comment 56 Jeff Law 2012-01-20 08:04:21 UTC
*** Bug 443713 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.