Bug 483636

Summary:	nscd segfaults under heavy load
Product:	Red Hat Enterprise Linux 5	Reporter:	Olivier Fourdan <ofourdan>
Component:	glibc	Assignee:	Jakub Jelinek <jakub>
Status:	CLOSED ERRATA	QA Contact:	BaseOS QE <qe-baseos-auto>
Severity:	high	Docs Contact:
Priority:	high
Version:	5.3	CC:	codonell, ddomingo, drepper, dwalsh, fweimer, kem, pmuller, rdassen, richton, sdsmall, sgrubb, sputhenp, tao
Target Milestone:	rc
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2009-09-02 11:44:22 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Olivier Fourdan 2009-02-02 17:54:22 UTC

Description of problem:

nscd is segfaulting when the cache is used intensively.

Version-Release number of selected component (if applicable):

nscd-2.5-34

How reproducible:

100% reproducible

Steps to Reproduce:

run a few "perl -e 'while (1) { getpwuid(int(rand(100000))); }'" in parallel
  
Actual results:

nscd segfaults after a few minutes with the following backtrace:

	(gdb) bt
	#0  gc (db=0x2ad0a7e2f120) at mem.c:342
	#1  0x00002ad0a7c22b3f in prune_cache (table=0x2ad0a7e2f120, now=1228862068, fd=-1) at cache.c:486
	#2  0x00002ad0a7c1c2f3 in nscd_run (p=0x42060a60) at connections.c:1489
	#3  0x00002ad0a825c367 in start_thread (arg=<value optimized out>) at pthread_create.c:297
	#4  0x00002ad0a8d8ef7d in clone () from /lib64/libc.so.6 

	336               do
	337                 {
	338                   assert ((*next_data)->key >= (*next_data)->packet);
	339                   assert ((*next_data)->key + (*next_data)->len
	340                           <= (*next_data)->packet + dh->allocsize);
	341
	342 ==>               (*next_data)->packet -= disp;
	343                   (*next_data)->key -= disp;
	344                   ++next_data;
	345                 }
	346               while (next_data < &he_data[db->head->nentries]
	347                      && (*next_data)->packet == off_alloc);

Expected results:

nscd does not crash

Additional info:

Initially, the problem was reported with nss_ldap but it can be reproduced without nss_ldap and with a default nscd configuration.

This could be a dup of other nscd crash report, namely rhbz#464918 and rhbz#443713 or even rhbz#241073 however the reproducer does not seem identical.

I thought this might have been upstream bugs #5381 and #5382

    http://sourceware.org/bugzilla/show_bug.cgi?id=5381
    http://sourceware.org/bugzilla/show_bug.cgi?id=5382

But the same problem is still reproducible with nscd from glibc-2.9-3 in Fedora 10 that contains fixes for the two bugs above, but with slightly different backtrace:

	#0  memcpy () at ../sysdeps/i386/i686/memcpy.S:75
	#1  0xaf218008 in ?? ()
	#2  0x00d6af94 in gc (db=0xd7b040) at ../string/bits/string3.h:52
	#3  0x00d69df7 in prune_cache (table=0xd7b040, now=1233244915, fd=-1) at cache.c:521
	#4  0x00d5e72c in nscd_run_prune (p=0x0) at connections.c:1528
	#5  0x0045651f in start_thread (arg=0xafdddb90) at pthread_create.c:297
	#6  0x0057804e in clone () at ../sysdeps/unix/sysv/linux/i386/clone.S:130

I tried backporting the code from current cvs in sourceware.org for nscd and the crash still occurs, so I suspect that this bug might still be uptream in CVS code.

The backtrace is not necessarily identical, for example I've seen this on EL5.3 too:

	Program received signal SIGSEGV, Segmentation fault.
	[Switching to Thread 0x40518940 (LWP 5051)]
	0x00002ba0001ca57b in gc (db=0x2ba0003d6120) at mem.c:90
	90            mark[elem++] |= 0xff << (start % BITS);
	(gdb) bt
	#0  0x00002ba0001ca57b in gc (db=0x2ba0003d6120) at mem.c:90
	#1  0x00002ba0001c9b3f in prune_cache (table=0x2ba0003d6120, now=1233246853, fd=-1) at cache.c:486
	#2  0x00002ba0001c32f3 in nscd_run (p=0x40517a60) at connections.c:1489
	#3  0x00002ba000803367 in start_thread (arg=<value optimized out>) at pthread_create.c:297
	#4  0x00002ba00133a0ad in clone () from /lib64/libc.so.6 

Reducing the size of the sample from 100000 to 100 seems to make the error less likely to occur.

I have been able to run "perl -e 'while (1) { getpwuid(int(rand(100))); }'" in parallel for hours without crash (though it does not prove the error would not occur at some point)

Comment 54 errata-xmlrpc 2009-09-02 11:44:22 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-1415.html

Comment 56 Jeff Law 2012-01-20 08:04:21 UTC

*** Bug 443713 has been marked as a duplicate of this bug. ***