Bug 443713

Summary:	[RHEL5] nscd SEGV's periodically
Product:	Red Hat Enterprise Linux 5	Reporter:	Aaron Richton <richton>
Component:	glibc	Assignee:	Jeff Law <law>
Status:	CLOSED DUPLICATE	QA Contact:	Brian Brock <bbrock>
Severity:	medium	Docs Contact:
Priority:	low
Version:	5.1	CC:	aoliva, drepper, fweimer, jakub, law
Target Milestone:	rc
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2012-01-20 08:04:21 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Aaron Richton 2008-04-22 23:09:43 UTC

Description of problem:
nscd SEGV's periodically. I was hoping to catch it under valgrind, but
apparently valgrind is missing syscalls so that's not in the cards. Maybe I can
use MALLOC_CHECK_...

Version-Release number of selected component (if applicable):
glibc-2.5-18.el5_1.1

How reproducible:
The crashes are pretty consistent...maybe a few days apart across six servers.

Steps to Reproduce:
1. "/sbin/service nscd start"
2. wait a few days...
  
Actual results:
core dump

Expected results:
no core dump

Additional info:
Core was generated by `/usr/sbin/nscd'.
Program terminated with signal 11, Segmentation fault.
#0  gc (db=0x55555576e330) at mem.c:96
96            mark[elem++] = ALLBITS;

Comment 1 Aaron Richton 2008-04-23 12:59:06 UTC

OK, it crashed with MALLOC_CHECK_=3. No change in the backtrace:
Core was generated by `/usr/sbin/nscd'.
Program terminated with signal 11, Segmentation fault.
#0  gc (db=0x55555576e330) at mem.c:96
96            mark[elem++] = ALLBITS;

#0  gc (db=0x55555576e330) at mem.c:96
#1  0xffffffffffffffff in ?? ()
#2  0xffffffffffffffff in ?? ()
#3  0xffffffffffffffff in ?? ()
[...]
#806 0xffffffffffffffff in ?? ()
#807 0xffffffffffffffff in ?? ()
#808 0x0000000000000000 in ?? ()

Comment 2 Aaron Richton 2008-04-23 20:41:54 UTC

I turned up the debug level and see that nscd crashed removing entries:

# tail -4 /var/log/nscd.log 
28398: remove GETHOSTBYADDR entry "218.57.182.136"
28398: remove GETHOSTBYADDR entry "77.210.97.189"
28398: remove GETHOSTBYNAME entry "adsl-79-81.ttk.if.ua"
28398: remove GETHOSTBYADDR entry "24.244.158.246"

The backtrace is different now and seems to match the log:
Core was generated by `/usr/sbin/nscd'.
Program terminated with signal 11, Segmentation fault.
#0  0x00005555555637d1 in gc (db=0x55555576e330) at mem.c:303
303           new_move->from = db->data + off_alloc;
(gdb) where
#0  0x00005555555637d1 in gc (db=0x55555576e330) at mem.c:303
#1  0x0000555555562a0f in prune_cache (table=0x55555576e330, now=1208956471,
fd=-1) at cache.c:486
#2  0x000055555555c303 in nscd_run (p=0x1) at connections.c:1484
#3  0x00002aaaaaed52f7 in start_thread (arg=<value optimized out>) at
pthread_create.c:296
#4  0x00002aaaaba0185d in clone () from /lib64/libc.so.6

Note that this is still with MALLOC_CHECK_=3.

Comment 3 Alexandre Oliva 2012-01-20 07:34:53 UTC

I'm pretty sure both of these stack trackes are caused by excessive use of alloca() for mark, in one case, and new_move in the other.  The second is certainly the first use of new_move's storage, right after allocation, and the former is possibly the first use of mark.  When the cache size grows large enough, we end up allocating too much stack space for cache garbage collection, and since we start accessing it by the bottom, we may end up accessing unmapped pages below the allocated stack bottom.

glibc 2.5-36 and newer fix this problem by using alloca only for small-enough allocations, and using malloc() otherwise.

Comment 4 Jeff Law 2012-01-20 08:04:21 UTC

As noted, this bug is a duplicate of 483636.  Bug 483636 was fixed by this errata:  http://rhn.redhat.com/errata/RHBA-2009-1415.html

*** This bug has been marked as a duplicate of bug 483636 ***