Bug 170320 - [RHEL4] NSCD segfaults after update to RHEL4 U2
Summary: [RHEL4] NSCD segfaults after update to RHEL4 U2
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: glibc
Version: 4.0
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ---
: ---
Assignee: Andreas Schwab
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2005-10-10 18:05 UTC by James Cooley
Modified: 2018-10-19 20:35 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2010-06-07 05:05:46 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Backtrace of the segfault (1.03 KB, text/plain)
2005-10-10 18:07 UTC, James Cooley
no flags Details
strace of the the segfault (2.30 KB, text/plain)
2005-10-10 18:10 UTC, James Cooley
no flags Details
Backtrace with debuginfo (1.75 KB, text/plain)
2005-10-12 21:46 UTC, James Cooley
no flags Details
Extra info on frames 4 to 8 (2.98 KB, text/plain)
2005-10-13 02:08 UTC, James Cooley
no flags Details
x/ls keystr (1.20 KB, text/plain)
2005-10-13 02:17 UTC, James Cooley
no flags Details
gdb output without using -d option (2.06 KB, text/plain)
2005-10-13 18:46 UTC, James Cooley
no flags Details

Description James Cooley 2005-10-10 18:05:18 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en) AppleWebKit/412.7 (KHTML, like Gecko) Safari/412.5

Description of problem:
After updating to RHEL4 U2, nscd randomly segfaults.  Looking at a backtrace, this appears to happen 
when getting certain entries from the nscd cache.

If nscd is set too keep a persistent cache, the segfault will occur within a few seconds of restarting the 
process.  Deleting the database files in /var/db/nscd will enable nscd to run for a while until it 
encounters an entry that causes it to segfault unually somewhere between an hour or several hours.

This behavior has only been noticed on out x86_64 machine.  Our i386 versions of Red Hat Enterprise 
appear to not have this issue.

This happens whether using the  2.6.9-11 or 2.6.9-22 Kernel.

I don't know what particular entry in the cache is causing the problem.

I'm attaching a backtrace and some strace output of the segfault.

Version-Release number of selected component (if applicable):
glibc-2.3.4-2.13   

How reproducible:
Always

Steps to Reproduce:
1.    Start NSCD with hosts cache enable
2.    Wait a while
3.    NSCD segfaults when accessing certain entries in the cache
  

Actual Results:  NSCD Segfaults leaving subsys locked

Expected Results:  NSCD retrieves entry from cache and continues working

Additional info:

2.6.9-22.ELsmp #1 SMP Mon Sep 19 18:00:54 EDT 2005 x86_64 x86_64 x86_64 GNU/Linux

Comment 1 James Cooley 2005-10-10 18:07:33 UTC
Created attachment 119781 [details]
Backtrace of the segfault

Comment 2 James Cooley 2005-10-10 18:10:52 UTC
Created attachment 119782 [details]
strace of the the segfault

This is an strace of nscd using the same cache database files as the backtrace.

Comment 3 Jakub Jelinek 2005-10-12 06:51:29 UTC
The backtrace is certainly weird.  Can you please install glibc-debuginfo
from
ftp://people.redhat.com/jakub/glibc/2.3.4-2.13/
and get a more accurrate backtrace?
E.g. there are no vsnprintf calls in nss/, resolv/.

Comment 4 James Cooley 2005-10-12 21:46:44 UTC
Created attachment 119856 [details]
Backtrace with debuginfo

This is a backtrace of the segfault with debuginfo included

Comment 5 James Cooley 2005-10-12 21:50:25 UTC
I was wrong initially.  This doesn't appear to be limited to the hosts cache, since the problem still exists 
when hosts cache is disabled.   I posted a new backtrace that had debuginfo included.

Comment 6 Jakub Jelinek 2005-10-12 22:03:24 UTC
Can you reproduce the problem even without -d?
Can you up 4 times and in the addpwbyX frame x/1s keystr ?
From a quick look it sounds like some keys in the cache aren't zero terminated,
but the only place where they are used as C strings rather than chunk
of memory ->len bytes long is in debugging printouts (in which case I guess
%.*s rather than %s in the format strings that print them would be sufficient).


Comment 7 James Cooley 2005-10-13 01:51:59 UTC
Yeah....the problem still exists, even without -d.  It was just easier to get info about what nscd was doing 
by using the debug option.   I can get an strace for it without -d (do you want me to follow threads, or 
stay with the parent process?).  I'm not too sure I userstand the rest of the request though.

The problem occurs at random, but with a persistent cache, I can reproduce the problem every time I try 
until I reset the cache.  Installing nscd-2.3.4-2.9 makes the problem go away, even if I use the cache files 
that were causing the new version of nscd to segfault.

Comment 8 James Cooley 2005-10-13 02:08:26 UTC
Created attachment 119869 [details]
Extra info on frames 4 to 8

Here is additional info on frames 4 to 8

Comment 9 James Cooley 2005-10-13 02:17:19 UTC
Created attachment 119870 [details]
x/ls keystr

Comment 10 James Cooley 2005-10-13 02:20:33 UTC
I've posted the info that I think was requested.  As mentioned, the problem does exist even when not 
using -d, but I don't know how to get gdb to follow threads, since the threads die almost immediately 
before I can attach to them.



Comment 13 Jakub Jelinek 2005-10-13 17:47:46 UTC
As you said that you can reproduce the problem every time once you get the
persistent cache into some state, can you run nscd (without -d) directly under
gdb at that point, so you don't have to attach?
Also, could we get a copy of one of the cache files that's causing this
(whether as private attachment here, or mailing it to me directly)?
There weren't many nscd changes between U1 and U2, the only important to this
would be that previously nscd was using bad time and therefore some cache entries
were never prunned.

Comment 14 James Cooley 2005-10-13 18:46:49 UTC
Created attachment 119941 [details]
gdb output without using -d option

This file has gdb output without using the -d option

Comment 15 Jakub Jelinek 2005-10-17 16:16:57 UTC
There is a corrupted entry in the passwd database file you posted:
$7 = {type = GETPWBYNAME, first = true, len = 6, key = 2865298694, owner = -1,
next = 4294967295, packet = 25648, {
    dellist = 0x2a9556c328, prevp = 0x2a9556c328}}
(gdb) p/x *here
$8 = {type = 0x0, first = 0x1, len = 0x6, key = 0xaac8fd06, owner = 0xffffffff,
next = 0xffffffff, packet = 0x6430, {
    dellist = 0x2a9556c328, prevp = 0x2a9556c328}}

here->key is clearly far beyond end of the database (entry at 0x6868 in the
passwd db file).  `here' is the first entry in the chain (so directly referenced
from the hash table).  Having packet == -1 sounds weird as well.

The nscd db verifier (currently in rawhide, scheduled for RHEL4 U3) detects
this situation and reinitializes the database file.

But so far I have no idea why would such corruption appear (except of hw
problems, which doesn't mean it can't be a nscd bug).

Comment 16 Mikko Suomi 2005-11-09 21:35:12 UTC
Hi,

I seemto have this same problem with nscd. It seqfaults and if I start it again
it will segfault unless I delete nscd database files. After deleting those files
it will run few hours to few days and segfault.

nscd[25661]: segfault at 0000002b401e2c42 rip 0000002a98c5f420 rsp
00000000401ff250 error 4

Redhat as4 with all updates except newest kernel.
2.6.9-22.ELsmp #1 SMP Mon Sep 19 18:00:54 EDT 2005 x86_64 x86_64 x86_64 GNU/Linux
Hardware is a dell poweredge 2850.

Comment 18 Gunther Schlegel 2006-01-20 15:34:21 UTC
I have the problem as well.

Comment 19 Eli Stair 2006-02-21 20:29:50 UTC
RHEL4.2, x86_64, 2.6.9-22

Seeing the same issue.  This started occuring on a system unchanged/uptime 20
days.  Removing the DB's resolves the issue immediately, will update if it
recurrs rebuilding the cache from zer0.  

Worth noting:  nscd doesn't re-build/re-size the DB if you increase the
suggested-size, it gives warnings in debug mode but continues to run with the
"original" smaller size tables on disk.

/eli

nscd[6684]: segfault at 0000002b9555c3b8 rip 000000552aab6c64 rsp
0000000040c04a60 error 4
nscd[10752]: segfault at 0000002b9555c3b8 rip 000000552aab6c64 rsp
0000000040c04a60 error 4
nscd[11497]: segfault at 0000002b9555c3b8 rip 000000552aab6c64 rsp
0000000040a03a60 error 4
nscd[12694]: segfault at 0000002b9555c3b8 rip 000000552aab6c64 rsp
0000000040601a60 error 4
nscd[14552]: segfault at 0000002b9555c3b8 rip 000000552aab6c64 rsp
0000000040400a60 error 4
nscd[15073]: segfault at 0000002b9555c3b8 rip 000000552aab6c64 rsp
0000000040c04a60 error 4
nscd[16465]: segfault at 0000002b9555c3cc rip 000000552aab6f65 rsp
00000000401ff800 error 4
nscd[17350]: segfault at 0000002b9555c3cc rip 000000552aab6f65 rsp
00000000401ff800 error 4
nscd[9286]: segfault at 0000002b9961e3b8 rip 000000552aab6c64 rsp
0000000040802a60 error 4
nscd[10249]: segfault at 0000002b9961e3cc rip 000000552aab6f65 rsp
00000000401ff800 error 4
nscd[11812]: segfault at 0000002b9961e3cc rip 000000552aab6f65 rsp
00000000401ff800 error 4
nscd[12885]: segfault at 0000002b9555c3cc rip 000000552aab6f65 rsp
00000000401ff800 error 4
nscd[13473]: segfault at 0000002b9961e3cc rip 000000552aab6f65 rsp
00000000401ff800 error 4
nscd[20113]: segfault at 0000002b9961e3cc rip 000000552aab6f65 rsp
00000000401ff800 error 4


Comment 20 dijuremo 2006-02-22 12:40:57 UTC
I have found the same problem, but in my case it seems that sendmail was
segfaulting while trying to do a host lookup (or at least the error cleared up
after nscd was fixed):

kernel: sendmail[28765]: segfault at 0000002b04bd421d rip 0000002a9657637f rsp
0000007fbfffc620 error 4

nscd did not crash, but was using 99.9% of one cpu.  At the same time I noticed
several ntpdate processes using 99.9% of the cpu. If nscd was off, ntpdate would
work correctly, but if nscd was on (it would start, then take 99.9% of cpu and
stay like that) ntpdate would hang.

Invalidating the caches did not fix the problem. Stoping nscd erasing the
databases as suggested from /var/db/nscd and restarting nscd again fixed all the
problems (sendmail and ntpdate) in my x86_64 machines. I can also confirm that
my 4 machines running i386 do not have any nscd problems.

My machines are fully patched and running latest kernel
2.6.9-22.0.2.ELsmp #1 SMP Thu Jan 5 17:11:56 EST 2006 x86_64 x86_64 x86_64 GNU/Linux

Diego

Comment 22 Kostas Georgiou 2006-03-01 23:04:29 UTC
I just saw the same under x86 as well (corrupted database killing nscd). I've
disabled the persistent cache until U3 is out but I kept the corrupted database
so  I'll be able to test if db verifier can handle the problem.

Comment 25 Seth Vidal 2007-02-26 22:31:54 UTC
ping. Is there anything new  on this? We're still seeing this problem in 4.4 on
x86_64 boxes.

Comment 26 Kostas Georgiou 2007-02-27 00:48:37 UTC
I haven't seen the problem after U3 in any of our our machines x86_64 included.

Comment 27 James Cooley 2007-02-27 01:07:54 UTC
We still see this, but it is more pronounced.  After the update to U3 (updated
glibc), we get segfaults every couple of minutes for nscd.  We've 'worked
around' the issue for the moment by setting up a script that checks for nscd
status ever minute and restarts it if it is dead.  We are considering keeping
RHEL for our 32-bit systems, and switching to another vendor for 64-bit systems
to work around this issue.  It only happens on our x86_64 machines. 

Comment 28 Seth Vidal 2007-02-27 01:15:31 UTC
Comment #27 matches the behavior we've seen. Very frequent failures on x86_64.

You aren't by chance using ldap for your nss are you?

I'd be curious to know what nss modules folks are using who are and are not
seeing the issue.


Comment 29 James Cooley 2007-02-27 01:38:22 UTC
We are currently using ldap for user and group information in nss, but we are
not using it for shadow.  For authentication, we are using LDAP authentication
by means of PAM.

Comment 30 Jakub Jelinek 2007-02-27 07:44:11 UTC
There have been several nscd related fixes (both on the nscd daemon and nscd
client code sides) post U3, some in U4 and some are queued for RHEL4.5
(you can try e.g. http://people.redhat.com/jakub/glibc/2.3.4-2.36.1/
packages (for testing only, they haven't been through QA)).
If you experience crashes even with that glibc and ideally without LDAP
(because nss_ldap or its libraries are a possible culprit too), please file a
new bug rather than adding a me too to a closed bug.


Comment 31 Kostas Georgiou 2007-02-27 09:40:50 UTC
If you do file a new bug please post it here so we can follow it :)

For the record we use nss_nis mostly here but the handfull of machines that use
nss_ldap don't have a problem either.

Comment 32 James Cooley 2007-06-22 01:18:04 UTC
We've tried the newer patches without any luck.  We also see this issue on RHEL
5.   However, as stated, we only see this on our x86_64 machines.


For those of you still having issues, we're using a workaround that reduces the
pain a bit.  We have a cron job running constantly looking for the ncsd process,
and if it isn't running, it restarts nscd and logs the event.  We have a lot of
failures daily depending on how heavily the system is used.


Here's our crontab entry

* * * * * if /bin/ps -e | /bin/grep nscd > /dev/null; then echo -n; else echo
DOWN:`date`; /etc/init.d/nscd restart; /usr/sbin/apachectl restart; fi >>
/var/log/nscd-restart.log   

Comment 34 Jun'ichi NOMURA 2009-02-09 13:26:59 UTC
Just FYI,
I've seen a quite similar database corruption with comment #15
(i.e. hashentry->packet points to outside of the allocated area).
My analysis and upstream patch is posted here:
http://sources.redhat.com/bugzilla/show_bug.cgi?id=9746


Note You need to log in before you can comment on or make changes to this bug.