Bug 241073 - [RHEL5] nscd segfaults semi weekly
Summary: [RHEL5] nscd segfaults semi weekly
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: glibc
Version: 5.0
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ---
: ---
Assignee: Andreas Schwab
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2007-05-23 21:22 UTC by Terje Røsten
Modified: 2018-10-19 22:14 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-09-02 20:41:17 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Terje Røsten 2007-05-23 21:22:11 UTC
Description of problem:

nscd segfaults roughly once a week

Version-Release number of selected component (if applicable):

$ rpm -q nscd glibc
nscd-2.5-12
glibc-2.5-12
glibc-2.5-12

How reproducible:

Varies, weeks or days:

$ grep segfault /var/log/messages* | cut -d: -f2

May 23 15:22:10 ann kernel: nscd[24366]: segfault at 00002aab401feb31 rip
00002aaaaf5c4d25 rsp 00000000401ff0e0 error 4
May 15 16:08:11 ann kernel: nscd[13769]: segfault at 00002aab401fec09 rip
00002aaaaf5c4d25 rsp 00000000401ff0e0 error 4
May  2 09:12:48 ann kernel: nscd[28776]: segfault at 00002aab401fec19 rip
00002aaaaf5c4d25 rsp 00000000401ff0e0 error 4
Apr 23 12:59:32 ann kernel: nscd[3251]: segfault at 00002aab401fe7a9 rip
00002aaaaf5c4d25 rsp 00000000401ff0e0 error 4
Apr 24 11:08:59 ann kernel: nscd[9133]: segfault at 00002aab401feba1 rip
00002aaaaf5c4d25 rsp 00000000401ff0e0 error 4

Steps to Reproduce:

Set up openldap-server to store user accounts
and let the server itself use pam_ldap and nss_ldap
for lookup e.g.

$ grep ldap /etc/nsswitch.conf

passwd:     files ldap
shadow:     files ldap
group:      files ldap
protocols:  files ldap
services:   files ldap
netgroup:   files ldap
automount:  files ldap

Additional info:

I have only seen this on the openldap-server, lots of clients (RHEL5, FC5, FC6)
don't have this problem.

I don't see any core files or anything, how can I get more info out from nscd
and a core file?

BTW: after this happens dovecot gets in trouble by not closing fds, and
finally stops as over 1024 fds are in use. Right after nscd segfaults and
stops, dovecot-auth complains:

May 23 15:25:34 ann dovecot-auth: nss_ldap: reconnected to LDAP server
ldap://anne after 1 attempt

Comment 1 Jakub Jelinek 2007-05-24 13:52:06 UTC
If server-user is set in /etc/nscd.conf (it is by default), nscd can't dump core
even when it has been run from ulimit -c unlimited environment.  That's because
the setuid it does makes the process undumpable.
You should install glibc-debuginfo and nss_ldap-debuginfo before looking at
backtraces once you manage to get a core.
Another alternative is to run nscd under gdb, with -d option.

Comment 2 Terje Røsten 2007-05-25 16:53:11 UTC
> Another alternative is to run nscd under gdb, with -d option.

Ok, I got this:

2892:   GETHOSTBYADDR (111.221.43.151)
2892: handle_request: request received (Version = 2) from PID 2797
2892:   GETHOSTBYNAME (foo.bar.com)

Program received signal SIGPIPE, Broken pipe.
[Switching to Thread 1082145088 (LWP 2898)]
0x00002aaaab9f5eaa in sendfile64 () from /lib64/libc.so.6
(gdb) bt
#0  0x00002aaaab9f5eaa in sendfile64 () from /lib64/libc.so.6
#1  0x000055555555c4a2 in nscd_run (p=<value optimized out>)
    at connections.c:220
#2  0x00002aaaaaed51b5 in start_thread (arg=<value optimized out>)
    at pthread_create.c:296
#3  0x00002aaaab9fe39d in clone () from /lib64/libc.so.6
#4  0x0000000000000000 in ?? ()

Any value at all in this information?

gdb still running, any more commands I could try?



Comment 3 Jakub Jelinek 2007-05-28 10:35:11 UTC
SIGPIPE is ignored in nscd, see
signal (SIGPIPE, SIG_IGN);
in nscd's main.
So when running it under gdb, you should
handle SIGPIPE nostop noprint pass
to just let nscd handle (ignore) SIGPIPE signals and not stop on them.
The above isn't any sign of a bug.

Comment 4 Terje Røsten 2007-05-28 12:58:56 UTC
> So when running it under gdb, you should
> handle SIGPIPE nostop noprint pass
> to just let nscd handle (ignore) SIGPIPE signals and not stop on them.

Ok, thanks for help.

Comment 5 Terje Røsten 2007-05-31 21:27:52 UTC
Is this better?

14350: remove INITGROUPS entry "postfix"
14350: Reloading "kietanh" in password cache!

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 1075841344 (LWP 14353)]
0x00002aaaab9a6330 in strlen () from /lib64/libc.so.6
(gdb) bt
#0  0x00002aaaab9a6330 in strlen () from /lib64/libc.so.6
#1  0x00002aaaab976b88 in _IO_vfprintf_internal (s=0x401feb20, 
    format=<value optimized out>, ap=0x401feca0) at vfprintf.c:1559
#2  0x00002aaaaba11358 in ___vsnprintf_chk (s=0x401fecc0 "Reloading \"@", 
    maxlen=<value optimized out>, flags=1, slen=<value optimized out>, 
    format=0x55555556a940 "Reloading \"%s\" in password cache!", 
    args=0x401feca0) at vsnprintf_chk.c:65
#3  0x0000555555560073 in dbg_log (fmt=0x0) at dbg_log.c:58
#4  0x000055555555d7e6 in addpwbyX (db=0x55555576e120, fd=-1, req=0x401ff6b0, 
    key={v = 0x2aab401febe1, u = 1075833825}, 
    keystr=0x2aab401febe1 <Address 0x2aab401febe1 out of bounds>, 
    c_uid=4294967295, he=0x2aaab0105c28, dh=0x2aaab0105b90) at pwdcache.c:440
#5  0x000055555555d9c0 in readdpwbyname (db=0x2aab401febe1, 
    he=<value optimized out>, dh=<value optimized out>) at pwdcache.c:518
#6  0x0000555555562746 in prune_cache (table=0x55555576e120, now=1180597834, 
    fd=-1) at cache.c:334
#7  0x000055555555c1ce in nscd_run (p=0x0) at connections.c:1451
#8  0x00002aaaaaed51b5 in start_thread (arg=<value optimized out>)
    at pthread_create.c:296
#9  0x00002aaaab9fe39d in clone () from /lib64/libc.so.6
#10 0x0000000000000000 in ?? ()
(gdb) 


Comment 6 Terje Røsten 2007-06-10 07:17:33 UTC
On more, note that the last line it logs before crashing is identical.
(The entry is a username).


27391: handle_request: request received (Version = 2) from PID 2797
27391:  GETHOSTBYNAME (eivind.phys.ntnu.no)
27391: Reloading "kietanh" in password cache!

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 1075841344 (LWP 27394)]
0x00002aaaab9a6330 in strlen () from /lib64/libc.so.6
(gdb): bt
#0  0x00002aaaab9a6330 in strlen () from /lib64/libc.so.6
#1  0x00002aaaab976b88 in _IO_vfprintf_internal (s=0x401feb20,
    format=<value optimized out>, ap=0x401feca0) at vfprintf.c:1559
#2  0x00002aaaaba11358 in ___vsnprintf_chk (s=0x401fecc0 "Reloading \"@",
    maxlen=<value optimized out>, flags=1, slen=<value optimized out>,
    format=0x55555556a940 "Reloading \"%s\" in password cache!", args=0x401feca0)
    at vsnprintf_chk.c:65
#3  0x0000555555560073 in dbg_log (fmt=0x0) at dbg_log.c:58
#4  0x000055555555d7e6 in addpwbyX (db=0x55555576e120, fd=-1, req=0x401ff6b0,
    key={v = 0x2aab401fea99, u = 1075833497},
    keystr=0x2aab401fea99 <Address 0x2aab401fea99 out of bounds>,
    c_uid=4294967295, he=0x2aaab0105ad0, dh=0x2aaab0105a38) at pwdcache.c:440
#5  0x000055555555d9c0 in readdpwbyname (db=0x2aab401fea99,
    he=<value optimized out>, dh=<value optimized out>) at pwdcache.c:518
#6  0x0000555555562746 in prune_cache (table=0x55555576e120, now=1181114009,
    fd=-1) at cache.c:334
#7  0x000055555555c1ce in nscd_run (p=0x0) at connections.c:1451
#8  0x00002aaaaaed51b5 in start_thread (arg=<value optimized out>)
    at pthread_create.c:296
#9  0x00002aaaab9fe39d in clone () from /lib64/libc.so.6
#10 0x0000000000000000 in ?? ()
(gdb)


Comment 7 Terje Røsten 2007-07-12 21:28:31 UTC
More:

793: handle_request: request received (Version = 2) from PID 7832
793:    GETHOSTBYADDR (129.241.48.174)
793: Reloading "hemmer" in password cache!

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 1075841344 (LWP 796)]
0x00002aaaab9a6330 in strlen () from /lib64/libc.so.6
(gdb) bt
#0  0x00002aaaab9a6330 in strlen () from /lib64/libc.so.6
#1  0x00002aaaab976b88 in _IO_vfprintf_internal (s=0x401feb20,
    format=<value optimized out>, ap=0x401feca0) at vfprintf.c:1559
#2  0x00002aaaaba11358 in ___vsnprintf_chk (s=0x401fecc0 "Reloading \"@",
    maxlen=<value optimized out>, flags=1, slen=<value optimized out>,
    format=0x55555556a940 "Reloading \"%s\" in password cache!",
    args=0x401feca0) at vsnprintf_chk.c:65
#3  0x0000555555560073 in dbg_log (fmt=0x0) at dbg_log.c:58
#4  0x000055555555d7e6 in addpwbyX (db=0x55555576e120, fd=-1, req=0x401ff6b0,
    key={v = 0x2aab401fe789, u = 1075832713},
    keystr=0x2aab401fe789 <Address 0x2aab401fe789 out of bounds>,
    c_uid=4294967295, he=0x2aaab2105c18, dh=0x2aaab2105b80) at pwdcache.c:440
#5  0x000055555555d9c0 in readdpwbyname (db=0x2aab401fe789,
    he=<value optimized out>, dh=<value optimized out>) at pwdcache.c:518
#6  0x0000555555562746 in prune_cache (table=0x55555576e120, now=1183458437,
    fd=-1) at cache.c:334
#7  0x000055555555c1ce in nscd_run (p=0x0) at connections.c:1451
#8  0x00002aaaaaed51b5 in start_thread (arg=<value optimized out>)
    at pthread_create.c:296
#9  0x00002aaaab9fe39d in clone () from /lib64/libc.so.6
#10 0x0000000000000000 in ?? ()
(gdb)

Any ideas at all?


Comment 8 Terje Røsten 2007-08-04 18:20:17 UTC
Ping?

Anything I can do to help to find and fix the problem?

More back traces is coming in:

14691: Reloading "kietanh" in password cache!

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 1075841344 (LWP 14694)]
0x00002aaaab9a6330 in strlen () from /lib64/libc.so.6
(gdb) bt
#0  0x00002aaaab9a6330 in strlen () from /lib64/libc.so.6
#1  0x00002aaaab976b88 in _IO_vfprintf_internal (s=0x401feb20,
    format=<value optimized out>, ap=0x401feca0) at vfprintf.c:1559
#2  0x00002aaaaba11358 in ___vsnprintf_chk (s=0x401fecc0 "Reloading \"@",
    maxlen=<value optimized out>, flags=1, slen=<value optimized out>,
    format=0x55555556a940 "Reloading \"%s\" in password cache!", args=0x401feca0)
    at vsnprintf_chk.c:65
#3  0x0000555555560073 in dbg_log (fmt=0x0) at dbg_log.c:58
#4  0x000055555555d7e6 in addpwbyX (db=0x55555576e120, fd=-1, req=0x401ff6b0, key=
      {v = 0x2aab401febf9, u = 1075833849},
    keystr=0x2aab401febf9 <Address 0x2aab401febf9 out of bounds>, c_uid=4294967295,
    he=0x2aaab2105c10, dh=0x2aaab2105b78) at pwdcache.c:440
#5  0x000055555555d9c0 in readdpwbyname (db=0x2aab401febf9, he=<value optimized
out>,
    dh=<value optimized out>) at pwdcache.c:518
#6  0x0000555555562746 in prune_cache (table=0x55555576e120, now=1184593836, fd=-1)
    at cache.c:334
#7  0x000055555555c1ce in nscd_run (p=0x0) at connections.c:1451
#8  0x00002aaaaaed51b5 in start_thread (arg=<value optimized out>) at
pthread_create.c:296
#9  0x00002aaaab9fe39d in clone () from /lib64/libc.so.6
#10 0x0000000000000000 in ?? ()
(gdb) 


Comment 9 Ulrich Drepper 2007-08-23 02:27:41 UTC
All the traces in comments 5 to 8 point to a corrupted database.  The strlen can
only fail if the string stored in the database is corrupted.  The place where
the string is determined in in readdpwbyname:

  addpwbyX (db, -1, &req, u, db->data + he->key, he->owner, he, dh);

For the string to be invalid the he->key value must be bogus.

I don't say here isn't a program bug but it's quite unlikely at this time.  If
tis would be a bigger problem more people would be affected.

I'd concentrate on the database file itself:

- do some 3rd party programs (maybe backup tools) touch it?

- what filesystem do you use for that directory /var/cache/nscd

- try disabling in /et/nscd.conf the 'persistent' and 'shared' entries for the
  passwd database

Comment 10 Terje Røsten 2007-08-23 05:39:11 UTC
> - do some 3rd party programs (maybe backup tools) touch it?

Not as I know.
 
> - what filesystem do you use for that directory /var/cache/nscd

ext3 

> - try disabling in /et/nscd.conf the 'persistent' and 'shared' entries for the
>   passwd database

Done.

Comment 11 Ulrich Drepper 2007-08-24 17:51:46 UTC
Changing the state back to NEEDINFO.  We still need some info.  There is nothing
in this report which could help to even narrow down the problem.

When you restart nscd, does it complain about corrupted databases?

Comment 12 Terje Røsten 2007-08-24 21:35:36 UTC
(In reply to comment #11)
> Changing the state back to NEEDINFO.  We still need some info.  There is nothing
> in this report which could help to even narrow down the problem.

Of course, remember it's can run fine for weeks and then suddenly fail.
Will report back new results with disables persistent and shared in pw db.

> When you restart nscd, does it complain about corrupted databases?

It did not do that now, however believe I have seen that before.


Wait, let's look at the screen session running nscd, a-ha, a new trace,
now with persistent and shared disabled:

31035:  GETHOSTBYADDR (127.0.0.1)
31035: handle_request: request received (Version = 2) from PID 32433
31035:  GETFDPW
31035: handle_request: request received (Version = 2) from PID 32433
31035:  GETPWBYUID (99)
31035: handle_request: request received (Version = 2) from PID 32433
31035:  GETPWBYNAME (kietanh)
31035: handle_request: request received (Version = 2) from PID 32433
31035:  GETPWBYUID (28570)
31035: handle_request: request received (Version = 2) from PID 32433
31035:  GETPWBYNAME (KIETANH)

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 1080043840 (LWP 31040)]
0x0000555555562de5 in cache_search (type=GETPWBYNAME, key=0x40601a80, len=8,
    table=0x55555576e120, owner=4294967295) at cache.c:64
64            if (type == here->type && len == here->len
(gdb) bt
#0  0x0000555555562de5 in cache_search (type=GETPWBYNAME, key=0x40601a80,
    len=8, table=0x55555576e120, owner=4294967295) at cache.c:64
#1  0x000055555555c3ee in nscd_run (p=<value optimized out>)
    at connections.c:970
#2  0x00002aaaaaed51b5 in start_thread (arg=<value optimized out>)
    at pthread_create.c:296
#3  0x00002aaaab9fe39d in clone () from /lib64/libc.so.6
#4  0x0000000000000000 in ?? ()


This is 2-way Opteron box, could there be a hard to trigger race condition
in the code?








Comment 13 Issue Tracker 2007-11-12 15:14:26 UTC
File uploaded: nscd1.core.gz

This event sent from IssueTracker by jnansi 
 issue 136113
it_file 108061

Comment 14 Terje Røsten 2007-12-04 19:11:26 UTC
Maybe a bit soon, however: not a single crash since install of
nscd.x86_64 2.5-18.el5_1.1 on Nov 07 2007.


Comment 15 Jan-Frode Myklebust 2007-12-14 11:49:34 UTC
FYI: I do get regular crashes with nscd-2.5-18.el5_1.1 on RHEL5/x86-64 on
mailservers doing lots of rbl/uribl-dns lookups. So I don't think this problem
is fixed.

Comment 16 Ulrich Drepper 2007-12-14 15:02:57 UTC
(In reply to comment #15)
> FYI: I do get regular crashes with nscd-2.5-18.el5_1.1 on RHEL5/x86-64 on
> mailservers doing lots of rbl/uribl-dns lookups.

I cannot really imagine a report which is less useful than this one.

Comment 17 Jan-Frode Myklebust 2007-12-14 15:40:52 UTC
Sorry... Can you maybe check service request 1790700 ?

I've been having regular segfaults with nscd on rhel50, didn't report anything
because I knew they'd just push me to upgrade to 5.1.. And once I got v5.1 into
production the same thing happens there. Regular segfaults of nscd daily, on a
couple of mailservers doing lots of DNS-lookups.

dmesg is reporting:

$ dmesg
nscd[10074]: segfault at 00002aaae5bf45e5 rip 0000555555562fc3 rsp
000000004200a9a0 error 4
nscd[12216] general protection rip:55555556307b rsp:40fa18a8 error:0
nscd[26221]: segfault at 00000000000001b7 rip 000055555556341f rsp
0000000040ffcec0 error 6

and I also see these in the syslog:

Dec 10 10:04:18 mailserver1 nscd: 3857 Access Vector Cache (AVC) started
Dec 10 10:59:15 mailserver1 nscd: 2704 Access Vector Cache (AVC) started
Dec 12 15:54:28 mailserver1 kernel: nscd[2739]: segfault at 0000000041004000 rip
0000555555563474 rsp 0000000040fab480 error 6
Dec 12 16:41:14 mailserver1 nscd: 7541 Access Vector Cache (AVC) started
Dec 12 16:41:14 mailserver1 nscd: 7541 invalid persistent database file
"/var/db/nscd/hosts": verification failed
Dec 12 21:15:03 mailserver1 kernel: nscd[10074]: segfault at 00002aaae5bf45e5
rip 0000555555562fc3 rsp 000000004200a9a0 error 4
Dec 13 13:53:57 mailserver1 nscd: 12211 Access Vector Cache (AVC) started
Dec 13 13:53:57 mailserver1 nscd: 12211 invalid persistent database file
"/var/db/nscd/hosts": verification failed
Dec 13 23:32:47 mailserver1 kernel: nscd[12216] general protection
rip:55555556307b rsp:40fa18a8 error:0
Dec 14 11:57:56 mailserver1 nscd: 26216 Access Vector Cache (AVC) started
Dec 14 11:57:56 mailserver1 nscd: 26216 invalid persistent database file
"/var/db/nscd/hosts": verification failed
Dec 14 12:09:57 mailserver1 kernel: nscd[26221]: segfault at 00000000000001b7
rip 000055555556341f rsp 0000000040ffcec0 error 6
Dec 14 12:28:57 mailserver1 nscd: 28966 Access Vector Cache (AVC) started
Dec 14 12:28:57 mailserver1 nscd: 28966 invalid persistent database file
"/var/db/nscd/hosts": verification failed
Dec 14 13:42:44 mailserver1 nscd: 4322 Access Vector Cache (AVC) started



Comment 19 Terje Røsten 2008-01-21 09:38:40 UTC
I have close to 100 days without crash now. 

Seems like this possible issue is tracked elsewhere too.

IMHO this ticket can be closed.


Comment 20 Jun'ichi NOMURA 2009-01-16 04:57:42 UTC
FYI, upstream bug report:
http://sources.redhat.com/bugzilla/show_bug.cgi?id=9746


Note You need to log in before you can comment on or make changes to this bug.