Description of problem: nscd segfaults roughly once a week Version-Release number of selected component (if applicable): $ rpm -q nscd glibc nscd-2.5-12 glibc-2.5-12 glibc-2.5-12 How reproducible: Varies, weeks or days: $ grep segfault /var/log/messages* | cut -d: -f2 May 23 15:22:10 ann kernel: nscd[24366]: segfault at 00002aab401feb31 rip 00002aaaaf5c4d25 rsp 00000000401ff0e0 error 4 May 15 16:08:11 ann kernel: nscd[13769]: segfault at 00002aab401fec09 rip 00002aaaaf5c4d25 rsp 00000000401ff0e0 error 4 May 2 09:12:48 ann kernel: nscd[28776]: segfault at 00002aab401fec19 rip 00002aaaaf5c4d25 rsp 00000000401ff0e0 error 4 Apr 23 12:59:32 ann kernel: nscd[3251]: segfault at 00002aab401fe7a9 rip 00002aaaaf5c4d25 rsp 00000000401ff0e0 error 4 Apr 24 11:08:59 ann kernel: nscd[9133]: segfault at 00002aab401feba1 rip 00002aaaaf5c4d25 rsp 00000000401ff0e0 error 4 Steps to Reproduce: Set up openldap-server to store user accounts and let the server itself use pam_ldap and nss_ldap for lookup e.g. $ grep ldap /etc/nsswitch.conf passwd: files ldap shadow: files ldap group: files ldap protocols: files ldap services: files ldap netgroup: files ldap automount: files ldap Additional info: I have only seen this on the openldap-server, lots of clients (RHEL5, FC5, FC6) don't have this problem. I don't see any core files or anything, how can I get more info out from nscd and a core file? BTW: after this happens dovecot gets in trouble by not closing fds, and finally stops as over 1024 fds are in use. Right after nscd segfaults and stops, dovecot-auth complains: May 23 15:25:34 ann dovecot-auth: nss_ldap: reconnected to LDAP server ldap://anne after 1 attempt
If server-user is set in /etc/nscd.conf (it is by default), nscd can't dump core even when it has been run from ulimit -c unlimited environment. That's because the setuid it does makes the process undumpable. You should install glibc-debuginfo and nss_ldap-debuginfo before looking at backtraces once you manage to get a core. Another alternative is to run nscd under gdb, with -d option.
> Another alternative is to run nscd under gdb, with -d option. Ok, I got this: 2892: GETHOSTBYADDR (111.221.43.151) 2892: handle_request: request received (Version = 2) from PID 2797 2892: GETHOSTBYNAME (foo.bar.com) Program received signal SIGPIPE, Broken pipe. [Switching to Thread 1082145088 (LWP 2898)] 0x00002aaaab9f5eaa in sendfile64 () from /lib64/libc.so.6 (gdb) bt #0 0x00002aaaab9f5eaa in sendfile64 () from /lib64/libc.so.6 #1 0x000055555555c4a2 in nscd_run (p=<value optimized out>) at connections.c:220 #2 0x00002aaaaaed51b5 in start_thread (arg=<value optimized out>) at pthread_create.c:296 #3 0x00002aaaab9fe39d in clone () from /lib64/libc.so.6 #4 0x0000000000000000 in ?? () Any value at all in this information? gdb still running, any more commands I could try?
SIGPIPE is ignored in nscd, see signal (SIGPIPE, SIG_IGN); in nscd's main. So when running it under gdb, you should handle SIGPIPE nostop noprint pass to just let nscd handle (ignore) SIGPIPE signals and not stop on them. The above isn't any sign of a bug.
> So when running it under gdb, you should > handle SIGPIPE nostop noprint pass > to just let nscd handle (ignore) SIGPIPE signals and not stop on them. Ok, thanks for help.
Is this better? 14350: remove INITGROUPS entry "postfix" 14350: Reloading "kietanh" in password cache! Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 1075841344 (LWP 14353)] 0x00002aaaab9a6330 in strlen () from /lib64/libc.so.6 (gdb) bt #0 0x00002aaaab9a6330 in strlen () from /lib64/libc.so.6 #1 0x00002aaaab976b88 in _IO_vfprintf_internal (s=0x401feb20, format=<value optimized out>, ap=0x401feca0) at vfprintf.c:1559 #2 0x00002aaaaba11358 in ___vsnprintf_chk (s=0x401fecc0 "Reloading \"@", maxlen=<value optimized out>, flags=1, slen=<value optimized out>, format=0x55555556a940 "Reloading \"%s\" in password cache!", args=0x401feca0) at vsnprintf_chk.c:65 #3 0x0000555555560073 in dbg_log (fmt=0x0) at dbg_log.c:58 #4 0x000055555555d7e6 in addpwbyX (db=0x55555576e120, fd=-1, req=0x401ff6b0, key={v = 0x2aab401febe1, u = 1075833825}, keystr=0x2aab401febe1 <Address 0x2aab401febe1 out of bounds>, c_uid=4294967295, he=0x2aaab0105c28, dh=0x2aaab0105b90) at pwdcache.c:440 #5 0x000055555555d9c0 in readdpwbyname (db=0x2aab401febe1, he=<value optimized out>, dh=<value optimized out>) at pwdcache.c:518 #6 0x0000555555562746 in prune_cache (table=0x55555576e120, now=1180597834, fd=-1) at cache.c:334 #7 0x000055555555c1ce in nscd_run (p=0x0) at connections.c:1451 #8 0x00002aaaaaed51b5 in start_thread (arg=<value optimized out>) at pthread_create.c:296 #9 0x00002aaaab9fe39d in clone () from /lib64/libc.so.6 #10 0x0000000000000000 in ?? () (gdb)
On more, note that the last line it logs before crashing is identical. (The entry is a username). 27391: handle_request: request received (Version = 2) from PID 2797 27391: GETHOSTBYNAME (eivind.phys.ntnu.no) 27391: Reloading "kietanh" in password cache! Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 1075841344 (LWP 27394)] 0x00002aaaab9a6330 in strlen () from /lib64/libc.so.6 (gdb): bt #0 0x00002aaaab9a6330 in strlen () from /lib64/libc.so.6 #1 0x00002aaaab976b88 in _IO_vfprintf_internal (s=0x401feb20, format=<value optimized out>, ap=0x401feca0) at vfprintf.c:1559 #2 0x00002aaaaba11358 in ___vsnprintf_chk (s=0x401fecc0 "Reloading \"@", maxlen=<value optimized out>, flags=1, slen=<value optimized out>, format=0x55555556a940 "Reloading \"%s\" in password cache!", args=0x401feca0) at vsnprintf_chk.c:65 #3 0x0000555555560073 in dbg_log (fmt=0x0) at dbg_log.c:58 #4 0x000055555555d7e6 in addpwbyX (db=0x55555576e120, fd=-1, req=0x401ff6b0, key={v = 0x2aab401fea99, u = 1075833497}, keystr=0x2aab401fea99 <Address 0x2aab401fea99 out of bounds>, c_uid=4294967295, he=0x2aaab0105ad0, dh=0x2aaab0105a38) at pwdcache.c:440 #5 0x000055555555d9c0 in readdpwbyname (db=0x2aab401fea99, he=<value optimized out>, dh=<value optimized out>) at pwdcache.c:518 #6 0x0000555555562746 in prune_cache (table=0x55555576e120, now=1181114009, fd=-1) at cache.c:334 #7 0x000055555555c1ce in nscd_run (p=0x0) at connections.c:1451 #8 0x00002aaaaaed51b5 in start_thread (arg=<value optimized out>) at pthread_create.c:296 #9 0x00002aaaab9fe39d in clone () from /lib64/libc.so.6 #10 0x0000000000000000 in ?? () (gdb)
More: 793: handle_request: request received (Version = 2) from PID 7832 793: GETHOSTBYADDR (129.241.48.174) 793: Reloading "hemmer" in password cache! Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 1075841344 (LWP 796)] 0x00002aaaab9a6330 in strlen () from /lib64/libc.so.6 (gdb) bt #0 0x00002aaaab9a6330 in strlen () from /lib64/libc.so.6 #1 0x00002aaaab976b88 in _IO_vfprintf_internal (s=0x401feb20, format=<value optimized out>, ap=0x401feca0) at vfprintf.c:1559 #2 0x00002aaaaba11358 in ___vsnprintf_chk (s=0x401fecc0 "Reloading \"@", maxlen=<value optimized out>, flags=1, slen=<value optimized out>, format=0x55555556a940 "Reloading \"%s\" in password cache!", args=0x401feca0) at vsnprintf_chk.c:65 #3 0x0000555555560073 in dbg_log (fmt=0x0) at dbg_log.c:58 #4 0x000055555555d7e6 in addpwbyX (db=0x55555576e120, fd=-1, req=0x401ff6b0, key={v = 0x2aab401fe789, u = 1075832713}, keystr=0x2aab401fe789 <Address 0x2aab401fe789 out of bounds>, c_uid=4294967295, he=0x2aaab2105c18, dh=0x2aaab2105b80) at pwdcache.c:440 #5 0x000055555555d9c0 in readdpwbyname (db=0x2aab401fe789, he=<value optimized out>, dh=<value optimized out>) at pwdcache.c:518 #6 0x0000555555562746 in prune_cache (table=0x55555576e120, now=1183458437, fd=-1) at cache.c:334 #7 0x000055555555c1ce in nscd_run (p=0x0) at connections.c:1451 #8 0x00002aaaaaed51b5 in start_thread (arg=<value optimized out>) at pthread_create.c:296 #9 0x00002aaaab9fe39d in clone () from /lib64/libc.so.6 #10 0x0000000000000000 in ?? () (gdb) Any ideas at all?
Ping? Anything I can do to help to find and fix the problem? More back traces is coming in: 14691: Reloading "kietanh" in password cache! Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 1075841344 (LWP 14694)] 0x00002aaaab9a6330 in strlen () from /lib64/libc.so.6 (gdb) bt #0 0x00002aaaab9a6330 in strlen () from /lib64/libc.so.6 #1 0x00002aaaab976b88 in _IO_vfprintf_internal (s=0x401feb20, format=<value optimized out>, ap=0x401feca0) at vfprintf.c:1559 #2 0x00002aaaaba11358 in ___vsnprintf_chk (s=0x401fecc0 "Reloading \"@", maxlen=<value optimized out>, flags=1, slen=<value optimized out>, format=0x55555556a940 "Reloading \"%s\" in password cache!", args=0x401feca0) at vsnprintf_chk.c:65 #3 0x0000555555560073 in dbg_log (fmt=0x0) at dbg_log.c:58 #4 0x000055555555d7e6 in addpwbyX (db=0x55555576e120, fd=-1, req=0x401ff6b0, key= {v = 0x2aab401febf9, u = 1075833849}, keystr=0x2aab401febf9 <Address 0x2aab401febf9 out of bounds>, c_uid=4294967295, he=0x2aaab2105c10, dh=0x2aaab2105b78) at pwdcache.c:440 #5 0x000055555555d9c0 in readdpwbyname (db=0x2aab401febf9, he=<value optimized out>, dh=<value optimized out>) at pwdcache.c:518 #6 0x0000555555562746 in prune_cache (table=0x55555576e120, now=1184593836, fd=-1) at cache.c:334 #7 0x000055555555c1ce in nscd_run (p=0x0) at connections.c:1451 #8 0x00002aaaaaed51b5 in start_thread (arg=<value optimized out>) at pthread_create.c:296 #9 0x00002aaaab9fe39d in clone () from /lib64/libc.so.6 #10 0x0000000000000000 in ?? () (gdb)
All the traces in comments 5 to 8 point to a corrupted database. The strlen can only fail if the string stored in the database is corrupted. The place where the string is determined in in readdpwbyname: addpwbyX (db, -1, &req, u, db->data + he->key, he->owner, he, dh); For the string to be invalid the he->key value must be bogus. I don't say here isn't a program bug but it's quite unlikely at this time. If tis would be a bigger problem more people would be affected. I'd concentrate on the database file itself: - do some 3rd party programs (maybe backup tools) touch it? - what filesystem do you use for that directory /var/cache/nscd - try disabling in /et/nscd.conf the 'persistent' and 'shared' entries for the passwd database
> - do some 3rd party programs (maybe backup tools) touch it? Not as I know. > - what filesystem do you use for that directory /var/cache/nscd ext3 > - try disabling in /et/nscd.conf the 'persistent' and 'shared' entries for the > passwd database Done.
Changing the state back to NEEDINFO. We still need some info. There is nothing in this report which could help to even narrow down the problem. When you restart nscd, does it complain about corrupted databases?
(In reply to comment #11) > Changing the state back to NEEDINFO. We still need some info. There is nothing > in this report which could help to even narrow down the problem. Of course, remember it's can run fine for weeks and then suddenly fail. Will report back new results with disables persistent and shared in pw db. > When you restart nscd, does it complain about corrupted databases? It did not do that now, however believe I have seen that before. Wait, let's look at the screen session running nscd, a-ha, a new trace, now with persistent and shared disabled: 31035: GETHOSTBYADDR (127.0.0.1) 31035: handle_request: request received (Version = 2) from PID 32433 31035: GETFDPW 31035: handle_request: request received (Version = 2) from PID 32433 31035: GETPWBYUID (99) 31035: handle_request: request received (Version = 2) from PID 32433 31035: GETPWBYNAME (kietanh) 31035: handle_request: request received (Version = 2) from PID 32433 31035: GETPWBYUID (28570) 31035: handle_request: request received (Version = 2) from PID 32433 31035: GETPWBYNAME (KIETANH) Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 1080043840 (LWP 31040)] 0x0000555555562de5 in cache_search (type=GETPWBYNAME, key=0x40601a80, len=8, table=0x55555576e120, owner=4294967295) at cache.c:64 64 if (type == here->type && len == here->len (gdb) bt #0 0x0000555555562de5 in cache_search (type=GETPWBYNAME, key=0x40601a80, len=8, table=0x55555576e120, owner=4294967295) at cache.c:64 #1 0x000055555555c3ee in nscd_run (p=<value optimized out>) at connections.c:970 #2 0x00002aaaaaed51b5 in start_thread (arg=<value optimized out>) at pthread_create.c:296 #3 0x00002aaaab9fe39d in clone () from /lib64/libc.so.6 #4 0x0000000000000000 in ?? () This is 2-way Opteron box, could there be a hard to trigger race condition in the code?
File uploaded: nscd1.core.gz This event sent from IssueTracker by jnansi issue 136113 it_file 108061
Maybe a bit soon, however: not a single crash since install of nscd.x86_64 2.5-18.el5_1.1 on Nov 07 2007.
FYI: I do get regular crashes with nscd-2.5-18.el5_1.1 on RHEL5/x86-64 on mailservers doing lots of rbl/uribl-dns lookups. So I don't think this problem is fixed.
(In reply to comment #15) > FYI: I do get regular crashes with nscd-2.5-18.el5_1.1 on RHEL5/x86-64 on > mailservers doing lots of rbl/uribl-dns lookups. I cannot really imagine a report which is less useful than this one.
Sorry... Can you maybe check service request 1790700 ? I've been having regular segfaults with nscd on rhel50, didn't report anything because I knew they'd just push me to upgrade to 5.1.. And once I got v5.1 into production the same thing happens there. Regular segfaults of nscd daily, on a couple of mailservers doing lots of DNS-lookups. dmesg is reporting: $ dmesg nscd[10074]: segfault at 00002aaae5bf45e5 rip 0000555555562fc3 rsp 000000004200a9a0 error 4 nscd[12216] general protection rip:55555556307b rsp:40fa18a8 error:0 nscd[26221]: segfault at 00000000000001b7 rip 000055555556341f rsp 0000000040ffcec0 error 6 and I also see these in the syslog: Dec 10 10:04:18 mailserver1 nscd: 3857 Access Vector Cache (AVC) started Dec 10 10:59:15 mailserver1 nscd: 2704 Access Vector Cache (AVC) started Dec 12 15:54:28 mailserver1 kernel: nscd[2739]: segfault at 0000000041004000 rip 0000555555563474 rsp 0000000040fab480 error 6 Dec 12 16:41:14 mailserver1 nscd: 7541 Access Vector Cache (AVC) started Dec 12 16:41:14 mailserver1 nscd: 7541 invalid persistent database file "/var/db/nscd/hosts": verification failed Dec 12 21:15:03 mailserver1 kernel: nscd[10074]: segfault at 00002aaae5bf45e5 rip 0000555555562fc3 rsp 000000004200a9a0 error 4 Dec 13 13:53:57 mailserver1 nscd: 12211 Access Vector Cache (AVC) started Dec 13 13:53:57 mailserver1 nscd: 12211 invalid persistent database file "/var/db/nscd/hosts": verification failed Dec 13 23:32:47 mailserver1 kernel: nscd[12216] general protection rip:55555556307b rsp:40fa18a8 error:0 Dec 14 11:57:56 mailserver1 nscd: 26216 Access Vector Cache (AVC) started Dec 14 11:57:56 mailserver1 nscd: 26216 invalid persistent database file "/var/db/nscd/hosts": verification failed Dec 14 12:09:57 mailserver1 kernel: nscd[26221]: segfault at 00000000000001b7 rip 000055555556341f rsp 0000000040ffcec0 error 6 Dec 14 12:28:57 mailserver1 nscd: 28966 Access Vector Cache (AVC) started Dec 14 12:28:57 mailserver1 nscd: 28966 invalid persistent database file "/var/db/nscd/hosts": verification failed Dec 14 13:42:44 mailserver1 nscd: 4322 Access Vector Cache (AVC) started
I have close to 100 days without crash now. Seems like this possible issue is tracked elsewhere too. IMHO this ticket can be closed.
FYI, upstream bug report: http://sources.redhat.com/bugzilla/show_bug.cgi?id=9746