Description of problem: After nscd is started, it dies within a few seconds. Version-Release number of selected component (if applicable): nscd-2.3.5-10 How reproducible: always Steps to Reproduce: 1.su - 2.nscd 3.wait a few seconds Actual results: nscd dies Expected results: nscd should stay up Additional info: clean install of FC4
I'm having the same problem. I'm using nss_ldap for LDAP authentication with SELinux disabled. Also tried with persistent = no and shared = no settings with the same results. This setup was stable under FC3. I can't seem to get it to dump a core file either, which seems strange: # ulimit -c unlimited Here's the tail end of an strace -f nscd [pid 440] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) [pid 440] futex(0x3682a4, FUTEX_WAKE, 1) = 0 [pid 440] time(NULL) = 1120330271 [pid 440] stat64("/etc/passwd", {st_mode=S_IFREG|0444, st_size=2516, ...}) = 0 [pid 440] clock_gettime(CLOCK_MONOTONIC, {49842, 645350000}) = 0 [pid 440] clock_gettime(CLOCK_MONOTONIC, {49842, 646743000}) = 0 [pid 440] futex(0x3682e4, FUTEX_WAIT, 111, {14, 998607000} <unfinished ...> [pid 443] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) [pid 443] futex(0x3682a4, FUTEX_WAKE, 1) = 0 [pid 443] time(NULL) = 1120330271 [pid 443] stat64("/etc/hosts", {st_mode=S_IFREG|0444, st_size=382, ...}) = 0 [pid 443] --- SIGSEGV (Segmentation fault) @ 0 (0) --- Process 437 detached Process 443 detached [pid 441] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) [pid 441] +++ killed by SIGSEGV +++ PANIC: handle_group_exit: 441 leader 437 [pid 440] <... futex resumed> ) = -1 EINTR (Interrupted system call) [pid 440] +++ killed by SIGSEGV +++ PANIC: handle_group_exit: 440 leader 437 Process 437 detached
You can use LD_PRELOAD=libSegFault.so nscd -d to see a backtrace. If you are using LDAP, it could be either nss_ldap or glibc bug. If the latter, I'd be interested to know if using glibc-2.3.5-11 nscd (rawhide) cures it (then it could be nscd miscompilation by GCC - #154782).
[root@localhost ~]# LD_PRELOAD=libSegFault.so nscd -d 3084: Access Vector Cache (AVC) started 3084: Reloading "0" in password cache! *** Segmentation fault Register dump: EAX: 00000005 EBX: 00b40cc0 ECX: 000000cb EDX: 00000005 ESI: b73993b8 EDI: f9e5beb7 EBP: b732ddb4 ESP: b732dbac EIP: 00b38732 EFLAGS: 00210296 CS: 0073 DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b Trap: 0000000e Error: 00000005 OldMask: 00000000 ESP/signal: b732dbac CR2: f9e5becb Backtrace: 3084: Reloading "0" in group cache! 3084: Reloading "ftp.freshrpms.net" in hosts cache! 3084: Reloading "lp" in group cache! 3084: Reloading "1" in group cache! 3084: Reloading "2" in group cache! 3084: Reloading "slocate" in group cache! 3084: Reloading "3" in group cache! 3084: Reloading "4" in group cache! 3084: Reloading "root" in group cache! 3084: Reloading "10" in group cache! 3084: Reloading "5" in group cache! 3084: Reloading "users" in group cache! 3084: Reloading "500" in group cache! 3084: Reloading "6" in group cache! 3084: Reloading "12" in group cache! /lib/libSegFault.so[0x8ac115] [0x11f420] nscd[0xb33616] /lib/libpthread.so.0[0xac8b80] /lib/libc.so.6(__clone+0x5e)[0x1eadee] Segmentation fault [root@localhost ~]#
I don't even know what LDAP is. I'm using nscd to try to speed up DNS as recommended by http://www.fedoraforum.org/forum/showthread.php?t=42943 Essentially the same thing happens with nscd-2.3.5-11 (just the one package upgraded).
Ok, can you as root: mkdir -p ~/db-nscd/ cp -a /var/db/nscd/* ~/db-nscd/ rm -f /var/db/nscd/* and retry? LDAP notice was in response to comment #1.
After following the instructions, nscd stays up (I restored the original nscd package first).
Ok, can you now stop nscd, copy the ~/db-nscd/* files back and retry? If that crashes again, I'd be very much interested in those 3 db files, to make nscd more robust when it sees broken cache files. You can mail the files to me or attach here.
It crashes again after restoring the 3 files, which I've emailed to you. I noticed that after doing a cp -p for the 3 files back to /var/db/nscd, even though the modification times are the same (June 12), the contents differ. Presumably the modification times are being saved and restored.
Still couldn't get a core dump with LD_PRELOAD=libSegFault.so. I'm testing this under Xen guest and host, could that be preventing this? I also tried upgrading glibc and nscd to no avail: # rpm -q glibc nscd glibc-2.3.5-11 nscd-2.3.5-11 Here's some valgrind output instead. Note that I'm also using LDAP for NIS netgroup lookups. I'm not sure why I get the fatal error on /proc/self/maps.. # valgrind --db-attach=no --tool=memcheck --error-limit=no nscd -d 7463: handle_request: request received (Version = 2) from PID 7474 7463: GETFDPW 7463: provide access to FD 5, for passwd 7463: handle_request: request received (Version = 2) from PID 7474 7463: GETPWBYNAME (dcox) 7463: Haven't found "dcox" in password cache! ==7463== ==7463== Thread 3: ==7463== Syscall param write(buf) points to uninitialised byte(s) ==7463== at 0x1B9330BB: (within /lib/libpthread-2.3.5.so) ==7463== by 0x1BBDC597: sb_debug_write (in /lib/libnss_ldap-2.3.5.so) ==7463== by 0x1BBD04E1: sb_tls_bio_write (in /lib/libnss_ldap-2.3.5.so) ==7463== by 0x1BCCB119: BIO_write (in /lib/libnss_ldap-2.3.5.so) ==7463== by 0x1BC9C100: ssl3_write_pending (in /lib/libnss_ldap-2.3.5.so) ==7463== by 0x1BC9C6AA: ssl3_write_bytes (in /lib/libnss_ldap-2.3.5.so) ==7463== by 0x1BCBAFA3: ssl3_write (in /lib/libnss_ldap-2.3.5.so) ==7463== by 0x1BCA240B: SSL_write (in /lib/libnss_ldap-2.3.5.so) ==7463== by 0x1BBD0323: sb_tls_write (in /lib/libnss_ldap-2.3.5.so) ==7463== by 0x1BBDC597: sb_debug_write (in /lib/libnss_ldap-2.3.5.so) ==7463== by 0x1BBDBA6B: ber_int_sb_write (in /lib/libnss_ldap-2.3.5.so) ==7463== by 0x1BBD87FD: ber_flush (in /lib/libnss_ldap-2.3.5.so) ==7463== Address 0x1BF2BFCD is 5 bytes inside a block of size 18698 alloc'd ==7463== at 0x1B909222: malloc (vg_replace_malloc.c:130) ==7463== by 0x1BCBCC39: default_malloc_ex (in /lib/libnss_ldap-2.3.5.so) ==7463== by 0x1BCBD1E6: CRYPTO_malloc (in /lib/libnss_ldap-2.3.5.so) ==7463== by 0x1BC9E621: ssl3_setup_buffers (in /lib/libnss_ldap-2.3.5.so) ==7463== by 0x1BC9F4EE: ssl23_connect (in /lib/libnss_ldap-2.3.5.so) ==7463== by 0x1BCA384B: SSL_connect (in /lib/libnss_ldap-2.3.5.so) ==7463== by 0x1BBD1D9F: ldap_int_tls_start (in /lib/libnss_ldap-2.3.5.so) ==7463== by 0x1BBD2282: ldap_start_tls_s (in /lib/libnss_ldap-2.3.5.so) ==7463== by 0x1BBAC0AA: do_open (ldap-nss.c:1274) ==7463== by 0x1BBAC28E: do_init2 (ldap-nss.c:960) ==7463== by 0x1BBAEDF7: _nss_ldap_initgroups_dyn (ldap-grp.c:1050) ==7463== by 0x1B9DED43: internal_getgrouplist (in /lib/libc-2.3.5.so) ==7463== FATAL: can't open /proc/self/maps
I found that even after deleting the 3 files and letting them be recreated, nscd eventually dies, though not quickly. Once it dies, the 3 files in /var/db/nscd at that time cause it to crash quickly again.
Stop adding the LDAP related comments here. As can be seen in the backtrace in comment #9, this seems to be a problem in the LDAP code. It might not be the only problem. Eliminate the use of LDAP if you want to add anything to this bug. Beside, your xen domain seems to be severely crippled. No /proc is mounted, that is fatal these days.
I have the same problem without ldap. here is the output of valgrind valgrind --db-attach=no --tool=memcheck --error-limit=no nsc d -d ==24122== Memcheck, a memory error detector for x86-linux. ==24122== Copyright (C) 2002-2005, and GNU GPL'd, by Julian Seward et al. ==24122== Using valgrind-2.4.0, a program supervision framework for x86-linux. ==24122== Copyright (C) 2000-2005, and GNU GPL'd, by Julian Seward et al. ==24122== For more details, rerun with: -v ==24122== ==24122== Syscall param write(buf) points to uninitialised byte(s) ==24122== at 0x525093: __write_nocancel (in /lib/libpthread-2.3.5.so) ==24122== by 0x40AC: main (in /usr/sbin/nscd) ==24122== Address 0x52BFE7F0 is on thread 1's stack 24122: handle_request: requête reçue (Version = 2) à partir du PID 24132 24122: GETFDPW 24122: provide access to FD 4, for passwd 24122: handle_request: requête reçue (Version = 2) à partir du PID 24132 24122: GETPWBYNAME (sshd) 24122: N'a pas trouvé « sshd » dans la cache des mots de passe! 24122: handle_request: requête reçue (Version = 2) à partir du PID 24133 24122: GETPWBYNAME (sshd) ........ [some lines removed ] .......... 24122: provide access to FD 4, for passwd 24122: handle_request: requête reçue (Version = 2) à partir du PID 24179 24122: GETFDPW 24122: provide access to FD 4, for passwd 24122: handle_request: requête reçue (Version = 2) à partir du PID 24181 24122: GETFDPW 24122: provide access to FD 4, for passwd 24122: remove GETPWBYNAME entry "test" ==24122== ==24122== Thread 2: ==24122== Invalid write of size 4 ==24122== at 0xBECE: (within /usr/sbin/nscd) ==24122== Address 0x1AE26AC0 is on thread 2's stack ==24122== Stack overflow in thread 2: can't grow stack to 0x1AE26AC0 ==24122== ==24122== Process terminating with default action of signal 11 (SIGSEGV) ==24122== Access not within mapped region at address 0x1AE26AC0 ==24122== at 0xBECE: (within /usr/sbin/nscd) ==24122== ==24122== ERROR SUMMARY: 4 errors from 2 contexts (suppressed: 26 from 1) ==24122== malloc/free: in use at exit: 13665 bytes in 28 blocks. ==24122== malloc/free: 256 allocs, 228 frees, 90239 bytes allocated. ==24122== For counts of detected errors, rerun with: -v ==24122== searching for pointers to 28 not-freed blocks. ==24122== checked 6436176 bytes. ==24122== ==24122== LEAK SUMMARY: ==24122== definitely lost: 0 bytes in 0 blocks. ==24122== possibly lost: 816 bytes in 6 blocks. ==24122== still reachable: 12849 bytes in 22 blocks. ==24122== suppressed: 0 bytes in 0 blocks. ==24122== Reachable blocks (those to which a pointer was found) are not shown. ==24122== To see them, rerun with: --show-reachable=yes ==24122== FATAL: can't open /proc/self/maps
*** This bug has been marked as a duplicate of 162712 ***