From Bugzilla Helper: User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en) AppleWebKit/412.7 (KHTML, like Gecko) Safari/412.5 Description of problem: After updating to RHEL4 U2, nscd randomly segfaults. Looking at a backtrace, this appears to happen when getting certain entries from the nscd cache. If nscd is set too keep a persistent cache, the segfault will occur within a few seconds of restarting the process. Deleting the database files in /var/db/nscd will enable nscd to run for a while until it encounters an entry that causes it to segfault unually somewhere between an hour or several hours. This behavior has only been noticed on out x86_64 machine. Our i386 versions of Red Hat Enterprise appear to not have this issue. This happens whether using the 2.6.9-11 or 2.6.9-22 Kernel. I don't know what particular entry in the cache is causing the problem. I'm attaching a backtrace and some strace output of the segfault. Version-Release number of selected component (if applicable): glibc-2.3.4-2.13 How reproducible: Always Steps to Reproduce: 1. Start NSCD with hosts cache enable 2. Wait a while 3. NSCD segfaults when accessing certain entries in the cache Actual Results: NSCD Segfaults leaving subsys locked Expected Results: NSCD retrieves entry from cache and continues working Additional info: 2.6.9-22.ELsmp #1 SMP Mon Sep 19 18:00:54 EDT 2005 x86_64 x86_64 x86_64 GNU/Linux
Created attachment 119781 [details] Backtrace of the segfault
Created attachment 119782 [details] strace of the the segfault This is an strace of nscd using the same cache database files as the backtrace.
The backtrace is certainly weird. Can you please install glibc-debuginfo from ftp://people.redhat.com/jakub/glibc/2.3.4-2.13/ and get a more accurrate backtrace? E.g. there are no vsnprintf calls in nss/, resolv/.
Created attachment 119856 [details] Backtrace with debuginfo This is a backtrace of the segfault with debuginfo included
I was wrong initially. This doesn't appear to be limited to the hosts cache, since the problem still exists when hosts cache is disabled. I posted a new backtrace that had debuginfo included.
Can you reproduce the problem even without -d? Can you up 4 times and in the addpwbyX frame x/1s keystr ? From a quick look it sounds like some keys in the cache aren't zero terminated, but the only place where they are used as C strings rather than chunk of memory ->len bytes long is in debugging printouts (in which case I guess %.*s rather than %s in the format strings that print them would be sufficient).
Yeah....the problem still exists, even without -d. It was just easier to get info about what nscd was doing by using the debug option. I can get an strace for it without -d (do you want me to follow threads, or stay with the parent process?). I'm not too sure I userstand the rest of the request though. The problem occurs at random, but with a persistent cache, I can reproduce the problem every time I try until I reset the cache. Installing nscd-2.3.4-2.9 makes the problem go away, even if I use the cache files that were causing the new version of nscd to segfault.
Created attachment 119869 [details] Extra info on frames 4 to 8 Here is additional info on frames 4 to 8
Created attachment 119870 [details] x/ls keystr
I've posted the info that I think was requested. As mentioned, the problem does exist even when not using -d, but I don't know how to get gdb to follow threads, since the threads die almost immediately before I can attach to them.
As you said that you can reproduce the problem every time once you get the persistent cache into some state, can you run nscd (without -d) directly under gdb at that point, so you don't have to attach? Also, could we get a copy of one of the cache files that's causing this (whether as private attachment here, or mailing it to me directly)? There weren't many nscd changes between U1 and U2, the only important to this would be that previously nscd was using bad time and therefore some cache entries were never prunned.
Created attachment 119941 [details] gdb output without using -d option This file has gdb output without using the -d option
There is a corrupted entry in the passwd database file you posted: $7 = {type = GETPWBYNAME, first = true, len = 6, key = 2865298694, owner = -1, next = 4294967295, packet = 25648, { dellist = 0x2a9556c328, prevp = 0x2a9556c328}} (gdb) p/x *here $8 = {type = 0x0, first = 0x1, len = 0x6, key = 0xaac8fd06, owner = 0xffffffff, next = 0xffffffff, packet = 0x6430, { dellist = 0x2a9556c328, prevp = 0x2a9556c328}} here->key is clearly far beyond end of the database (entry at 0x6868 in the passwd db file). `here' is the first entry in the chain (so directly referenced from the hash table). Having packet == -1 sounds weird as well. The nscd db verifier (currently in rawhide, scheduled for RHEL4 U3) detects this situation and reinitializes the database file. But so far I have no idea why would such corruption appear (except of hw problems, which doesn't mean it can't be a nscd bug).
Hi, I seemto have this same problem with nscd. It seqfaults and if I start it again it will segfault unless I delete nscd database files. After deleting those files it will run few hours to few days and segfault. nscd[25661]: segfault at 0000002b401e2c42 rip 0000002a98c5f420 rsp 00000000401ff250 error 4 Redhat as4 with all updates except newest kernel. 2.6.9-22.ELsmp #1 SMP Mon Sep 19 18:00:54 EDT 2005 x86_64 x86_64 x86_64 GNU/Linux Hardware is a dell poweredge 2850.
I have the problem as well.
RHEL4.2, x86_64, 2.6.9-22 Seeing the same issue. This started occuring on a system unchanged/uptime 20 days. Removing the DB's resolves the issue immediately, will update if it recurrs rebuilding the cache from zer0. Worth noting: nscd doesn't re-build/re-size the DB if you increase the suggested-size, it gives warnings in debug mode but continues to run with the "original" smaller size tables on disk. /eli nscd[6684]: segfault at 0000002b9555c3b8 rip 000000552aab6c64 rsp 0000000040c04a60 error 4 nscd[10752]: segfault at 0000002b9555c3b8 rip 000000552aab6c64 rsp 0000000040c04a60 error 4 nscd[11497]: segfault at 0000002b9555c3b8 rip 000000552aab6c64 rsp 0000000040a03a60 error 4 nscd[12694]: segfault at 0000002b9555c3b8 rip 000000552aab6c64 rsp 0000000040601a60 error 4 nscd[14552]: segfault at 0000002b9555c3b8 rip 000000552aab6c64 rsp 0000000040400a60 error 4 nscd[15073]: segfault at 0000002b9555c3b8 rip 000000552aab6c64 rsp 0000000040c04a60 error 4 nscd[16465]: segfault at 0000002b9555c3cc rip 000000552aab6f65 rsp 00000000401ff800 error 4 nscd[17350]: segfault at 0000002b9555c3cc rip 000000552aab6f65 rsp 00000000401ff800 error 4 nscd[9286]: segfault at 0000002b9961e3b8 rip 000000552aab6c64 rsp 0000000040802a60 error 4 nscd[10249]: segfault at 0000002b9961e3cc rip 000000552aab6f65 rsp 00000000401ff800 error 4 nscd[11812]: segfault at 0000002b9961e3cc rip 000000552aab6f65 rsp 00000000401ff800 error 4 nscd[12885]: segfault at 0000002b9555c3cc rip 000000552aab6f65 rsp 00000000401ff800 error 4 nscd[13473]: segfault at 0000002b9961e3cc rip 000000552aab6f65 rsp 00000000401ff800 error 4 nscd[20113]: segfault at 0000002b9961e3cc rip 000000552aab6f65 rsp 00000000401ff800 error 4
I have found the same problem, but in my case it seems that sendmail was segfaulting while trying to do a host lookup (or at least the error cleared up after nscd was fixed): kernel: sendmail[28765]: segfault at 0000002b04bd421d rip 0000002a9657637f rsp 0000007fbfffc620 error 4 nscd did not crash, but was using 99.9% of one cpu. At the same time I noticed several ntpdate processes using 99.9% of the cpu. If nscd was off, ntpdate would work correctly, but if nscd was on (it would start, then take 99.9% of cpu and stay like that) ntpdate would hang. Invalidating the caches did not fix the problem. Stoping nscd erasing the databases as suggested from /var/db/nscd and restarting nscd again fixed all the problems (sendmail and ntpdate) in my x86_64 machines. I can also confirm that my 4 machines running i386 do not have any nscd problems. My machines are fully patched and running latest kernel 2.6.9-22.0.2.ELsmp #1 SMP Thu Jan 5 17:11:56 EST 2006 x86_64 x86_64 x86_64 GNU/Linux Diego
I just saw the same under x86 as well (corrupted database killing nscd). I've disabled the persistent cache until U3 is out but I kept the corrupted database so I'll be able to test if db verifier can handle the problem.
ping. Is there anything new on this? We're still seeing this problem in 4.4 on x86_64 boxes.
I haven't seen the problem after U3 in any of our our machines x86_64 included.
We still see this, but it is more pronounced. After the update to U3 (updated glibc), we get segfaults every couple of minutes for nscd. We've 'worked around' the issue for the moment by setting up a script that checks for nscd status ever minute and restarts it if it is dead. We are considering keeping RHEL for our 32-bit systems, and switching to another vendor for 64-bit systems to work around this issue. It only happens on our x86_64 machines.
Comment #27 matches the behavior we've seen. Very frequent failures on x86_64. You aren't by chance using ldap for your nss are you? I'd be curious to know what nss modules folks are using who are and are not seeing the issue.
We are currently using ldap for user and group information in nss, but we are not using it for shadow. For authentication, we are using LDAP authentication by means of PAM.
There have been several nscd related fixes (both on the nscd daemon and nscd client code sides) post U3, some in U4 and some are queued for RHEL4.5 (you can try e.g. http://people.redhat.com/jakub/glibc/2.3.4-2.36.1/ packages (for testing only, they haven't been through QA)). If you experience crashes even with that glibc and ideally without LDAP (because nss_ldap or its libraries are a possible culprit too), please file a new bug rather than adding a me too to a closed bug.
If you do file a new bug please post it here so we can follow it :) For the record we use nss_nis mostly here but the handfull of machines that use nss_ldap don't have a problem either.
We've tried the newer patches without any luck. We also see this issue on RHEL 5. However, as stated, we only see this on our x86_64 machines. For those of you still having issues, we're using a workaround that reduces the pain a bit. We have a cron job running constantly looking for the ncsd process, and if it isn't running, it restarts nscd and logs the event. We have a lot of failures daily depending on how heavily the system is used. Here's our crontab entry * * * * * if /bin/ps -e | /bin/grep nscd > /dev/null; then echo -n; else echo DOWN:`date`; /etc/init.d/nscd restart; /usr/sbin/apachectl restart; fi >> /var/log/nscd-restart.log
Just FYI, I've seen a quite similar database corruption with comment #15 (i.e. hashentry->packet points to outside of the allocated area). My analysis and upstream patch is posted here: http://sources.redhat.com/bugzilla/show_bug.cgi?id=9746