From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux ppc; en-US; rv:1.7.12) Gecko/20051018 Epiphany/1.8.2 Description of problem: I have a test laptop that usually uses LDAP for NSS and kerberos for authentication. The laptop uses nscd and pam_ccreds to operate when disconnected from the network. This setup has worked fine for quite some time. Recently, I updated to the newest Rawhide and nscd seems to be broken. I use a PowerPC kernel/glibc. Version-Release number of selected component (if applicable): nscd-2.3.90-15 How reproducible: Always Steps to Reproduce: 1. Gdb nscd. 2. gdb> run -d 3. id -ng 4. Disconnect laptop from LDAP server. 5. id -ng Actual Results: After step 2, the user's group name is printed. After step 4, nscd crashes. The id command hangs in step 5 because nscd is gone and the LDAP server is not available. 7226: handle_request: request received (Version = 2) from PID 7237 7226: GETFDGR 7226: provide access to FD 12, for group 7226: handle_request: request received (Version = 2) from PID 7239 7226: GETFDPW 7226: provide access to FD 10, for passwd Program received signal SIGTERM, Terminated. [Switching to Thread 805432992 (LWP 7226)] 0x07e64148 in epoll_wait () from /lib/libc.so.6 (gdb) ba #0 0x07e64148 in epoll_wait () from /lib/libc.so.6 #1 0x08006d30 in sighup_handler () from /usr/sbin/nscd #2 0x08006d30 in sighup_handler () from /usr/sbin/nscd #3 0x08006d30 in sighup_handler () from /usr/sbin/nscd #4 0x08006d30 in sighup_handler () from /usr/sbin/nscd #5 0x08006d30 in sighup_handler () from /usr/sbin/nscd Previous frame inner to this frame (corrupt stack?) Expected Results: Nscd should allow the id command to print even when the LDAP server is not available. Additional info: When disconnected: Logins fail, hanging on the initgroups function. The system message bus will not start, hangs on the getgrouplist function.
Created attachment 121528 [details] Program to test initgroups() and nscd As of nscd-2.3.90-18, the daemon no longer crashes (that I have seen.) However, the original symptoms remain. The attached program may be used to test nscd. Here are some scenarios: 1. Execute program while attached to network/LDAP server, the nscd daemon says: 31166: handle_request: request received (Version = 2) from PID 3904 31166: GETFDGR 31166: provide access to FD 9, for group 31166: handle_request: request received (Version = 2) from PID 3904 31166: INITGROUPS (mike) 31166: Haven't found "mike" in group cache! 2. Wait 10 seconds, the nscd daemon says (why removed so soon?): 31166: remove INITGROUPS entry "mike" 3. Disconnect from network, execute program, nscd daemon says: 31166: handle_request: request received (Version = 2) from PID 5090 31166: GETFDGR 31166: provide access to FD 9, for group 31166: handle_request: request received (Version = 2) from PID 5090 31166: INITGROUPS (mike) 31166: Haven't found "mike" in group cache! Program hangs, trying to make LDAP request. NOTE: if you disconnect and execute program before "remove INITGROUPS" message, then program will NOT hang. I also see this message printed by the daemon: "31166: short write in addinitgroupsX: Broken pipe."
Can you please: 1) install glibc-debuginfo* corresponding to glibc/nscd you have installed 2) when you reproduce the hang in some application, as root gdb /usr/sbin/nscd `/sbin/pidof nscd` and get backtraces of all threads to see where exactly is it hang? It might very well be a nss_ldap bug, which is a separate package.
Created attachment 121618 [details] Backtrace of su during hang
Created attachment 121619 [details] Backtrace of nscd threads during hang of su
It seems that nscd is prematurely invalidating its cache of initgroups data. See in comment #1, "31166: remove INITGROUPS entry 'mike'." Why is nscd invalidating this cache entry so soon after it has been entered (within seconds, according to comment #1?)
See also http://sources.redhat.com/bugzilla/show_bug.cgi?id=2098.
You didn't explain what kind of entries are evacuated to early. I think it's an entry without auxiliary groups. For this I checked in a patch. The entries are now added with the usual timeout value. Should be in the next rawhide build.
Why bz closed the bug I don't know. Until a new rawhide release is out it should remain open.
The changes are in nscd-2.4.90-17 in rawhide.
I tested nscd-2.4.90-21 and this seems fixed. Thank you.