Description of problem: When I boot into runlevel 5, gdm does not work. The X cursor is on the screen momentarily and then everything goes blank. Version-Release number of selected component (if applicable): nscd-2.6.90-15 How reproducible: Every time Steps to Reproduce: Boot into runlevel 5. Actual results: Gdm does not work. If I switch to another virtual console and run top, then I see that nscd is using a lot of computer cycles. If I enter runlevel 3, stop the nscd service and return to runlevel 5, then gdm works fine. Expected results: Gdm should work when nscd is running. Additional info:
Created attachment 210691 [details] My nscd configuration
Also of note is that I am using LDAP for NS.
Try without LDAP. In 99% of the cases when nss_ldap is involved it's the module's fault.
I am currently looking into if I can reproduce this while LDAP is not being used. Until then, here is some output from a nscd process that is consuming a lot of CPU cycles (LDAP used): # nscd -d 4496: Access Vector Cache (AVC) started 4496: invalid persistent database file "/var/db/nscd/passwd": file size does not match 4496: invalid persistent database file "/var/db/nscd/group": file size does not match 4496: invalid persistent database file "/var/db/nscd/hosts": file size does not match 4496: handle_request: request received (Version = 2) from PID 4523 4496: GETFDPW 4496: provide access to FD 6, for passwd 4496: handle_request: request received (Version = 2) from PID 4523 4496: GETPWBYUID (0) 4496: Haven't found "0" in password cache! 4496: short write in cache_addpw: Permission denied 4496: handle_request: request received (Version = 2) from PID 4526 4496: GETFDPW 4496: provide access to FD 6, for passwd 4496: handle_request: request received (Version = 2) from PID 4526 4496: GETPWBYUID (32) 4496: Haven't found "32" in password cache! 4496: short write in cache_addpw: Permission denied 4496: handle_request: request received (Version = 2) from PID 4536 4496: handle_request: request received (Version = 2) from PID 4536 4496: handle_request: request received (Version = 2) from PID 4536 4496: handle_request: request received (Version = 2) from PID 4536 4496: handle_request: request received (Version = 2) from PID 4545 4496: handle_request: request received (Version = 2) from PID 4545 4496: handle_request: request received (Version = 2) from PID 4612 4496: GETFDPW 4496: provide access to FD 6, for passwd
Are you running it as root? Do you see any AVC denial messages in /var/log/audit/audit.log? Can you strace it? The short writes can cause problems, sure, but they shouldn't normally happen, unless the perms are wrong or unless you run out of disk space.
Yes, I am running nscd as root. I don't see any AVC denial messages yet. This is what "strace nscd -d" says while nscd burns CPU cycles: epoll_ctl(14, EPOLL_CTL_DEL, 15, NULL) = 0 futex(0x2003046c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x20030468, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_EQ, 0}3004: handle_request: request received (Version = 2) from PID 3114 3004: GETFDPW 3004: provide access to FD 8, for passwd ) = 1 epoll_wait(14, {}, 100, 29988) = 0 time(NULL) = 1191370579 epoll_wait(14, {}, 100, 29988) = 0 time(NULL) = 1191370609 epoll_wait(14, [...] Something is not right. The lack of AVC messages suprises me. I'll continue to try to figure this out.
That's not the interesting part of the strace. I was interested to see the write which returned -EPERM that caused the short write in cache_addpw: Permission denied message you cited above.
Created attachment 218241 [details] Another strace of nscd
I just attached to a procmail process that seemed to be eating CPU cycles to no end. This is the backtrace of the process: (gdb) ba #0 0x0feabde8 in __nscd_cache_search () from /lib/libc.so.6 #1 0x0fea9144 in nscd_getpw_r () from /lib/libc.so.6 #2 0x0fea9518 in __nscd_getpwuid_r () from /lib/libc.so.6 #3 0x0fe28c48 in getpwuid_r@@GLIBC_2.1.2 () from /lib/libc.so.6 #4 0x0fe283ac in getpwuid () from /lib/libc.so.6 #5 0x1000f3ac in ?? () #6 0x10001568 in ?? () #7 0x10002bb8 in ?? () #8 0x0fd9946c in generic_start_main () from /lib/libc.so.6 #9 0x0fd9963c in __libc_start_main () from /lib/libc.so.6 #10 0x00000000 in ?? ()
You are still using LDAP. What about situations when this is not the case? The LDAP module is of poor quality and might very well be the source of the problem? For the stack trace in comment #9: is this with persistent databases? If yes, does nscd report an error when you restart it? This can only happen if the database is corrupted in which case there can be a circular list. I've added some protection against this case now but this wouldn't fix any problem. And again: we need proof that this happens without the LDAP module.
I have removed all references to nss_ldap from /etc/nsswitch.conf. I have disabled SELinux. I executed "su -" and nscd and su both began to burn CPU cycles endlessly. PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2723 nscd 20 0 151m 1244 1000 S 77.6 0.5 0:38.15 nscd 2774 root 20 0 5056 632 560 R 13.2 0.2 4:35.42 su su: (gdb) ba #0 0x0ff50df4 in __nscd_cache_search () from /lib/libc.so.6 #1 0x0ff4e144 in nscd_getpw_r () from /lib/libc.so.6 #2 0x0fecd9a8 in getpwnam_r@@GLIBC_2.1.2 () from /lib/libc.so.6 #3 0x0fecd1fc in getpwnam () from /lib/libc.so.6 #4 0x100031c4 in ?? () #5 0x0fe3e46c in generic_start_main () from /lib/libc.so.6 #6 0x0fe3e63c in __libc_start_main () from /lib/libc.so.6 #7 0x00000000 in ?? () nscd: (gdb) ba #0 0x1fdf10b8 in epoll_wait () from /lib/libc.so.6 #1 0x20006edc in start_threads () from /usr/sbin/nscd #2 0x20005cac in main () from /usr/sbin/nscd I am now using the default nscd.conf that is distributed with Fedora Rawhide: server-user nscd debug-level 0 paranoia no enable-cache passwd yes positive-time-to-live passwd 600 negative-time-to-live passwd 20 suggested-size passwd 211 check-files passwd yes persistent passwd yes shared passwd yes max-db-size passwd 33554432 auto-propagate passwd yes enable-cache group yes positive-time-to-live group 3600 negative-time-to-live group 60 suggested-size group 211 check-files group yes persistent group yes shared group yes max-db-size group 33554432 auto-propagate group yes enable-cache hosts yes positive-time-to-live hosts 3600 negative-time-to-live hosts 20 suggested-size hosts 211 check-files hosts yes persistent hosts yes shared hosts yes max-db-size hosts 33554432 enable-cache services yes positive-time-to-live services 28800 negative-time-to-live services 20 suggested-size services 211 check-files services yes persistent services yes shared services yes max-db-size services 33554432
Did you start from a fresh set of databases? I.e., remove everything in /var/db/nscd/ in then start again.
Yes, I delete the databases in /var/bd/nscd before I start nscd.
Based on the date this bug was created, it appears to have been reported during the development of Fedora 8. In order to refocus our efforts as a project we are changing the version of this bug to '8'. If this bug still exists in rawhide, please change the version back to rawhide. (If you're unable to change the bug's version, add a comment to the bug and someone will change it for you.) Thanks for your help and we apologize for the interruption. The process we're following is outlined here: http://fedoraproject.org/wiki/BugZappers/F9CleanUp We will be following the process here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping to ensure this doesn't happen again.