311291 – Nscd seems to cause gdm to hang

Bug 311291 - Nscd seems to cause gdm to hang

Summary: Nscd seems to cause gdm to hang

Keywords:
Status:	CLOSED RAWHIDE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	glibc
Sub Component:
Version:	8
Hardware:	powerpc
OS:	Linux
Priority:	low
Severity:	low
Target Milestone:	---
Assignee:	Jakub Jelinek
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:	bzcl34nup
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-09-28 16:13 UTC by W. Michael Petullo
Modified:	2008-09-06 18:57 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2008-09-06 18:57:05 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
My nscd configuration (1.99 KB, text/plain) 2007-09-28 16:19 UTC, W. Michael Petullo	no flags	Details
Another strace of nscd (57.67 KB, text/plain) 2007-10-06 01:21 UTC, W. Michael Petullo	no flags	Details
View All

Description W. Michael Petullo 2007-09-28 16:13:56 UTC

Description of problem:
When I boot into runlevel 5, gdm does not work.  The X cursor is on the screen
momentarily and then everything goes blank.

Version-Release number of selected component (if applicable):
nscd-2.6.90-15

How reproducible:
Every time

Steps to Reproduce:
Boot into runlevel 5.
  
Actual results:
Gdm does not work. If I switch to another virtual console and run top, then I
see that nscd is using a lot of computer cycles. If I enter runlevel 3, stop the
nscd service and return to runlevel 5, then gdm works fine.

Expected results:
Gdm should work when nscd is running.

Additional info:

Comment 1 W. Michael Petullo 2007-09-28 16:19:13 UTC

Created attachment 210691 [details]
My nscd configuration

Comment 2 W. Michael Petullo 2007-09-28 16:24:53 UTC

Also of note is that I am using LDAP for NS.

Comment 3 Ulrich Drepper 2007-09-30 22:07:09 UTC

Try without LDAP.  In 99% of the cases when nss_ldap is involved it's the
module's fault.

Comment 4 W. Michael Petullo 2007-10-02 00:15:29 UTC

I am currently looking into if I can reproduce this while LDAP is not being used.

Until then, here is some output from a nscd process that is consuming a lot of
CPU cycles (LDAP used):

# nscd -d
4496: Access Vector Cache (AVC) started
4496: invalid persistent database file "/var/db/nscd/passwd": file size does not
match
4496: invalid persistent database file "/var/db/nscd/group": file size does not
match
4496: invalid persistent database file "/var/db/nscd/hosts": file size does not
match
4496: handle_request: request received (Version = 2) from PID 4523
4496:   GETFDPW
4496: provide access to FD 6, for passwd
4496: handle_request: request received (Version = 2) from PID 4523
4496:   GETPWBYUID (0)
4496: Haven't found "0" in password cache!
4496: short write in cache_addpw: Permission denied
4496: handle_request: request received (Version = 2) from PID 4526
4496:   GETFDPW
4496: provide access to FD 6, for passwd
4496: handle_request: request received (Version = 2) from PID 4526
4496:   GETPWBYUID (32)
4496: Haven't found "32" in password cache!
4496: short write in cache_addpw: Permission denied
4496: handle_request: request received (Version = 2) from PID 4536
4496: handle_request: request received (Version = 2) from PID 4536
4496: handle_request: request received (Version = 2) from PID 4536
4496: handle_request: request received (Version = 2) from PID 4536
4496: handle_request: request received (Version = 2) from PID 4545
4496: handle_request: request received (Version = 2) from PID 4545
4496: handle_request: request received (Version = 2) from PID 4612
4496:   GETFDPW
4496: provide access to FD 6, for passwd

Comment 5 Jakub Jelinek 2007-10-02 11:58:23 UTC

Are you running it as root?  Do you see any AVC denial messages in
/var/log/audit/audit.log?  Can you strace it?
The short writes can cause problems, sure, but they shouldn't normally happen,
unless the perms are wrong or unless you run out of disk space.

Comment 6 W. Michael Petullo 2007-10-03 00:26:59 UTC

Yes, I am running nscd as root. I don't see any AVC denial messages yet. This is
what "strace nscd -d" says while nscd burns CPU cycles:

epoll_ctl(14, EPOLL_CTL_DEL, 15, NULL)  = 0
futex(0x2003046c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x20030468, {FUTEX_OP_SET, 0,
FUTEX_OP_CMP_EQ, 0}3004: handle_request: request received (Version = 2) from PID
3114
3004:   GETFDPW
3004: provide access to FD 8, for passwd
) = 1
epoll_wait(14, {}, 100, 29988)          = 0
time(NULL)                              = 1191370579
epoll_wait(14, {}, 100, 29988)          = 0
time(NULL)                              = 1191370609
epoll_wait(14, 
[...]

Something is not right. The lack of AVC messages suprises me. I'll continue to
try to figure this out.

Comment 7 Jakub Jelinek 2007-10-03 13:52:00 UTC

That's not the interesting part of the strace.  I was interested to see
the write which returned -EPERM that caused the
short write in cache_addpw: Permission denied
message you cited above.

Comment 8 W. Michael Petullo 2007-10-06 01:21:30 UTC

Created attachment 218241 [details]
Another strace of nscd

Comment 9 W. Michael Petullo 2007-10-06 18:10:38 UTC

I just attached to a procmail process that seemed to be eating CPU cycles to no
end.  This is the backtrace of the process:

(gdb) ba
#0  0x0feabde8 in __nscd_cache_search () from /lib/libc.so.6
#1  0x0fea9144 in nscd_getpw_r () from /lib/libc.so.6
#2  0x0fea9518 in __nscd_getpwuid_r () from /lib/libc.so.6
#3  0x0fe28c48 in getpwuid_r@@GLIBC_2.1.2 () from /lib/libc.so.6
#4  0x0fe283ac in getpwuid () from /lib/libc.so.6
#5  0x1000f3ac in ?? ()
#6  0x10001568 in ?? ()
#7  0x10002bb8 in ?? ()
#8  0x0fd9946c in generic_start_main () from /lib/libc.so.6
#9  0x0fd9963c in __libc_start_main () from /lib/libc.so.6
#10 0x00000000 in ?? ()

Comment 10 Ulrich Drepper 2007-10-06 18:52:59 UTC

You are still using LDAP.  What about situations when this is not the case?  The
LDAP module is of poor quality and might very well be the source of the problem?

For the stack trace in comment #9: is this with persistent databases?  If yes,
does nscd report an error when you restart it?  This can only happen if the
database is corrupted in which case there can be a circular list.

I've added some protection against this case now but this wouldn't fix any problem.

And again: we need proof that this happens without the LDAP module.

Comment 11 W. Michael Petullo 2007-10-06 23:05:04 UTC

I have removed all references to nss_ldap from /etc/nsswitch.conf.
I have disabled SELinux.

I executed "su -" and nscd and su both began to burn CPU cycles endlessly.

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
 2723 nscd      20   0  151m 1244 1000 S 77.6  0.5   0:38.15 nscd               
 2774 root      20   0  5056  632  560 R 13.2  0.2   4:35.42 su

su:
(gdb) ba
#0  0x0ff50df4 in __nscd_cache_search () from /lib/libc.so.6
#1  0x0ff4e144 in nscd_getpw_r () from /lib/libc.so.6
#2  0x0fecd9a8 in getpwnam_r@@GLIBC_2.1.2 () from /lib/libc.so.6
#3  0x0fecd1fc in getpwnam () from /lib/libc.so.6
#4  0x100031c4 in ?? ()
#5  0x0fe3e46c in generic_start_main () from /lib/libc.so.6
#6  0x0fe3e63c in __libc_start_main () from /lib/libc.so.6
#7  0x00000000 in ?? ()

nscd:
(gdb) ba
#0  0x1fdf10b8 in epoll_wait () from /lib/libc.so.6
#1  0x20006edc in start_threads () from /usr/sbin/nscd
#2  0x20005cac in main () from /usr/sbin/nscd

I am now using the default nscd.conf that is distributed with Fedora Rawhide:

        server-user             nscd
        debug-level             0
        paranoia                no

        enable-cache            passwd          yes
        positive-time-to-live   passwd          600
        negative-time-to-live   passwd          20
        suggested-size          passwd          211
        check-files             passwd          yes
        persistent              passwd          yes
        shared                  passwd          yes
        max-db-size             passwd          33554432
        auto-propagate          passwd          yes

        enable-cache            group           yes
        positive-time-to-live   group           3600
        negative-time-to-live   group           60
        suggested-size          group           211
        check-files             group           yes
        persistent              group           yes
        shared                  group           yes
        max-db-size             group           33554432
        auto-propagate          group           yes

        enable-cache            hosts           yes
        positive-time-to-live   hosts           3600
        negative-time-to-live   hosts           20
        suggested-size          hosts           211
        check-files             hosts           yes
        persistent              hosts           yes
        shared                  hosts           yes
        max-db-size             hosts           33554432

        enable-cache            services        yes
        positive-time-to-live   services        28800
        negative-time-to-live   services        20
        suggested-size          services        211
        check-files             services        yes
        persistent              services        yes
        shared                  services        yes
        max-db-size             services        33554432

Comment 12 Ulrich Drepper 2007-10-06 23:12:43 UTC

Did you start from a fresh set of databases?  I.e., remove everything in
/var/db/nscd/ in then start again.

Comment 13 W. Michael Petullo 2007-10-07 02:26:52 UTC

Yes, I delete the databases in /var/bd/nscd before I start nscd.

Comment 14 Bug Zapper 2008-04-04 13:56:47 UTC

Based on the date this bug was created, it appears to have been reported
during the development of Fedora 8. In order to refocus our efforts as
a project we are changing the version of this bug to '8'.

If this bug still exists in rawhide, please change the version back to
rawhide.
(If you're unable to change the bug's version, add a comment to the bug
and someone will change it for you.)

Thanks for your help and we apologize for the interruption.

The process we're following is outlined here:
http://fedoraproject.org/wiki/BugZappers/F9CleanUp

We will be following the process here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping to ensure this
doesn't happen again.

Note You need to log in before you can comment on or make changes to this bug.