450704 – nscd segfaults

Bug 450704 - nscd segfaults

Summary: nscd segfaults

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	glibc
Sub Component:
Version:	rawhide
Hardware:	All
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	---
Assignee:	Jakub Jelinek
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2008-06-10 15:15 UTC by Tom London
Modified:	2008-07-23 07:08 UTC (History)
CC List:	3 users (show)
Fixed In Version:	2.8.90-8
Clone Of:
Environment:
Last Closed:	2008-07-04 20:11:41 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Complete output of "valgrind nscd -d" (15.30 KB, text/plain) 2008-06-10 19:21 UTC, Tom London	no flags	Details
Another "valgrind nscd -d" run (with segfault) (41.98 KB, text/plain) 2008-06-11 13:41 UTC, Tom London	no flags	Details
nscd.log from crashed/exited run of nscd (18.37 KB, text/plain) 2008-06-11 14:26 UTC, Tom London	no flags	Details
/var/log/nscd.log.... nscd stops on boot up (55.52 KB, text/plain) 2008-06-11 14:53 UTC, Tom London	no flags	Details
*Clean nscd.log after cleaning out /var/db/nscd/ and rebooting** (42.97 KB, text/plain) 2008-06-11 16:59 UTC, Tom London	no flags	Details
Valgrind log of nscd getting an Illegal Instruction fault (SIGILL) (20.28 KB, text/plain) 2008-06-11 17:25 UTC, Tom London	no flags	Details
View All

Description Tom London 2008-06-10 15:15:42 UTC

Description of problem:
I noticed this in /var/log/messages:
Jun 10 08:00:17 localhost nscd: 6638 Access Vector Cache (AVC) started
Jun 10 08:11:57 localhost kernel:<6>nscd[6643]: segfault at e42ed04 ip 0022275f
sp af14cdf8 error 4 in libc-2.8.90.so[1ac000+16d000]

I restarted nscd and connected gdb to it ("gdb program PROCNUM").

Got this segfault/backtrace:

(gdb) run
The program being debugged has been started already.
Start it from the beginning? (y or n) n
Program not restarted.
(gdb) cont
Continuing.

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xaf14db90 (LWP 6643)]
obstack_free (h=<value optimized out>, obj=<value optimized out>)
    at obstack.c:366
366	      plp = lp->prev;
(gdb) bt
#0  obstack_free (h=<value optimized out>, obj=<value optimized out>)
    at obstack.c:366
#1  0xb7f80434 in gc (db=0xb7f90200) at mem.c:542
#2  0xb7f7f198 in prune_cache (table=0xb7f90200, now=1213110354, fd=-1)
    at cache.c:521
#3  0xb7f73f49 in nscd_run_prune (p=0x2) at connections.c:1451
#4  0x0014351f in start_thread (arg=<value optimized out>)
    at pthread_create.c:297
#5  0x0028fbde in clone () from /lib/libc.so.6
(gdb) 


Version-Release number of selected component (if applicable):
nscd-2.8.90-5.i386

How reproducible:
Seems to happen after a while each time I start it.

Steps to Reproduce:
1. Start nscd ("service nscd start")
2. connect gdb to it
3. wait while doing normal browsing, etc.
  
Actual results:


Expected results:


Additional info:

Comment 1 Ulrich Drepper 2008-06-10 18:40:44 UTC

This can only happen if memory is corrupted.  I cannot see anything wrong in the
use of the obstack.  So it has to come from somewhere else.

What NSS modules do you use?  If you have any which don't come with glibc (like
LDAP, samba stuff, mdns, ...) remove them and try again.

Also, run nscd in debug mode (-d) under valgrind.

Comment 2 Tom London 2008-06-10 19:08:22 UTC

[root@localhost ~]# rpm -qa \*nss\* | grep -v openss
nss-mdns-0.10-4.fc9.i386
nss-tools-3.12.0.3-2.fc10.i386
nss-debuginfo-3.12.0.3-2.fc10.i386
nss-mdns-debuginfo-0.10-4.fc9.i386
nss-devel-3.12.0.3-2.fc10.i386
nss_compat_ossl-0.9.2-4.fc9.i386
nss_db-2.2-40.fc9.i386
nss-3.12.0.3-2.fc10.i386
nss_ldap-259-3.fc9.i386
[root@localhost ~]# 

I removed nss-mdns*, and nss_db.

ipa-client requires nss_ldap, and slrn requires nss_compat_ossl (I could also
remove these, if you recommend).

So, my system now looks like:

[root@localhost ~]# rpm -qa \*nss\* | grep -v openss
nss-tools-3.12.0.3-2.fc10.i386
nss-debuginfo-3.12.0.3-2.fc10.i386
nss-devel-3.12.0.3-2.fc10.i386
nss_compat_ossl-0.9.2-4.fc9.i386
nss-3.12.0.3-2.fc10.i386
nss_ldap-259-3.fc9.i386
[root@localhost ~]# 

I'll retest with this setup.  Let me know if I should remove nss_ldap and
nss_compat_ossl as well.

Comment 3 Tom London 2008-06-10 19:11:13 UTC

Trying valgrind, I get:

[root@localhost ~]# valgrind nscd -d
valgrind: mmap(0x0, 135168) failed in UME with error 13 (Permission denied).
[root@localhost ~]# 

when running in SELinux enforcing mode.

I'm running in permissive mode......

Comment 4 Tom London 2008-06-10 19:19:36 UTC

OK, running as described in #3, I get the follow from valgrind:

12876: provide access to FD 6, for passwd
12876: remove GETAI entry "www.google.com"
12876: remove GETAI entry "news.google.com"
12876: remove GETAI entry "cm.my.yahoo.com"
12876: remove GETAI entry "www.mail.com"
==12876== Thread 5:
==12876== Conditional jump or move depends on uninitialised value(s)
==12876==    at 0xACC730: obstack_free (obstack.c:364)
==12876==    by 0x12433: gc (mem.c:542)
==12876==    by 0x11197: prune_cache (cache.c:521)
==12876==    by 0x5F48: nscd_run_prune (connections.c:1451)
==12876==    by 0xC0351E: start_thread (pthread_create.c:297)
==12876==    by 0xB39BDD: clone (in /lib/libc-2.8.90.so)
==12876== 
==12876== Conditional jump or move depends on uninitialised value(s)
==12876==    at 0xACC752: obstack_free (obstack.c:364)
==12876==    by 0x12433: gc (mem.c:542)
==12876==    by 0x11197: prune_cache (cache.c:521)
==12876==    by 0x5F48: nscd_run_prune (connections.c:1451)
==12876==    by 0xC0351E: start_thread (pthread_create.c:297)
==12876==    by 0xB39BDD: clone (in /lib/libc-2.8.90.so)
==12876== 
==12876== Use of uninitialised value of size 4
==12876==    at 0xACC75F: obstack_free (obstack.c:366)
==12876==    by 0x12433: gc (mem.c:542)
==12876==    by 0x11197: prune_cache (cache.c:521)
==12876==    by 0x5F48: nscd_run_prune (connections.c:1451)
==12876==    by 0xC0351E: start_thread (pthread_create.c:297)
==12876==    by 0xB39BDD: clone (in /lib/libc-2.8.90.so)
==12876== 
==12876== Invalid read of size 4
==12876==    at 0xACC75F: obstack_free (obstack.c:366)
==12876==    by 0x12433: gc (mem.c:542)
==12876==    by 0x11197: prune_cache (cache.c:521)
==12876==    by 0x5F48: nscd_run_prune (connections.c:1451)
==12876==    by 0xC0351E: start_thread (pthread_create.c:297)
==12876==    by 0xB39BDD: clone (in /lib/libc-2.8.90.so)
==12876==  Address 0x3e7ff004 is not stack'd, malloc'd or (recently) free'd
==12876== 
==12876== Process terminating with default action of signal 11 (SIGSEGV)
==12876==  Access not within mapped region at address 0x3E7FF004
==12876==    at 0xACC75F: obstack_free (obstack.c:366)
==12876==    by 0x12433: gc (mem.c:542)
==12876==    by 0x11197: prune_cache (cache.c:521)
==12876==    by 0x5F48: nscd_run_prune (connections.c:1451)
==12876==    by 0xC0351E: start_thread (pthread_create.c:297)
==12876==    by 0xB39BDD: clone (in /lib/libc-2.8.90.so)
==12876== 
==12876== ERROR SUMMARY: 4 errors from 4 contexts (suppressed: 26 from 1)
==12876== malloc/free: in use at exit: 33,794 bytes in 487 blocks.
==12876== malloc/free: 2,450 allocs, 1,963 frees, 550,531 bytes allocated.
==12876== For counts of detected errors, rerun with: -v
==12876== searching for pointers to 487 not-freed blocks.
==12876== checked 19,848,208 bytes.
==12876== 
==12876== LEAK SUMMARY:
==12876==    definitely lost: 0 bytes in 0 blocks.
==12876==      possibly lost: 1,368 bytes in 9 blocks.
==12876==    still reachable: 32,426 bytes in 478 blocks.
==12876==         suppressed: 0 bytes in 0 blocks.
==12876== Rerun with --leak-check=full to see details of leaked memory.
Killed
[root@localhost ~]# 

I attach the complete output of "valgrind nscd -d" below.

Is this useful?

Comment 5 Tom London 2008-06-10 19:21:31 UTC

Created attachment 308849 [details]
Complete output of "valgrind nscd -d"

I did the following:

1.) start "valgrind nscd -d"
2.) restart firefox
3.) got above output with firefox bringing up my home pages
4.) firefox reported "couldn't find pages" messages for each of my home pages.

Comment 6 Ulrich Drepper 2008-06-10 20:16:54 UTC

(In reply to comment #2)
> So, my system now looks like:
> 
> [root@localhost ~]# rpm -qa \*nss\* | grep -v openss
> nss-tools-3.12.0.3-2.fc10.i386
> nss-debuginfo-3.12.0.3-2.fc10.i386
> nss-devel-3.12.0.3-2.fc10.i386
> nss_compat_ossl-0.9.2-4.fc9.i386
> nss-3.12.0.3-2.fc10.i386
> nss_ldap-259-3.fc9.i386

You don't have to remove the NSS packages, just remove the appropriate entries
in /etc/nsswitch.conf.  In the passwd, group, hosts, and services lines there
should only be glibc-provided entries.

Comment 7 Tom London 2008-06-10 20:31:23 UTC

I believe /etc/nsswitch.conf is "stock" on my system:

#
# /etc/nsswitch.conf
#
# An example Name Service Switch config file. This file should be
# sorted with the most-used services at the beginning.
#
# The entry '[NOTFOUND=return]' means that the search for an
# entry should stop if the search in the previous entry turned
# up nothing. Note that if the search failed due to some other reason
# (like no NIS server responding) then the search continues with the
# next entry.
#
# Legal entries are:
#
#	nisplus or nis+		Use NIS+ (NIS version 3)
#	nis or yp		Use NIS (NIS version 2), also called YP
#	dns			Use DNS (Domain Name Service)
#	files			Use the local files
#	db			Use the local database (.db) files
#	compat			Use NIS on compat mode
#	hesiod			Use Hesiod for user lookups
#	[NOTFOUND=return]	Stop searching if not found so far
#

# To use db, put the "db" in front of "files" for entries you want to be
# looked up first in the databases
#
# Example:
#passwd:    db files nisplus nis
#shadow:    db files nisplus nis
#group:     db files nisplus nis

passwd:     files
shadow:     files
group:      files

#hosts:     db files nisplus nis dns
hosts:      files dns

# Example - obey only what nisplus tells us...
#services:   nisplus [NOTFOUND=return] files
#networks:   nisplus [NOTFOUND=return] files
#protocols:  nisplus [NOTFOUND=return] files
#rpc:        nisplus [NOTFOUND=return] files
#ethers:     nisplus [NOTFOUND=return] files
#netmasks:   nisplus [NOTFOUND=return] files     

bootparams: nisplus [NOTFOUND=return] files

ethers:     files
netmasks:   files
networks:   files
protocols:  files
rpc:        files
services:   files

netgroup:   files

publickey:  nisplus

automount:  files
aliases:    files

Comment 8 Tom London 2008-06-11 13:41:17 UTC

Created attachment 308930 [details]
Another "valgrind nscd -d" run (with segfault)

Another segfault running "nscd -d"

Comment 9 Tom London 2008-06-11 14:26:14 UTC

Created attachment 308931 [details]
nscd.log from crashed/exited run of nscd

Not sure this is interesting, but I uncommented the 'logfile' option in
/etc/nscd.conf, set nscd to start in runlevel 5 and rebooted.

The last lines in the log file are:

2566: handle_request: request received (Version = 2) from PID 2801
2566:	INITGROUPS (gdm)
2566: remove GETAI entry ""

I do see these type of messages in /var/log/messages:
Jun 11 07:11:25 localhost setroubleshoot: SELinux is preventing nscd (nscd_t)
"search" to ./2799 (xdm_t). For complete SELinux messages. run sealert -l
8e30d5b3-d350-47ae-9797-abc57816e649
Jun 11 07:11:25 localhost setroubleshoot: SELinux is preventing nscd (nscd_t)
"search" to ./2801 (xdm_dbusd_t). For complete SELinux messages. run sealert -l
204af7c5-e663-4b2e-927a-e623abe3b543
Jun 11 07:11:25 localhost setroubleshoot: SELinux is preventing nscd (nscd_t)
"search" to ./2801 (xdm_dbusd_t). For complete SELinux messages. run sealert -l
204af7c5-e663-4b2e-927a-e623abe3b543

#============= nscd_t ==============
allow nscd_t avahi_t:dir search;
allow nscd_t cupsd_t:dir search;
allow nscd_t hald_t:dir search;
allow nscd_t ntpd_t:dir search;
allow nscd_t pam_console_t:dir search;
allow nscd_t setroubleshootd_t:dir search;
allow nscd_t sshd_t:dir search;
allow nscd_t xdm_dbusd_t:dir search;
allow nscd_t xdm_t:dir search;

Don't remember seeing these before.  I'll reboot in permissive mode.

Comment 10 Tom London 2008-06-11 14:53:36 UTC

Created attachment 308937 [details]
/var/log/nscd.log.... nscd stops on boot up

OK. Changed loglevel to 9 and rebooted in permissive mode (no more AVCs).

nscd seems to stop just as ntpd starts.  Believe this from /var/log/messages
applies:

Jun 11 07:36:43 localhost gdm-simple-slave[2724]: DEBUG: GdmSlave: Stopping
slave
Jun 11 07:36:43 localhost gdm-simple-slave[2724]: DEBUG: GdmSignalHandler:
Finalizing signal handler
Jun 11 07:36:43 localhost gdm-simple-slave[2724]: DEBUG: Slave finished
Jun 11 07:36:44 localhost avahi-daemon[2607]: Got SIGTERM, quitting.
Jun 11 07:36:44 localhost avahi-daemon[2607]: Leaving mDNS multicast group on
interface eth0.IPv4 with address 192.168.1.101.
Jun 11 07:36:45 localhost ntpd[2582]: ntpd exiting on signal 15
Jun 11 07:36:45 localhost NetworkManager: <WARN>  nm_signal_handler(): Caught
signal 15, shutting down normally.
Jun 11 07:36:45 localhost NetworkManager: <info>  (eth0): now unmanaged
Jun 11 07:36:45 localhost NetworkManager: <info>  (eth0): device state change:
8 -> 1
Jun 11 07:36:45 localhost NetworkManager: <info>  (eth0): deactivating device.
Jun 11 07:36:45 localhost NetworkManager: <info>  eth0: canceled DHCP
transaction, dhclient pid 2518
Jun 11 07:36:45 localhost kernel:<6>nscd[3888]: segfault at 10 ip 00000010 sp
af04f800 error 4 in ld-2.8.90.so[110000+20000]
Jun 11 07:36:45 localhost NetworkManager: <info>  (eth0): cleaning up...
Jun 11 07:36:45 localhost NetworkManager: <info>  (eth0): taking down device.
Jun 11 07:36:45 localhost NetworkManager: <info>  (wlan0): now unmanaged


Here are the last few lines of nscd.log:

2566: considering GETAI entry "localhost", timeout 1213198701
2566: considering GETAI entry "0.fedora.pool.ntp.org", timeout 1213195245
2566: considering GETAI entry "localhost.localdomain", timeout 1213198702
2566: considering GETAI entry "", timeout 1213195124
2566: handle_request: request received (Version = 2) from PID 2665
2566:	GETFDPW
2566: provide access to FD 7, for passwd
2566: handle_request: request received (Version = 2) from PID 2665
2566:	GETFDPW
2566: provide access to FD 7, for passwd
2566: handle_request: request received (Version = 2) from PID 2704
2566:	GETFDPW
2566: provide access to FD 7, for passwd
2566: pruning services cache; time 1213195112
2566: considering GETSERVBYNAME entry "ntp/udp", timeout 1213223282
2566: considering GETSERVBYNAME entry "ipp/", timeout 1213219037
2566: handle_request: request received (Version = 2) from PID 2728
2566:	GETFDPW
2566: provide access to FD 7, for passwd
2566: handle_request: request received (Version = 2) from PID 2665
2566:	GETFDPW
2566: provide access to FD 7, for passwd
2566: handle_request: request received (Version = 2) from PID 2665
2566:	GETFDGR
2566: provide access to FD 9, for group
2566: handle_request: request received (Version = 2) from PID 2736
2566:	GETFDPW
2566: provide access to FD 7, for passwd
2566: handle_request: request received (Version = 2) from PID 2736
2566:	GETFDPW
2566: provide access to FD 7, for passwd
2566: handle_request: request received (Version = 2) from PID 2764
2566:	GETFDPW
2566: provide access to FD 7, for passwd
2566: handle_request: request received (Version = 2) from PID 2772
2566:	GETFDPW
2566: provide access to FD 7, for passwd
2566: handle_request: request received (Version = 2) from PID 2772
2566:	GETFDGR
2566: provide access to FD 9, for group
2566: handle_request: request received (Version = 2) from PID 2773
2566:	GETFDPW
2566: provide access to FD 7, for passwd
2566: handle_request: request received (Version = 2) from PID 2773
2566:	GETFDPW
2566: provide access to FD 7, for passwd
2566: handle_request: request received (Version = 2) from PID 2777
2566:	GETFDPW
2566: provide access to FD 7, for passwd
2566: pruning hosts cache; time 1213195126
2566: considering GETAI entry "localhost", timeout 1213198701
2566: considering GETAI entry "0.fedora.pool.ntp.org", timeout 1213195245
2566: considering GETAI entry "localhost.localdomain", timeout 1213198702
2566: considering GETAI entry "", timeout 1213195124
2566: remove GETAI entry ""

Comment 11 Ulrich Drepper 2008-06-11 16:21:40 UTC

This is wrong:

> 2566: considering GETAI entry "", timeout 1213195124

Not that I think it should cause a crash but it's wrong.  The entire log as far
nscd is concerned would be good.  There must be a line for the time when the
entry was added.

And just to be sure, can you show the content of /proc/PID/maps from a time when
nscd is up and running (PID is the PID of the nscd process).

Comment 12 Tom London 2008-06-11 16:34:24 UTC

Believe I attached complete nscd log files in #10 above.

The nscd process that is started via boot is always dead by the time I log in.

Here is the context of /proc/XXXX/maps for nscd when I manually start it after
the system has been running for a while.

[root@localhost ~]# cat /proc/5038/maps
00110000-00130000 r-xp 00000000 fd:00 7209254    /lib/ld-2.8.90.so
00130000-00131000 r--p 00020000 fd:00 7209254    /lib/ld-2.8.90.so
00131000-00132000 rw-p 00021000 fd:00 7209254    /lib/ld-2.8.90.so
00132000-00133000 r-xp 00132000 00:00 0          [vdso]
00133000-0013b000 r-xp 00000000 fd:00 7209459    /lib/librt-2.8.90.so
0013b000-0013c000 r--p 00007000 fd:00 7209459    /lib/librt-2.8.90.so
0013c000-0013d000 rw-p 00008000 fd:00 7209459    /lib/librt-2.8.90.so
0013d000-00153000 r-xp 00000000 fd:00 7209457    /lib/libpthread-2.8.90.so
00153000-00154000 r--p 00015000 fd:00 7209457    /lib/libpthread-2.8.90.so
00154000-00155000 rw-p 00016000 fd:00 7209457    /lib/libpthread-2.8.90.so
00155000-00157000 rw-p 00155000 00:00 0 
00157000-0016d000 r-xp 00000000 fd:00 7209677    /lib/libnsl-2.8.90.so
0016d000-0016e000 r--p 00016000 fd:00 7209677    /lib/libnsl-2.8.90.so
0016e000-0016f000 rw-p 00017000 fd:00 7209677    /lib/libnsl-2.8.90.so
0016f000-00171000 rw-p 0016f000 00:00 0 
00171000-0018b000 r-xp 00000000 fd:00 7209428    /lib/libselinux.so.1
0018b000-0018c000 r--p 00019000 fd:00 7209428    /lib/libselinux.so.1
0018c000-0018d000 rw-p 0001a000 fd:00 7209428    /lib/libselinux.so.1
0018d000-001a5000 r-xp 00000000 fd:00 7209705    /lib/libaudit.so.0.0.0
001a5000-001a6000 r--p 00017000 fd:00 7209705    /lib/libaudit.so.0.0.0
001a6000-001a7000 rw-p 00018000 fd:00 7209705    /lib/libaudit.so.0.0.0
001a7000-001ab000 r-xp 00000000 fd:00 7209005    /lib/libcap.so.2.10
001ab000-001ac000 rw-p 00003000 fd:00 7209005    /lib/libcap.so.2.10
001ac000-00319000 r-xp 00000000 fd:00 7209423    /lib/libc-2.8.90.so
00319000-0031b000 r--p 0016d000 fd:00 7209423    /lib/libc-2.8.90.so
0031b000-0031c000 rw-p 0016f000 fd:00 7209423    /lib/libc-2.8.90.so
0031c000-0031f000 rw-p 0031c000 00:00 0 
0031f000-00322000 r-xp 00000000 fd:00 7209426    /lib/libdl-2.8.90.so
00322000-00323000 r--p 00002000 fd:00 7209426    /lib/libdl-2.8.90.so
00323000-00324000 rw-p 00003000 fd:00 7209426    /lib/libdl-2.8.90.so
00324000-0032f000 r-xp 00000000 fd:00 10420519   /lib/libnss_files-2.8.90.so
0032f000-00330000 r--p 0000a000 fd:00 10420519   /lib/libnss_files-2.8.90.so
00330000-00331000 rw-p 0000b000 fd:00 10420519   /lib/libnss_files-2.8.90.so
aeafd000-aeafe000 ---p aeafd000 00:00 0 
aeafe000-aebfe000 rw-p aeafe000 00:00 0 
aebfe000-aebff000 ---p aebfe000 00:00 0 
aebff000-aecff000 rw-p aebff000 00:00 0 
aecff000-aed00000 ---p aecff000 00:00 0 
aed00000-aee00000 rw-p aed00000 00:00 0 
aee00000-aee01000 ---p aee00000 00:00 0 
aee01000-aef01000 rw-p aee01000 00:00 0 
aef01000-aef02000 ---p aef01000 00:00 0 
aef02000-af002000 rw-p aef02000 00:00 0 
af002000-af003000 ---p af002000 00:00 0 
af003000-af103000 rw-p af003000 00:00 0 
af103000-af104000 ---p af103000 00:00 0 
af104000-af204000 rw-p af104000 00:00 0 
af204000-af205000 ---p af204000 00:00 0 
af205000-af305000 rw-p af205000 00:00 0 
af305000-b1305000 rw-s 00000000 fd:00 66827      /var/db/nscd/services
b1305000-b3305000 rw-s 00000000 fd:00 66825      /var/db/nscd/hosts
b3305000-b5305000 rw-s 00000000 fd:00 66823      /var/db/nscd/group
b5305000-b7305000 rw-s 00000000 fd:00 66814      /var/db/nscd/passwd
b7305000-b7306000 ---p b7305000 00:00 0 
b7306000-b7d06000 rw-p b7306000 00:00 0 
b7d06000-b7f06000 r--p 00000000 fd:00 5485971    /usr/lib/locale/locale-archive
b7f06000-b7f09000 rw-p b7f06000 00:00 0 
b7f22000-b7f23000 rw-p b7f22000 00:00 0 
b7f23000-b7f44000 r-xp 00000000 fd:00 5489047    /usr/sbin/nscd
b7f44000-b7f45000 r--p 00020000 fd:00 5489047    /usr/sbin/nscd
b7f45000-b7f46000 rw-p 00021000 fd:00 5489047    /usr/sbin/nscd
b8fd9000-b8ffa000 rw-p b8fd9000 00:00 0          [heap]
bfa2f000-bfa44000 rw-p bffeb000 00:00 0          [stack]
[root@localhost ~]# 

Is that helpful?

Also, (believe nscd.log is for several "runs"):

[tbl@localhost ~]$ grep '""' /var/log/nscd.log
2566: Haven't found "" in hosts cache!
2566: remove GETAI entry ""
2556: Haven't found "" in hosts cache!
2556: add new entry "" of type GETAI for hosts to cache (first)
2556: considering GETAI entry "", timeout 1213194512
2556: considering GETAI entry "", timeout 1213194512
2556: considering GETAI entry "", timeout 1213194512
2556: remove GETAI entry ""
2566: Haven't found "" in hosts cache!
2566: add new entry "" of type GETAI for hosts to cache (first)
2566: considering GETAI entry "", timeout 1213195124
2566: considering GETAI entry "", timeout 1213195124
2566: remove GETAI entry ""
2658: Haven't found "" in hosts cache!
2658: add new entry "" of type GETAI for hosts to cache (first)
2658: considering GETAI entry "", timeout 1213199888
2658: considering GETAI entry "", timeout 1213199888
2658: considering GETAI entry "", timeout 1213199888
2658: remove GETAI entry ""
[tbl@localhost ~]$

Comment 13 Ulrich Drepper 2008-06-11 16:37:22 UTC

Oh, and please make sure you remove all /var/db/nscd/* files before starting
nscd.  We check for validity of the files but who knows, maybe something is missing.

Comment 14 Tom London 2008-06-11 16:59:30 UTC

Created attachment 308961 [details]
Clean nscd.log after cleaning out /var/db/nscd/* and rebooting

OK.  Removed /var/db/nscd/*  (drat, should have renamed instead.....) and
rebooted.

This time I have what appears to be a different nscd.log file, and nscd was
running for a while after gnome login.

But, some DNS look ups failed, that is, my firefox home pages came up with the
"can't find this domain" page.

However, this time I booted in enforcing mode, and I got a bunch of AVCs, e.g.:


Jun 11 09:48:41 localhost setroubleshoot: SELinux is preventing nscd (nscd_t)
"search" to ./3338 (system_dbusd_t). For complete SELinux messages. run sealert
-l 3ccfc02e-bf12-4fe8-8ac2-8021b91626bc
Jun 11 09:48:41 localhost setroubleshoot: SELinux is preventing nscd (nscd_t)
"search" to ./3338 (system_dbusd_t). For complete SELinux messages. run sealert
-l 3ccfc02e-bf12-4fe8-8ac2-8021b91626bc
Jun 11 09:48:41 localhost setroubleshoot: SELinux is preventing nscd (nscd_t)
"search" to ./3353 (setroubleshootd_t). For complete SELinux messages. run
sealert -l 71303d2f-2b4c-495c-a033-6f64d4265fe7
Jun 11 09:48:44 localhost setroubleshoot: SELinux is preventing nscd (nscd_t)
"search" to ./3375 (hald_t). For complete SELinux messages. run sealert -l
c66afb2f-294c-4009-9fa1-133ff8da3e38
Jun 11 09:48:44 localhost setroubleshoot: SELinux is preventing nscd (nscd_t)
"search" to ./3375 (hald_t). For complete SELinux messages. run sealert -l
c66afb2f-294c-4009-9fa1-133ff8da3e38
Jun 11 09:48:49 localhost setroubleshoot: SELinux is preventing nscd (nscd_t)
"search" to ./3389 (hald_t). For complete SELinux messages. run sealert -l
1cdf56a2-2ce6-45c4-bf22-321596de42d8
Jun 11 09:48:49 localhost setroubleshoot: SELinux is preventing nscd (nscd_t)
"search" to ./3389 (hald_t). For complete SELinux messages. run sealert -l
1cdf56a2-2ce6-45c4-bf22-321596de42d8
Jun 11 09:48:56 localhost setroubleshoot: SELinux is preventing nscd (nscd_t)
"search" to ./3406 (hald_t). For complete SELinux messages. run sealert -l
eaea69c7-6404-4172-a4a0-9d6f30d21e9d
Jun 11 09:48:57 localhost setroubleshoot: SELinux is preventing nscd (nscd_t)
"search" to ./3406 (hald_t). For complete SELinux messages. run sealert -l
eaea69c7-6404-4172-a4a0-9d6f30d21e9d
Jun 11 09:49:02 localhost setroubleshoot: SELinux is preventing nscd (nscd_t)
"search" to ./3463 (hald_t). For complete SELinux messages. run sealert -l
378f6336-cd56-4828-bde1-00da6dcd2ad5
Jun 11 09:49:02 localhost setroubleshoot: SELinux is preventing nscd (nscd_t)
"search" to ./3463 (hald_t). For complete SELinux messages. run sealert -l
378f6336-cd56-4828-bde1-00da6dcd2ad5
Jun 11 09:49:07 localhost setroubleshoot: SELinux is preventing nscd (nscd_t)
"search" to ./3467 (hald_t). For complete SELinux messages. run sealert -l
39dfc119-7cfc-40d8-838b-dd52dea06a59
Jun 11 09:49:07 localhost setroubleshoot: SELinux is preventing nscd (nscd_t)
"search" to ./3467 (hald_t). For complete SELinux messages. run sealert -l
39dfc119-7cfc-40d8-838b-dd52dea06a59
Jun 11 09:49:13 localhost setroubleshoot: SELinux is preventing nscd (nscd_t)
"search" to ./3472 (hald_t). For complete SELinux messages. run sealert -l
bc1facd9-0d9f-41ae-86ae-bb5c8a4402ca
Jun 11 09:49:13 localhost setroubleshoot: SELinux is preventing nscd (nscd_t)
"search" to ./3472 (hald_t). For complete SELinux messages. run sealert -l
bc1facd9-0d9f-41ae-86ae-bb5c8a4402ca
Jun 11 09:49:19 localhost setroubleshoot: SELinux is preventing nscd (nscd_t)
"search" to ./3476 (hald_t). For complete SELinux messages. run sealert -l
288fd2eb-3149-4099-ac21-3266b26f2df4
<<<<<SNIP>>>>>

Here is the complete list:
[root@localhost ~]# check-avcs


#============= nscd_t ==============
allow nscd_t avahi_t:dir search;
allow nscd_t hald_t:dir search;
allow nscd_t ntpd_t:dir search;
allow nscd_t setroubleshootd_t:dir search;
allow nscd_t sshd_t:dir search;
allow nscd_t system_dbusd_t:dir search;
[root@localhost ~]# 

Is that an issue?  I'll reboot in permissive.

Comment 15 Tom London 2008-06-11 17:15:22 UTC

Similar behavior with no AVCs when I boot in permissive.  Something must be
scanning /proc looking for a particular process or something....

Comment 16 Tom London 2008-06-11 17:25:33 UTC

Created attachment 308967 [details]
Valgrind log of nscd getting an Illegal Instruction fault (SIGILL)

I rebooted in permissive mode, killed off the flaky nscd that gets started
during boot, and ran "valgrind nscd -d".

I then started firefox.  I have 4 tabbed home pages, one is news.google.com and
another is gmail.

I got "Address not found" pages for http://news.google.com/ and for
https://mail.google.com, followed quickly by nscd faulting with SIGILL.

Here is a snippet from the attached complete valgrind log:

vex x86->IR: unhandled instruction bytes: 0x6C 0x69 0x62 0x63
==4200== valgrind: Unrecognised instruction at address 0xB94663.
==4200== Your program just tried to execute an instruction that Valgrind
==4200== did not recognise.  There are two possible reasons for this.
==4200== 1. Your program has a bug and erroneously jumped to a non-code
==4200==    location.  If you are running Memcheck and you just saw a
==4200==    warning about a bad jump, it's probably your program's fault.
==4200== 2. The instruction is legitimate but Valgrind doesn't handle it,
==4200==    i.e. it's Valgrind's fault.  If you think this is the case or
==4200==    you are not sure, please let us know and we'll try to fix it.
==4200== Either way, Valgrind will now raise a SIGILL signal which will
==4200== probably kill your program.
==4200== 
==4200== Process terminating with default action of signal 4 (SIGILL)
==4200==  Illegal opcode at address 0xB94663
==4200==    at 0xB94663: (within /lib/libc-2.8.90.so)
==4200==    by 0x12433: gc (mem.c:542)
==4200==    by 0x11197: prune_cache (cache.c:521)
==4200==    by 0x5F48: nscd_run_prune (connections.c:1451)
==4200==    by 0xC0351E: start_thread (pthread_create.c:297)
==4200==    by 0xB39BDD: clone (in /lib/libc-2.8.90.so)
==4200== 

This follows a bunch of "uninitialized memory" messages from valgrind.

Comment 17 Ulrich Drepper 2008-06-12 04:55:33 UTC

The obstack-related problem should be fixed upstream now.  We'll have to see
whether there are more problems.  Jakub will build a new glibc sometime soon.

Comment 18 Tom London 2008-06-13 00:21:32 UTC

With nscd-2.8.90-6.i386 (and related glibc packages), I no longer get seg
faults, etc, but DNS searches still don't work:

[root@localhost ~]# setenforce 0
[root@localhost ~]# rm -rf /var/db/nscd
[root@localhost ~]# curl https://mail.google.com
<html><head><meta http-equiv="Refresh"
content="0;URL=https://mail.google.com/mail/" /></head><body><script
type="text/javascript" language="javascript"><!--
location.replace("https://mail.google.com/mail/")
--></script></body></html>[root@localhost ~]# 
[root@localhost ~]# service nscd start
Starting nscd:                                             [  OK  ]
[root@localhost ~]# curl https://mail.google.com
curl: (6) Couldn't resolve host 'mail.google.com'
[root@localhost ~]# curl https://mail.google.com
curl: (6) Couldn't resolve host 'mail.google.com'
[root@localhost ~]# service nscd stop
Stopping nscd:                                             [  OK  ]
[root@localhost ~]# curl https://mail.google.com
<html><head><meta http-equiv="Refresh"
content="0;URL=https://mail.google.com/mail/" /></head><body><script
type="text/javascript" language="javascript"><!--
location.replace("https://mail.google.com/mail/")
--></script></body></html>[root@localhost ~]# nslookup mail.google.com
Server:		10.10.3.10
Address:	10.10.3.10#53

Non-authoritative answer:
mail.google.com	canonical name = googlemail.l.google.com.
Name:	googlemail.l.google.com
Address: 209.85.201.19
Name:	googlemail.l.google.com
Address: 209.85.201.83
Name:	googlemail.l.google.com
Address: 209.85.201.18

[root@localhost ~]# 


Some other data/info useful?

Comment 19 Ulrich Drepper 2008-06-13 04:37:53 UTC

(In reply to comment #18)
> With nscd-2.8.90-6.i386 (and related glibc packages), I no longer get seg
> faults, etc, but DNS searches still don't work:

That's a transient problem which is why I didn't see it.  All the stars have to
align for it.  I think it's fixed in cvs.  Maybe Jakub has time tomorrow to
build another glibc.

Comment 20 Ulrich Drepper 2008-06-13 19:30:23 UTC

Try 2.8.90-7.

Comment 21 Tom London 2008-06-14 05:13:03 UTC

2.8.90-7 works in permissive mode!!!!!!

But not in enforcing mode.

Seems to want:


#============= hald_t ==============
allow hald_t var_t:file read;

#============= nscd_t ==============
allow nscd_t var_t:dir { write add_name };
allow nscd_t var_t:file { read write create getattr };


[The directory and create privs are for the case where there is no nscd
directory in /var/db, and it and the files under it must be created.]

Thanks for tracking this down and fixing it!

Comment 22 Tom London 2008-06-18 14:54:00 UTC

I've managed to get nscd "up and running" on boot to runlevel 5.

Two sets of issues (at least on my system):

1. nscd needs new SELinux policy to get up and running this way. Examples of
allow rules include:

allow nscd_t inotifyfs_t:dir read;
allow nscd_t hald_acl_t:lnk_file read;
and
allow nscd_t *:lnk_file read;  <---- This is just shorthand....

The last rule appears to come from nscd wanting to read /proc/PROCNUM/exe.  I'm
guessing it does this for every process that wants to connect to it. 

I can collect and post all the individual rules I've added, but is there a
better way to do this?

2. Often, on reboot, I notice (via "ps agx | grep nscd") several nscd processes
running, with the early ones eating up cpu cycles, etc. If I kill all the
running nscd processes and restart the service, all appears to work.

The best I can figure out from the log is that each of the spinning nscd
processes is spinning after the following log entry:

2578: considering GETAI entry "0.fedora.pool.ntp.org", timeout 1213798101
2578: considering GETHOSTBYNAME entry "localhost.localdomain", timeout 1213800081
2578: considering GETAI entry "localhost.localdomain", timeout 1213798631
2578: considering GETAI entry "www.mail.com", timeout 1213798107
2578: considering GETAI entry "scs.msg.yahoo.com", timeout 1213797432
2578: considering GETAI entry "", timeout 1213797342
2578: remove GETAI entry ""
2578: freed 88 bytes in hosts cache
3298: Access Vector Cache (AVC) started
3298: handle_request: request received (Version = 2) from PID 3308
3298:   INVALIDATE (hosts)
3298: pruning hosts cache; time 2147483647
3298: considering GETAI entry "weather.noaa.gov", timeout 1213797792
3298: considering GETAI entry "address.yahoo.com", timeout 1213797545
3298: considering GETHOSTBYNAMEv6 entry "localhost", timeout 1213800938
3298: considering GETAI entry "localhost", timeout 1213800491

Is this something particular to my system?

Comment 23 Tom London 2008-06-18 16:09:10 UTC

Example of multiple nscd procs on boot:

[tbl@localhost ~]$ sudo -i
[root@localhost ~]# ps agx | grep nscd
 2603 ?        Ssl    6:24 /usr/sbin/nscd
 2661 ?        Ssl    0:36 /usr/sbin/nscd
 3515 ?        Ssl    0:00 /usr/sbin/nscd
 3671 pts/0    S+     0:00 grep nscd
[root@localhost ~]# killall nscd
[root@localhost ~]# ps agx | grep nscd
 3686 pts/0    S+     0:00 grep nscd
[root@localhost ~]#

Comment 24 Tom London 2008-06-18 16:56:29 UTC

Here's another reboot:

[tbl@localhost ~]$ sudo -i
[root@localhost ~]# ps agx | grep nscd
 2605 ?        Ssl   10:12 /usr/sbin/nscd
 2646 ?        Rsl    8:21 /usr/sbin/nscd
 3334 ?        Rsl    1:20 /usr/sbin/nscd
 3744 ?        Ssl    1:12 /usr/sbin/nscd
 3764 ?        Ssl    0:00 /usr/sbin/nscd
 3979 pts/0    S+     0:00 grep nscd
[root@localhost ~]# killall nscd
[root@localhost ~]# service nscd start
Starting nscd:                                             [  OK  ]
[root@localhost ~]# ps agx | grep nscd
 4004 ?        Ssl    0:00 /usr/sbin/nscd
 4016 pts/0    S+     0:00 grep nscd
[root@localhost ~]#

Comment 25 Tom London 2008-06-18 22:06:58 UTC

Happened again, this time way after boot:

[root@localhost ~]# ps agx | grep nscd
 4004 ?        Rsl   77:55 /usr/sbin/nscd
27354 ?        Ssl    0:01 /usr/sbin/nscd
30793 pts/0    S+     0:00 grep nscd
[root@localhost ~]# 

Here are lines from nscd.log at the time 27354 was started:

4004: considering GETAI entry "mail.google.com", timeout 1213824121
4004: considering GETHOSTBYNAMEv6 entry "localhost.localdomain", timeout 1213822450
4004: considering GETHOSTBYNAME entry "localhost.localdomain", timeout 1213822402
4004: considering GETAI entry "localhost.localdomain", timeout 1213822328
4004: considering GETAI entry "news.yahoo.com", timeout 1213821466
4004: considering GETAI entry "suggestqueries.google.com", timeout 1213821584
4004: considering GETAI entry "www.halflifesource.com", timeout 1213828473
4004: considering GETAI entry "jakarta.apache.org", timeout 1213822994
4004: considering GETAI entry "www.bloglines.com", timeout 1213821934
4004: considering GETAI entry "share.skype.com", timeout 1213822172
4004: considering GETHOSTBYADDRv6 entry "::1", timeout 1213823224
4004: remove GETAI entry "cm.my.yahoo.com"
4004: freed 128 bytes in hosts cache
4004: handle_request: request received (Version = 2) from PID 27338
4004: 	GETFDPW
4004: provide access to FD 8, for passwd
4004: handle_request: request received (Version = 2) from PID 27338
4004: 	GETFDPW
4004: provide access to FD 8, for passwd
4004: pruning hosts cache; time 1213821433
4004: considering GETAI entry "hv3.webstat.com", timeout 1213823673
4004: considering GETAI entry "now.eloqua.com", timeout 1213900282
4004: considering GETAI entry "weather.noaa.gov", timeout 1213821862
4004: considering GETAI entry "weather.noaa.gov", timeout 1213821862
4004: considering GETAI entry "weather.noaa.gov", timeout 1213821862
4004: considering GETAI entry "weather.noaa.gov", timeout 1213821862
4004: considering GETAI entry "weather.noaa.gov", timeout 1213821862
4004: considering GETAI entry "geo.yahoo.com", timeout 1213823478
4004: considering GETAI entry "koji.fedoraproject.org", timeout 1213823074
4004: considering GETAI entry "www.seamonkey-project.org", timeout 1213824901
4004: considering GETAI entry "localhost", timeout 1213823224
4004: considering GETHOSTBYNAME entry "localhost", timeout 1213822486
4004: considering GETAI entry "ai.vmware.com", timeout 1213852275
4004: considering GETAI entry "www.google.com", timeout 1213821687
4004: considering GETAI entry "www.nccgroup.com", timeout 1213906464
4004: considering GETAI entry "l.yimg.com", timeout 1213821432
4004: considering GETAI entry "vmware-www.baynote.net", timeout 1213835482
4004: considering GETAI entry "ocsp.thawte.com", timeout 1213824266
4004: considering GETAI entry "www.newsgator.com", timeout 1213822190
4004: considering GETAI entry "media3.washingtonpost.com", timeout 1213824689
4004: considering GETAI entry "news.google.com", timeout 1213821673
4004: considering GETAI entry "us.lrd.yahoo.com", timeout 1213821448
4004: considering GETAI entry "mail.google.com", timeout 1213824121
4004: considering GETHOSTBYNAMEv6 entry "localhost.localdomain", timeout 1213822450
4004: considering GETHOSTBYNAME entry "localhost.localdomain", timeout 1213822402
4004: considering GETAI entry "localhost.localdomain", timeout 1213822328
4004: considering GETAI entry "news.yahoo.com", timeout 1213821466
4004: considering GETAI entry "suggestqueries.google.com", timeout 1213821584
4004: considering GETAI entry "www.halflifesource.com", timeout 1213828473
4004: considering GETAI entry "jakarta.apache.org", timeout 1213822994
4004: considering GETAI entry "www.bloglines.com", timeout 1213821934
4004: considering GETAI entry "share.skype.com", timeout 1213822172
4004: considering GETHOSTBYADDRv6 entry "::1", timeout 1213823224
4004: remove GETAI entry "l.yimg.com"
4004: freed 120 bytes in hosts cache
27354: Access Vector Cache (AVC) started
27354: handle_request: request received (Version = 2) from PID 27365
27354: 	INVALIDATE (hosts)
27354: pruning hosts cache; time 2147483647
27354: considering GETAI entry "hv3.webstat.com", timeout 1213823673
27354: considering GETAI entry "now.eloqua.com", timeout 1213900282
27354: considering GETAI entry "weather.noaa.gov", timeout 1213821862
27354: considering GETAI entry "weather.noaa.gov", timeout 1213821862
27354: considering GETAI entry "weather.noaa.gov", timeout 1213821862
27354: considering GETAI entry "weather.noaa.gov", timeout 1213821862
27354: considering GETAI entry "weather.noaa.gov", timeout 1213821862

If I 'strace -p 4004', I get an endless number of:

read(6, 0xbff05350, 100)                = -1 EAGAIN (Resource temporarily
unavailable)
read(6, 0xbff05350, 100)                = -1 EAGAIN (Resource temporarily
unavailable)
read(6, 0xbff05350, 100)                = -1 EAGAIN (Resource temporarily
unavailable)
read(6, 0xbff05350, 100)                = -1 EAGAIN (Resource temporarily
unavailable)
read(6, 0xbff05350, 100)                = -1 EAGAIN (Resource temporarily
unavailable)
read(6, 0xbff05350, 100)                = -1 EAGAIN (Resource temporarily
unavailable)
read(6, 0xbff05350, 100)                = -1 EAGAIN (Resource temporarily
unavailable)
read(6, 0xbff05350, 100)                = -1 EAGAIN (Resource temporarily
unavailable)

If I 'gdb -p 4004', I get this from 'where':

0x00132416 in __kernel_vsyscall ()
(gdb) where
#0  0x00132416 in __kernel_vsyscall ()
#1  0x0014a13b in read () from /lib/libpthread.so.0
#2  0xb7eea598 in main_loop_epoll (efd=16) at ../posix/bits/unistd.h:45
#3  0xb7eeaad0 in start_threads () at connections.c:2136
#4  0xb7ee92ad in main (argc=1, argv=0xbff055d4) at nscd.c:279
(gdb)

Comment 26 Ulrich Drepper 2008-06-18 22:29:41 UTC

The endless read loop problem should be fixed upstream.

Comment 27 Tom London 2008-07-04 16:23:45 UTC

glibc-2.8.90-8 seems to fix endless read loop problem.

I now get only one nscd process, and all seems well (no more segfaults either).....

Comment 28 Ulrich Drepper 2008-07-04 20:11:41 UTC

All reported problems should indeed be fixed in 2.8.90-8.

Comment 29 Fedora Update System 2008-07-08 20:27:26 UTC

glibc-2.8-7 has been submitted as an update for Fedora 9

Comment 30 Fedora Update System 2008-07-23 07:08:27 UTC

glibc-2.8-8 has been pushed to the Fedora 9 stable repository.  If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.