Bug 429702 - multiple nscd problems (with ldap)
Summary: multiple nscd problems (with ldap)
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: nss_ldap
Version: 4.8
Hardware: i386
OS: Linux
low
high
Target Milestone: rc
: ---
Assignee: Nalin Dahyabhai
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2008-01-22 15:06 UTC by Joel Eidsath
Modified: 2012-06-20 16:57 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-06-20 16:57:37 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Joel Eidsath 2008-01-22 15:06:19 UTC
Description of problem:

We use ldap for user authentication for a number of RHEL 3, 4, and 5 servers. A
number of months ago, nscd began to crash every few weeks on the RHEL 4 and 5
servers. We did not consider it a big deal because nscd is not a vital service.

Lately, the nscd problems have gotten serious. On two of our servers beginning
about 1-2 months ago (one RHEL 3 and one RHEL 4 server) nscd began claiming that
certain users did not exist, while authenticating fine for others. Once nscd was
turned off, the problem went away.

At that point we turned off nscd on all of our servers in chkconfig.
Unfortunately, two (all?) of our RHEL 4 boxes will not start without nscd
enabled. They hang on "starting system message bus." We are currently turning
nscd off by hand on these servers after start up.

Version-Release number of selected component (if applicable):
Varies.

How reproducible:
The messagebus dependency is easy to reproduce. The rest is not.

Steps to Reproduce: (messagebus dependency)
1. Enable messagebus (on RHEL4).
2. Disable nscd.
3. Attempt to reboot machine.
  
Actual results:
Hang on starting system message bus.

Expected results:
Normal start up.

Additional info:

Comment 1 Jack Neely 2008-01-29 19:57:31 UTC
The problem with message bus I believe is related to having 'ldap' listed to
search for protocols in your /etc/nsswitch.conf file.  Remove ldap from that
line and see if that helps.

I'm also curious to the rest of the ldap/nscd issues.  Do they look like Bug
#428837?

Comment 2 Joel Eidsath 2008-01-29 22:34:37 UTC
We've removed ldap from protocols in /etc/nsswitch.conf. The next time either
our user server or our mail server is rebooted (hopefully not for a while) we'll
let you know if that was the fix.

This issue does not look at all like #428837. I've never seen nscd hit 100% of
the CPU. The two nscd issues that we are having are:

1) nscd crashes at random intervals (usually after running for a week) -- this
behavior has existed for a number of months
2) nscd returns bad or missing information for random users -- this behavior has
existed for 1 or 2 months.

I attempted to debug the first issue, but I was never able to capture a crash
with nscd in debug mode -- it generates a lot of debug data.

Comment 3 Joel Eidsath 2008-02-15 20:20:27 UTC
Removing the ldap from the protocols line in /etc/nsswitch.conf did not fix the
the messagebus dependency.

Comment 4 Jose Plans 2008-02-15 20:36:53 UTC
What does your /etc/ldap.conf look like? 
Do you use "nss_initgroups_ignoreusers" by any chance in it?

Comment 5 Joel Eidsath 2008-02-15 21:20:07 UTC
Yes, we've got the following line: 
nss_initgroups_ignoreusers root,ldap

Also, I've finally been able to record a failure in the ldap logs. My username
is "thras" with uid 4954

With nscd on, I ran 'id thras' a couple times from the command line, and it
returned no such user. Then I turned nscd off and 'id thras' worked. I wasn't
able to reproduce this again with myself or any other users.

But here is (I think -- there aren't any timestamps) the relevant portion of the
nscd log:
2906: handle_request: request received (Version = 2) from PID 9515
2906:   GETFDPW
2906: provide access to FD 6, for passwd
2906: handle_request: request received (Version = 2) from PID 9515
2906:   GETFDGR
2906: provide access to FD 8, for group
2906: handle_request: request received (Version = 2) from PID 9515
2906:   GETGRBYGID (4954)
2906: Haven't found "4954" in group cache!
2906: handle_request: request received (Version = 2) from PID 9529
2906:   GETFDPW
2906: provide access to FD 6, for passwd
2906: handle_request: request received (Version = 2) from PID 9529
2906:   GETFDGR
2906: provide access to FD 8, for group
2906: handle_request: request received (Version = 2) from PID 9529
2906:   GETGRBYGID (4954)
2906: Haven't found "4954" in group cache!
2906: pruning hosts cache; time 1203108144

About 2000 lines (~3 minutes) earlier in the log this shows up, but I don't
think that's when 'id' failed:
2906: considering INITGROUPS entry "thras", timeout 1200467811
2906: Reloading "thras" in group cache!


Comment 6 Joel Eidsath 2008-02-15 21:21:15 UTC
There was a typo above. It should read "I've finally been able to record a
failure in the nscd logs."

Comment 7 Joel Eidsath 2008-02-15 21:36:14 UTC
I'll try adding nscd to nss_initgroups_ignoreusers and see if that corrects the
system message bus problem.

Comment 8 Eijiro Sumii 2008-03-10 05:58:42 UTC
For the dbus hang-up, see #431301
(https://bugzilla.redhat.com/show_bug.cgi?id=431301).  Downgrading nss_ldap from
nss_ldap-226-20 to nss_ldap-226-18 solved the problem for me.  It is also
reported to fix some other issues as well
(https://bugzilla.redhat.com/show_bug.cgi?id=426155 and
https://bugzilla.redhat.com/show_bug.cgi?id=427189).  (In my case,
nss_initgroups_ignoreusers didn't help.  In fact, it isn't even supported in
nss_ldap-226-18.)

Comment 9 Joel Eidsath 2008-10-10 21:58:36 UTC
We were never able to solve the dbus problem. Currently, downgrading nss_ldap seems to fix all sorts of problems. I don't know what sort of testing process is going on with this package before release, but it may need some modifications.

Comment 10 Buchan Milne 2009-06-26 09:16:38 UTC
We've seen similar problems on RHEL4, since about 17 February, when our updates updated nss_ldap and nscd. We had not seen this before on RHEL4.

We also see it on some 5.3 boxes, but they weren't in production on anything before 5.3.

However, we have seen problems enumerating local users (e.g. 'getent passwd root' fails), so I suspect this is an nscd bug, and not an nss_ldap bug.

E.g., we have about 10 servers which are very similar software-wise, one of these did not get the updates at the same time, and this host is not seeing the problem.

(I don't agree with the nss_initgroups_ignoreusers workaround, we use 'bind_policy soft' to restore the older nss_ldap behaviour).

Is anyone seeing this problem without nscd ?

This could just be bug #495515 ...

Comment 11 Jiri Pallich 2012-06-20 16:57:37 UTC
Thank you for submitting this issue for consideration in Red Hat Enterprise Linux. The release for which you requested us to review is now End of Life. 
Please See https://access.redhat.com/support/policy/updates/errata/

If you would like Red Hat to re-consider your feature request for an active release, please re-open the request via appropriate support channels and provide additional supporting details about the importance of this issue.


Note You need to log in before you can comment on or make changes to this bug.