Bug 502072 - After enabling LDAP authentication/identification, booting system hangs starting dbus....
After enabling LDAP authentication/identification, booting system hangs start...
Status: CLOSED NOTABUG
Product: Fedora
Classification: Fedora
Component: nss_ldap (Show other bugs)
14
All Linux
urgent Severity medium
: ---
: ---
Assigned To: Nalin Dahyabhai
Fedora Extras Quality Assurance
: Reopened
Depends On:
Blocks: 513460
  Show dependency treegraph
 
Reported: 2009-05-21 14:14 EDT by Tom London
Modified: 2011-01-30 12:51 EST (History)
28 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2010-12-02 09:35:11 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Tom London 2009-05-21 14:14:49 EDT
Description of problem:
Attempting to enable LDAP authentication.

I configured "User information" and "Authentication" in System->Authentication to use LDAP, and configured LDAP settings as directed by IT admin.

Restarting system afterwards, the boot hangs trying to start dbus:

May 21 10:12:18 tlondon kernel: RPC: Registered udp transport module.
May 21 10:12:18 tlondon kernel: RPC: Registered tcp transport module.
May 21 10:12:19 tlondon dbus-daemon: nss_ldap: failed to bind to LDAP server ldap://hqdc081.baystorm.local: Can't contact LDAP server
May 21 10:12:19 tlondon dbus-daemon: nss_ldap: failed to bind to LDAP server ldap://hqdc081.baystorm.local: Can't contact LDAP server
May 21 10:12:19 tlondon dbus-daemon: nss_ldap: reconnecting to LDAP server (sleeping 4 seconds)...
May 21 10:12:20 tlondon kernel: JBD: barrier-based sync failed on dm-0:8 - disabling barriers
May 21 10:12:23 tlondon dbus-daemon: nss_ldap: failed to bind to LDAP server ldap://hqdc081.baystorm.local: Can't contact LDAP server
May 21 10:12:23 tlondon dbus-daemon: nss_ldap: reconnecting to LDAP server (sleeping 8 seconds)...

etc.

I've waited several minutes, but no joy.

To get system booting again, I had to boot single user, and manually edit /etc/nsswitch.conf to delete "ldap":

[root@tlondon ~]# diff /etc/nsswitch.conf*
33,35c33,35
< passwd:     files
< shadow:     files
< group:      files
---
> passwd:     files ldap
> shadow:     files ldap
> group:      files ldap
57c57
< netgroup:   files
---
> netgroup:   files ldap
61c61
< automount:  files
---
> automount:  files ldap
[root@tlondon ~]# 


Version-Release number of selected component (if applicable):
dbus-libs-1.2.12-1.fc11.x86_64
dbus-debuginfo-1.2.12-1.fc11.x86_64
dbus-1.2.12-1.fc11.x86_64
dbus-x11-1.2.12-1.fc11.x86_64
dbus-glib-0.80-2.fc11.x86_64
dbus-glib-devel-0.80-2.fc11.x86_64
dbus-python-0.83.0-5.fc11.x86_64
dbus-devel-1.2.12-1.fc11.x86_64
dbus-glib-debuginfo-0.80-2.fc11.x86_64


How reproducible:
Looks like "every time"

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:
Comment 1 Tom London 2009-05-21 14:31:31 EDT
Here is a log from a "longer wait":

May 21 10:05:48 tlondon kernel: RPC: Registered tcp transport module.
May 21 10:05:48 tlondon dbus-daemon: nss_ldap: failed to bind to LDAP server ldap://hqdc081.baystorm.local: Can't contact LDAP server
May 21 10:05:48 tlondon dbus-daemon: nss_ldap: failed to bind to LDAP server ldap://hqdc081.baystorm.local: Can't contact LDAP server
May 21 10:05:48 tlondon dbus-daemon: nss_ldap: reconnecting to LDAP server (sleeping 4 seconds)...
May 21 10:05:49 tlondon kernel: JBD: barrier-based sync failed on dm-0:8 - disabling barriers
May 21 10:05:52 tlondon dbus-daemon: nss_ldap: failed to bind to LDAP server ldap://hqdc081.baystorm.local: Can't contact LDAP server
May 21 10:05:52 tlondon dbus-daemon: nss_ldap: reconnecting to LDAP server (sleeping 8 seconds)...
May 21 10:06:00 tlondon dbus-daemon: nss_ldap: failed to bind to LDAP server ldap://hqdc081.baystorm.local: Can't contact LDAP server
May 21 10:06:00 tlondon dbus-daemon: nss_ldap: reconnecting to LDAP server (sleeping 16 seconds)...
May 21 10:06:16 tlondon dbus-daemon: nss_ldap: failed to bind to LDAP server ldap://hqdc081.baystorm.local: Can't contact LDAP server
May 21 10:06:16 tlondon dbus-daemon: nss_ldap: reconnecting to LDAP server (sleeping 32 seconds)...
May 21 10:06:48 tlondon dbus-daemon: nss_ldap: failed to bind to LDAP server ldap://hqdc081.baystorm.local: Can't contact LDAP server
May 21 10:06:48 tlondon dbus-daemon: nss_ldap: reconnecting to LDAP server (sleeping 64 seconds)...
May 21 10:07:52 tlondon dbus-daemon: nss_ldap: failed to bind to LDAP server ldap://hqdc081.baystorm.local: Can't contact LDAP server
May 21 10:07:52 tlondon dbus-daemon: nss_ldap: could not search LDAP server - Server is unavailable
May 21 10:07:52 tlondon dbus-daemon: nss_ldap: failed to bind to LDAP server ldap://hqdc081.baystorm.local: Can't contact LDAP server
May 21 10:07:52 tlondon dbus-daemon: nss_ldap: failed to bind to LDAP server ldap://hqdc081.baystorm.local: Can't contact LDAP server
May 21 10:07:52 tlondon dbus-daemon: nss_ldap: reconnecting to LDAP server (sleeping 4 seconds)...
May 21 10:07:56 tlondon dbus-daemon: nss_ldap: failed to bind to LDAP server ldap://hqdc081.baystorm.local: Can't contact LDAP server
May 21 10:07:56 tlondon dbus-daemon: nss_ldap: reconnecting to LDAP server (sleeping 8 seconds)...
May 21 10:08:04 tlondon dbus-daemon: nss_ldap: failed to bind to LDAP server ldap://hqdc081.baystorm.local: Can't contact LDAP server
May 21 10:08:04 tlondon dbus-daemon: nss_ldap: reconnecting to LDAP server (sleeping 16 seconds)...
May 21 10:08:20 tlondon dbus-daemon: nss_ldap: failed to bind to LDAP server ldap://hqdc081.baystorm.local: Can't contact LDAP server
May 21 10:08:20 tlondon dbus-daemon: nss_ldap: reconnecting to LDAP server (sleeping 32 seconds)...
May 21 10:08:52 tlondon dbus-daemon: nss_ldap: failed to bind to LDAP server ldap://hqdc081.baystorm.local: Can't contact LDAP server
May 21 10:08:52 tlondon dbus-daemon: nss_ldap: reconnecting to LDAP server (sleeping 64 seconds)...
May 21 10:09:43 tlondon init: rc5 main process (1337) killed by TERM signal
Comment 2 Colin Walters 2009-05-21 17:39:46 EDT
Hmm.  Does nss_ldap contact the ldap server to look up even "system" uids (say < 100)?
Comment 3 Tom London 2009-05-21 18:49:20 EDT
In case its needed, I'm running:

nss_ldap-264-2.fc11.x86_64
Comment 4 Bug Zapper 2009-06-09 12:16:42 EDT
This bug appears to have been reported against 'rawhide' during the Fedora 11 development cycle.
Changing version to '11'.

More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping
Comment 5 Jacques Isaac 2009-06-10 15:44:59 EDT
Is there a workaround for this issue? If not, the priority should be high or critical...

Please let me know.

Thanks.
Comment 6 Tom London 2009-06-10 16:21:15 EDT
Not sure....

The issue occurs when my system (laptop) moves from my @work corporate LAN (where booting works) to my @home ISP provided connection, and where the LDAP server's name is no longer resolvable (XXXX.baystorm.local).

I disabled LDAP auth shortly after reporting this: my workaround.

Suppose I could re-enable if testing is needed....

Regargless, I would think there could be some other response to "failed to bind" besides hanging....
Comment 7 Jacques Isaac 2009-06-10 16:45:09 EDT
I don't understand... I mean is there a workaround for those who have to authenticate against LDAP?
Comment 8 Jon Doran 2009-07-01 14:40:39 EDT
This is a _critical_ issue for us as well.  Given the choice of not booting, or not being able to authenticate logins, I can't say I like either.

This seems like a repeat of an older issue with threading.  I was able to get the system to boot by removing "ldap" from group in nsswitch.conf (which was the old suggested workaround, from 2006). (Note this means the system boots with ldap in the other fields). But we use LDAP for everything, and therefore have no group membership information for users.

This is a new FC11 install (replacing FC8 server over the summer).  All updates applied.  And LDAP was working fine before I rebooted.
Comment 9 Colin Walters 2009-07-02 02:19:25 EDT
Does it help to enable nscd?  

It'd be useful to know what uid it's trying to look up.  Basically if we're trying to look up a uid < 500 over the network it's going to be broken.  I don't see a good fix other than nss_ldap explicitly skipping those uids.  We could potentially custom patch dbus to access /etc/passwd directly too of course, but it seems possible other daemons would have similar problems, such as sshd.
Comment 10 Jon Doran 2009-07-02 12:58:10 EDT
Enabling nscd did not help.

This is the same as bug 232699 (imho), which was closed WONTFIX.

Following the advice in that old bug, I changed /etc/init.d/messagebus to "chkconfig 28 85" so that it would load after ldap.  This is similar, but not the same levels as mentioned in but 232699.  The system did not boot, but I am still hopeful that some variant will work.  I need to go to another lab right now, so I can't poke at this machine until this evening.

I do not know how to determine which uid is being looked up.  If I could get detailed instructions I will be happy to try it.  But I am hopeful that tweaking the startup order of services will be the correct fix.
Comment 11 Jon Doran 2009-07-02 18:53:17 EDT
Oops, I forgot to run resetpriorities after making that change.  Anyhow, I lowered the priority further on ldap:   2345 11 73

Make the change in /etc/init.d/ldap, then run
   /sbin/chkconfig --levels 2345 ldap resetpriorities

The system boots with ldap in nsswitch.conf.

LDAP seems only to depend on the network and the local filesystem, so starting it early doesn't hurt.  And anything which might use it, which could be just about anything, needs to start afterwards.
Comment 12 Richard Colley 2009-07-08 04:17:58 EDT
This is a very old bug that has been around since 2006. Why it still isn't fixed is anybody's guess. Or perhaps it is a regression.

In any case, the simplest workaround is documented here: https://bugzilla.redhat.com/show_bug.cgi?id=182464#c10

Related to this may be that NetworkManager doesn't run until S27 (after messagebus).

Comments about changing the priority of the ldap startup are only relevant if the box hosts the ldap server.

Another possible solution (untested) is to change lines in nsswitch.conf to the following pattern:

  passwd:     files [SUCCESS=return] ldap
Comment 13 Richard Colley 2009-07-08 04:20:45 EDT
p.s. I'm running rawhide (fc12), so perhaps there are some differences in startup priorities, but the general comment is still relevant.
Comment 14 Josh Fisher 2009-08-06 14:51:41 EDT
(In reply to comment #12)
> 
> Another possible solution (untested) is to change lines in nsswitch.conf to the
> following pattern:
> 
>   passwd:     files [SUCCESS=return] ldap  

Doesn't work. It looks like nss-ldap still tries to make a connection to the ldap server.

The fix is to set "bind_policy soft" in /etc/ldap.conf. By default, nss-ldap uses bind_policy hard, which means it will keep trying to open the connection until finally, at some point, timing out. The 'soft' policy tells nss-ldap to try once and fail if it cannot bind to the ldap server.

Several services will call nsswitch related functions at startup before ldap is brought up. The new bind initscript tries to chown named.named /var/named, for example, causing named to appear to hang at startup because named is configured to start before ldap.

Changing the startup order may work for some particular system config, but is not an answer. If any daemon is ever added that starts up before ldap and also needs to do a getpwnam() call, it will hang for some time if bind_policy hard is specified for nss-ldap. For ldap auth systems running slapd the lag will always occur. But it will also occur on any system using ldap auth whenever the ldap server is unreachable. So for example, any service started before 'network' at startup could hang if it needs to do a getpwnam() function call.

Solution: When authconfig adds ldap lookups to /etc/nsswitch.conf it should also be adding 'bind_policy soft' to /etc/ldap.conf.
Comment 15 Simo Sorce 2009-09-13 18:26:22 EDT
The problem here is that messagebus need to activate some services and transition them to use their own users. In order to change user an initgroups call is made.

Initgroups calls always span all NSS modules as the call basically is a search for all groups the user belongs to.

A quick workaround to avoid unduly delays when using nss_ldap is to blacklist the users messagebus have to switch to, although this list will need to be updated every time a new system user is used by messagebus.

On my test system (rawhide from f12alpha) adding 'nss_initgroups_ignoreusers' to /etc/ldap.conf eliminates all delays whith the following list of users to ignore:
root,dirsrv,gdm,rtkit,pulse,haldaemon,polkituser,avahi,dbus

(I have 389ds installed that's why I also have dirsrv, but that one is not needed unless this is a server with 389ds installed)
Comment 16 Peter Glassenbury 2009-10-15 21:50:46 EDT
Bug also in Fedora 12 rawhide
Comment 17 Vincent Danen 2009-10-20 18:25:17 EDT
I'm seeing this exact same issue on Fedora 12 beta today.  I think what needs to happen is for the bind_policy to be changed to soft (this worked here).  Having to blacklist users may mean this works for a time but if new users/packages are installed, then it may require more additions to the blacklist which means at some point this will bite someone again.

The other option would be to start the network before messagebus starts (not sure how realistic this is, I haven't tried).  Before finding this bug I did try manually editing /etc/sysconfig/network-scripts/ifcfg-eth0 to make the interface start on boot, but messagebus is started first anyways so it didn't help.

I did not run into this issue at all with Fedora 11 (I'm using LDAP for auth on my local network).  With Fedora 12, this is fairly significant bug.

Ummm...  I'm not sure if this is due to the use of bind_policy soft, but at gdm I have no user entry field.  I see Suspend, Restart, Shut Down, and Log In, and when clicking on Log In it just recycles the gdm screen.  I can, however, login at the console just fine.  An entirely different bug, I'm sure, but probably related (I can't tell as I did not make a local user, and root is unable to login at gdm).  I'm suspecting that with the soft bind_policy, gdm isn't getting a list of allowable groups with which to build either a pick list of users or an entry field to present.  nscd is installed and running, and even after a reboot (hoping perhaps the nscd cache would persist to present this information), still no way to login via gdm.

(Just a gripe, but ctrl-alt-backspace doesn't work either to (attempt) to force gdm to re-read the list of available users, so looking at https://fedoraproject.org/wiki/Fedora_11_Alpha_release_notes#X_Server indicates to add DontZap back into xorg.conf, but there's no xorg.conf to be found!)

Doing a kill -9 on gdm doesn't work either; still nothing usable to log into at gdm.
Comment 18 Peter Glassenbury 2009-10-20 20:03:06 EDT
A response to the second part of #17 I have the same problem...All symptoms
the same..can't login to gdm. It has been happening a while on F12alpha but I don't know where the issue is. 
If you type at the screen a box shows up to take the input but doesn't
change the result(it might be a userlist search box)

I changed the bind_policy back to hard (but what needs a HUP for it to notice)
The change on its own made no difference to the gdm login so possibly a new bug
Comment 19 Peter Glassenbury 2009-10-20 20:15:10 EDT
Gdm issues probably different bug... changed nsswitch to not use ldap at all...
just files and dns. I still get no user entry field.
Comment 20 Vincent Danen 2009-10-29 09:48:54 EDT
Ok, I think I have some clues on the gdm thing and possibly dbus as well.  If I setup myself with a local account, I see my name at the gdm login screen (LDAP auth is there, but the local account takes precedence).  Fine, I can login that way and it's worked fine the last few days.  No dbus issues.

I have noticed, now that there is a local user, that the "Other" button is there.  To test, I clicked on it and logged in as my wife (no home directory, but she is in LDAP).  Worked fine.

So I backed up passwd, shadow, and group and removed myself and rebooted thinking that perhaps the existence of the home directory is all I needed.  When I booted again, however, gdm only gives me a "Log in" button... with no entry fields to actually login.

So I'm suspecting, through all of this, that there needs to be at least one local account for gdm to, at the very least, enable the "other" button to allow LDAP users to login.  Without that local account, I can't get gdm to let me in for love or money.  With it, I can login with other LDAP accounts.

FWIW, gdm must do some kind of polling to get the user list as I can switch to vc/2, restore my backup files, then switch back to vc/1 and voila, my local user is there, and so is my "other" button.

To keep this sort of on-topic (sorry, I probably should be creating a new bug for the gdm issues), during the removal of my local user and reboot, there were no dbus issues whatsoever and boot was fast and normal.  I'm not sure what messagebus is doing, or by having had a login there is some sort of caching or something going on, but it's no longer giving me grief since I duplicated my LDAP account as a local account (unless something was fixed in the last week as I've been keeping it up to date with new packages).
Comment 21 Peter Glassenbury 2009-10-29 20:50:14 EDT
To try and keep the gdm and ldap(system dbus) issues separate a followup comment 
I have updated to the latest gdm-2.28.1-12.fc12.x86_64 which has fixed 
my gui login issues but has NOT had any effect on the system-dbus hang
It still looks up ldap, can't find servers because the network isn't ready yet
The "bind_policy soft" will workaround it.. Booting after changing to this
gave an error on Nfs.client
and a 10 second stop on sendmail but recovered, completed boot and allows 
ldap logins. My local password file only has the defaults although I note that 
there are 55 users in that file so might some, other than the 13 below
 be triggering something from this line in ldap?
# Just assume that there are no supplemental groups for these named users
nss_initgroups_ignoreusers root,ldap,named,avahi,haldaemon,dbus,radvd,tomcat,radiusd,news,mailman,nscd,gdm,polkituser
Comment 22 Peter Glassenbury 2009-10-29 21:04:10 EDT
Just had a thought of using brute force and ignorance... 
Someone in the know may explain this effect 

IF I change back to bind_policy hard  (broken) 
AND change nss_initgroups_ignoreusers to be all the  accounts
    in my local password file(they are all default fedora accounts since
    my normal users are in ldap)

THEN - everything seems to work fine... 

No hangs, no failed services, users can login to gui fine

Have I opened up a hole or does nss_initgroups_ignoreusers need
to default to a more expanded set of system users??
Comment 23 Josh Fisher 2009-10-30 08:36:14 EDT
Are we sure this is not an nsswitch problem? I thought that if nsswitch.conf had for example 'passwd: files ldap', then account info should be looked for first in files and then in ldap. I would think with this lookup order that a user in the passwd file would be found and no LDAP lookup would even be made. If the user is in the passwd file, but a LDAP lookup is being made anyway, even though the user has already been found in the passwd file, then isn't that the problem?
Comment 24 Simo Sorce 2009-10-30 08:40:21 EDT
(In reply to comment #23)
> Are we sure this is not an nsswitch problem? I thought that if nsswitch.conf
> had for example 'passwd: files ldap', then account info should be looked for
> first in files and then in ldap. I would think with this lookup order that a
> user in the passwd file would be found and no LDAP lookup would even be made.
> If the user is in the passwd file, but a LDAP lookup is being made anyway, even
> though the user has already been found in the passwd file, then isn't that the
> problem?  

Read comment #15 ...
Comment 25 Braden McDaniel 2009-11-15 23:49:05 EST
The behavior I'm seeing is probably related. In my case, the session bus is taking a *long* time to come up during boot; but it does eventually succeed.

In /var/log/messages I see several lines that look like this:

Nov 15 03:52:20 bolt dbus-daemon: nss_ldap: failed to bind to LDAP server ldap://ldap.endoframe.net: Can't contact LDAP server

Eventually it gives up and proceeds:

Nov 15 04:00:53 bolt ntpd[1910]: nss_ldap: could not search LDAP server - Server is unavailable

... but the service does report successful startup.

For me, this is new behavior in F12. F11 didn't do this.
Comment 26 Vincent Danen 2009-11-18 17:58:08 EST
On a default F12 install (just installed both 32bit and 64bit in vmware), without setting a local user but using LDAP and kerberos authentication defined during the auth setup, I am completely unable to login.  I can't even login as root.  If I wait long enough, gdm displays "Username" and a field to enter (with "Other..." being the only selectable choice), but as soon as I type my username in, there is no prompt for a password and gdm sits there for a good few minutes before I'm prompted for the password.  Then it sits there some more.

It's been sitting there for over 15 minutes now, after providing the password for the LDAP account.

Rebooting into rescue mode shows nss_ldap unable to bind to the LDAP server, over multiple attempts, called by nscd, ntpd(?), abrtd, gdm-simple-slave and gdm-password.  In rescue mode, chrooted to the system, and with the network started, I can "id vdanen" and get a response from the LDAP server.  I can also su to myself and my wife's accounts, provide passwords, and have it authenticate properly.

Without being able to login as root, it makes it a bit difficult to debug.  Unfortunately, rebooting and watching the boot up, I see no stalls and no errors before gdm starts, so unless there is something after that starting up that is causing this, I'm not sure where the problem is precisely.

Changing to "bind_policy soft" in ldap.conf does not allow me to login, but does give me a much more responsive gdm.  When I enter my username/password, I get "User unknown to underlying authentication mechanism" (or something along those lines, can't see the whole message).

But I can login as root this way.  Aaargh!  It's like NetworkManager isn't starting eth0 at all, which might be what the problem is.  /etc/sysconfig/network-scripts/ifcfg-eth0 says ONBOOT=no, and of course I can't seem to configure it at the CLI.

At this point all I can safely say is that dbus has started with no delay.  I have no idea what is going on here other than perhaps NetworkManager doesn't want to connect eth0 until a user has logged in, which makes this whole situation a little impossible since I can't login without eth0 being up.  I don't believe I was given the chance to configure the network in any way during installation.

If I manually edit /etc/sysconfig/network-scripts/ifcfg-eth0 to make it start at boot, using DHCP, etc. then everything works fine (and this is with setting bind_policy back to hard).  I do, however, have to wait a few seconds before I can login or I get the "unknown user" error.

Reproduced by _only_ fixing ifcfg-eth0 on the other virtual machine.
Comment 27 Braden McDaniel 2009-11-30 04:55:21 EST
I've observed the behavior described in comment #26 on two F12 systems, now. When I fixed it the first time, I wasn't sure what I did. The second time I was more surgical: I think this behavior results from the default /etc/ldap.conf *not* setting "host"--it only sets "uri". If I set "host", things start working.

And Vincent, you should have an opportunity to log into the system as root sometime between when attempts to contact the LDAP server time out (~8 minutes) and when pam starts trying again. During this interval pam appears to fall back to local authentication.
Comment 28 Peter Glassenbury 2009-11-30 15:10:27 EST
Seeing the comments coming in... I am still having this problem in F12 release
I note that the version in the bug says for F11.
I have "fixed" it with this workaround and would be interested if Braden and Vincent could be fixed the same way... Rather than "bind_policy soft" which
appears to be a "give up on ldap and try starting again later when the network is going" type policy, I use the following sed command. I would also like comment on whether this is a good way to fix things or not. This was the only change made.

# ================================================================= #
# Change the ldap.conf file to ignore some of the local users...    #
# otherwise F12 hangs at dbus trying to contact the ldap server     #
# ================================================================= #
sed -i.DIST -e 's/^nss_initgroups_ignoreusers.*$/nss_initgroups_ignoreusers root,bin,daemon,adm,lp,sync,shutdown,halt,mail,uucp,operator,games,gopher,ftp,nobody,dbus,oprofile,vcsa,avahi-autoipd,ntp,qemu,polkituser,rpc,rpcuser,nfsnobody,rtkit,distcache,nscd,tcpdump,avahi,apache,mailnull,smmsp,openvpn,named,smolt,webalizer,nm-openconnect,postgres,sshd,postfix,dovecot,torrent,pulse,haldaemon,mysql,hsqldb,jetty,exim,squid,backuppc,news,gdm,tomcat,saslauth/' $LOCATION/etc/ldap.conf
Comment 29 Vincent Danen 2009-11-30 16:23:44 EST
(In reply to comment #28)
> Seeing the comments coming in... I am still having this problem in F12 release
> I note that the version in the bug says for F11.
> I have "fixed" it with this workaround and would be interested if Braden and
> Vincent could be fixed the same way... Rather than "bind_policy soft" which

That's not a scalable fix.  You'd need to manually change things every time you add a new package that adds another system group.

An appropriate fix at install-time I think would be to ask whether or not the NetworkManager interface(s) should start at boot, or automagically configure them to start at boot for any network-based authentication.  The problem is obviously in the LDAP server being unreachable.. and it doesn't make sense to have the network start after a user logs in -- especially if they're a network-based user.

Being able to tell Anaconda (automatically or not) that "yes, I need the network to start at boot" would probably fix this whole thing.
Comment 30 Colin Walters 2009-12-01 11:05:27 EST
(In reply to comment #29)
> (In reply to comment #28)
> > Seeing the comments coming in... I am still having this problem in F12 release
> > I note that the version in the bug says for F11.
> > I have "fixed" it with this workaround and would be interested if Braden and
> > Vincent could be fixed the same way... Rather than "bind_policy soft" which
> 
> That's not a scalable fix.  You'd need to manually change things every time you
> add a new package that adds another system group.
> 
> An appropriate fix at install-time I think would be to ask whether or not the
> NetworkManager interface(s) should start at boot, or automagically configure
> them to start at boot for any network-based authentication.  The problem is
> obviously in the LDAP server being unreachable.. and it doesn't make sense to
> have the network start after a user logs in -- especially if they're a
> network-based user.

It also doesn't make any sense to go over the network to get the groups for system uids like haldaemon or gdm; you shouldn't even have those users in a central LDAP server since they can vary per OS release.

A much better solution as I mentioned before is to look up uids < 500 locally; for example, having "nss_initgroups_uidminimum 500" or the like.  (time passes) In fact...nalin implemented this back in 2007.

http://cvs.fedoraproject.org/viewvc/devel/nss_ldap/nss_ldap-257-initgroups-minimum_uid.patch?view=log

I also see:

* Wed Nov  4 2009 Nalin Dahyabhai <nalin@redhat.com> 264-8
- add "rtkit" and "pulse" to the list of users whom we default to ignoring
  for looking up supplemental groups (Gordon Messmer, part of #186527)

So there's definitely work in nss_ldap for this; why are people still hitting issues here?  Can we change the default configuration to use the minimum_uid setting?
Comment 31 Simo Sorce 2009-12-01 11:23:19 EST
The problem is generally the initgroups call.
that calls gives you a user name not a uid.
I guess nss_ldap could be patched to read /etc/passwd and see if the user is there, and if it it is not go over the network and just return nothing.
Something like nss_initgroups_blacklist_passwd but this would have to be implemented in each nss driver that speak to a server over the network and does not support offline modes and detection natively, not just nss_ldap. Although nss_ldap is the more common so far.
Comment 32 Colin Walters 2009-12-01 13:36:58 EST
(In reply to comment #31)
> The problem is generally the initgroups call.
> that calls gives you a user name not a uid.

Ah, right.  I keep forgetting that, thanks.

> I guess nss_ldap could be patched to read /etc/passwd and see if the user is
> there, and if it it is not go over the network and just return nothing.

Well, it should return the data from the password file, not nothing; right?

> Something like nss_initgroups_blacklist_passwd but this would have to be
> implemented in each nss driver that speak to a server over the network and does
> not support offline modes and detection natively, not just nss_ldap. Although
> nss_ldap is the more common so far.

Ok.  Do we agree then this should be reassigned to nss_ldap?
Comment 33 Simo Sorce 2009-12-01 13:52:40 EST
(In reply to comment #32)
> (In reply to comment #31)
> > The problem is generally the initgroups call.
> > that calls gives you a user name not a uid.
> 
> Ah, right.  I keep forgetting that, thanks.
> 
> > I guess nss_ldap could be patched to read /etc/passwd and see if the user is
> > there, and if it it is not go over the network and just return nothing.
> 
> Well, it should return the data from the password file, not nothing; right?

Wrong, that data has already been returned by nss_files.

> > Something like nss_initgroups_blacklist_passwd but this would have to be
> > implemented in each nss driver that speak to a server over the network and does
> > not support offline modes and detection natively, not just nss_ldap. Although
> > nss_ldap is the more common so far.
> 
> Ok.  Do we agree then this should be reassigned to nss_ldap?

I guess it might make sense to reassign if the maintainer agrees that that is the solution.
Comment 34 Colin Walters 2009-12-01 14:04:13 EST
(In reply to comment #33)

> Wrong, that data has already been returned by nss_files.

Ah, because it spans modules.  Ok.

> I guess it might make sense to reassign if the maintainer agrees that that is
> the solution.  

Thanks for the assistance!
Comment 35 Peter Glassenbury 2009-12-01 16:18:00 EST
In response to comment #30
>you shouldn't even have those users in a
>central LDAP server since they can vary per OS release.
We don't so that is why it seems so strange it is doing the ldap lookup

In response to comment #29
> That's not a scalable fix.  You'd need to manually change things every time you
> add a new package that adds another system group.

At first I thought "of course" but on longer deliberation... there are probably a couple of issues...
The network should be up as you say..if the system is looking for ldap..so that is one issue

The current release of F12 already has a reasonably long list of users in the nss_initgroups_ignoreusers of the ldap.conf.. It is just not a complete list.
If that feature remains as part of ldap, it should include the standard system
users.

3rd issue..my nsswitch.conf has "passwd: files ldap" ... Why is it looking at ldap anyway..At least at a point in boot where all its questions can be answered
by looking in files.. (This may be just that I don't know what nss_initgroups_ignoreusers does but there it looks to me that there is no benefit in looking at ldap at the dbus point as all the answers are in files.)
Do I need a feature [FOUND=return] - don't go on looking :-)  :-)
Comment 36 Simo Sorce 2009-12-01 16:31:19 EST
because nsswitch unfortunately assumes that a user can be member of groups in other databases.
I personally think that each database should be fully self-contained and memberships should never cross database boundaries, it is just unhealthy and risky.
But that how nsswitch and the initgroups calls are designed and implemented in glibc, so the only thing you can do is cope with it.
Comment 37 Richard Colley 2009-12-01 17:04:10 EST
(In reply to comment #36)
> I personally think that each database should be fully self-contained and
> memberships should never cross database boundaries, it is just unhealthy and
> risky.

That's not workable ... it would mean ldap users could never be members of local groups.  Membership of local groups is required for many system functions to work properly (e.g. groups video, audio, pulse, etc)
Comment 38 Simo Sorce 2009-12-01 17:11:24 EST
I agree local groups are sort of special, it's the other way around that makes little sense (local user members of ldap groups).
Comment 39 Braden McDaniel 2009-12-01 17:15:24 EST
I appreciate that with respect to the explanations of this problem offered here, it makes little sense that the solution I described in comment #27 would work. And yet it did. This makes me suspicious that there may be an alternative reason for this problem that is being overlooked. Would anyone care to try to reproduce my results?
Comment 40 Nalin Dahyabhai 2009-12-01 18:49:10 EST
(In reply to comment #30)
> A much better solution as I mentioned before is to look up uids < 500 locally;
> for example, having "nss_initgroups_uidminimum 500" or the like.  (time passes)
> In fact...nalin implemented this back in 2007.
> 
> http://cvs.fedoraproject.org/viewvc/devel/nss_ldap/nss_ldap-257-initgroups-minimum_uid.patch?view=log

That approach didn't work (I'm pretty sure it ended up hitting a deadlock when the module recursed into itself, but it's been a while), so the patch isn't applied.  I just didn't throw it away.
Comment 41 Zaphod Beeblebrox 2009-12-02 10:16:48 EST
(In reply to comment #20)
Regarding not seeing a list of recent users which logged on through gdm... This worked correctly in F11 but does not in F12. Doing a comparison, there was a difference in permissions on /var/log/ConsoleKit/history. In F11, this was world readable but in F12 it was not. When I changed the permissions to 644, the recent user functionality in gdm started working again.

Also, as pointed out previously several times now, the messagebus startup hang waiting on ldap before networkmanager has started the network interfaces was not an issue in F11. Is there any easy way to troubleshoot and figure out which userids need to be added to nss_initgroups_ignoreusers rather than just adding all local users < uid 500 ?
Comment 42 Chris Paulson-Ellis 2009-12-22 09:34:00 EST
Another data point...

I have a system that uses "files ldap" for passwd, shadow & group in /etc/nsswitch.conf. It suffers from LDAP lookup timeouts during startup and shutdown of named caused by the "chown root.named" line in the init script.

This is despite /etc/ldap.conf having all local users in nss_initgroups_ignoreusers (including root & named) and host set rather than uri.

This makes no sense to me. What is chown looking up that cannot be satisfied by /etc/passwd and /etc/group and isn't suppressed by nss_initgroups_ignoreusers?
Comment 43 Andrew McNabb 2010-01-31 22:16:02 EST
Hey.  I've been on the cc list for the other two duplicates of this bug report (#182464 and #186527), but I just found out about this one.  Now I'm on all three. :)  Bug #553032 is a related, but slightly different issue.  Fun stuff.
Comment 44 Andrew McNabb 2010-03-17 15:07:50 EDT
Is there any reason this bug can't be merged with #182464 and #186527?  On bug #186527, a few of us have observed that simply moving the ldap server earlier in the boot order is a nice, clean fix to the problem.  I tried moving it to S21ldap (in front of S22messagebus), and Andrew Meredith tried moving it to S12ldap.  We both found that this solves the problem cleanly.  Is there any chance that this fix could go into Fedora 13?
Comment 45 Dmitri Pal 2010-05-21 07:45:24 EDT
In general the recommended alternative to nss_ldap is SSSD.
Comment 46 Aurelien Gouny 2010-07-13 02:24:51 EDT
I opened a related IT#501563 a while ago, which referred me to BZ#174064.
The 'workaround' works but as mentioned in comment 29 this is not scalable at all.

Is there any plan in the pipeline ?

Cheers,
- Aurelien.
Comment 47 Dmitri Pal 2010-07-13 10:26:52 EDT
(In reply to comment #46)
> I opened a related IT#501563 a while ago, which referred me to BZ#174064.
> The 'workaround' works but as mentioned in comment 29 this is not scalable at
> all.
> 
> Is there any plan in the pipeline ?
> 
> Cheers,
> - Aurelien.    

The plan is to use SSSD. It has internal cache so even if the network is not available the system will be able to start smoothly.

To read more about SSSD look here:
https://fedorahosted.org/sssd/
Comment 48 Braden McDaniel 2010-07-14 00:36:11 EDT
Just some notes from a recent sssd experience...

I had installed sssd on my (F13) system, but I hadn't gotten around to learning, configuring, and enabling it. Well, I happened to run authconfig-tui just recently; and along the way, I noticed that values were repeated in the fields for some of the servers. I cleaned these up as I proceeded through the wizard. As I finished, it automatically started up sssd.

With some trepidation, I started poking around a bit. I found I couldn't log in as anyone but root. I inspected sssd.conf and found that the ldap_uri had a repeated value (not too unlike the goofy ones I saw in authconfig--either I missed this one or it wasn't presented in authconfig-tui). I cleaned it up and restarted sssd. After that, things seem to be working fine.

And no more long pause on bootup!
Comment 49 Dmitri Pal 2010-07-14 08:52:36 EDT
Thank you for trying it.

Can you please describe in more details what were the issues with authconfig-gui?
If you have a copy of your old configuration, would be great. Please open a separate bug then.

The smooth migration from pre SSSD configuration to SSSD configuration was definitely a goal but it seems that we were not be able to cover all use cases.

Thanks
Dmitri
Comment 50 Braden McDaniel 2010-07-14 12:14:45 EDT
I'm using kerberos for authentication and LDAP for user information.

Specifically, my issues were with authconfig-tui. I tried authconfig-gui on a different box and did not experience the same problems. Whether that was due to differences between authconfig-gui and -tui or some subtle difference in the configurations on the boxes, I can't say.

Specifically, in the Kerberos Settings, both the KDC and the Admin Server had the server:portname repeated and separated by a comma; e.g.:

  kerberos.example.com:88,kerberos.example.com:88

Then, when I looked in sssd.conf (after authconfig-tui finished), ldap_uri looked like this:

  ldap_uri = ldap.example.comldap://ldap.example.com

No comma (or other separator) this time; and, yes, the URI scheme did only occur on the second copy.

I'm afraid I don't have a copy of my configuration from before I ran authconfig-tui.
Comment 51 Jakub Hrozek 2010-07-16 07:24:16 EDT
(In reply to comment #50)
> I'm using kerberos for authentication and LDAP for user information.
> 
> Specifically, my issues were with authconfig-tui. I tried authconfig-gui on a
> different box and did not experience the same problems. Whether that was due to
> differences between authconfig-gui and -tui or some subtle difference in the
> configurations on the boxes, I can't say.
> 
> Specifically, in the Kerberos Settings, both the KDC and the Admin Server had
> the server:portname repeated and separated by a comma; e.g.:
> 
>   kerberos.example.com:88,kerberos.example.com:88
> 
> Then, when I looked in sssd.conf (after authconfig-tui finished), ldap_uri
> looked like this:
> 
>   ldap_uri = ldap.example.comldap://ldap.example.com
> 
> No comma (or other separator) this time; and, yes, the URI scheme did only
> occur on the second copy.
> 
> I'm afraid I don't have a copy of my configuration from before I ran
> authconfig-tui.    

Brendan,
this really sounds like authconfig-tui issue. Unfortunately, I could not reproduce it with neither a vanilla F13 install nor on my production machine. If you could reproduce it somewhere else, it would be nice to have a bug report.

I couldn't find any canonical source on this, but I think that the interactive authconfig-tui is considered obsolete and the noninteractive authconfig should be used for non-GUI configuration.

For your use-case, the invocation would look something like:
# authconfig --update --enableldap --ldapserver=ldap.example.com --ldapbasedn=<DN> --enablekrb5 --krb5kdc=kerberos.example.com --krb5realm=EXAMPLE.COM

On Fedora 13 and newer, SSSD is configured by default including NSS switch and PAM configuration.
Comment 52 Stephen Gallagher 2010-07-19 07:36:54 EDT
Requesting confirmation from Tomas Mraz (owner of authconfig) regarding the obsolescence of authconfig-tui.
Comment 53 Tomas Mraz 2010-08-02 06:21:08 EDT
authconfig-tui is obsolete in the way that we do not add new options or other features to it. However this seems more like a bug, that could/should be fixed however I would need a concrete reproducer on how you use authconfig-tui to get the broken config.
Comment 54 Bug Zapper 2010-11-04 07:13:24 EDT
This message is a reminder that Fedora 12 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 12.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '12'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 12's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 12 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping
Comment 55 Tom London 2010-11-04 09:20:00 EDT
Running rawhide.

My setup has changed and I can no longer test this.
Comment 56 Vincent Danen 2010-11-05 13:56:06 EDT
FWIW, I have not seen this issue in Fedora 13 and 14.  I suspect using SSSD has cleared it up, or something else has.  From my point of view, this is resolved (and Fedora 12 is nearing EOL so doubtful anyone will invest effort to fix it there if it's still an issue).  I am unable to test F12.
Comment 57 Stephen Gallagher 2010-11-05 14:14:22 EDT
Closing this bug as fixed in the current release.

If this problem reappears, please reopen this bug and set the version appropriately.
Comment 58 Andrew Zabolotny 2010-12-02 09:26:13 EST
Upgraded today to Fedora14 at work and got again this problem. My configuration is a purely client machine, e.g. LDAP server is a separate machine which is up and configured fine.

After booting I got no GDM login screen, just console logins. When trying to log in at the console, it hangs (after entering id/password) then after some time returns to the login prompt. If waiting long enough (~10 minutes) the GDM screen shows up, but I was unable to log in there.

Booted in single mode, started the network. "getent passwd" worked fine, "getent group" as well, e.g. LDAP configuration is set up correctly and LDAP server returns correct entries.

Still, I continued to see the following in log files (when booting not in single mode):

dbus-daemon: nss_ldap: reconnecting to LDAP server (sleeping 8 seconds)...
dbus-daemon: nss_ldap: reconnecting to LDAP server (sleeping 16 seconds)...
dbus-daemon: nss_ldap: reconnecting to LDAP server (sleeping 32 seconds)...
dbus-daemon: nss_ldap: reconnecting to LDAP server (sleeping 64 seconds)...
dbus-daemon: nss_ldap: could not search LDAP server - Server is unavailable

Then this:

gnome-session[2554]: libupower-glib-WARNING: Couldn't connect to system bus: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.

So, it looks like dbus-daemon couldn't connect for some reason to LDAP server. However, "getent" works fine (and shows data from LDAP) so I can't understand why dbus-daemon cannot reach it.

Now I added "bind_policy soft" as suggested above to /etc/nss_ldap.conf. The machine booted and logged in fine, but some programs (notably pulseaudio) refused to start because they could not connect to dbus-daemon:

dbus-daemon: nss_ldap: could not search LDAP server - Server is unavailable
gnome-session[2186]: libupower-glib-WARNING: Couldn't connect to syst
em bus: [...]
dbus-daemon: nss_ldap: could not search LDAP server - Server is unava
ilable
gnome-session[2186]: WARNING: Could not connect to ConsoleKit:[...]
pulseaudio[2275]: core-util.c: Failed to connect to system bus: Did n
ot receive a reply.[...]

and lots and lots of similar messages.

What's strange is that nss_ldap in dbus-daemon process cannot reach LDAP server even after the network is up and everything works fine (checked from a SSH session - btw, SSH logins worked just fine, unlike console logins). E.g. from a SSH console I can access the LDAP server, issue "getent passwd" and alike, and at the same time trying to log in via the console will render "could not search LDAP server" messages in the syslog.

Now something clicked inside my head and I moved the nscd service to start before messagebus service. And voila - everything started booting normally and all error messages disappeared from syslog.

Net result: lost a whole work day, and still don't understand the roots of the problem.
Comment 59 Stephen Gallagher 2010-12-02 09:35:11 EST
> Now something clicked inside my head and I moved the nscd service to start
> before messagebus service. And voila - everything started booting normally and
> all error messages disappeared from syslog.
> 
> Net result: lost a whole work day, and still don't understand the roots of the
> problem.

As stated above, the problem is that some D-BUS services require access to users before the network is available for nss_ldap to serve them. So in order to have these services start, you need to ensure that a local cache of the users is available before the messagebus starts (either nscd or SSSD would work).
Comment 60 Andrew Zabolotny 2010-12-03 13:51:12 EST
Is this a normal situation that user enables LDAP in system-config-authentication and operating system stops booting after that? I think not.

The accounts that are requested through the D-BUS service are system accounts, present in passwd file. Maybe this is a bug in the resolver, which always queries nss_ldap, no matter if that's required or not?

Note You need to log in before you can comment on or make changes to this bug.