Bug 182464

Summary: dbus-daemon hangs while starting on a system with ldap authorisation enabled.
Product: [Fedora] Fedora Reporter: Alastair Neil <aneil2>
Component: nss_ldapAssignee: Nalin Dahyabhai <nalin>
Status: CLOSED WONTFIX QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: medium    
Version: 12CC: afm, alexandre.magaz, amcnabb, andreas.bierfert, arechenberg, diccon.tesson, dqarras, gasi, gdr, johnp, libin.charles, mattdm, orion, raina, rodrigo, shawn.somers, squinney, triage, zing
Target Milestone: ---Keywords: Reopened, Triaged
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard: bzcl34nup
Fixed In Version: 248-3 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-12-05 07:18:25 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
strace output of dbus-daemon none

Description Alastair Neil 2006-02-22 19:02:09 UTC
Description of problem:
mesagebus service hangs on boot on system with ldap auth configured.

Version-Release number of selected component (if applicable):
dbus-0.60-7.2
kernel-2.6.15-1.1969_FC5

How reproducible:
always

Steps to Reproduce:
1.boot
2.
3.
  
Actual results:
hang until bored

Expected results:
fedora niceness

Additional info:
workaround is to remove the entry for ldap from the group line in
/etc/nsswitch.conf not an acceptible long term solution.

Comment 1 John (J5) Palmieri 2006-02-23 15:46:59 UTC
A bug where d-bus is put in an infinite loop because of missing group
information might be what you are seeing.  I am doing new upstream release and
will package it up for Fedora.

Comment 2 John (J5) Palmieri 2006-02-23 17:22:59 UTC
I was just informed by our ldap guru here that the ldap module does not like
threaded apps.  D-Bus uses threads for listening to SELinux avc denial messages
over netlink.  

You can work around the issue by rebuilding the package with SELinux disabled if
it is important.

Fixes to the ldap module are being looked at and I will be looking at moving the
SELinux code to use the mainloop instead of a thread but I am not sure how long
it will take to get these issues resolved.

Comment 3 Nalin Dahyabhai 2006-02-24 00:34:59 UTC
This is an nss_ldap bug, which should be fixed in 248-3.  Please reopen this bug
if you find that this is not the case.

Comment 4 John (J5) Palmieri 2006-02-27 16:00:27 UTC
*** Bug 181305 has been marked as a duplicate of this bug. ***

Comment 5 W. Michael Petullo 2006-03-20 02:33:28 UTC
This bug still remains for me when using:

udev-084-13
nss_ldap-249-1

The udev service hangs unless my LDAP server is running.

Alastair, do you still have this problem?

Comment 6 Nalin Dahyabhai 2006-03-20 15:24:38 UTC
W., does your /etc/ldap.conf include a "nss_initgroups_ignoreusers root,ldap"
setting?  Without it, you'd at least hit long delays as nss_ldap timed out
attempting to contact a directory server.  Until 248-3, that would have
deadlocked apps which linked against libpthread (which includes the D-BUS daemon).

Comment 7 Alastair Neil 2006-03-20 17:44:44 UTC
No I do not have the problem any more.  For me this has not been an issue for
the last couple of weeks.  I have the same udev and nss_ldap as Michael and I'm
running  2.6.15-1.2054_FC5.

Comment 8 Devin Reade 2006-03-23 04:24:22 UTC
This still occurs with FC5.

In particular, I have a machine that was originally FC4 with current patches.
It uses LDAP for users/groups/automount and is itself the (only) ldap server.

After performing the upgrade from FC4 to FC5 via the FC5 DVD, it hung on the
"starting system message bus" line.  It responded to ctrl-alt-delete, so I
rebooted in single user mode and examined /var/log/messages, which gave a
bunch of entries, thus:

nss_ldap: failed to bind to LDAP server ldap://127.0.0.1: Can't contact LDAP server

This caused me to look, of course, at LDAP, whereupon I merged in the
*.rpmnew changes to /etc/ldap.conf, including the nss_initgroups_ignoreusers
line, to no effect (on reboot).

After locating this bug item, I've commented out 'ldap' from the 'group' line
in nsswitch.conf, which has allowed me to bring the system up higher than
single user mode.  Obviously this is not an acceptable workaround for
production, though.

I'm currently merging in all other *.rpmnew files that were created during
the FC4->FC5 upgrade, but the list doesn't contain any other suspect candidates.

kernel-2.6.15-1.2054_FC5
udev-084-13
nss_ldap-249-1

Comment 9 Nalin Dahyabhai 2006-03-23 16:57:37 UTC
Does adding "dbus" to the nss_initgroups_ignoreusers list solve this?  I suspect
that it will, because it's reasonable for the message bus daemon to set up its
supplemental groups list before dropping privileges to run as that user.

If it does, this is going to need a better long-term solution (one where we
don't have to eventually add all system users to this line, which would suck).

Comment 10 Devin Reade 2006-03-24 05:18:09 UTC
No, adding dbus to nss_initgroups_ignoreusers had no effect.

I did some splunking with strace, followed by code inspection of libnss_ldap.
It turns out that the information referenced by nss_initgroups_ignoreusers
is only used _after_ the library attempts to connect to the ldap server.

As a temporary work-around, I found that setting 'bind_policy soft' in
/etc/ldap.conf was sufficient to get the machine fully running when having 
'group: files ldap' in nsswitch.conf.  However I'd prefer to not be using
'soft' after the machine is running, so another solution is preferred.

I would contend that with a 'files ldap' ordering in ldap.conf and a 
match on nss_initgroups_ignoreusers, the ldap connection should not 
occur.

Comment 11 Devin Reade 2006-03-24 05:30:50 UTC
Created attachment 126602 [details]
strace output of dbus-daemon

strace output of dbus-daemon obtained by hacking /etc/init.d/messagebus.
Notice the connect(2) calls to 127.0.0.1.

Comment 12 VALOIS, Pascal 2006-03-24 10:22:57 UTC
another way of resolving temporary of permanently the problem without affecting
the ways your box is configured is to ensure that the ldap server is running
when messagebus launch.

in /etc/rc3.d/ and /etc/rc5.d we can see : 

S22messagebus
S27ldap

putting messagebus as S28messagebus solves the problem.

Comment 13 John (J5) Palmieri 2006-03-24 16:49:27 UTC
D-BUS needs to start early in the process as future components may rely on the
system bus.  This is not a fix we can use as default in the distro unless of
course ldap can start earlier.  BTW we do not key off the dbus user but off of
special user 81 which may change names in the future but will always be uid 81. 

Comment 14 VALOIS, Pascal 2006-03-24 21:17:40 UTC
ok, so let messagebus be S22, and try S21 for ldap
betweend S22 and S27, there is only bluetooth, netfs, and hidd (that is
depending on bluetooth).

so nothing seems to prevent ldap from being launched just before messagebus.


Comment 15 Devin Reade 2006-03-24 21:46:05 UTC
I guess as long as nobody decides to add messagebus support into slapd ...

FWIW I also noticed that rpc.statd (which gets started before messagebus)
also reports problems in /var/log/messages, however _it_ is able to not
hang but rather retry later.  So how is rpc.statd using libnss_ldap differently
from dbus-daemon such that the former doesn't hang but the latter does?

(rpc.statd is started via S14nfslock)


Comment 16 Andreas Bierfert 2006-03-25 10:06:45 UTC
Starting ldap before messagebus is ihmo the right choice for now. Even after
fixing the other upgrade issues for ldap (like path changes) my server still
won't start on its own if I have ldap auth enabled during boot. Lots of things
just hang. Would be nice if we could find a solution quick and push an upgrade
so that other users who upgrade/install don't run against this wall.

Comment 17 Red Hat Bugzilla 2007-02-05 19:23:16 UTC
REOPENED status has been deprecated. ASSIGNED with keyword of Reopened is preferred.

Comment 18 Matthew Miller 2007-04-06 18:02:17 UTC
Fedora Core 5 and Fedora Core 6 are, as we're sure you've noticed, no longer
test releases. We're cleaning up the bug database and making sure important bug
reports filed against these test releases don't get lost. It would be helpful if
you could test this issue with a released version of Fedora or with the latest
development / test release. Thanks for your help and for your patience.

[This is a bulk message for all open FC5/FC6 test release bugs. I'm adding
myself to the CC list for each bug, so I'll see any comments you make after this
and do my best to make sure every issue gets proper attention.]


Comment 19 Devin Reade 2007-04-06 19:14:41 UTC
Why was this marked as closed without an explanation?  Does that mean it's
fixed in FC7?  What action was taken?  I don't have a virgin FC6 handy to 
look at, but it seems to me that neither the nsswitch.conf workaround nor
the change in the messagebus vs slapd startup sequence has been changed.

Comment 20 Matthew Miller 2007-04-06 19:22:39 UTC
Not sure. The original bug reporter closed the issue. (Presumably because it
works for him, as per #7.) I'm going to reopen, put to "devel", and someone
familiar with this particular issue can then decide where to go from there.

Comment 21 Bug Zapper 2008-04-03 17:00:33 UTC
Based on the date this bug was created, it appears to have been reported
against rawhide during the development of a Fedora release that is no
longer maintained. In order to refocus our efforts as a project we are
flagging all of the open bugs for releases which are no longer
maintained. If this bug remains in NEEDINFO thirty (30) days from now,
we will automatically close it.

If you can reproduce this bug in a maintained Fedora version (7, 8, or
rawhide), please change this bug to the respective version and change
the status to ASSIGNED. (If you're unable to change the bug's version
or status, add a comment to the bug and someone will change it for you.)

Thanks for your help, and we apologize again that we haven't handled
these issues to this point.

The process we're following is outlined here:
http://fedoraproject.org/wiki/BugZappers/F9CleanUp

We will be following the process here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping to ensure this
doesn't happen again.

Comment 22 Andrew McNabb 2008-04-10 20:39:16 UTC
I'm seeing this in rawhide.  I just ran into it today.  Running `service
messagebus start` hangs on boot.  By booting into single user mode, I found that
I can get messagebus to start if and only if I replaced:

group:          files ldap

with

group:          files

When I ran `strace -f service messagebus start`, I noticed that it would do a
bunch of stuff including network activity, call nanosleep and wait a while, do
the same network stuff, call nanosleep and wait, and so on.

By the way, I did not see this problem in Fedora 8, even though I had the exact
same configuration.

Comment 23 Andrew McNabb 2008-04-10 20:48:19 UTC
By the way, it looks like this bug and 221199 are the same bug.  However, root
and nscd both already appear in nss_initgroups_ignoreusers in /etc/ldap.conf, so
that didn't really help.

Comment 24 Andrew McNabb 2008-04-10 20:49:55 UTC
It also looks like 186527 might be the same.

Comment 25 John Poelstra 2008-05-06 03:17:38 UTC
Nalin, is comment #24 correct?

Comment 26 Nalin Dahyabhai 2008-05-06 19:02:44 UTC
(In reply to comment #25)
> Nalin, is comment #24 correct?

It sure looks that way.  The distinction between LDAP and LDAPS turned out to
not make a difference in #186527, so I'm left concluding it's getting stuck
enumerating the supplemental groups for a user listed in one of the files it's
reading at startup, just as it does here.

Comment 27 John Poelstra 2008-05-07 02:24:56 UTC
(In reply to comment #26)
> (In reply to comment #25)
> > Nalin, is comment #24 correct?
> 
> It sure looks that way.  The distinction between LDAP and LDAPS turned out to
> not make a difference in #186527, so I'm left concluding it's getting stuck
> enumerating the supplemental groups for a user listed in one of the files it's
> reading at startup, just as it does here.

Nalin, close one as a duplicate or keep them both open?

Comment 28 Mike Chambers 2008-05-07 14:37:20 UTC
Changing ot assigned for now until a conclusion/duplicate is figured out.  This
will let it not get closed by the bug zapper program.

Comment 29 Daniel Qarras 2008-05-07 17:44:17 UTC
I must be doing something wrong because for me boot does not hang. I configured
nsswitch.conf and pam.d/system-auth to use LDAP and in /etc/ldap.conf I use
localhost as my server. When rebooting either with or without ldap service, boot
always proceeds normally.

How do you reproduce this on pristine Fedora 9?

Comment 30 Bug Zapper 2008-05-14 02:05:53 UTC
Changing version to '9' as part of upcoming Fedora 9 GA.
More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 31 Andrew Rechenberg 2008-12-18 13:53:05 UTC
This also occurs on F10 with all updates as of yesterday (20081217).

I have the following in /etc/ldap.conf:

nss_initgroups_ignoreusers root,ldap,named,avahi,haldaemon,dbus

I am using nss_ldap to retrieve user information from an Active Directory domain with an ldaps:// URI and ssl set to yes in /etc/ldap.conf

This problem (at least for me) is network-related as the network hasn't come up yet because it is controller in F10 with NetworkManager and NetworkManager depends on messagebus being started.

I chkconfig'd network on since it was off and it's start priority is 10 and it will start before messagebus and this change resolves the hanging of messagebus at boot-time for me.

Comment 32 Shawn Somers 2009-03-06 15:15:30 UTC
Have had this problem with multiple Fedora releases, including 10.

A reasonable fix that has worked for Me so far it to modify /etc/ldap.conf add  `bind_policy soft' to the end of the file.

after adding that line, the system no longer goes into long delay loops while booting, and I am still able to use all functionality.


My question, to add to this bug, is, Why does this relatively minor policy change seem to fix it for Me??

Comment 33 Herbert Gasiorowski 2009-03-16 09:30:53 UTC
Seems to exist here on every Fedora 10 system with LDAP too!

Comment 34 Bug Zapper 2009-06-09 22:07:01 UTC
This message is a reminder that Fedora 9 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 9.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '9'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 9's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 9 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 35 Andrew McNabb 2009-06-19 22:25:02 UTC
The version for this bug needs to be updated to 11.

Comment 36 Orion Poplawski 2009-09-30 20:31:17 UTC
Present in rawhide.  

Found the source of the other user lookups by dbus-daemon - the "user=" lines in all of the /etc/dbus-1/system.d/*.conf files.  In my case I needed to add nm-openconnect, rtkit, and pulse to nss_initgroups_ignoreusers.  But this is clearly not a maintainable solution.  Perhaps ignoring (optionally but by default) uids under 500 would be at least a little better.

Comment 37 Orion Poplawski 2009-09-30 20:36:57 UTC
(In reply to comment #32)
> A reasonable fix that has worked for Me so far it to modify /etc/ldap.conf add 
> `bind_policy soft' to the end of the file.
> 
> after adding that line, the system no longer goes into long delay loops while
> booting, and I am still able to use all functionality.
> 
> My question, to add to this bug, is, Why does this relatively minor policy
> change seem to fix it for Me??  

It's not a minor change:
# Reconnect policy: hard (default) will retry connecting to
# the software with exponential backoff, soft will fail
# immediately.

To some extent it comes down to whether LDAP is critical or not.

Comment 38 Devin Reade 2009-09-30 21:52:31 UTC
Since 2006 there's been a lot of talk regarding startup sequences and config changes.  In comment 10 I described the fact that (based on code inspection at the time) that it appeared that there was an inverted program flow with respect to referencing ignoreusers and connecting to ldap.  There was no indication in this report that this aspect was ever considered (even to the point of someone saying, "no, you read it wrong"). I admit that I've not looked at the source since then, so I don't know if it still exists, but perhaps one of the current developers would be interested in examining that possibility?

Comment 39 Andrew McNabb 2009-09-30 22:29:09 UTC
Comments #12 through #16 suggest changing the startup order, changing from S27ldap to S21ldap to put it before S22messagebus.  I haven't seen any arguments against this.  It seems like this suggestion and Comment #36 would both be better than what we're having now.

This bug has been open for 3 and a half years without any visible progress.  It's frustrating that nothing has happened yet.  By the way, bug #186527, which seems to be about the same issue, is still open, too.

Comment 40 Daniel Qarras 2009-10-03 10:20:13 UTC
As I mentioned in the other report referenced, bug #186527, I think SSSD might well fix this when enabled. It would sound as a better approach to provide a mechanism to allow OS to work properly even if external databases are offline rather than making exceptions for a numerous set of system level users and services.

https://fedorahosted.org/sssd/
https://fedoraproject.org/wiki/Features/SSSD

I'll try this after F12 Beta is out, probably best way to see whether these speculations are true and has there been any progress on other related fronts.

Thanks.

Comment 41 Andrew McNabb 2009-10-06 17:00:58 UTC
Unfortunately, the SSSD feature page seems to say that SSSD won't be a default feature (i.e., you have to manually install and configure it).  If that's true, then SSSD will be a nice workaround, but it won't really fix the bug, since anyone without SSD installed will still have this problem.

Comment 42 Orion Poplawski 2009-10-15 17:21:55 UTC
I've installed sssd and it appears to have no effect on this startup issue, no do I see any evidence as to why it would.

Comment 43 Daniel Qarras 2009-10-16 07:45:38 UTC
> I've installed sssd and it appears to have no effect on this startup issue, no
> do I see any evidence as to why it would.

If you use "files sss" in /etc/nsswitch.conf and "pam_sss.so" in /etc/pam.d/system-auth (and no ldap / pam_ldap.so at all) then all userinfo/authentication attempts go through SSSD which should be able to handle offline situations. You probably need a recent version like 0.6 or so.

Comment 44 D A T 2009-11-13 23:17:26 UTC
I seem to have stumbled on this one after sucessfully runnig F10 with LDAP for over a year.
I run KDE4 (this is relivant, bear with me)

I installed NetworkManager-pptp in an atempt to get a PPTP VPN connection working. As opposed to KNetworkManager wich failed to work for the VPN.  I'm fairly sure i Installd NetworkManager itself too, though it could have silently been sitting there already. After sucessfully doing so the following day i rebooted I had stumbled on this issue. Long delays of message bus, /var/log/messages screaming about ldap issues.

There seems to be sugnificance where NetworkManager takes over the normal Network service. I was surprised to find /etc/sysconfig/networking-scripts/ifcfg-eth0 set to ONBOOT=no.

Anyway I by no means pretent to be an expert on Fedoras network stack or management. Just thought the fact that I had managed to induce the problem after such a long previously stable time might help you guys figure it out.

I can provide any futher details you think would help.

Comment 45 Bug Zapper 2009-11-16 07:50:35 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 12 development cycle.
Changing version to '12'.

More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 46 Andreas Mueller 2010-04-21 10:16:39 UTC
The root cause apparently has not been investigated yet. Reading the
source code of dbus-daemon has revealed the following:

dbus-daemon reads all the groups of the user root when it parses
the user="root" attributes in the configuration file. This triggers
many ldap lookups, that trigger the exponential back off of the
bind_policy hard setting in /etc/ldap.conf. So parsing the config
file takes long, and dbus-daemon forks only after parsing the config.
At that point, the boot continues.

The point is that dbus-daemon has a logical error in it. It is
not necessary to read the list of groups of a user ever. Such a
list is dynamic, it changes when naming services become available,
or when the ldap contents are changed. So dbus-daemon should rather
check group memberships when it needs to, i.e. when it has to
authorize a request. This could be done much more efficiently
using the getgrent family of calls instead of the getgrouplist
call dbus-daemon is currently using.

So I propose that the upstream providers of dbus-daemon are contacted
to get dbus-daemon fixed. Possible fixes;

1. quick and dirty: add an option to stop dbus-daemon from expanding
   group lists.

2. fix the logical error, don't use getgrouplist, check group membership
   late and rely on nscd's caching mechanism for performance.

Comment 47 Orion Poplawski 2010-04-21 21:43:39 UTC
Andreas -  Sounds like a great analysis.  Upstream is here: http://www.freedesktop.org/wiki/Software/dbus  Would you be willing to file a bug there?

Comment 48 Bin Li 2010-06-02 10:13:15 UTC
I also met this issue, and I report it in upstream.

https://bugs.freedesktop.org/show_bug.cgi?id=28355

Comment 49 Andreas Mueller 2010-06-02 10:58:34 UTC
Sorry for not following up more quickly. I've filed the analysis above
as additional information to fredesktop.org bug 28355 created by Bin Li.

Comment 50 Bug Zapper 2010-11-04 12:17:15 UTC
This message is a reminder that Fedora 12 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 12.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '12'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 12's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 12 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 51 Bug Zapper 2010-12-05 07:18:25 UTC
Fedora 12 changed to end-of-life (EOL) status on 2010-12-02. Fedora 12 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.