Bug 621700

Summary: GDM doesn't handle nsswitch.conf getting updated on the fly
Product: Red Hat Enterprise Linux 6 Reporter: Ray Strode [halfline] <rstrode>
Component: gdmAssignee: Ray Strode [halfline] <rstrode>
Status: CLOSED CURRENTRELEASE QA Contact: desktop-bugs <desktop-bugs>
Severity: medium Docs Contact:
Priority: low    
Version: 6.0CC: cmeadors, jgalipea, jkoten, jlaska, jmccann, notting, overholt, roland, rstrode, sbose, sgallagh, syeghiay
Target Milestone: rcKeywords: RHELNAK
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: gdm-2.30.4-16.el6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 607233 Environment:
Last Closed: 2010-11-10 20:28:18 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 599016    
Attachments:
Description Flags
The aforementioned work around none

Description Ray Strode [halfline] 2010-08-05 20:49:40 UTC
+++ This bug was initially created as a clone of Bug #607233 +++

Description of problem:
Network-provided users can't log in through GDM. They stall after accepting the password and are never presented with a desktop.

Version-Release number of selected component (if applicable):
gdm-2.29.6-1.fc13

How reproducible:
Every time

Steps to Reproduce:

--- Additional comment from jlaska on 2010-08-04 16:39:32 EDT ---

Moving back to ASSIGNED based on comment#19.  

Sumit: Is the procedure Jiri notes in comment#19 a correct way to reset the system configuration to reproduce this failure?

--- Additional comment from sgallagh on 2010-08-05 07:50:54 EDT ---

That set of reproduction steps is incomplete. It doesn't describe where in those steps they are attempting logins.

Let me try to explain what's happening at each step here.

1) Change auth conf to local account only
This removes sss from nsswitch.conf and pam_sss from /etc/pam.d/[system|password]-auth as well as shutting the daemon down.

As a result, after this step, no user identity is looked up from SSSD.

2) Removing cached credentials
It is safe to purge the cache at this time, since the SSSD is not running. Be aware that this is also removing the cached user identities, so if the system is not online after this, it will not be able to return user information.

3) Rebooting
This step could be shortened to dropping into runlevel 3 and then returning to runlevel 5. I assume that the goal here is just to restart gdm.

I'm assuming that at this point the engineer is logging in using a local user account. At this time, no activity happens related to the SSSD. The sss client libraries aren't in use, thus this is NOT a valid test of this bug.

4) Setting auth conf to ldap+kerberos
This adds sss back into nsswitch.conf and starts up the SSSD daemon processes.

5) Switching user
This would actually be the first lookup to the SSSD. If this is failing at this time, then it's most likely that SSSD cannot reach the LDAP server or is experiencing a similar failure that is resulting in it not answering the request. For this, I'd need to see the /var/log/sssd/sssd_default.log  (and I'd prefer that the debug_level be set to 9 in the sssd.conf)


So this approach is NOT testing the specific fix.

Testing this specific fix is actually very easy:
1) Use authconfig to set LDAP+Kerberos
1) telinit 3
2) Log in as root on the local console
3) service stop sssd
4) rm -f /var/log/sss/db/cache_default.ldb
5) service sssd start
6) telinit 5
7) Log in to GDM as an LDAP user with the appropriate Kerberos password
8) service sssd restart
9) Log out of the logged-in user and log in again

Before this fix, that would crash. After this fix it should go smoothly.

--- Additional comment from jkoten on 2010-08-05 09:35:45 EDT ---

Your steps may be good to test the specific fix in sssd but it don't reproduce steps from comment 0. The problem is in step 7 - you start GDM with sssd already configured. But that worked even before the fix - see comment 11.

Use case is that you use local user to configure sssd through authconfig-gtk and then you switch to sssd user (i.e. without restarting gdm).

Steps to reproduce:
1) Change auth. conf to local account only
2) telinit 3
3) telinit 5
4) Log in as a local user
5) Use authconfig to set LDAP+Kerberos
6) Switch user
7) Log in to GDM as an LDAP user with the appropriate Kerberos password

It seems to me that the problem is in gdm - feel free to clone this bug against gdm.

--- Additional comment from jkoten on 2010-08-05 10:04:14 EDT ---

Created an attachment (id=436858)
sssd_default.log

--- Additional comment from rstrode on 2010-08-05 16:15:58 EDT ---

Okay, so sgallagh and I spent a few hours looking into this today.

What's going on is one of gdm's processes is very long running.  This process runs before sssd is configured in nsswitch.conf and continues to run after nsswith.conf is configured in nsswitch.conf.

The problem is, it seems that glibc will only read the list of modules from nsswitch.conf once for the lifetime of a process (the first time the process calls getpwnam()), so it doesn't notice that the system has been updated.  This gives the long running gdm process an inconsistent view of the world compared to the shorter running gdm processes, and gdm doesn't handle that inconsistency very robustly.

There are a few possibilities on what we could do next:

1) Fix glibc to automatically detect when nsswitch.conf is updated and clear its cache
2) Add a new function to glibc ala res_init() but for nsswitch.conf instead of resolv.conf and make gdm call that function before doing getpwnam().
3) Make gdm fork a helper process any time it wants to call getpwnam() to ensure that getpwnam() always returns current information
4) Make gdm fail instead of crash.  This would prevent users from being able to login with sssd until they reboot, but would at least wouldn't show a crash message in their syslog.
5) Make authconfig tell the user they need to reboot for changes to take effect.
6) Release note this limitation

1 and 2 would be nicest fixes for me, but they may not be feasible on the glibc side. From a code aesthetics point of view, 3 loses, but it has the advantage of making everything work out of the box.  4, 5, and 6 all lose from a "Just Works" point of view.

--- Additional comment from roland on 2010-08-05 16:34:51 EDT ---

The other workaround that comes to mind is using nscd.  If nscd is available, then libc won't look at nsswitch.conf at all.  The first time a call is tried when nscd is gone, it should fall back to looking at nsswitch.conf.  So you could do some dance where nscd runs in a default configuration to begin with, and then when sssd is enabled you write nsswitch.conf first and then "service nscd stop".  That should work, but it seems pretty fragile.  The #3 sort of approach is really the only thing that is surely going to be robust without relying on intricate details of libc internals.

For adding or changing libc behavior (#2 is plausible, #1 won't happen), you need to consult the upstream glibc maintainers, i.e. drepper.

--- Additional comment from rstrode on 2010-08-05 16:45:43 EDT ---

Given nscd has other side-effects (although some of those orthogonal side-effects would be a good thing!), it's probably not appropriate to do at this stage in the development cycle.

So it sounds like 2 would potentially be a better long term option, but 3 is better for rhel 6.

I'll add the workaround to GDM.

This bug is getting a little crowded though, given it's covering two different issues, one that's already fixed in sssd and this new one that we need to work around in gdm.

I'll clone this bug to cover the gdm work.

Comment 2 RHEL Program Management 2010-08-05 21:07:41 UTC
This issue has been proposed when we are only considering blocker
issues in the current Red Hat Enterprise Linux release.

** If you would still like this issue considered for the current
release, ask your support representative to file as a blocker on
your behalf. Otherwise ask that it be considered for the next
Red Hat Enterprise Linux release. **

Comment 3 Ray Strode [halfline] 2010-08-05 22:23:59 UTC
I have a functioning patch for this.  Just need acks.

Comment 4 Ray Strode [halfline] 2010-08-06 17:00:06 UTC
Created attachment 437212 [details]
The aforementioned work around

This patch is the work around I mentioned previously.  It just changes gdm to call out a helper program instead of using getpwnam() directly for the part of the code that hits this bug.

By execing a new process we get an updated module list, and everything works.

Comment 5 Ray Strode [halfline] 2010-08-06 18:28:58 UTC
attachment 437212 [details] is building now.  I tested this with Nalin yesterday and it seems to work okay.

Comment 8 releng-rhel@redhat.com 2010-11-10 20:28:18 UTC
Red Hat Enterprise Linux 6.0 is now available and should resolve
the problem described in this bug report. This report is therefore being closed
with a resolution of CURRENTRELEASE. You may reopen this bug report if the
solution does not work for you.