Bug 756428

Summary: sssd.service does not depend on time-sync.target, breaks krb5 and such
Product: [Fedora] Fedora Reporter: Sandro Mathys <sandro>
Component: sssdAssignee: Stephen Gallagher <sgallagh>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 16CC: jhrozek, sbose, sgallagh, ssorce
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-03-05 21:23:48 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
secure log of what's described in comment #6
none
syslog of what's described in comment #6
none
systemctl status sssd.service output as described in comment #6 none

Description Sandro Mathys 2011-11-23 14:34:43 UTC
Description of problem:
sssd.service should depend on time-sync.target to make sure the system has the correct time. Otherwise krb5 and probably other authentication mechanisms won't work.

Version-Release number of selected component (if applicable):
sssd-1.6.3-1.fc16.x86_64

How reproducible:
Always.

Steps to Reproduce:
1. Configure sssd to use krb5 authentication
2. Make sure your hwclock is off my an hour or so
3. Configure ntpd/ntpdate or chronyd/chrony-wait
4. Reboot
5. Try to log in
  
Actual results:
Authentication fails

Expected results:
Authentication works

Additional info:
/lib/systemd/system/sssd.service needs to be changed to read:
After=syslog.target time-sync.target

Comment 1 Stephen Gallagher 2011-11-23 14:45:43 UTC
This is a bit of a chicken and egg problem. time-sync.target (IIRC) is dependent on the network target. SSSD has to be available to handle requests before the network comes up. SSSD is capable of handling this because it has built-in caching (unlike nss_ldap which would in many cases hang for minutes during startup).

What you should be dealing with here is a race-condition. If you try to log in during the period where the network is up, but ntp is not yet started, then SSSD will contact the KDC and get an error about clock skew. But once ntp adjusts the time, logins should work correctly. This race-condition should be *very* small.

If you're seeing this always fail, then there's a bug in NTP/chrony, not SSSD.

Comment 2 Simo Sorce 2011-11-23 15:00:14 UTC
Stephen, what you say is right, but then we need to patch sssd to use offline authentication if we get a clock skew error. Failing authentication in this case is not the expected outcome.

I think we should open a ticket in sssd's trac if we fail authentication on clock skew.

Comment 3 Sandro Mathys 2011-11-23 15:05:51 UTC
chronyd-wait.service does work, i.e. the time is corrected within a couple of seconds. The system's time is correct when I try to login, but authentication will fail.

As soon as I restart sssd, authentication works.

Reproducable, always.

Now tell me where this is a chrony bug and not a SSSD bug.

Comment 4 Stephen Gallagher 2011-11-23 15:13:05 UTC
Upstream ticket:
https://fedorahosted.org/sssd/ticket/1096

Comment 5 Stephen Gallagher 2011-11-23 15:28:24 UTC
If the system time is correct, but SSSD fails then something else is going on here.

Are you using cached credentials? Also, are you using GSSAPI to connect to the LDAP provider (or using the IPA provider?)

What shows up in /var/log/secure when this happens?

Comment 6 Sandro Mathys 2011-11-24 08:12:31 UTC
We don't cache credentials as the user homes are on a CIFS share so login in offline state is useless in this case anyway.

Yes, we are using GSSAPI to connect to the ldap provider.

I'll attach the secure log, the syslog (grep'ed for sss and chrony) and the output of two 'systemctl status sssd.service'.

I did the following:
1) reboot
2) login as (remote user) sandroma -> fail (3 times, just to make sure)
3) login as (local user) root
4) systemctl status sssd.service
5) systemctl restart sssd.service
6) login as sandroma -> success
7) systemctl status sssd.service

If you put the information of those logs into one timeline, you'll clearly see chrony has synced the time well ahead oh the user trying to log in. Oh, in case it matters somehow: I logged in on a tty, even though gdm was started.

Comment 7 Sandro Mathys 2011-11-24 08:13:31 UTC
Created attachment 535807 [details]
secure log of what's described in comment #6

Comment 8 Sandro Mathys 2011-11-24 08:13:52 UTC
Created attachment 535808 [details]
syslog of what's described in comment #6

Comment 9 Sandro Mathys 2011-11-24 08:14:31 UTC
Created attachment 535809 [details]
systemctl status sssd.service output as described in comment #6

Comment 10 Stephen Gallagher 2011-11-28 12:30:01 UTC
Sorry for the long delay. It was a holiday in the US.

Sandro, could you set
debug_level = 9

in the [domain/<DOMAINNAME>] section of /etc/sssd/sssd.conf and then reboot to rerun the failing test?

Then please attach /var/log/sssd/sssd_<DOMAINNAME>.log and /var/log/ldap_child.log to this ticket (sanitized if needed).

Also, could you please try one more thing with your test? Please try waiting five minutes after the first failure before attempting a second login. I have a hunch that what may be happening is this:

SSSD comes up before NTP starts.
Something asks SSSD for a lookup before NTP has started.
SSSD tries to connect to LDAP and is denied the GSSAPI bind, so SSSD goes into "offline mode" to answer requests from the cache for two minutes.
The system finishes booting and you try to log in before those two minutes are complete. Since you don't have cached credentials, SSSD has to return failure (since from its perspective, there's no way to validate you until we return to online mode).

I should point out that in this configuration, the cached credentials would still be valuable, as you are technically "online" in a network sense, but "offline" from SSSD's perspective.

Anyway, I want to rule out whether this is a timing issue or somehow we're entering an offline state from which we will never return (which is a serious issue).

Thanks for your help sorting this out.