Bug 1644919 - First(?) login as FreeIPA domain user takes longer with sssd-2.0.0-4.fc29 (38 seconds vs. 3 seconds)
Summary: First(?) login as FreeIPA domain user takes longer with sssd-2.0.0-4.fc29 (38...
Keywords:
Status: NEW
Alias: None
Product: Fedora
Classification: Fedora
Component: sssd
Version: 29
Hardware: All
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Michal Zidek
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-10-31 22:14 UTC by Adam Williamson
Modified: 2019-03-08 09:33 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:


Attachments (Terms of Use)
client /var/log tarball (from openQA staging job #389236) (1.40 MB, application/octet-stream)
2018-11-01 02:23 UTC, Adam Williamson
no flags Details
server /var/log tarball (from openQA staging job #389235) (2.32 MB, application/octet-stream)
2018-11-01 02:27 UTC, Adam Williamson
no flags Details
client /var/log tarball (from openQA staging job #390554) (1.26 MB, application/octet-stream)
2018-11-06 01:13 UTC, Adam Williamson
no flags Details
server /var/log tarball (from openQA staging job #390553) (3.30 MB, application/octet-stream)
2018-11-06 01:14 UTC, Adam Williamson
no flags Details

Description Adam Williamson 2018-10-31 22:14:09 UTC
With sssd-2.0.0-4.fc29 - https://bodhi.fedoraproject.org/updates/FEDORA-2018-6b9c9c4f85 - the openQA FreeIPA tests fail:

https://openqa.fedoraproject.org/tests/300815
https://openqa.fedoraproject.org/tests/300817

Since the update was pushed stable, the openQA FreeIPA tests for *every* F29 update are failing this way, as they all now use that version of sssd.

In the test, a FreeIPA server is deployed on one VM with the domain name as DOMAIN.LOCAL, and a user called 'test1' with the right to login to any host in the domain is created on the server end. Another VM is enrolled as a client in the domain, and once that has completed, the test tries to log in at a console as 'test1@DOMAIN.LOCAL' with the correct password.

Prior to this update, that worked fine. With this update, the login process seems to get stuck and never completes - the system just sits at this screen:

https://openqa.fedoraproject.org/tests/300815#step/freeipa_client/19

until the test times out (I've tried it with timeouts of 10 and 60 seconds, it behaved the same; I'm trying with a timeout of 180 seconds now...before this update, 10 seconds was enough).

Will attach logs from both server and client ends. I'm also going to see if this affects my own personal setup (I run a FreeIPA domain, and my desktop is F29...)

Comment 1 Adam Williamson 2018-10-31 22:53:23 UTC
Doesn't seem to affect my 'live' setup. So, first guess would be that the bug happens when it's the *first* time logging into an account on the system?

Comment 2 Adam Williamson 2018-10-31 23:28:35 UTC
Update: I didn't change the timeout setting correctly before. The test by default actually waits 30 seconds for the prompt to appear after entering the password. If I change that timeout to 180, it works; it looks like in the run I just did, it actually took about 38 seconds for the login to succeed. With the previous sssd, it looks like it took about 3 seconds, from the openQA logs.

Comment 3 Adam Williamson 2018-11-01 02:23:09 UTC
Created attachment 1499710 [details]
client /var/log tarball (from openQA staging job #389236)

Comment 4 Adam Williamson 2018-11-01 02:27:25 UTC
Created attachment 1499711 [details]
server /var/log tarball (from openQA staging job #389235)

Note that in these logs, the attempt to login as test1 happens at about 22:35:29 UTC, which seems to be 18:35:29 in system local time. To view the journal, you can use `journalctl --file var/log/journal/(longstring)/system.journal --utc`. Just press tab to fill in "(longstring)".

These logs are from one of the tests where the login timeout was set to 30 seconds, so the login failed. I don't have logs from the case where the timeout was extended and the login succeeds right now, but could adjust the test to upload them if desired.

Comment 5 Michal Zidek 2018-11-01 10:58:40 UTC
Would it be possible to adjust the tests to include debug_level=9 in all sections in sssd.conf? Currently the logs are empty.

Comment 6 Michal Zidek 2018-11-01 12:40:44 UTC
Sorry, I looked at the server logs. The client logs are ok.Removing the needinfo flag.

Comment 7 Jakub Hrozek 2018-11-05 14:29:17 UTC
The requests in the SSSD log are fairly fast, e.g. the account request which I was suspecting would be slow takes 4 seconds:

(Wed Oct 31 18:35:31 2018) [sssd[pam]] [pam_dp_send_req] (0x0100): Sending request with the following data:                                                                                                                                  
(Wed Oct 31 18:35:31 2018) [sssd[pam]] [pam_print_data] (0x0100): command: SSS_PAM_ACCT_MGMT                                                                                                                                                 
(Wed Oct 31 18:35:31 2018) [sssd[pam]] [pam_print_data] (0x0100): domain: domain.local                                                                                                                                                       
(Wed Oct 31 18:35:31 2018) [sssd[pam]] [pam_print_data] (0x0100): user: test1@domain.local                                                                                                                                                   
(Wed Oct 31 18:35:31 2018) [sssd[pam]] [pam_print_data] (0x0100): service: login                                                                                                                                                             
(Wed Oct 31 18:35:31 2018) [sssd[pam]] [pam_print_data] (0x0100): tty: tty2                                                                                                                                                                  
(Wed Oct 31 18:35:31 2018) [sssd[pam]] [pam_print_data] (0x0100): ruser: not set                                                                                                                                                             
(Wed Oct 31 18:35:31 2018) [sssd[pam]] [pam_print_data] (0x0100): rhost: not set                                                                                                                                                             
(Wed Oct 31 18:35:31 2018) [sssd[pam]] [pam_print_data] (0x0100): authtok type: 0                                                                                                                                                            
(Wed Oct 31 18:35:31 2018) [sssd[pam]] [pam_print_data] (0x0100): newauthtok type: 0                                                                                                                                                         
(Wed Oct 31 18:35:31 2018) [sssd[pam]] [pam_print_data] (0x0100): priv: 1                                                                                                                                                                    
(Wed Oct 31 18:35:31 2018) [sssd[pam]] [pam_print_data] (0x0100): cli_pid: 2312                                                                                                                                                              
(Wed Oct 31 18:35:31 2018) [sssd[pam]] [pam_print_data] (0x0100): logon name: test1@DOMAIN.LOCAL                                                                                                                                             
(Wed Oct 31 18:35:31 2018) [sssd[pam]] [pam_print_data] (0x0100): flags: 0                                                                                                                                                                   
(Wed Oct 31 18:35:31 2018) [sssd[pam]] [pam_dom_forwarder] (0x0100): pam_dp_send_req returned 0                                                                                                                                              
(Wed Oct 31 18:35:35 2018) [sssd[pam]] [pam_dp_send_req_done] (0x0200): received: [0 (Success)][domain.local]                                                                                                                                
(Wed Oct 31 18:35:35 2018) [sssd[pam]] [pam_reply] (0x0200): pam_reply called with result [0]: Success.
(Wed Oct 31 18:35:35 2018) [sssd[pam]] [filter_responses] (0x0100): [pam_response_filter] not available, not fatal.                                                                                                                          
(Wed Oct 31 18:35:35 2018) [sssd[pam]] [pam_reply] (0x0200): blen: 29   

But then there is a delay between sending a reply to the account PAM phase and the session phase:                                                                                                                                      
                                                                                                                                                                     
(Wed Oct 31 18:36:07 2018) [sssd[pam]] [pam_cmd_setcred] (0x0100): entering pam_cmd_setcred                                                                                                                                                  
(Wed Oct 31 18:36:07 2018) [sssd[pam]] [sss_parse_name_for_domains] (0x0200): name 'test1@DOMAIN.LOCAL' matched expression for domain 'domain.local', user is test1                                                                          
(Wed Oct 31 18:36:07 2018) [sssd[pam]] [pam_print_data] (0x0100): command: SSS_PAM_SETCRED                                                                                                                                                   
(Wed Oct 31 18:36:07 2018) [sssd[pam]] [pam_print_data] (0x0100): domain: domain.local                                                                                                                                                       
(Wed Oct 31 18:36:07 2018) [sssd[pam]] [pam_print_data] (0x0100): user: test1                                                                                                                                                                
(Wed Oct 31 18:36:07 2018) [sssd[pam]] [pam_print_data] (0x0100): service: login                                                                                                                                                             
(Wed Oct 31 18:36:07 2018) [sssd[pam]] [pam_print_data] (0x0100): tty: tty2                                                                                                                                                                  
(Wed Oct 31 18:36:07 2018) [sssd[pam]] [pam_print_data] (0x0100): ruser: not set                                                                                                                                                             
(Wed Oct 31 18:36:07 2018) [sssd[pam]] [pam_print_data] (0x0100): rhost: not set                                                                                                                                                             
(Wed Oct 31 18:36:07 2018) [sssd[pam]] [pam_print_data] (0x0100): authtok type: 0                                                                                                                                                            
(Wed Oct 31 18:36:07 2018) [sssd[pam]] [pam_print_data] (0x0100): newauthtok type: 0                                                                                                                                                         
(Wed Oct 31 18:36:07 2018) [sssd[pam]] [pam_print_data] (0x0100): priv: 1                                                                                                                                                                    
(Wed Oct 31 18:36:07 2018) [sssd[pam]] [pam_print_data] (0x0100): cli_pid: 2312                                                                                                                                                              
(Wed Oct 31 18:36:07 2018) [sssd[pam]] [pam_print_data] (0x0100): logon name: test1@DOMAIN.LOCAL                                                                                                                                             
(Wed Oct 31 18:36:07 2018) [sssd[pam]] [pam_print_data] (0x0100): flags: 2                                                                                                                                                                   
(Wed Oct 31 18:36:07 2018) [sssd[pam]] [cache_req_send] (0x0400): CR #3: New request 'Initgroups by name'                                                                                                                                    
(Wed Oct 31 18:36:07 2018) [sssd[pam]] [cache_req_process_input] (0x0400): CR #3: Parsing input name [test1@DOMAIN.LOCAL] 

Looking at the journal around that time, there are these messages:
Oct 31 23:35:35 client003.domain.local audit[2312]: USER_ACCT pid=2312 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:local_login_t:s0-s0:c0.c1023 msg='op=PAM:accounting grantors=pam_unix,pam_sss,pam_permit acct="test1@DOMA>

--> this is pam_sss granting the account phase

Oct 31 23:36:03 client003.domain.local realmd[919]: quitting realmd service after timeout

....hmm, why did the realmd service quit? I guess just an idle timeout?

Oct 31 23:36:06 client003.domain.local dbus-daemon[645]: Unknown username "systemd-timesync" in message bus configuration file

Hmm, strange, but doesn't sound related?

Oct 31 23:36:06 client003.domain.local dbus-daemon[645]: [system] Reloaded configuration
Oct 31 23:36:06 client003.domain.local realmd[919]: stopping service
Oct 31 23:36:06 client003.domain.local audit[1]: USER_AVC pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='avc:  received policyload notice (seqno=2)
                                                  exe="/usr/lib/systemd/systemd" sauid=0 hostname=? addr=? terminal=?'
Oct 31 23:36:07 client003.domain.local audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=realmd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=s>
Oct 31 23:36:07 client003.domain.local audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=realmd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=su>
Oct 31 23:36:07 client003.domain.local systemd[1]: Started Getty on tty6.
Oct 31 23:36:07 client003.domain.local audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=getty@tty6 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? r>

looks like systemd spawned a tty??

Oct 31 23:36:07 client003.domain.local audit[2312]: CRED_ACQ pid=2312 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:local_login_t:s0-s0:c0.c1023 msg='op=PAM:setcred grantors=pam_localuser,pam_sss acct="test1@DOMAIN.LOCAL" >

Here is the setcred PAM phase which corresponds to:
(Wed Oct 31 18:36:07 2018) [sssd[pam]] [pam_cmd_setcred] (0x0100): entering pam_cmd_setcred                                                                                                                                                  

But so far I don't know how to explain the delay between calling into account and setcred/session. Looking at the PAM configuration, there are several modules called before reaching the system-auth file which includes pam_sss:

# pam_selinux.so close should be the first session rule
session    required     pam_selinux.so close
session    required     pam_loginuid.so
session    optional     pam_console.so
# pam_selinux.so open should only be followed by sessions to be executed in the user context
session    required     pam_selinux.so open
session    required     pam_namespace.so
session    optional     pam_keyinit.so force revoke
session    include      system-auth

I wonder if it is possible to re-run the test with e.g. ssh or another service?

Comment 8 Jakub Hrozek 2018-11-05 14:30:11 UTC
Alternatively, would it be possible to run the same test with a recent fedora tree, but an older sssd to try to bisect the issue and see if it was really caused by sssd or something else?

Comment 9 Adam Williamson 2018-11-05 16:19:07 UTC
We already have fairly strong proof of that: the bug first appeared only in the test of the sssd update when it was in updates-testing, then as soon as the sssd update was pushed to stable, started happening in every other update test as well.

The spawn of tty6 is just what the test does on failure - when it timed out and gave up waiting for the login to succeed, it switched to tty6 and logged in as the local root account, to upload logs and so on. It's not related to the bug.

As I mentioned, if you like, I can tweak the test to upload logs even on success, so we'll have slightly cleaner logs from the current case (where we wait long enough for the login to succeed, and the test completes).

Comment 10 Jakub Hrozek 2018-11-05 19:06:50 UTC
(In reply to Adam Williamson from comment #9)
> We already have fairly strong proof of that: the bug first appeared only in
> the test of the sssd update when it was in updates-testing, then as soon as
> the sssd update was pushed to stable, started happening in every other
> update test as well.
> 

I see

> The spawn of tty6 is just what the test does on failure - when it timed out
> and gave up waiting for the login to succeed, it switched to tty6 and logged
> in as the local root account, to upload logs and so on. It's not related to
> the bug.
> 
> As I mentioned, if you like, I can tweak the test to upload logs even on
> success, so we'll have slightly cleaner logs from the current case (where we
> wait long enough for the login to succeed, and the test completes).

Yes please.

Comment 11 Adam Williamson 2018-11-06 01:13:55 UTC
Created attachment 1502245 [details]
client /var/log tarball (from openQA staging job #390554)

Comment 12 Adam Williamson 2018-11-06 01:14:39 UTC
Created attachment 1502246 [details]
server /var/log tarball (from openQA staging job #390553)

Comment 13 Adam Williamson 2018-11-06 01:18:19 UTC
OK, there's a couple of new log tarballs from a case where we waited long enough for the login to succeed. It looks like the login attempt started (test hit 'enter' after typing password) at 22:24:51 UTC and the login completed at 22:25:12 UTC (21 seconds later) - weirdly, the time it takes seems to have become a *bit* shorter in the last several days of test runs, it's dropped from about 35 seconds in each test to about 20 seconds.

As I mentioned, before the problematic sssd update, this took about 2 seconds.

Comment 14 Jakub Hrozek 2018-11-06 08:03:55 UTC
One more question, is it possible to run the tests with a scratch package? There is a handful of patches between the 'good' and the 'bad' update but not so many as to make bisect impractical.

Comment 15 Adam Williamson 2018-11-06 18:08:56 UTC
yeah, it's possible. Just throw me the scratch builds to try and I'll try 'em.

Comment 16 Jakub Hrozek 2018-11-06 20:17:18 UTC
(In reply to Adam Williamson from comment #15)
> yeah, it's possible. Just throw me the scratch builds to try and I'll try
> 'em.

I'll do that (and thank you very much for the offer), unless there is a way I can run the tests myself?

Comment 17 Adam Williamson 2018-11-06 20:38:41 UTC
Not unless you want to set up your own pet openQA deployment, which is possible but a bit of work. Probably easier just for me to do it :)

Comment 18 Jakub Hrozek 2018-11-23 14:06:19 UTC
(In reply to Adam Williamson from comment #17)
> Not unless you want to set up your own pet openQA deployment, which is
> possible but a bit of work. Probably easier just for me to do it :)

I totally forgot about this bug :-(

Here is a build that tries to "bisect" the issue by applying ony half of the patches between -3 and -4 in rawhide:
https://koji.fedoraproject.org/koji/taskinfo?taskID=31064701

Comment 19 Adam Williamson 2018-11-27 01:35:49 UTC
Sorry, me too. I've got a bit of a backlog ATM but I'll try to get back and give that build a shot soon.

Comment 20 Adam Williamson 2018-12-18 01:33:32 UTC
Well, we actually figured out the cause of this a different way - see https://bugzilla.redhat.com/show_bug.cgi?id=1654537#c9 . Jakub realized there that the key change is https://pagure.io/SSSD/sssd/c/945865a , which basically makes a rather resource-intensive SELinux-related operation happen pretty much every time you login to a system as a FreeIPA domain user for the first time. Before that change it happened much more rarely.

That was a genuine fix for a genuine bug, though, so it's a bit tricky to "fix" this - we can't just rever that. One suggestion is to try and make the SELinux operation smoother. The other thing I'd note is that *the SELinux operation itself* is marked as being a 'workaround' for a much older bug, https://bugzilla.redhat.com/show_bug.cgi?id=1186422 . That bug seems to have been resolved now. Can we look again at whether it's still necessary for sssd to do this at all, with that fix?

Comment 21 Adam Williamson 2018-12-18 01:40:02 UTC
To be clear, I'm talking about this here:

https://pagure.io/SSSD/sssd/blob/945865a/f/src/providers/ipa/selinux_child.c#_144

note the comment:

	
    /* This is a workaround for
     * https://bugzilla.redhat.com/show_bug.cgi?id=1186422 to make sure
     * the directories are created with the expected permissions
     */

Comment 22 Jakub Hrozek 2018-12-18 15:27:05 UTC
The comment is really just related to the umask() call on line 154 in current master and the subsequent umask re-set. It shouldn't even be required with modern libsemanage versions, but at the same time, the umask is not the slow part, it's setting the context.

I filed https://bugzilla.redhat.com/show_bug.cgi?id=1660546 to ask the semanage developers what we can do about speeding the operation up. A bug report was the best way of tracking the issue esp. given that holidays are coming up.

Comment 23 Adam Williamson 2018-12-20 16:38:25 UTC
Note that in the aarch64 tests, this seems to be taking about *90* seconds:

(Thu Dec 20 05:49:32 2018) [sssd[be[domain.local]]] [selinux_child_create_buffer] (0x4000): buffer size: 43
(Thu Dec 20 05:49:32 2018) [sssd[be[domain.local]]] [child_handler_setup] (0x2000): Setting up signal handler up for pid [23892]
(Thu Dec 20 05:49:32 2018) [sssd[be[domain.local]]] [child_handler_setup] (0x2000): Signal handler set up for pid [23892]
(Thu Dec 20 05:49:32 2018) [sssd[be[domain.local]]] [sdap_process_result] (0x2000): Trace: sh[0xaaaad0944d60], connected[1], ops[(nil)], ldap[0xaaaad095d4e0]
(Thu Dec 20 05:49:32 2018) [sssd[be[domain.local]]] [sdap_process_result] (0x2000): Trace: end of ldap_result list
(Thu Dec 20 05:49:32 2018) [sssd[be[domain.local]]] [write_pipe_handler] (0x0400): All data has been sent!
(Thu Dec 20 05:51:02 2018) [sssd[be[domain.local]]] [child_sig_handler] (0x1000): Waiting for child [23892].
(Thu Dec 20 05:51:02 2018) [sssd[be[domain.local]]] [child_sig_handler] (0x0100): child [23892] finished successfully.

and because it takes *so* long, we actually hit, I think, some other timeout during login. This doesn't seem to be the systemd timeout - it's more that getty just suddenly exits, and systemd sees that and restarts it:

Dec 20 02:50:30 client003.domain.local audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=getty@tty2 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Dec 20 02:50:30 client003.domain.local systemd[1]: getty@tty2.service: Service has no hold-off time (RestartSec=0), scheduling restart.
Dec 20 02:50:30 client003.domain.local systemd[1]: getty@tty2.service: Scheduled restart job, restart counter is at 1.
Dec 20 02:50:30 client003.domain.local systemd[1]: Stopped Getty on tty2.

Looking into this, it seems that getty calls 'login' after reading the username - per https://unix.stackexchange.com/questions/144303/change-tty-login-timeout-archlinux - and login has a default 60 second timeout. From 'man login':

CONFIG FILE ITEMS
       login  reads  the /etc/login.defs(5) configuration file.  Note that the configuration file could be distributed with another package (e.g. shadow-utils).  The following configuration items are relevant for login(1):

...

       LOGIN_TIMEOUT (number)
           Max time in seconds for login.  The default value is 60.

so I think that's what we're hitting. I'll file a separate bug for this.

Comment 24 Adam Williamson 2018-12-20 16:45:19 UTC
Filed https://bugzilla.redhat.com/show_bug.cgi?id=1661273 .


Note You need to log in before you can comment on or make changes to this bug.