With sssd-2.0.0-4.fc29 - https://bodhi.fedoraproject.org/updates/FEDORA-2018-6b9c9c4f85 - the openQA FreeIPA tests fail: https://openqa.fedoraproject.org/tests/300815 https://openqa.fedoraproject.org/tests/300817 Since the update was pushed stable, the openQA FreeIPA tests for *every* F29 update are failing this way, as they all now use that version of sssd. In the test, a FreeIPA server is deployed on one VM with the domain name as DOMAIN.LOCAL, and a user called 'test1' with the right to login to any host in the domain is created on the server end. Another VM is enrolled as a client in the domain, and once that has completed, the test tries to log in at a console as 'test1' with the correct password. Prior to this update, that worked fine. With this update, the login process seems to get stuck and never completes - the system just sits at this screen: https://openqa.fedoraproject.org/tests/300815#step/freeipa_client/19 until the test times out (I've tried it with timeouts of 10 and 60 seconds, it behaved the same; I'm trying with a timeout of 180 seconds now...before this update, 10 seconds was enough). Will attach logs from both server and client ends. I'm also going to see if this affects my own personal setup (I run a FreeIPA domain, and my desktop is F29...)
Doesn't seem to affect my 'live' setup. So, first guess would be that the bug happens when it's the *first* time logging into an account on the system?
Update: I didn't change the timeout setting correctly before. The test by default actually waits 30 seconds for the prompt to appear after entering the password. If I change that timeout to 180, it works; it looks like in the run I just did, it actually took about 38 seconds for the login to succeed. With the previous sssd, it looks like it took about 3 seconds, from the openQA logs.
Created attachment 1499710 [details] client /var/log tarball (from openQA staging job #389236)
Created attachment 1499711 [details] server /var/log tarball (from openQA staging job #389235) Note that in these logs, the attempt to login as test1 happens at about 22:35:29 UTC, which seems to be 18:35:29 in system local time. To view the journal, you can use `journalctl --file var/log/journal/(longstring)/system.journal --utc`. Just press tab to fill in "(longstring)". These logs are from one of the tests where the login timeout was set to 30 seconds, so the login failed. I don't have logs from the case where the timeout was extended and the login succeeds right now, but could adjust the test to upload them if desired.
Would it be possible to adjust the tests to include debug_level=9 in all sections in sssd.conf? Currently the logs are empty.
Sorry, I looked at the server logs. The client logs are ok.Removing the needinfo flag.
The requests in the SSSD log are fairly fast, e.g. the account request which I was suspecting would be slow takes 4 seconds: (Wed Oct 31 18:35:31 2018) [sssd[pam]] [pam_dp_send_req] (0x0100): Sending request with the following data: (Wed Oct 31 18:35:31 2018) [sssd[pam]] [pam_print_data] (0x0100): command: SSS_PAM_ACCT_MGMT (Wed Oct 31 18:35:31 2018) [sssd[pam]] [pam_print_data] (0x0100): domain: domain.local (Wed Oct 31 18:35:31 2018) [sssd[pam]] [pam_print_data] (0x0100): user: test1 (Wed Oct 31 18:35:31 2018) [sssd[pam]] [pam_print_data] (0x0100): service: login (Wed Oct 31 18:35:31 2018) [sssd[pam]] [pam_print_data] (0x0100): tty: tty2 (Wed Oct 31 18:35:31 2018) [sssd[pam]] [pam_print_data] (0x0100): ruser: not set (Wed Oct 31 18:35:31 2018) [sssd[pam]] [pam_print_data] (0x0100): rhost: not set (Wed Oct 31 18:35:31 2018) [sssd[pam]] [pam_print_data] (0x0100): authtok type: 0 (Wed Oct 31 18:35:31 2018) [sssd[pam]] [pam_print_data] (0x0100): newauthtok type: 0 (Wed Oct 31 18:35:31 2018) [sssd[pam]] [pam_print_data] (0x0100): priv: 1 (Wed Oct 31 18:35:31 2018) [sssd[pam]] [pam_print_data] (0x0100): cli_pid: 2312 (Wed Oct 31 18:35:31 2018) [sssd[pam]] [pam_print_data] (0x0100): logon name: test1 (Wed Oct 31 18:35:31 2018) [sssd[pam]] [pam_print_data] (0x0100): flags: 0 (Wed Oct 31 18:35:31 2018) [sssd[pam]] [pam_dom_forwarder] (0x0100): pam_dp_send_req returned 0 (Wed Oct 31 18:35:35 2018) [sssd[pam]] [pam_dp_send_req_done] (0x0200): received: [0 (Success)][domain.local] (Wed Oct 31 18:35:35 2018) [sssd[pam]] [pam_reply] (0x0200): pam_reply called with result [0]: Success. (Wed Oct 31 18:35:35 2018) [sssd[pam]] [filter_responses] (0x0100): [pam_response_filter] not available, not fatal. (Wed Oct 31 18:35:35 2018) [sssd[pam]] [pam_reply] (0x0200): blen: 29 But then there is a delay between sending a reply to the account PAM phase and the session phase: (Wed Oct 31 18:36:07 2018) [sssd[pam]] [pam_cmd_setcred] (0x0100): entering pam_cmd_setcred (Wed Oct 31 18:36:07 2018) [sssd[pam]] [sss_parse_name_for_domains] (0x0200): name 'test1' matched expression for domain 'domain.local', user is test1 (Wed Oct 31 18:36:07 2018) [sssd[pam]] [pam_print_data] (0x0100): command: SSS_PAM_SETCRED (Wed Oct 31 18:36:07 2018) [sssd[pam]] [pam_print_data] (0x0100): domain: domain.local (Wed Oct 31 18:36:07 2018) [sssd[pam]] [pam_print_data] (0x0100): user: test1 (Wed Oct 31 18:36:07 2018) [sssd[pam]] [pam_print_data] (0x0100): service: login (Wed Oct 31 18:36:07 2018) [sssd[pam]] [pam_print_data] (0x0100): tty: tty2 (Wed Oct 31 18:36:07 2018) [sssd[pam]] [pam_print_data] (0x0100): ruser: not set (Wed Oct 31 18:36:07 2018) [sssd[pam]] [pam_print_data] (0x0100): rhost: not set (Wed Oct 31 18:36:07 2018) [sssd[pam]] [pam_print_data] (0x0100): authtok type: 0 (Wed Oct 31 18:36:07 2018) [sssd[pam]] [pam_print_data] (0x0100): newauthtok type: 0 (Wed Oct 31 18:36:07 2018) [sssd[pam]] [pam_print_data] (0x0100): priv: 1 (Wed Oct 31 18:36:07 2018) [sssd[pam]] [pam_print_data] (0x0100): cli_pid: 2312 (Wed Oct 31 18:36:07 2018) [sssd[pam]] [pam_print_data] (0x0100): logon name: test1 (Wed Oct 31 18:36:07 2018) [sssd[pam]] [pam_print_data] (0x0100): flags: 2 (Wed Oct 31 18:36:07 2018) [sssd[pam]] [cache_req_send] (0x0400): CR #3: New request 'Initgroups by name' (Wed Oct 31 18:36:07 2018) [sssd[pam]] [cache_req_process_input] (0x0400): CR #3: Parsing input name [test1] Looking at the journal around that time, there are these messages: Oct 31 23:35:35 client003.domain.local audit[2312]: USER_ACCT pid=2312 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:local_login_t:s0-s0:c0.c1023 msg='op=PAM:accounting grantors=pam_unix,pam_sss,pam_permit acct="test1@DOMA> --> this is pam_sss granting the account phase Oct 31 23:36:03 client003.domain.local realmd[919]: quitting realmd service after timeout ....hmm, why did the realmd service quit? I guess just an idle timeout? Oct 31 23:36:06 client003.domain.local dbus-daemon[645]: Unknown username "systemd-timesync" in message bus configuration file Hmm, strange, but doesn't sound related? Oct 31 23:36:06 client003.domain.local dbus-daemon[645]: [system] Reloaded configuration Oct 31 23:36:06 client003.domain.local realmd[919]: stopping service Oct 31 23:36:06 client003.domain.local audit[1]: USER_AVC pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='avc: received policyload notice (seqno=2) exe="/usr/lib/systemd/systemd" sauid=0 hostname=? addr=? terminal=?' Oct 31 23:36:07 client003.domain.local audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=realmd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=s> Oct 31 23:36:07 client003.domain.local audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=realmd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=su> Oct 31 23:36:07 client003.domain.local systemd[1]: Started Getty on tty6. Oct 31 23:36:07 client003.domain.local audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=getty@tty6 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? r> looks like systemd spawned a tty?? Oct 31 23:36:07 client003.domain.local audit[2312]: CRED_ACQ pid=2312 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:local_login_t:s0-s0:c0.c1023 msg='op=PAM:setcred grantors=pam_localuser,pam_sss acct="test1" > Here is the setcred PAM phase which corresponds to: (Wed Oct 31 18:36:07 2018) [sssd[pam]] [pam_cmd_setcred] (0x0100): entering pam_cmd_setcred But so far I don't know how to explain the delay between calling into account and setcred/session. Looking at the PAM configuration, there are several modules called before reaching the system-auth file which includes pam_sss: # pam_selinux.so close should be the first session rule session required pam_selinux.so close session required pam_loginuid.so session optional pam_console.so # pam_selinux.so open should only be followed by sessions to be executed in the user context session required pam_selinux.so open session required pam_namespace.so session optional pam_keyinit.so force revoke session include system-auth I wonder if it is possible to re-run the test with e.g. ssh or another service?
Alternatively, would it be possible to run the same test with a recent fedora tree, but an older sssd to try to bisect the issue and see if it was really caused by sssd or something else?
We already have fairly strong proof of that: the bug first appeared only in the test of the sssd update when it was in updates-testing, then as soon as the sssd update was pushed to stable, started happening in every other update test as well. The spawn of tty6 is just what the test does on failure - when it timed out and gave up waiting for the login to succeed, it switched to tty6 and logged in as the local root account, to upload logs and so on. It's not related to the bug. As I mentioned, if you like, I can tweak the test to upload logs even on success, so we'll have slightly cleaner logs from the current case (where we wait long enough for the login to succeed, and the test completes).
(In reply to Adam Williamson from comment #9) > We already have fairly strong proof of that: the bug first appeared only in > the test of the sssd update when it was in updates-testing, then as soon as > the sssd update was pushed to stable, started happening in every other > update test as well. > I see > The spawn of tty6 is just what the test does on failure - when it timed out > and gave up waiting for the login to succeed, it switched to tty6 and logged > in as the local root account, to upload logs and so on. It's not related to > the bug. > > As I mentioned, if you like, I can tweak the test to upload logs even on > success, so we'll have slightly cleaner logs from the current case (where we > wait long enough for the login to succeed, and the test completes). Yes please.
Created attachment 1502245 [details] client /var/log tarball (from openQA staging job #390554)
Created attachment 1502246 [details] server /var/log tarball (from openQA staging job #390553)
OK, there's a couple of new log tarballs from a case where we waited long enough for the login to succeed. It looks like the login attempt started (test hit 'enter' after typing password) at 22:24:51 UTC and the login completed at 22:25:12 UTC (21 seconds later) - weirdly, the time it takes seems to have become a *bit* shorter in the last several days of test runs, it's dropped from about 35 seconds in each test to about 20 seconds. As I mentioned, before the problematic sssd update, this took about 2 seconds.
One more question, is it possible to run the tests with a scratch package? There is a handful of patches between the 'good' and the 'bad' update but not so many as to make bisect impractical.
yeah, it's possible. Just throw me the scratch builds to try and I'll try 'em.
(In reply to Adam Williamson from comment #15) > yeah, it's possible. Just throw me the scratch builds to try and I'll try > 'em. I'll do that (and thank you very much for the offer), unless there is a way I can run the tests myself?
Not unless you want to set up your own pet openQA deployment, which is possible but a bit of work. Probably easier just for me to do it :)
(In reply to Adam Williamson from comment #17) > Not unless you want to set up your own pet openQA deployment, which is > possible but a bit of work. Probably easier just for me to do it :) I totally forgot about this bug :-( Here is a build that tries to "bisect" the issue by applying ony half of the patches between -3 and -4 in rawhide: https://koji.fedoraproject.org/koji/taskinfo?taskID=31064701
Sorry, me too. I've got a bit of a backlog ATM but I'll try to get back and give that build a shot soon.
Well, we actually figured out the cause of this a different way - see https://bugzilla.redhat.com/show_bug.cgi?id=1654537#c9 . Jakub realized there that the key change is https://pagure.io/SSSD/sssd/c/945865a , which basically makes a rather resource-intensive SELinux-related operation happen pretty much every time you login to a system as a FreeIPA domain user for the first time. Before that change it happened much more rarely. That was a genuine fix for a genuine bug, though, so it's a bit tricky to "fix" this - we can't just rever that. One suggestion is to try and make the SELinux operation smoother. The other thing I'd note is that *the SELinux operation itself* is marked as being a 'workaround' for a much older bug, https://bugzilla.redhat.com/show_bug.cgi?id=1186422 . That bug seems to have been resolved now. Can we look again at whether it's still necessary for sssd to do this at all, with that fix?
To be clear, I'm talking about this here: https://pagure.io/SSSD/sssd/blob/945865a/f/src/providers/ipa/selinux_child.c#_144 note the comment: /* This is a workaround for * https://bugzilla.redhat.com/show_bug.cgi?id=1186422 to make sure * the directories are created with the expected permissions */
The comment is really just related to the umask() call on line 154 in current master and the subsequent umask re-set. It shouldn't even be required with modern libsemanage versions, but at the same time, the umask is not the slow part, it's setting the context. I filed https://bugzilla.redhat.com/show_bug.cgi?id=1660546 to ask the semanage developers what we can do about speeding the operation up. A bug report was the best way of tracking the issue esp. given that holidays are coming up.
Note that in the aarch64 tests, this seems to be taking about *90* seconds: (Thu Dec 20 05:49:32 2018) [sssd[be[domain.local]]] [selinux_child_create_buffer] (0x4000): buffer size: 43 (Thu Dec 20 05:49:32 2018) [sssd[be[domain.local]]] [child_handler_setup] (0x2000): Setting up signal handler up for pid [23892] (Thu Dec 20 05:49:32 2018) [sssd[be[domain.local]]] [child_handler_setup] (0x2000): Signal handler set up for pid [23892] (Thu Dec 20 05:49:32 2018) [sssd[be[domain.local]]] [sdap_process_result] (0x2000): Trace: sh[0xaaaad0944d60], connected[1], ops[(nil)], ldap[0xaaaad095d4e0] (Thu Dec 20 05:49:32 2018) [sssd[be[domain.local]]] [sdap_process_result] (0x2000): Trace: end of ldap_result list (Thu Dec 20 05:49:32 2018) [sssd[be[domain.local]]] [write_pipe_handler] (0x0400): All data has been sent! (Thu Dec 20 05:51:02 2018) [sssd[be[domain.local]]] [child_sig_handler] (0x1000): Waiting for child [23892]. (Thu Dec 20 05:51:02 2018) [sssd[be[domain.local]]] [child_sig_handler] (0x0100): child [23892] finished successfully. and because it takes *so* long, we actually hit, I think, some other timeout during login. This doesn't seem to be the systemd timeout - it's more that getty just suddenly exits, and systemd sees that and restarts it: Dec 20 02:50:30 client003.domain.local audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=getty@tty2 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' Dec 20 02:50:30 client003.domain.local systemd[1]: getty: Service has no hold-off time (RestartSec=0), scheduling restart. Dec 20 02:50:30 client003.domain.local systemd[1]: getty: Scheduled restart job, restart counter is at 1. Dec 20 02:50:30 client003.domain.local systemd[1]: Stopped Getty on tty2. Looking into this, it seems that getty calls 'login' after reading the username - per https://unix.stackexchange.com/questions/144303/change-tty-login-timeout-archlinux - and login has a default 60 second timeout. From 'man login': CONFIG FILE ITEMS login reads the /etc/login.defs(5) configuration file. Note that the configuration file could be distributed with another package (e.g. shadow-utils). The following configuration items are relevant for login(1): ... LOGIN_TIMEOUT (number) Max time in seconds for login. The default value is 60. so I think that's what we're hitting. I'll file a separate bug for this.
Filed https://bugzilla.redhat.com/show_bug.cgi?id=1661273 .
This message is a reminder that Fedora 29 is nearing its end of life. Fedora will stop maintaining and issuing updates for Fedora 29 on 2019-11-26. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a Fedora 'version' of '29'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 29 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete.
I'm pretty sure the basic issue here (the login taking a long time) is still a thing.
Yep, this is still commonly breaking our openQA tests on aarch64. It's quite a bad problem.
This package has changed maintainer in the Fedora. Reassigning to the new maintainer of this component.
Hi, SELinux developer here... 90s delay during login definitely sounds bad :/ Can you point us to the specific function that is slow? Is it sss_seuser_exists()? If I'm reading the comments right, it's also slow on other arches than aarch64, just less so (~20 seconds), right?
Jakub actually filed a separate bug with details (so many bugs, sorry, I lost track of that one): https://bugzilla.redhat.com/show_bug.cgi?id=1660546
This bug appears to have been reported against 'rawhide' during the Fedora 33 development cycle. Changing version to 33.
I don't think there's any point having both bugs open any more, let's close this as a dupe. *** This bug has been marked as a duplicate of bug 1660546 ***