Bug 1661273

Summary: Expensive SELinux operation on first login can exceed login(1) timeout, cause failure
Product: [Fedora] Fedora Reporter: Adam Williamson <awilliam>
Component: sssdAssignee: sssd-maintainers <sssd-maintainers>
Status: CLOSED EOL QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 31CC: abokovoy, jhrozek, lslebodn, mzidek, pbrezina, rharwood, rob.verduijn, sbose, ssorce
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-11-24 18:45:25 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
/var/log tarball from a recent aarch64 failure none

Description Adam Williamson 2018-12-20 16:45:01 UTC
This is another consequence of https://bugzilla.redhat.com/show_bug.cgi?id=1644919 , but I'm filing it separately as we could potentially resolve it without actually making the operation faster.

As discussed in that bug, an expensive SELinux operation is commonly run on first login to a system as a domain user. The amount of time it takes seems to vary quite a lot across different configurations, but in the openQA aarch64 tests, it seems to be taking *90* seconds:

(Thu Dec 20 05:49:32 2018) [sssd[be[domain.local]]] [selinux_child_create_buffer] (0x4000): buffer size: 43
(Thu Dec 20 05:49:32 2018) [sssd[be[domain.local]]] [child_handler_setup] (0x2000): Setting up signal handler up for pid [23892]
(Thu Dec 20 05:49:32 2018) [sssd[be[domain.local]]] [child_handler_setup] (0x2000): Signal handler set up for pid [23892]
(Thu Dec 20 05:49:32 2018) [sssd[be[domain.local]]] [sdap_process_result] (0x2000): Trace: sh[0xaaaad0944d60], connected[1], ops[(nil)], ldap[0xaaaad095d4e0]
(Thu Dec 20 05:49:32 2018) [sssd[be[domain.local]]] [sdap_process_result] (0x2000): Trace: end of ldap_result list
(Thu Dec 20 05:49:32 2018) [sssd[be[domain.local]]] [write_pipe_handler] (0x0400): All data has been sent!
(Thu Dec 20 05:51:02 2018) [sssd[be[domain.local]]] [child_sig_handler] (0x1000): Waiting for child [23892].
(Thu Dec 20 05:51:02 2018) [sssd[be[domain.local]]] [child_sig_handler] (0x0100): child [23892] finished successfully.

and because it takes *so* long, we actually hit a timeout during login. This doesn't seem to be the systemd timeout - it's more that getty just suddenly exits, and systemd sees that and restarts it:

Dec 20 02:50:30 client003.domain.local audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=getty@tty2 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Dec 20 02:50:30 client003.domain.local systemd[1]: getty: Service has no hold-off time (RestartSec=0), scheduling restart.
Dec 20 02:50:30 client003.domain.local systemd[1]: getty: Scheduled restart job, restart counter is at 1.
Dec 20 02:50:30 client003.domain.local systemd[1]: Stopped Getty on tty2.

Looking into this, it seems that getty calls 'login' after reading the username - per https://unix.stackexchange.com/questions/144303/change-tty-login-timeout-archlinux - and login has a default 60 second timeout. From 'man login':

CONFIG FILE ITEMS
       login  reads  the /etc/login.defs(5) configuration file.  Note that the configuration file could be distributed with another package (e.g. shadow-utils).  The following configuration items are relevant for login(1):

...

       LOGIN_TIMEOUT (number)
           Max time in seconds for login.  The default value is 60.

I'm filing this as a bug against sssd, and not against util-linux, because 60 really does seem like a *reasonable* default, login certainly shouldn't take longer than that. So I don't think we should change the default distro-wide. However, we could potentially make it 120 or something for systems enrolled in a domain; or we could (possibly? I'm not sure if this is technically possible) tweak the implementation in sssd such that login knows login has 'succeeded' before the expensive SELinux operation happens.

For openQA, I guess I'll tweak the test to set a higher LOGIN_TIMEOUT value in /etc/login.defs after enrolling the system and before attempting a login...

Comment 1 Ben Cotton 2019-08-13 17:00:29 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 31 development cycle.
Changing version to '31'.

Comment 2 Ben Cotton 2019-08-13 19:32:15 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 31 development cycle.
Changing version to 31.

Comment 3 Adam Williamson 2019-09-13 18:02:30 UTC
For the record, this is still a problem. It's still taking about 90 seconds on aarch64.

Comment 4 Adam Williamson 2019-09-13 18:06:00 UTC
I do have the LOGIN_TIMEOUT workaround in the openQA tests, btw, but they seem to have started failing again with an error "Authentication service cannot retrieve authentication info", which I think is different from before. I'll have to figure out where that error is coming from and see if that thing has a configurable timeout.

Comment 5 Adam Williamson 2019-09-13 18:21:05 UTC
Welp, that's a PAM error - PAM_AUTHINFO_UNAVAIL - and it's probably coming from the SSSD PAM client, I guess...but we already bumped the sbus timeout to 120 seconds, and I don't know what other timeout we might be hitting here :/

I'll attach a current /var/log tarball from an affected aarch64 test.

Comment 6 Adam Williamson 2019-09-13 18:23:01 UTC
Created attachment 1614953 [details]
/var/log tarball from a recent aarch64 failure

Comment 7 Sumit Bose 2019-09-16 10:57:11 UTC
Hi Adam,

to which value did you set LOGIN_TIMEOUT? According to /var/log/secure

Sep 11 18:20:14 localhost login[24669]: pam_sss(login:auth): authentication success; logname=LOGIN uid=0 euid=0 tty=tty2 ruser= rhost= user=test1
Sep 11 18:21:42 localhost login[24669]: Authentication service cannot retrieve authentication info

The error happens 1min and 28s after the successful authentication. SSSD's selinux_child needed a bit longer:

Sep 11 18:20:14 localhost login[24669]: pam_sss(login:auth): authentication success; logname=LOGIN uid=0 euid=0 tty=tty2 ruser= rhost= user=test1
Sep 11 18:21:42 localhost login[24669]: Authentication service cannot retrieve authentication info

So my guess is that LOGIN_TIMEOUT is somewhat around 90s?

bye,
Sumit

Comment 8 Adam Williamson 2019-09-16 14:52:03 UTC
No, it's 120. And IIRC, that is not the error we get when we hit that timeout (which is what was happening the first time I encountered this bug), we instead get 'login failed' or something like it.

Comment 9 Fedora Admin user for bugzilla script actions 2020-06-18 14:59:22 UTC
This package has changed maintainer in the Fedora.
Reassigning to the new maintainer of this component.

Comment 10 Ben Cotton 2020-11-03 15:06:28 UTC
This message is a reminder that Fedora 31 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora 31 on 2020-11-24.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
Fedora 'version' of '31'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 31 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 11 Ben Cotton 2020-11-24 18:45:25 UTC
Fedora 31 changed to end-of-life (EOL) status on 2020-11-24. Fedora 31 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.