Bug 2420946

Summary: sssd: Cannot log in during Kerberos infrastructure outages
Product: [Fedora] Fedora Reporter: Florian Weimer <fweimer>
Component: sssdAssignee: sssd-maintainers <sssd-maintainers>
Status: NEW --- QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 43CC: abokovoy, atikhono, lslebodn, pbrezina, sbose, ssorce, sssd-maintainers
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: ---
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Florian Weimer 2025-12-10 10:20:04 UTC
Earlier today, I couldn't log in until I pulled the Ethernet plug and caused the VPN connection to go down. The journal contains a Kerberos error.

This can't be a security feature because making the network go down fixed it.

Dec 10 10:10:58 fweimer-oldenburg.csb.redhat.com systemd[1]: redhat-csb-metrics.service: Deactivated successfully.
Dec 10 10:10:58 fweimer-oldenburg.csb.redhat.com systemd[1]: Finished redhat-csb-metrics.service - Run CSB Metrics gathering playbook.
Dec 10 10:10:58 fweimer-oldenburg.csb.redhat.com audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=redhat-csb-metrics comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Dec 10 10:10:58 fweimer-oldenburg.csb.redhat.com audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=redhat-csb-metrics comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Dec 10 10:27:40 fweimer-oldenburg.csb.redhat.com systemd[1]: Starting sssd-kcm.service - SSSD Kerberos Cache Manager...
Dec 10 10:27:40 fweimer-oldenburg.csb.redhat.com systemd[1]: Started sssd-kcm.service - SSSD Kerberos Cache Manager.
Dec 10 10:27:40 fweimer-oldenburg.csb.redhat.com audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=sssd-kcm comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Dec 10 10:27:40 fweimer-oldenburg.csb.redhat.com sssd_kcm[41444]: Starting up
Dec 10 10:27:41 fweimer-oldenburg.csb.redhat.com krb5_child[41429]: Generic error (see e-text)
Dec 10 10:27:41 fweimer-oldenburg.csb.redhat.com audit[41415]: AUDIT1100 pid=41415 uid=0 auid=18797 ses=2 subj=system_u:system_r:xdm_t:s0-s0:c0.c1023 msg='op=PAM:authentication grantors=? acct="fweimer" exe="/usr/libexec/gdm-session-worker" hostname=fweimer-oldenburg.csb.redhat.com addr=? terminal=/dev/tty1 res=failed'
Dec 10 10:27:41 fweimer-oldenburg.csb.redhat.com gdm-password][41415]: pam_sss(gdm-password:auth): authentication failure; logname=fweimer uid=0 euid=0 tty=/dev/tty1 ruser= rhost= user=fweimer
Dec 10 10:27:41 fweimer-oldenburg.csb.redhat.com gdm-password][41415]: pam_sss(gdm-password:auth): received for user fweimer: 4 (System error)
Dec 10 10:27:41 fweimer-oldenburg.csb.redhat.com gdm-password][41415]: gkr-pam: unlocked login keyring
Dec 10 10:27:43 fweimer-oldenburg.csb.redhat.com audit[41415]: AUDIT1112 pid=41415 uid=0 auid=18797 ses=2 subj=system_u:system_r:xdm_t:s0-s0:c0.c1023 msg='uid=18797 exe="/usr/libexec/gdm-session-worker" hostname=? addr=? terminal=? res=failed'
Dec 10 10:27:50 fweimer-oldenburg.csb.redhat.com krb5_child[41456]: Generic error (see e-text)
Dec 10 10:27:50 fweimer-oldenburg.csb.redhat.com gdm-password][41449]: pam_sss(gdm-password:auth): authentication failure; logname=fweimer uid=0 euid=0 tty=/dev/tty1 ruser= rhost= user=fweimer
Dec 10 10:27:50 fweimer-oldenburg.csb.redhat.com gdm-password][41449]: pam_sss(gdm-password:auth): received for user fweimer: 4 (System error)
Dec 10 10:27:50 fweimer-oldenburg.csb.redhat.com gdm-password][41449]: gkr-pam: unlocked login keyring
Dec 10 10:27:50 fweimer-oldenburg.csb.redhat.com audit[41449]: AUDIT1100 pid=41449 uid=0 auid=18797 ses=2 subj=system_u:system_r:xdm_t:s0-s0:c0.c1023 msg='op=PAM:authentication grantors=? acct="fweimer" exe="/usr/libexec/gdm-session-worker" hostname=fweimer-oldenburg.csb.redhat.com addr=? terminal=/dev/tty1 res=failed'
Dec 10 10:27:52 fweimer-oldenburg.csb.redhat.com audit[41449]: AUDIT1112 pid=41449 uid=0 auid=18797 ses=2 subj=system_u:system_r:xdm_t:s0-s0:c0.c1023 msg='uid=18797 exe="/usr/libexec/gdm-session-worker" hostname=? addr=? terminal=? res=failed'
Dec 10 10:27:55 fweimer-oldenburg.csb.redhat.com gnome-shell[4737]: Gio.DBusError: GDBus.Error:org.freedesktop.DBus.Error.Failed: Set global engine failed: Operation was cancelled
                                                                    
                                                                    (No stack trace)
Dec 10 10:27:59 fweimer-oldenburg.csb.redhat.com krb5_child[41466]: Generic error (see e-text)
Dec 10 10:27:59 fweimer-oldenburg.csb.redhat.com gdm-password][41461]: pam_sss(gdm-password:auth): authentication failure; logname=fweimer uid=0 euid=0 tty=/dev/tty1 ruser= rhost= user=fweimer
Dec 10 10:27:59 fweimer-oldenburg.csb.redhat.com gdm-password][41461]: pam_sss(gdm-password:auth): received for user fweimer: 4 (System error)
Dec 10 10:27:59 fweimer-oldenburg.csb.redhat.com gdm-password][41461]: gkr-pam: unlocked login keyring
Dec 10 10:27:59 fweimer-oldenburg.csb.redhat.com audit[41461]: AUDIT1100 pid=41461 uid=0 auid=18797 ses=2 subj=system_u:system_r:xdm_t:s0-s0:c0.c1023 msg='op=PAM:authentication grantors=? acct="fweimer" exe="/usr/libexec/gdm-session-worker" hostname=fweimer-oldenburg.csb.redhat.com addr=? terminal=/dev/tty1 res=failed'
Dec 10 10:28:00 fweimer-oldenburg.csb.redhat.com audit[41461]: AUDIT1112 pid=41461 uid=0 auid=18797 ses=2 subj=system_u:system_r:xdm_t:s0-s0:c0.c1023 msg='uid=18797 exe="/usr/libexec/gdm-session-worker" hostname=? addr=? terminal=? res=failed'


Reproducible: Always

Comment 5 Alexander Bokovoy 2025-12-10 11:08:09 UTC
The backend is considering itself online, so it tries to do online authentication. 
I guess it is a matter of waiting until we settle to get the backend offline and then offline authentication will be tried.

The question is whether SSSD could improve detection of situations like this that can trigger offline state?

-1765328324 is a KRB5KRB_ERR_GENERIC, in get_and_save_tgt() SSSD calls into krb5_get_init_creds_password() which only triggers this error in very few cases, namely, when the client cannot find a preauthentication module to process the response from KDC or KDC returns it itself. 

Another option is when a PKINIT preauthentication module fails to get the crypto primitives, but since we use krb5_get_init_creds_password(), we aren't dealing with PKINIT here.

So it might be a failure on the KDC side but you are saying making the network disconnect fixed it (presumably after connecting back to VPN?). Right?

Comment 7 Florian Weimer 2025-12-10 11:28:20 UTC
(In reply to Alexander Bokovoy from comment #5)
> So it might be a failure on the KDC side but you are saying making the
> network disconnect fixed it (presumably after connecting back to VPN?).
> Right?

The infrastructure issue has been intermittent this morning. Right now, it's working again, so I can't tell what the e-text likely would have said.

Earlier, the network disconnect fixed my login issue, but authentication was still broken (sudo didn't work) after I brought up the VPN again (but it didn't matter because I had unlocked the screen at that point).

Comment 8 Simo Sorce 2025-12-10 16:57:28 UTC
Occasionally (once or twice a year) I get this issue too.

I have not yet been able to pin-point why (each time this happen I have no time to try and debug as I am due to a meeting or something similar).

One of my working hypothesis so far is that this happens when I hit an IdM server that has a very busy LDAP service, and the krb5kdc ends up timing out (or some other error contacting the LDAP server) and responding with an error.

If we could classify these kind of errors as "not security relevant" and cause an offline auth to happen it would probably be a good thing.