795562 – Infinite loop checking Kerberos credentials

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 795562 - Infinite loop checking Kerberos credentials

Summary: Infinite loop checking Kerberos credentials

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	sssd
Sub Component:
Version:	6.2
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Stephen Gallagher
QA Contact:	IDM QE LIST
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	828190
TreeView+	depends on / blocked

Reported:	2012-02-20 22:13 UTC by Daniel Sands
Modified:	2020-05-02 16:46 UTC (History)
CC List:	6 users (show)
Fixed In Version:	sssd-1.8.0-12.el6
Doc Type:	Bug Fix
Doc Text:	Cause: A status of a server in the SSSD server list is reset after 30 seconds to allow retries. If a full cycle over the server list took more than 30 seconds, the cycle would start again Consequence: SSSD deployments using large server fail over lists might loop indefinitely Fix: The SSSD was fixed to only loop over the fail over list once Result: If the SSSD tries all the servers in the fail over list without succeeding, the operation always fails.
Clone Of:
Clones:	828190 (view as bug list)
Environment:
Last Closed:	2012-06-20 11:55:03 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
sssd_default logging (sanitized) (79.56 KB, application/x-gzip) 2012-02-25 00:48 UTC, Daniel Sands	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	SSSD sssd issues 2256	0	None	None	None	2020-05-02 16:46:24 UTC
Red Hat Product Errata	RHBA-2012:0747	0	normal	SHIPPED_LIVE	sssd bug fix and enhancement update	2012-06-19 19:31:43 UTC

Description Daniel Sands 2012-02-20 22:13:27 UTC

Description of problem:

I have seen this happen on several machines.

The sssd_be launches krb5_child to communicate with the Kerberos server. If krb5_child does not return a reply within the timeout period (default 15 seconds), sssd_be will kill the child and restart it.

The problem is that from the krb5 server it is apparent that the request is being made and answered within the first second most of the time. However, sssd_be never seems to get the message.

A more detailed analysis shows that handle_child_step does get called. It closes the output file descriptor for krb5_child, which immediately exits at that point. However, handle_child_done is never called.

The impact is greatest after the credentials expire (or if the password was entered wrong to begin with), when it begins sending invalid creds to the krb5 server, which responds by locking out the user after a certain number of attempts.

Version-Release number of selected component (if applicable):
sssd 1.5.1-66.el6_2.1.x86_64

How reproducible:
It happens at random intervals, possibly less than a day after the previous occurrence or possibly up to a week or two. So far no definite cause has been found.

Steps to Reproduce:
Unfortunately, I don't know yet. For starters though, leave the krb5_auth_timeout at its default.

Actual results:
krb5_child is constantly restarted, eventually locking out a user.

Expected results:
Credentials or error message received from krb5 server and passed through sssd to the originating process.

Additional info:
If krb5_child is sending its response but the sssd_be process never receives it, it could be indicative of a bug within libtevent, upon which sssd relies.

Setting krb5_auth_timeout to a large value, say 3600, will at least mitigate the password lockout problem (for us). However, it is yet to be seen whether or not this is a viable workaround for the whole problem.

Comment 2 Jakub Hrozek 2012-02-21 10:06:19 UTC

If handle_child_done() is never called would mean that the child never answered.

Daniel, do you have logs available? Especially the /var/log/sssd/sssd_$domain.log and /var/log/sssd/krb5_child.log are of interest.

You can generate logs by putting debug_level = 7 (or higher) into the domain section of sssd.conf

Comment 3 Daniel Sands 2012-02-25 00:48:23 UTC

Created attachment 565700 [details]
sssd_default logging (sanitized)

Here is the sssd_default log.  There are no krb5_child logs to speak of because when krb5_child is killed with -9, it apparently fails to add any logging.

Comment 4 Daniel Sands 2012-02-25 00:49:56 UTC

Just to clarify, I know that krb5_child finished its job because I was watching the network with WireShark while krb5_child was being restarted every 15 seconds.

Comment 5 Jakub Hrozek 2012-02-28 16:39:28 UTC

Upstream ticket:
https://fedorahosted.org/sssd/ticket/1214

Comment 6 Jakub Hrozek 2012-02-28 16:53:42 UTC

Thank you for providing the logs, Daniel. I identified the cause of the infinite loop from them, which is now being tracked in the upstream ticket #1214.

However, I'm concerned about the empty krb5_child.log. Even if the child is killed with a signal, it should always log at least a level-7 debug message saying "krb5_child started.". I'll look into reproducing that issue locally.

Comment 7 Jakub Hrozek 2012-03-02 13:19:23 UTC

(In reply to comment #6)
> However, I'm concerned about the empty krb5_child.log. Even if the child is
> killed with a signal, it should always log at least a level-7 debug message
> saying "krb5_child started.". I'll look into reproducing that issue locally.

There was actually a small bug where we would set up debugging to file *after* we attempted to print the "krb5_child started." message. 

There are also no tracing DEBUG messages in the child code, only error reporting, so unless there was an error, the child process might really not log anything..I suspect that was your case, the child called a blocking krb5 function, hang there, was killed by the responder without having a chance to log any error.

Comment 8 Daniel Sands 2012-03-05 18:13:42 UTC

That's plausible.  In my case Wireshark showed that the KRB5 request was sent and a reply received.  And once the sssd backend closed its send pipe, krb5_child seemed to exit normally.

Comment 9 Jenny Severance 2012-03-08 14:56:04 UTC

Please add steps to reproduce this issue

Comment 10 Jakub Hrozek 2012-03-08 15:29:27 UTC

(In reply to comment #9)
> Please add steps to reproduce this issue

The process of connecting to the server must take at least 30 seconds, which is the (currently hardcoded) interval after which we reset the status of a server.

This could be achieved by putting two krb5 servers into the fail over list, setting the krb5_auth_timeout to, say, 20 seconds and starting a firewall with DROP rules on the servers. The authentication should take 40 seconds, after which the status of the first server would be reset and authentication would cycle.

Comment 12 Jakub Hrozek 2012-04-03 18:27:13 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause: A status of a server in the SSSD server list is reset after 30 seconds to allow retries. If a full cycle over the server list took more than 30 seconds, the cycle would start again

Consequence: SSSD deployments using large server fail over lists might loop indefinitely

Fix: The SSSD was fixed to only loop over the fail over list once

Result: If the SSSD tries all the servers in the fail over list without succeeding, the operation always fails.

Comment 13 Kaushik Banerjee 2012-04-25 10:27:02 UTC

Tested before fix:
The first server status is always reset.

With the patch (sssd-1.8.0-23.el6):
The cycle doesn't restart. The last server status is reset.


Verified in version:
# rpm -qi sssd | head
Name        : sssd                         Relocations: (not relocatable)
Version     : 1.8.0                             Vendor: Red Hat, Inc.
Release     : 23.el6                        Build Date: Fri 20 Apr 2012 11:30:39 PM IST
Install Date: Mon 23 Apr 2012 08:48:40 PM IST      Build Host: x86-003.build.bos.redhat.com
Group       : Applications/System           Source RPM: sssd-1.8.0-23.el6.src.rpm
Size        : 7874744                          License: GPLv3+
Signature   : (none)
Packager    : Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>
URL         : http://fedorahosted.org/sssd/
Summary     : System Security Services Daemon

Comment 15 errata-xmlrpc 2012-06-20 11:55:03 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2012-0747.html

Note You need to log in before you can comment on or make changes to this bug.