Bug 795562

Summary: Infinite loop checking Kerberos credentials
Product: Red Hat Enterprise Linux 6 Reporter: Daniel Sands <dnsands>
Component: sssdAssignee: Stephen Gallagher <sgallagh>
Status: CLOSED ERRATA QA Contact: IDM QE LIST <seceng-idm-qe-list>
Severity: high Docs Contact:
Priority: unspecified    
Version: 6.2CC: dpal, grajaiya, jgalipea, jhrozek, kbanerje, prc
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: sssd-1.8.0-12.el6 Doc Type: Bug Fix
Doc Text:
Cause: A status of a server in the SSSD server list is reset after 30 seconds to allow retries. If a full cycle over the server list took more than 30 seconds, the cycle would start again Consequence: SSSD deployments using large server fail over lists might loop indefinitely Fix: The SSSD was fixed to only loop over the fail over list once Result: If the SSSD tries all the servers in the fail over list without succeeding, the operation always fails.
Story Points: ---
Clone Of:
: 828190 (view as bug list) Environment:
Last Closed: 2012-06-20 11:55:03 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 828190    
Attachments:
Description Flags
sssd_default logging (sanitized) none

Description Daniel Sands 2012-02-20 22:13:27 UTC
Description of problem:

I have seen this happen on several machines.

The sssd_be launches krb5_child to communicate with the Kerberos server.  If krb5_child does not return a reply within the timeout period (default 15 seconds), sssd_be will kill the child and restart it.

The problem is that from the krb5 server it is apparent that the request is being made and answered within the first second most of the time.  However, sssd_be never seems to get the message.

A more detailed analysis shows that handle_child_step does get called.  It closes the output file descriptor for krb5_child, which immediately exits at that point.  However, handle_child_done is never called.

The impact is greatest after the credentials expire (or if the password was entered wrong to begin with), when it begins sending invalid creds to the krb5 server, which responds by locking out the user after a certain number of attempts.

Version-Release number of selected component (if applicable):
sssd 1.5.1-66.el6_2.1.x86_64

How reproducible:
It happens at random intervals, possibly less than a day after the previous occurrence or possibly up to a week or two.  So far no definite cause has been found.

Steps to Reproduce:
Unfortunately, I don't know yet.  For starters though, leave the krb5_auth_timeout at its default.
  
Actual results:
krb5_child is constantly restarted, eventually locking out a user.

Expected results:
Credentials or error message received from krb5 server and passed through sssd to the originating process.

Additional info:
If krb5_child is sending its response but the sssd_be process never receives it, it could be indicative of a bug within libtevent, upon which sssd relies.

Setting krb5_auth_timeout to a large value, say 3600, will at least mitigate the password lockout problem (for us).  However, it is yet to be seen whether or not this is a viable workaround for the whole problem.

Comment 2 Jakub Hrozek 2012-02-21 10:06:19 UTC
If handle_child_done() is never called would mean that the child never answered.

Daniel, do you have logs available? Especially the /var/log/sssd/sssd_$domain.log and /var/log/sssd/krb5_child.log are of interest.

You can generate logs by putting debug_level = 7 (or higher) into the domain section of sssd.conf

Comment 3 Daniel Sands 2012-02-25 00:48:23 UTC
Created attachment 565700 [details]
sssd_default logging (sanitized)

Here is the sssd_default log.  There are no krb5_child logs to speak of because when krb5_child is killed with -9, it apparently fails to add any logging.

Comment 4 Daniel Sands 2012-02-25 00:49:56 UTC
Just to clarify, I know that krb5_child finished its job because I was watching the network with WireShark while krb5_child was being restarted every 15 seconds.

Comment 5 Jakub Hrozek 2012-02-28 16:39:28 UTC
Upstream ticket:
https://fedorahosted.org/sssd/ticket/1214

Comment 6 Jakub Hrozek 2012-02-28 16:53:42 UTC
Thank you for providing the logs, Daniel. I identified the cause of the infinite loop from them, which is now being tracked in the upstream ticket #1214.

However, I'm concerned about the empty krb5_child.log. Even if the child is killed with a signal, it should always log at least a level-7 debug message saying "krb5_child started.". I'll look into reproducing that issue locally.

Comment 7 Jakub Hrozek 2012-03-02 13:19:23 UTC
(In reply to comment #6)
> However, I'm concerned about the empty krb5_child.log. Even if the child is
> killed with a signal, it should always log at least a level-7 debug message
> saying "krb5_child started.". I'll look into reproducing that issue locally.

There was actually a small bug where we would set up debugging to file *after* we attempted to print the "krb5_child started." message. 

There are also no tracing DEBUG messages in the child code, only error reporting, so unless there was an error, the child process might really not log anything..I suspect that was your case, the child called a blocking krb5 function, hang there, was killed by the responder without having a chance to log any error.

Comment 8 Daniel Sands 2012-03-05 18:13:42 UTC
That's plausible.  In my case Wireshark showed that the KRB5 request was sent and a reply received.  And once the sssd backend closed its send pipe, krb5_child seemed to exit normally.

Comment 9 Jenny Severance 2012-03-08 14:56:04 UTC
Please add steps to reproduce this issue

Comment 10 Jakub Hrozek 2012-03-08 15:29:27 UTC
(In reply to comment #9)
> Please add steps to reproduce this issue

The process of connecting to the server must take at least 30 seconds, which is the (currently hardcoded) interval after which we reset the status of a server.

This could be achieved by putting two krb5 servers into the fail over list, setting the krb5_auth_timeout to, say, 20 seconds and starting a firewall with DROP rules on the servers. The authentication should take 40 seconds, after which the status of the first server would be reset and authentication would cycle.

Comment 12 Jakub Hrozek 2012-04-03 18:27:13 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause: A status of a server in the SSSD server list is reset after 30 seconds to allow retries. If a full cycle over the server list took more than 30 seconds, the cycle would start again

Consequence: SSSD deployments using large server fail over lists might loop indefinitely

Fix: The SSSD was fixed to only loop over the fail over list once

Result: If the SSSD tries all the servers in the fail over list without succeeding, the operation always fails.

Comment 13 Kaushik Banerjee 2012-04-25 10:27:02 UTC
Tested before fix:
The first server status is always reset.

With the patch (sssd-1.8.0-23.el6):
The cycle doesn't restart. The last server status is reset.


Verified in version:
# rpm -qi sssd | head
Name        : sssd                         Relocations: (not relocatable)
Version     : 1.8.0                             Vendor: Red Hat, Inc.
Release     : 23.el6                        Build Date: Fri 20 Apr 2012 11:30:39 PM IST
Install Date: Mon 23 Apr 2012 08:48:40 PM IST      Build Host: x86-003.build.bos.redhat.com
Group       : Applications/System           Source RPM: sssd-1.8.0-23.el6.src.rpm
Size        : 7874744                          License: GPLv3+
Signature   : (none)
Packager    : Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>
URL         : http://fedorahosted.org/sssd/
Summary     : System Security Services Daemon

Comment 15 errata-xmlrpc 2012-06-20 11:55:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2012-0747.html