Description of problem:
I have seen this happen on several machines.
The sssd_be launches krb5_child to communicate with the Kerberos server. If krb5_child does not return a reply within the timeout period (default 15 seconds), sssd_be will kill the child and restart it.
The problem is that from the krb5 server it is apparent that the request is being made and answered within the first second most of the time. However, sssd_be never seems to get the message.
A more detailed analysis shows that handle_child_step does get called. It closes the output file descriptor for krb5_child, which immediately exits at that point. However, handle_child_done is never called.
The impact is greatest after the credentials expire (or if the password was entered wrong to begin with), when it begins sending invalid creds to the krb5 server, which responds by locking out the user after a certain number of attempts.
Version-Release number of selected component (if applicable):
It happens at random intervals, possibly less than a day after the previous occurrence or possibly up to a week or two. So far no definite cause has been found.
Steps to Reproduce:
Unfortunately, I don't know yet. For starters though, leave the krb5_auth_timeout at its default.
krb5_child is constantly restarted, eventually locking out a user.
Credentials or error message received from krb5 server and passed through sssd to the originating process.
If krb5_child is sending its response but the sssd_be process never receives it, it could be indicative of a bug within libtevent, upon which sssd relies.
Setting krb5_auth_timeout to a large value, say 3600, will at least mitigate the password lockout problem (for us). However, it is yet to be seen whether or not this is a viable workaround for the whole problem.
If handle_child_done() is never called would mean that the child never answered.
Daniel, do you have logs available? Especially the /var/log/sssd/sssd_$domain.log and /var/log/sssd/krb5_child.log are of interest.
You can generate logs by putting debug_level = 7 (or higher) into the domain section of sssd.conf
Created attachment 565700 [details]
sssd_default logging (sanitized)
Here is the sssd_default log. There are no krb5_child logs to speak of because when krb5_child is killed with -9, it apparently fails to add any logging.
Just to clarify, I know that krb5_child finished its job because I was watching the network with WireShark while krb5_child was being restarted every 15 seconds.
Thank you for providing the logs, Daniel. I identified the cause of the infinite loop from them, which is now being tracked in the upstream ticket #1214.
However, I'm concerned about the empty krb5_child.log. Even if the child is killed with a signal, it should always log at least a level-7 debug message saying "krb5_child started.". I'll look into reproducing that issue locally.
(In reply to comment #6)
> However, I'm concerned about the empty krb5_child.log. Even if the child is
> killed with a signal, it should always log at least a level-7 debug message
> saying "krb5_child started.". I'll look into reproducing that issue locally.
There was actually a small bug where we would set up debugging to file *after* we attempted to print the "krb5_child started." message.
There are also no tracing DEBUG messages in the child code, only error reporting, so unless there was an error, the child process might really not log anything..I suspect that was your case, the child called a blocking krb5 function, hang there, was killed by the responder without having a chance to log any error.
That's plausible. In my case Wireshark showed that the KRB5 request was sent and a reply received. And once the sssd backend closed its send pipe, krb5_child seemed to exit normally.
Please add steps to reproduce this issue
(In reply to comment #9)
> Please add steps to reproduce this issue
The process of connecting to the server must take at least 30 seconds, which is the (currently hardcoded) interval after which we reset the status of a server.
This could be achieved by putting two krb5 servers into the fail over list, setting the krb5_auth_timeout to, say, 20 seconds and starting a firewall with DROP rules on the servers. The authentication should take 40 seconds, after which the status of the first server would be reset and authentication would cycle.
Technical note added. If any revisions are required, please edit the "Technical Notes" field
accordingly. All revisions will be proofread by the Engineering Content Services team.
Cause: A status of a server in the SSSD server list is reset after 30 seconds to allow retries. If a full cycle over the server list took more than 30 seconds, the cycle would start again
Consequence: SSSD deployments using large server fail over lists might loop indefinitely
Fix: The SSSD was fixed to only loop over the fail over list once
Result: If the SSSD tries all the servers in the fail over list without succeeding, the operation always fails.
Tested before fix:
The first server status is always reset.
With the patch (sssd-1.8.0-23.el6):
The cycle doesn't restart. The last server status is reset.
Verified in version:
# rpm -qi sssd | head
Name : sssd Relocations: (not relocatable)
Version : 1.8.0 Vendor: Red Hat, Inc.
Release : 23.el6 Build Date: Fri 20 Apr 2012 11:30:39 PM IST
Install Date: Mon 23 Apr 2012 08:48:40 PM IST Build Host: x86-003.build.bos.redhat.com
Group : Applications/System Source RPM: sssd-1.8.0-23.el6.src.rpm
Size : 7874744 License: GPLv3+
Signature : (none)
Packager : Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>
URL : http://fedorahosted.org/sssd/
Summary : System Security Services Daemon
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.