Bug 1350012

Summary:	kinit / sssd kerberos fail over
Product:	Red Hat Enterprise Linux 7	Reporter:	Eugene Keck <ekeck>
Component:	sssd	Assignee:	Sumit Bose <sbose>
Status:	CLOSED ERRATA	QA Contact:	ipa-qe <ipa-qe>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	7.4	CC:	apeetham, ekeck, gparente, grajaiya, jhrozek, joe.collette, keith.young, kelvin.yund, kewhite, kludhwan, lslebodn, mkosek, mrichter, msauton, mzidek, ndehadra, patdung100+redhat, pbrezina, rharwood, sbose, striker, tscherf
Target Milestone:	rc
Target Release:	---
Hardware:	All
OS:	All
Whiteboard:
Fixed In Version:	sssd-1.16.4-11.el7	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-08-06 13:02:00 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1420851, 1644708, 1647919

Description Eugene Keck 2016-06-24 19:54:11 UTC

Description of problem:
kinit will only use the configured ipa_backup_server for a short period of time

Version-Release number of selected component (if applicable):
sssd-1.13.0-40.el7_2.4.x86_64

How reproducible:
Always

Steps to Reproduce:
1. setup sssd.conf with a ipa_server and ipa_backup_server
2. Take down primary ipa server
3. kinit <user>

Actual results:
kinit fails after about 20 seconds.

Expected results:
kinit <user> works

Additional info:
192.168.1.50 is my backup server (ipa_backup_server) and 192.168.1.51 is my primary server (ipa_server).

What I am seeing is kinit will fail over after I take down my primary server.

#  KRB5_TRACE=/dev/stdout kinit
[7733] 1466797071.639806: Getting initial credentials for root
[7733] 1466797071.641727: Sending request (163 bytes) to EXAMPLE.COM
[7733] 1466797071.641897: Initiating TCP connection to stream 192.168.1.50:88
[7733] 1466797071.642201: Sending TCP request to stream 192.168.1.50:88
[7733] 1466797071.643177: Received answer (169 bytes) from stream 192.168.1.50:88
[7733] 1466797071.643187: Terminating TCP connection to stream 192.168.1.50:88
[7733] 1466797071.643252: Response was from master KDC
[7733] 1466797071.643272: Received error from KDC: -1765328378/Client not found in Kerberos database
kinit: Client 'root' not found in Kerberos database while getting initial credentials

Then about 20 seconds later it moves back to the primary server 

#  KRB5_TRACE=/dev/stdout kinit                                                                        
[7741] 1466797094.952302: Getting initial credentials for root
[7741] 1466797094.954328: Sending request (163 bytes) to EXAMPLE.COM
[7741] 1466797094.954495: Initiating TCP connection to stream 192.168.1.51:88
[7741] 1466797094.954734: Terminating TCP connection to stream 192.168.1.51:88
[7741] 1466797094.954777: Sending initial UDP request to dgram 192.168.1.51:88
kinit: Cannot contact any KDC for realm 'EXAMPLE.COM' while getting initial credentials

kinit will fail until sssd needs to use the backup kerberos server then it will work for another 20 seconds until it moves back to the primary server. I was not able to get around this behavior using srv or static records in my /etc/krb5.conf

Comment 2 Sumit Bose 2016-06-27 07:56:27 UTC

I think this is expected behavior. Please see the 'FAILOVER' section in the sssd-krb5 man page. As you can see SSSD will try to fall-back automatically when a backup server is used. If you want to stick to the failover server you should add both servers as 'ipa_server' and do not use the 'ipa_backup_server' option.

Comment 3 Joe Collette 2016-06-29 22:24:10 UTC

I would say this is not the expected behavior and the kdc should not flip back to the primary, unless the service is available. Yes, the sssd is checking every 30 seconds or so whether the primary is back up, but ldap does not failback unless the primary server becomes available again, that is expected behavior and kinit should behave the same way.


The file /var/lib/sss/pubconf/kdcinifo.DOMAIN is being updated with the backup server IP after the primary fails, but during the first check to see if the primary IPA server is available again, the file is inadvertently updated with the primary server IP again and therefore kinit fails.

Having all servers on the ipa_server line is not an option if you want your servers to properly failback to their "local" IPA server in a particular region or datacenter. If you have all the servers on the ipa_server line, they failover to the next server in line and never revert to the first in the list. That's why there is the "ipa_backup_server" option so that you can failover to another IPA server, but when your primary comes online, it will failback.

Comment 4 Sumit Bose 2016-07-06 08:01:16 UTC

I think the most flexible solution is described in https://fedorahosted.org/sssd/ticket/941 where multiple addresses are made available by the plugin.

I'm currently thinking about what would be the best strategy for ordering the the additional addresses. As long as all servers (primary and backup) are configured explicitly we can try the one considered active by SSSD first and then add all primary servers and finally all backup servers.

If the servers are discovered dynamically we might need to add a configurable cut-off limit because e.g. in larger AD domains there might be a huge amount of servers even if we restrict the list to servers from the local site. If the plugin will return too many servers application will be busy for quite some time to check all servers before finally giving up if e.g. there are network issues.

Comment 5 Jakub Hrozek 2016-07-07 11:53:03 UTC

Upstream ticket:
https://fedorahosted.org/sssd/ticket/941

Comment 7 Jakub Hrozek 2016-11-28 09:12:16 UTC

There is some work we might do in the scope of 7.4, but it's not scoped for sure. Some might be improvements to the kdcinfo plugin itself, some might be in the scope of the KCM server.

Comment 8 Jakub Hrozek 2017-02-01 13:24:39 UTC

*** Bug 1380174 has been marked as a duplicate of this bug. ***

Comment 20 Fabiano Fidêncio 2018-06-14 18:18:49 UTC

Fixed as part of the following series ...
 master:
  efae950
  9f68324
  c1fbc6b
  2124275
  cc79227
  d91661e
  4759a48
  f28d995

Comment 21 Jakub Hrozek 2018-06-14 19:46:11 UTC

I'm going to move the bug back to ASSIGNED. I know the upstream tickets were really less than clear, cross-referenced each other and so did the bugzillas.

So to make it clear what was pushed to upstream:
 - the Kerberos locator plugin is now able to consume multiple addresses from the kdcinfo files. This will most probably be in 7.6 (modulo the usual disclaimer about future releases..)
 - BUT there is no mechanism yet in SSSD which would write multiple addresses into the kdcinfo files

 - the patches add the ability to generate the kdcinfo files for trusted domains of directly joined AD clients and IPA masters in a trust relationship with an AD domain
  - BUT there is no mechanism yet for writing the kdcinfo files on IPA clients in an IPA-AD trust setup. I'm working on a subsequent patch that would implement this and I hope that the patch makes 7.6 (modulo the usual disclaimer about future releases..)

We're now working on another, subsequent patchset that would write multiple addresses into kdcinfo files, but /only in the IPA-AD client case/.

I hope it is clearer what was implemented and what is still in the works.

Comment 23 Jakub Hrozek 2019-03-19 22:44:56 UTC

First ticket fixed:
    master: 63ccbfe
    sssd-1-16: 96e4d71

Comment 24 Jakub Hrozek 2019-03-23 19:38:19 UTC

Another master PR:
* master: 208a79a
* sssd-1-16: c225ed7

Comment 25 Jakub Hrozek 2019-04-02 20:41:11 UTC

Final PR:
 * master: e8d806d9bbb1ba288ed6a83158113f4d8f8a8929
 * sssd-1-16: ec0c31a71ef8ec61e11b1c6d21edcf9b923ce4ca

Comment 33 errata-xmlrpc 2019-08-06 13:02:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2019:2177