Bug 1769755

Summary: sssd failover leads to delayed and failed logins
Product: Red Hat Enterprise Linux 7 Reporter: Oliver Falk <ofalk>
Component: sssdAssignee: Sumit Bose <sbose>
Status: CLOSED ERRATA QA Contact: ipa-qe <ipa-qe>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 7.8CC: apeetham, grajaiya, jhrozek, kbanerje, lmiksik, lslebodn, mzidek, ndehadra, pbrezina, peter.vreman, sbose, sgoveas, tscherf
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: sync-to-jira
Fixed In Version: sssd-1.16.4-35.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1807933 1807934 (view as bug list) Environment:
Last Closed: 2020-03-31 19:44:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1122832, 1807933, 1807934    

Description Oliver Falk 2019-11-07 11:50:22 UTC
Description of problem:
During testing IPA deployment in my customers environment and running various fail over test scenarios, we recognized that under some circumstances, fail over didn't work as expected and resulted in failed or delayed (~ 60 - 70 seconds) logins.

Version-Release number of selected component (if applicable): 1.16.4


How reproducible: Always.


Steps to Reproduce:
1. Client connected to two IPA servers (A and B)
2. Cut connection to server A
3. Login to client
4. Allow connection to server A
5. Cut connection to server B

If you keep doing this repeatedly, at some point the fail back from B to A doesn't work; SSSD takes a very long time to recognize the connection to server A is restored and uses it again.

Actual results: Logins delayed or not working at all


Expected results: Fail over + fail back work smoothly


Additional info:
* This was already analysed by Sumit Bose and he has a fix for it available.
* Customer case will be linked.
* Exception set to ?
* We'll need this fix in 7.7 z-stream (for EUS) later as well
* It also applies to RHEL 8 AFAIK

Comment 7 Sumit Bose 2019-11-07 12:30:27 UTC
Upstream ticket:
https://pagure.io/SSSD/sssd/issue/4114

Comment 12 Sumit Bose 2019-11-29 11:16:14 UTC
SSSD-1-16:
 - 4897063996b624b71823e61c73916f47832f103a
 - a4dd1eb5087c2f8a3a9133f42efa025221edc1c9

Comment 15 Nikhil Dehadrai 2019-12-13 11:46:53 UTC
[root@master ~]# rpm -q ipa-server ipa-client
ipa-server-4.6.6-11.el7.x86_64
ipa-client-4.6.6-11.el7.x86_64



Verified the bug on the basis of following steps/observations:
1. Setup IPA master at RHEL78
2. Setup IPA Replica at RHEL78
3. Setup IPA client at RHEL78 (Ensuring that resolv.conf has entries for both MASTER and REPLICA)
4. Alternately Start / Stop Master and Replica and check if kinit works on client machine


Script used:
while true; do
date
echo --------------------
echo MASTER OFF
ssh -t root.test "ipactl status"
ssh -t root.test "ipactl stop"
ssh -t root.test "ipactl status"
echo REPLICA ON
ssh -t root.test "ipactl restart"
ssh -t root.test "ipactl status"
systemctl stop sssd; rm -rf /var/lib/sss/db/*; systemctl start sssd
kdestroy
klist
echo Secret123 | kinit admin
klist
getent passwd admin
echo ===============================================
date
echo --------------------
echo MASTER ON
ssh -t root.test  "ipactl restart"
ssh -t root.test "ipactl status"
echo REPLICA OFF
ssh -t root.test "ipactl status"
ssh -t root.test "ipactl stop"
ssh -t root.test "ipactl status"
systemctl stop sssd; rm -rf /var/lib/sss/db/*; systemctl start sssd
kdestroy
klist
echo Secret123 | kinit admin
klist
getent passwd admin
echo ===============================================
done


Ran the above script continuously for 10mins and the kinit was successful with FAILOVER from Master to REPLICA and Vice-Versa.
Observations:

===============================================
Fri Dec 13 06:09:15 EST 2019
--------------------
MASTER OFF
Directory Service: RUNNING
krb5kdc Service: RUNNING
kadmin Service: RUNNING
named Service: RUNNING
httpd Service: RUNNING
ipa-custodia Service: RUNNING
pki-tomcatd Service: RUNNING
ipa-otpd Service: RUNNING
ipa-dnskeysyncd Service: RUNNING
ipa: INFO: The ipactl command was successful
Connection to master.ipa.test closed.
Stopping ipa-dnskeysyncd Service
Stopping ipa-otpd Service
Stopping pki-tomcatd Service
Stopping ipa-custodia Service
Stopping httpd Service
Stopping named Service
Stopping kadmin Service
Stopping krb5kdc Service
Stopping Directory Service
ipa: INFO: The ipactl command was successful
Connection to master.ipa.test closed.
Directory Service: STOPPED
Directory Service must be running in order to obtain status of other services
ipa: INFO: The ipactl command was successful
Connection to master.ipa.test closed.
REPLICA ON
Starting Directory Service
Starting krb5kdc Service
Starting kadmin Service
Starting named Service
Starting httpd Service
Starting ipa-custodia Service
Starting ntpd Service
Starting pki-tomcatd Service
Starting ipa-otpd Service
Starting ipa-dnskeysyncd Service
ipa: INFO: The ipactl command was successful
Connection to replica1.ipa.test closed.
Directory Service: RUNNING
krb5kdc Service: RUNNING
kadmin Service: RUNNING
named Service: RUNNING
httpd Service: RUNNING
ipa-custodia Service: RUNNING
ntpd Service: RUNNING
pki-tomcatd Service: RUNNING
ipa-otpd Service: RUNNING
ipa-dnskeysyncd Service: RUNNING
ipa: INFO: The ipactl command was successful
Connection to replica1.ipa.test closed.
klist: Credentials cache keyring 'persistent:0:0' not found
Password for admin: 
Ticket cache: KEYRING:persistent:0:0
Default principal: admin

Valid starting     Expires            Service principal
12/13/19 06:09:57  12/14/19 06:09:57  krbtgt/IPA.TEST
admin:*:773400000:773400000:Administrator:/home/admin:/bin/bash
===============================================
Fri Dec 13 06:09:56 EST 2019
--------------------
MASTER ON
Starting Directory Service
Starting krb5kdc Service
Starting kadmin Service
Starting named Service
Starting httpd Service
Starting ipa-custodia Service
Starting pki-tomcatd Service
Starting ipa-otpd Service
Starting ipa-dnskeysyncd Service
ipa: INFO: The ipactl command was successful
Connection to master.ipa.test closed.
Directory Service: RUNNING
krb5kdc Service: RUNNING
kadmin Service: RUNNING
named Service: RUNNING
httpd Service: RUNNING
ipa-custodia Service: RUNNING
pki-tomcatd Service: RUNNING
ipa-otpd Service: RUNNING
ipa-dnskeysyncd Service: RUNNING
ipa: INFO: The ipactl command was successful
Connection to master.ipa.test closed.
REPLICA OFF
Directory Service: RUNNING
krb5kdc Service: RUNNING
kadmin Service: RUNNING
named Service: RUNNING
httpd Service: RUNNING
ipa-custodia Service: RUNNING
ntpd Service: RUNNING
pki-tomcatd Service: RUNNING
ipa-otpd Service: RUNNING
ipa-dnskeysyncd Service: RUNNING
ipa: INFO: The ipactl command was successful
Connection to replica1.ipa.test closed.
Stopping ipa-dnskeysyncd Service
Stopping ipa-otpd Service
Stopping pki-tomcatd Service
Stopping ntpd Service
Stopping ipa-custodia Service
Stopping httpd Service
Stopping named Service
Stopping kadmin Service
Stopping krb5kdc Service
Stopping Directory Service
ipa: INFO: The ipactl command was successful
Connection to replica1.ipa.test closed.
Directory Service: STOPPED
Directory Service must be running in order to obtain status of other services
ipa: INFO: The ipactl command was successful
Connection to replica1.ipa.test closed.
klist: Credentials cache keyring 'persistent:0:0' not found
Password for admin: 
Ticket cache: KEYRING:persistent:0:0
Default principal: admin

Valid starting     Expires            Service principal
12/13/19 06:10:33  12/14/19 06:10:32  krbtgt/IPA.TEST
admin:*:773400000:773400000:Administrator:/home/admin:/bin/bash


Thus on the basis of above observations, marking the status of bug to "VERIFIED"

Comment 24 errata-xmlrpc 2020-03-31 19:44:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:1053