Bug 880956

Summary: Primary server status is not always reset after failover to backup server happened
Product: Red Hat Enterprise Linux 6 Reporter: Kaushik Banerjee <kbanerje>
Component: sssdAssignee: Jakub Hrozek <jhrozek>
Status: CLOSED ERRATA QA Contact: Kaushik Banerjee <kbanerje>
Severity: unspecified Docs Contact:
Priority: high    
Version: 6.4CC: grajaiya, jgalipea, okos, pbrezina
Target Milestone: rcKeywords: Regression
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: sssd-1.9.2-51.el6 Doc Type: Bug Fix
Doc Text:
No documentation needed.
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-02-21 09:41:25 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 895654    

Description Kaushik Banerjee 2012-11-28 09:39:19 UTC
Description of problem:
Primary server status is not always reset after failover to backup server happened

Version-Release number of selected component (if applicable):
sssd-1.9.2-25

How reproducible:
Always, if TTL of primary server is more than 30s.

Steps to Reproduce:
1.sssd.conf domain section has:
ldap_uri = ldap://SERVER1
ldap_backup_uri = ldap://SERVER2

2. getent passwd user1_from_server1
Works

3. Stop ldap on SERVER1

4. getent passwd user1_from_server2
Works

5. Start ldap on Server1

6. sleep 40

7. getent passwd user2_from_server1
Fails (Should have worked, since backup to primary failover is 30s)

8. sleep 40

9. getent passwd user2_from_server1
Works fine now

Actual results:
Primary Server is not contacted back after 30s after failover to backup server happened.
From the logs, it seems primary server was looked up in 30s. However, since the TTL of primary server is set to more than 30 in dns, the primary server status is not reset.
See additional info for logs.

Expected results:
Failover from backup to primary should reset the primary server status after 30s.

Additional info:

(Wed Nov 28 13:25:52 2012) [sssd[be[LDAP]]] [be_primary_server_timeout_activate] (0x2000): Primary server reactivation timeout set to 30 seconds

...
After 30 seconds
...

(Wed Nov 28 13:26:22 2012) [sssd[be[LDAP]]] [be_primary_server_timeout] (0x0400): Looking for primary server!
(Wed Nov 28 13:26:22 2012) [sssd[be[LDAP]]] [fo_resolve_service_send] (0x0100): Trying to resolve service 'LDAP'
(Wed Nov 28 13:26:22 2012) [sssd[be[LDAP]]] [get_server_status] (0x1000): Status of server 'SERVER2' is 'working'
(Wed Nov 28 13:26:22 2012) [sssd[be[LDAP]]] [get_port_status] (0x1000): Port status of port 389 for server 'SERVER2' is 'working'
(Wed Nov 28 13:26:22 2012) [sssd[be[LDAP]]] [get_server_status] (0x1000): Status of server 'SERVER1' is 'working'
(Wed Nov 28 13:26:22 2012) [sssd[be[LDAP]]] [get_port_status] (0x1000): Port status of port 389 for server 'SERVER1' is 'not working'
(Wed Nov 28 13:26:22 2012) [sssd[be[LDAP]]] [get_server_status] (0x1000): Status of server 'SERVER2' is 'working'
(Wed Nov 28 13:26:22 2012) [sssd[be[LDAP]]] [get_port_status] (0x1000): Port status of port 389 for server 'SERVER2' is 'working'
(Wed Nov 28 13:26:22 2012) [sssd[be[LDAP]]] [fo_resolve_service_activate_timeout] (0x2000): Resolve timeout set to 10 seconds
(Wed Nov 28 13:26:22 2012) [sssd[be[LDAP]]] [get_server_status] (0x1000): Status of server 'SERVER2' is 'working'
(Wed Nov 28 13:26:22 2012) [sssd[be[LDAP]]] [be_resolve_server_process] (0x1000): Saving the first resolved server
(Wed Nov 28 13:26:22 2012) [sssd[be[LDAP]]] [be_resolve_server_process] (0x0200): Found address for server SERVER2: [192.168.122.22] TTL 59
(Wed Nov 28 13:26:22 2012) [sssd[be[LDAP]]] [be_primary_server_timeout_activate] (0x2000): Primary server reactivation timeout set to 30 seconds

...
After 30 seconds
...

(Wed Nov 28 13:26:52 2012) [sssd[be[LDAP]]] [be_primary_server_timeout] (0x0400): Looking for primary server!
(Wed Nov 28 13:26:52 2012) [sssd[be[LDAP]]] [fo_resolve_service_send] (0x0100): Trying to resolve service 'LDAP'
(Wed Nov 28 13:26:52 2012) [sssd[be[LDAP]]] [get_server_status] (0x1000): Status of server 'SERVER1' is 'working'
(Wed Nov 28 13:26:52 2012) [sssd[be[LDAP]]] [get_server_status] (0x0100): Hostname resolution expired, resetting the server status of 'SERVER1'
(Wed Nov 28 13:26:52 2012) [sssd[be[LDAP]]] [set_server_common_status] (0x0100): Marking server 'SERVER1' as 'name not resolved'
(Wed Nov 28 13:26:52 2012) [sssd[be[LDAP]]] [get_port_status] (0x1000): Port status of port 389 for server 'SERVER1' is 'not working'
(Wed Nov 28 13:26:52 2012) [sssd[be[LDAP]]] [get_port_status] (0x0100): Reseting the status of port 389 for server 'SERVER1'
(Wed Nov 28 13:26:52 2012) [sssd[be[LDAP]]] [fo_resolve_service_activate_timeout] (0x2000): Resolve timeout set to 10 seconds
(Wed Nov 28 13:26:52 2012) [sssd[be[LDAP]]] [get_server_status] (0x1000): Status of server 'SERVER1' is 'name not resolved'
(Wed Nov 28 13:26:52 2012) [sssd[be[LDAP]]] [resolv_is_address] (0x4000): [cobra.lab.eng.pnq.redhat.com] does not look like an IP address
(Wed Nov 28 13:26:52 2012) [sssd[be[LDAP]]] [resolv_gethostbyname_step] (0x2000): Querying files
(Wed Nov 28 13:26:52 2012) [sssd[be[LDAP]]] [resolv_gethostbyname_files_send] (0x0100): Trying to resolve A record of 'SERVER1' in files
(Wed Nov 28 13:26:52 2012) [sssd[be[LDAP]]] [set_server_common_status] (0x0100): Marking server 'SERVER1' as 'resolving name'
(Wed Nov 28 13:26:52 2012) [sssd[be[LDAP]]] [resolv_gethostbyname_step] (0x2000): Querying files
(Wed Nov 28 13:26:52 2012) [sssd[be[LDAP]]] [resolv_gethostbyname_files_send] (0x0100): Trying to resolve AAAA record of 'SERVER1' in files
(Wed Nov 28 13:26:52 2012) [sssd[be[LDAP]]] [resolv_gethostbyname_next] (0x0200): No more address families to retry
(Wed Nov 28 13:26:52 2012) [sssd[be[LDAP]]] [resolv_gethostbyname_step] (0x2000): Querying DNS
(Wed Nov 28 13:26:52 2012) [sssd[be[LDAP]]] [resolv_gethostbyname_dns_query] (0x0100): Trying to resolve A record of 'SERVER1' in DNS
(Wed Nov 28 13:26:52 2012) [sssd[be[LDAP]]] [schedule_request_timeout] (0x2000): Scheduling a timeout of 5 seconds
(Wed Nov 28 13:26:52 2012) [sssd[be[LDAP]]] [schedule_timeout_watcher] (0x2000): Scheduling DNS timeout watcher
(Wed Nov 28 13:26:52 2012) [sssd[be[LDAP]]] [unschedule_timeout_watcher] (0x4000): Unscheduling DNS timeout watcher
(Wed Nov 28 13:26:52 2012) [sssd[be[LDAP]]] [resolv_gethostbyname_dns_parse] (0x1000): Parsing an A reply
(Wed Nov 28 13:26:52 2012) [sssd[be[LDAP]]] [request_watch_destructor] (0x0400): Deleting request watch
(Wed Nov 28 13:26:52 2012) [sssd[be[LDAP]]] [set_server_common_status] (0x0100): Marking server 'SERVER1' as 'name resolved'
(Wed Nov 28 13:26:52 2012) [sssd[be[LDAP]]] [be_resolve_server_process] (0x1000): Saving the first resolved server
(Wed Nov 28 13:26:52 2012) [sssd[be[LDAP]]] [be_resolve_server_process] (0x0200): Found address for server SERVER1: [192.168.122.11] TTL 60
(Wed Nov 28 13:26:52 2012) [sssd[be[LDAP]]] [sdap_uri_callback] (0x0400): Constructed uri 'ldap://SERVER1'
(Wed Nov 28 13:26:52 2012) [sssd[be[LDAP]]] [be_run_reconnect_cb] (0x0400): Reconnecting. Running callbacks.

Comment 2 Jakub Hrozek 2012-11-29 08:12:21 UTC
Upstream ticket:
https://fedorahosted.org/sssd/ticket/1679

Comment 5 Kaushik Banerjee 2012-12-31 08:02:19 UTC
Verified in version 1.9.2-59

Report of tests for ldap provider:
   [   PASS   ]      primary_server_ldap_001 One primary server down, online after 31s
   [   PASS   ]      primary_server_ldap_003 Service lookup in primary
   [   PASS   ]      primary_server_ldap_004 Service lookup in backup server list
   [   PASS   ]      primary_server_ldap_005 Primary and Backup list empty
   [   PASS   ]      primary_server_ldap_006 Primary list is empty, should default to Service lookup
   [   PASS   ]      primary_server_ldap_007 Multiple Primary servers
   [   PASS   ]      primary_server_ldap_008 Multiple Backup servers
   [   PASS   ]      primary_server_ldap_009 Primary list has multiple ldap uris and a service entry
   [   PASS   ]      primary_server_ldap_010 Primary list has multiple ldap uris and a service entry in between
   [   PASS   ]      primary_server_ldap_011 Backup list has multiple ldap uris and a service entry at the start
   [   PASS   ]      primary_server_ldap_012 Backup list has multiple ldap uris and a service entry in between
   [   PASS   ]      primary_server_ldap_013 Backup list has multiple ldap uris and a service entry at the end


Report of tests for krb5 provider:
   [   PASS   ]      primary_server_krb5_001 One primary server down, online after 30s
   [   PASS   ]      primary_server_krb5_002 IP Address in backup server
   [   PASS   ]      primary_server_krb5_003 Service lookup in primary
   [   PASS   ]      primary_server_krb5_004 Service lookup in backup server list
   [   PASS   ]      primary_server_krb5_005 Primary and Backup list empty
   [   PASS   ]      primary_server_krb5_006 Primary list is empty, should default to Service lookup
   [   PASS   ]      primary_server_krb5_007 Multiple Primary servers
   [   PASS   ]      primary_server_krb5_008 Multiple Backup servers
   [   PASS   ]      primary_server_krb5_010 Primary list has multiple servers and a service entry in between
   [   PASS   ]      primary_server_krb5_012 Backup list has multiple krb5 servers and a service entry in between
   [   PASS   ]      primary_server_krb5_014 Primary list has a typo and ends with service lookup
   [   PASS   ]      primary_server_krb5_015 Primary and Backup list has a typo and ends with appropriate uri Upstream

Comment 6 errata-xmlrpc 2013-02-21 09:41:25 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0508.html