880956 – Primary server status is not always reset after failover to backup server happened

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 880956 - Primary server status is not always reset after failover to backup server happened

Summary: Primary server status is not always reset after failover to backup server hap...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	sssd
Sub Component:
Version:	6.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	unspecified
Target Milestone:	rc
Target Release:	---
Assignee:	Jakub Hrozek
QA Contact:	Kaushik Banerjee
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	895654
TreeView+	depends on / blocked

Reported:	2012-11-28 09:39 UTC by Kaushik Banerjee
Modified:	2020-05-02 17:08 UTC (History)
CC List:	4 users (show)
Fixed In Version:	sssd-1.9.2-51.el6
Doc Type:	Bug Fix
Doc Text:	No documentation needed.
Clone Of:
Environment:
Last Closed:	2013-02-21 09:41:25 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	SSSD sssd issues 2721	0	None	None	None	2020-05-02 17:08:08 UTC
Red Hat Product Errata	RHSA-2013:0508	0	normal	SHIPPED_LIVE	Low: sssd security, bug fix and enhancement update	2013-02-20 21:30:10 UTC

Description Kaushik Banerjee 2012-11-28 09:39:19 UTC

Description of problem:
Primary server status is not always reset after failover to backup server happened

Version-Release number of selected component (if applicable):
sssd-1.9.2-25

How reproducible:
Always, if TTL of primary server is more than 30s.

Steps to Reproduce:
1.sssd.conf domain section has:
ldap_uri = ldap://SERVER1
ldap_backup_uri = ldap://SERVER2

2. getent passwd user1_from_server1
Works

3. Stop ldap on SERVER1

4. getent passwd user1_from_server2
Works

5. Start ldap on Server1

6. sleep 40

7. getent passwd user2_from_server1
Fails (Should have worked, since backup to primary failover is 30s)

8. sleep 40

9. getent passwd user2_from_server1
Works fine now

Actual results:
Primary Server is not contacted back after 30s after failover to backup server happened.
From the logs, it seems primary server was looked up in 30s. However, since the TTL of primary server is set to more than 30 in dns, the primary server status is not reset.
See additional info for logs.

Expected results:
Failover from backup to primary should reset the primary server status after 30s.

Additional info:

(Wed Nov 28 13:25:52 2012) [sssd[be[LDAP]]] [be_primary_server_timeout_activate] (0x2000): Primary server reactivation timeout set to 30 seconds

...
After 30 seconds
...

(Wed Nov 28 13:26:22 2012) [sssd[be[LDAP]]] [be_primary_server_timeout] (0x0400): Looking for primary server!
(Wed Nov 28 13:26:22 2012) [sssd[be[LDAP]]] [fo_resolve_service_send] (0x0100): Trying to resolve service 'LDAP'
(Wed Nov 28 13:26:22 2012) [sssd[be[LDAP]]] [get_server_status] (0x1000): Status of server 'SERVER2' is 'working'
(Wed Nov 28 13:26:22 2012) [sssd[be[LDAP]]] [get_port_status] (0x1000): Port status of port 389 for server 'SERVER2' is 'working'
(Wed Nov 28 13:26:22 2012) [sssd[be[LDAP]]] [get_server_status] (0x1000): Status of server 'SERVER1' is 'working'
(Wed Nov 28 13:26:22 2012) [sssd[be[LDAP]]] [get_port_status] (0x1000): Port status of port 389 for server 'SERVER1' is 'not working'
(Wed Nov 28 13:26:22 2012) [sssd[be[LDAP]]] [get_server_status] (0x1000): Status of server 'SERVER2' is 'working'
(Wed Nov 28 13:26:22 2012) [sssd[be[LDAP]]] [get_port_status] (0x1000): Port status of port 389 for server 'SERVER2' is 'working'
(Wed Nov 28 13:26:22 2012) [sssd[be[LDAP]]] [fo_resolve_service_activate_timeout] (0x2000): Resolve timeout set to 10 seconds
(Wed Nov 28 13:26:22 2012) [sssd[be[LDAP]]] [get_server_status] (0x1000): Status of server 'SERVER2' is 'working'
(Wed Nov 28 13:26:22 2012) [sssd[be[LDAP]]] [be_resolve_server_process] (0x1000): Saving the first resolved server
(Wed Nov 28 13:26:22 2012) [sssd[be[LDAP]]] [be_resolve_server_process] (0x0200): Found address for server SERVER2: [192.168.122.22] TTL 59
(Wed Nov 28 13:26:22 2012) [sssd[be[LDAP]]] [be_primary_server_timeout_activate] (0x2000): Primary server reactivation timeout set to 30 seconds

...
After 30 seconds
...

(Wed Nov 28 13:26:52 2012) [sssd[be[LDAP]]] [be_primary_server_timeout] (0x0400): Looking for primary server!
(Wed Nov 28 13:26:52 2012) [sssd[be[LDAP]]] [fo_resolve_service_send] (0x0100): Trying to resolve service 'LDAP'
(Wed Nov 28 13:26:52 2012) [sssd[be[LDAP]]] [get_server_status] (0x1000): Status of server 'SERVER1' is 'working'
(Wed Nov 28 13:26:52 2012) [sssd[be[LDAP]]] [get_server_status] (0x0100): Hostname resolution expired, resetting the server status of 'SERVER1'
(Wed Nov 28 13:26:52 2012) [sssd[be[LDAP]]] [set_server_common_status] (0x0100): Marking server 'SERVER1' as 'name not resolved'
(Wed Nov 28 13:26:52 2012) [sssd[be[LDAP]]] [get_port_status] (0x1000): Port status of port 389 for server 'SERVER1' is 'not working'
(Wed Nov 28 13:26:52 2012) [sssd[be[LDAP]]] [get_port_status] (0x0100): Reseting the status of port 389 for server 'SERVER1'
(Wed Nov 28 13:26:52 2012) [sssd[be[LDAP]]] [fo_resolve_service_activate_timeout] (0x2000): Resolve timeout set to 10 seconds
(Wed Nov 28 13:26:52 2012) [sssd[be[LDAP]]] [get_server_status] (0x1000): Status of server 'SERVER1' is 'name not resolved'
(Wed Nov 28 13:26:52 2012) [sssd[be[LDAP]]] [resolv_is_address] (0x4000): [cobra.lab.eng.pnq.redhat.com] does not look like an IP address
(Wed Nov 28 13:26:52 2012) [sssd[be[LDAP]]] [resolv_gethostbyname_step] (0x2000): Querying files
(Wed Nov 28 13:26:52 2012) [sssd[be[LDAP]]] [resolv_gethostbyname_files_send] (0x0100): Trying to resolve A record of 'SERVER1' in files
(Wed Nov 28 13:26:52 2012) [sssd[be[LDAP]]] [set_server_common_status] (0x0100): Marking server 'SERVER1' as 'resolving name'
(Wed Nov 28 13:26:52 2012) [sssd[be[LDAP]]] [resolv_gethostbyname_step] (0x2000): Querying files
(Wed Nov 28 13:26:52 2012) [sssd[be[LDAP]]] [resolv_gethostbyname_files_send] (0x0100): Trying to resolve AAAA record of 'SERVER1' in files
(Wed Nov 28 13:26:52 2012) [sssd[be[LDAP]]] [resolv_gethostbyname_next] (0x0200): No more address families to retry
(Wed Nov 28 13:26:52 2012) [sssd[be[LDAP]]] [resolv_gethostbyname_step] (0x2000): Querying DNS
(Wed Nov 28 13:26:52 2012) [sssd[be[LDAP]]] [resolv_gethostbyname_dns_query] (0x0100): Trying to resolve A record of 'SERVER1' in DNS
(Wed Nov 28 13:26:52 2012) [sssd[be[LDAP]]] [schedule_request_timeout] (0x2000): Scheduling a timeout of 5 seconds
(Wed Nov 28 13:26:52 2012) [sssd[be[LDAP]]] [schedule_timeout_watcher] (0x2000): Scheduling DNS timeout watcher
(Wed Nov 28 13:26:52 2012) [sssd[be[LDAP]]] [unschedule_timeout_watcher] (0x4000): Unscheduling DNS timeout watcher
(Wed Nov 28 13:26:52 2012) [sssd[be[LDAP]]] [resolv_gethostbyname_dns_parse] (0x1000): Parsing an A reply
(Wed Nov 28 13:26:52 2012) [sssd[be[LDAP]]] [request_watch_destructor] (0x0400): Deleting request watch
(Wed Nov 28 13:26:52 2012) [sssd[be[LDAP]]] [set_server_common_status] (0x0100): Marking server 'SERVER1' as 'name resolved'
(Wed Nov 28 13:26:52 2012) [sssd[be[LDAP]]] [be_resolve_server_process] (0x1000): Saving the first resolved server
(Wed Nov 28 13:26:52 2012) [sssd[be[LDAP]]] [be_resolve_server_process] (0x0200): Found address for server SERVER1: [192.168.122.11] TTL 60
(Wed Nov 28 13:26:52 2012) [sssd[be[LDAP]]] [sdap_uri_callback] (0x0400): Constructed uri 'ldap://SERVER1'
(Wed Nov 28 13:26:52 2012) [sssd[be[LDAP]]] [be_run_reconnect_cb] (0x0400): Reconnecting. Running callbacks.

Comment 2 Jakub Hrozek 2012-11-29 08:12:21 UTC

Upstream ticket:
https://fedorahosted.org/sssd/ticket/1679

Comment 5 Kaushik Banerjee 2012-12-31 08:02:19 UTC

Verified in version 1.9.2-59

Report of tests for ldap provider:
   [   PASS   ]      primary_server_ldap_001 One primary server down, online after 31s
   [   PASS   ]      primary_server_ldap_003 Service lookup in primary
   [   PASS   ]      primary_server_ldap_004 Service lookup in backup server list
   [   PASS   ]      primary_server_ldap_005 Primary and Backup list empty
   [   PASS   ]      primary_server_ldap_006 Primary list is empty, should default to Service lookup
   [   PASS   ]      primary_server_ldap_007 Multiple Primary servers
   [   PASS   ]      primary_server_ldap_008 Multiple Backup servers
   [   PASS   ]      primary_server_ldap_009 Primary list has multiple ldap uris and a service entry
   [   PASS   ]      primary_server_ldap_010 Primary list has multiple ldap uris and a service entry in between
   [   PASS   ]      primary_server_ldap_011 Backup list has multiple ldap uris and a service entry at the start
   [   PASS   ]      primary_server_ldap_012 Backup list has multiple ldap uris and a service entry in between
   [   PASS   ]      primary_server_ldap_013 Backup list has multiple ldap uris and a service entry at the end


Report of tests for krb5 provider:
   [   PASS   ]      primary_server_krb5_001 One primary server down, online after 30s
   [   PASS   ]      primary_server_krb5_002 IP Address in backup server
   [   PASS   ]      primary_server_krb5_003 Service lookup in primary
   [   PASS   ]      primary_server_krb5_004 Service lookup in backup server list
   [   PASS   ]      primary_server_krb5_005 Primary and Backup list empty
   [   PASS   ]      primary_server_krb5_006 Primary list is empty, should default to Service lookup
   [   PASS   ]      primary_server_krb5_007 Multiple Primary servers
   [   PASS   ]      primary_server_krb5_008 Multiple Backup servers
   [   PASS   ]      primary_server_krb5_010 Primary list has multiple servers and a service entry in between
   [   PASS   ]      primary_server_krb5_012 Backup list has multiple krb5 servers and a service entry in between
   [   PASS   ]      primary_server_krb5_014 Primary list has a typo and ends with service lookup
   [   PASS   ]      primary_server_krb5_015 Primary and Backup list has a typo and ends with appropriate uri Upstream

Comment 6 errata-xmlrpc 2013-02-21 09:41:25 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0508.html

Note You need to log in before you can comment on or make changes to this bug.