Bug 703624

Summary: SSSD's async resolver only tries the first nameserver in /etc/resolv.conf
Product: Red Hat Enterprise Linux 6 Reporter: Jenny Severance <jgalipea>
Component: sssdAssignee: Stephen Gallagher <sgallagh>
Status: CLOSED ERRATA QA Contact: Chandrasekar Kannan <ckannan>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 6.1CC: benl, dpal, grajaiya, jgalipea, jhrozek, jwest, kbanerje, prc
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: sssd-1.5.1-35.el6 Doc Type: Bug Fix
Doc Text:
Cause: the internal resolver of SSSD was set to never retry other name servers it reads from /etc/resolv.conf should the first one fail to resolve a host name Consequence: If the resolving failed, SSSD switched to offline mode without asking the other configured name servers Fix: the resolver was configured so that it queries all name servers Result: hostname resulution correctly retries until it either queries all the configured name servers or resolves the host name
Story Points: ---
Clone Of:
: 707574 748835 (view as bug list) Environment:
Last Closed: 2011-12-06 16:38:21 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 707574, 708352, 748835    

Description Jenny Severance 2011-05-10 20:16:24 UTC
Description of problem:
Log in to GDM fails or logs in with cached credentials if master ipa server is down and only replica is available with integrated DNS installed on both Master and Replica.

<snip>

(Tue May 10 11:34:28 2011) [sssd[be[testrelm]]] [id_callback] (4): Got id ack and version (1) from Monitor
(Tue May 10 11:34:28 2011) [sssd[be[testrelm]]] [be_client_init] (4): Set-up Backend ID timeout [0x88adc38]
(Tue May 10 11:34:28 2011) [sssd[be[testrelm]]] [be_client_init] (4): Set-up Backend ID timeout [0x88b0b30]
(Tue May 10 11:34:28 2011) [sssd[be[testrelm]]] [client_registration] (4): Cancel DP ID timeout [0x88adc38]
(Tue May 10 11:34:28 2011) [sssd[be[testrelm]]] [client_registration] (4): Added Frontend client [NSS]
(Tue May 10 11:34:28 2011) [sssd[be[testrelm]]] [client_registration] (4): Cancel DP ID timeout [0x88b0b30]
(Tue May 10 11:34:28 2011) [sssd[be[testrelm]]] [client_registration] (4): Added Frontend client [PAM]
(Tue May 10 11:34:32 2011) [sssd[be[testrelm]]] [be_get_account_info] (4): Got request for [4097][1][name=jennyg]
(Tue May 10 11:34:32 2011) [sssd[be[testrelm]]] [fo_resolve_service_send] (4): Trying to resolve service 'IPA'
(Tue May 10 11:34:32 2011) [sssd[be[testrelm]]] [resolv_gethostbyname_send] (4): Trying to resolve A record of 'dhcp-100-18-190.testrelm'
(Tue May 10 11:34:32 2011) [sssd[be[testrelm]]] [resolve_srv_cont] (4): Searching for servers via SRV query '_ldap._tcp.testrelm'
(Tue May 10 11:34:32 2011) [sssd[be[testrelm]]] [resolv_getsrv_send] (4): Trying to resolve SRV record of '_ldap._tcp.testrelm'
(Tue May 10 11:34:32 2011) [sssd[be[testrelm]]] [resolve_srv_done] (1): SRV query failed: [Could not contact DNS servers]
(Tue May 10 11:34:32 2011) [sssd[be[testrelm]]] [fo_set_port_status] (4): Marking port 0 of server '(no name)' as 'not working'
(Tue May 10 11:34:32 2011) [sssd[be[testrelm]]] [set_srv_data_status] (4): Marking SRV lookup of service 'IPA' as 'not resolved'
(Tue May 10 11:34:32 2011) [sssd[be[testrelm]]] [fo_resolve_service_send] (4): Trying to resolve service 'IPA'
(Tue May 10 11:34:32 2011) [sssd[be[testrelm]]] [get_server_status] (4): Hostname resolution expired, reseting the server status of 'dhcp-100-18-10.testrelm'
(Tue May 10 11:34:32 2011) [sssd[be[testrelm]]] [set_server_common_status] (4): Marking server 'dhcp-100-18-10.testrelm' as 'name not resolved'
(Tue May 10 11:34:32 2011) [sssd[be[testrelm]]] [resolv_gethostbyname_send] (4): Trying to resolve A record of 'dhcp-100-18-10.testrelm'
(Tue May 10 11:34:32 2011) [sssd[be[testrelm]]] [set_server_common_status] (4): Marking server 'dhcp-100-18-10.testrelm' as 'resolving name'
(Tue May 10 11:34:32 2011) [sssd[be[testrelm]]] [fo_resolve_service_done] (1): Failed to resolve server 'dhcp-100-18-10.testrelm': Could not contact DNS servers
(Tue May 10 11:34:32 2011) [sssd[be[testrelm]]] [set_server_common_status] (4): Marking server 'dhcp-100-18-10.testrelm' as 'not working'
(Tue May 10 11:34:32 2011) [sssd[be[testrelm]]] [fo_resolve_service_send] (4): Trying to resolve service 'IPA'
(Tue May 10 11:34:32 2011) [sssd[be[testrelm]]] [fo_resolve_service_send] (1): No available servers for service 'IPA'
(Tue May 10 11:34:32 2011) [sssd[be[testrelm]]] [sdap_id_op_connect_done] (1): Failed to connect, going offline (5 [Input/output error])
(Tue May 10 11:34:32 2011) [sssd[be[testrelm]]] [be_run_offline_cb] (3): Going offline. Running callbacks.
(Tue May 10 11:34:32 2011) [sssd[be[testrelm]]] [acctinfo_callback] (4): Request processed. Returned 1,11,Offline
(Tue May 10 11:34:32 2011) [sssd[be[testrelm]]] [be_get_account_info] (4): Got request for [4097][1][name=jennyg]
(Tue May 10 11:34:32 2011) [sssd[be[testrelm]]] [be_get_account_info] (4): Request processed. Returned 1,11,Fast reply - offline
(Tue May 10 11:34:38 2011) [sssd[be[testrelm]]] [be_get_account_info] (4): Got request for [4097][1][name=jennyg]
(Tue May 10 11:34:38 2011) [sssd[be[testrelm]]] [be_get_account_info] (4): Request processed. Returned 1,11,Fast reply - offline
(Tue May 10 11:34:38 2011) [sssd[be[testrelm]]] [be_get_account_info] (4): Got request for [4097][1][name=jennyg]

</snip>

Client does not failover to replica, but goes offline.

/etc/resolve.conf contains both master and replica nameservers - first master and then replica.

If I change the order to replica then slave, then it works



Version-Release number of selected component (if applicable):

ipa-client-2.0.0-23.el6.i686
sssd-1.5.1-34.el6.i686

How reproducible:
always

Steps to Reproduce:
1. install and configure IPA master and replica both with integrated DNS
2. install IPA client and test authentication from GDM with an ipa user to cache credentials on the client - make sure /etc/resolve.conf contains both of the DNS servers first the master then the slave
3. create a new ipa user and assign the user a password
4. bring the master IPA server down (ipactl stop)
5. log into the client GDM as the user with cached credentials - uses credential cache even though the replica is available
6. log into the client GDM as the new user - authentication fails and not prompted to create new password
  
Actual results:

Replica is not found and client goes offline

Expected results:

Replica would be used for authentication while master is down

Additional info:

Comment 2 RHEL Program Management 2011-05-11 06:00:29 UTC
Since RHEL 6.1 External Beta has begun, and this bug remains
unresolved, it has been rejected as it is not proposed as
exception or blocker.

Red Hat invites you to ask your support representative to
propose this request, if appropriate and relevant, in the
next release of Red Hat Enterprise Linux.

Comment 3 Stephen Gallagher 2011-05-11 20:22:17 UTC
Updating bug summary. The problem is not limited (or even related) to FreeIPA with integrated DNS.

I have opened upstream ticket https://fedorahosted.org/sssd/ticket/867 to track the real issue.


We're not properly failing over to secondary DNS servers if the first server in the list is broken.

Steps to reproduce:

    1. Set up a valid /etc/resolv.conf with a working primary DNS server
    2. Add nameserver 127.0.0.2 to the above the working DNS entries (simulates having an unreachable DNS server first in the list)
    3. Enable debug logs and restart SSSD 

The debug log will contain

(Wed May 11 16:08:52 2011) [sssd[be[example.com]]] [fo_resolve_service_done] (1): Failed to resolve server 'ldap.example.com': Could not contact DNS servers

and SSSD will operate permanently in offline mode because it can never resolve the SRV records.

It's unclear right now whether the bug is in SSSD's async resolver or internal to the c-ares library.

Comment 7 Kaushik Banerjee 2011-09-07 17:15:46 UTC
Verified in version:

# rpm -qi sssd | head
Name        : sssd                         Relocations: (not relocatable)
Version     : 1.5.1                             Vendor: Red Hat, Inc.
Release     : 49.el6                        Build Date: Mon 29 Aug 2011 08:26:38 PM IST
Install Date: Wed 31 Aug 2011 07:01:44 AM IST      Build Host: x86-010.build.bos.redhat.com
Group       : Applications/System           Source RPM: sssd-1.5.1-49.el6.src.rpm
Size        : 3549339                          License: GPLv3+
Signature   : (none)
Packager    : Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>
URL         : http://fedorahosted.org/sssd/
Summary     : System Security Services Daemon

Comment 8 Jakub Hrozek 2011-10-26 16:17:53 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause: the internal resolver of SSSD was set to never retry other name servers it reads from /etc/resolv.conf should the first one fail to resolve a host name
Consequence: If the resolving failed, SSSD switched to offline mode without asking the other configured name servers
Fix: the resolver was configured so that it queries all name servers
Result: hostname resulution correctly retries until it either queries all the configured name servers or resolves the host name

Comment 9 errata-xmlrpc 2011-12-06 16:38:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2011-1529.html