Bug 1122873

Summary: Failover does not always happen from SRV to hostname resolution(via /etc/hosts)
Product: Red Hat Enterprise Linux 6 Reporter: Kaushik Banerjee <kbanerje>
Component: sssdAssignee: Jakub Hrozek <jhrozek>
Status: CLOSED ERRATA QA Contact: Kaushik Banerjee <kbanerje>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 6.6CC: fahmed, grajaiya, jgalipea, lslebodn, mkosek, pbrezina, preichl
Target Milestone: rcKeywords: Regression
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: sssd-1.11.6-13.el6 Doc Type: Bug Fix
Doc Text:
Cause: Port status was not marked as not working when previous SRV lookup status failed but SSSD still did not try to resolv SRV record again. Consequence: Failover did not cycle through all available servers and thus SSSD remained offline. Fix: Port status is marked as not working. Result: Failover mechanism will continue with next configured server.
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-10-14 04:49:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Domain log none

Description Kaushik Banerjee 2014-07-24 09:24:16 UTC
Description of problem:
This test was written as part of the fix to bug 954275
The failover is not proper from srv lookup to hostname resolution(via /etc/hosts)

Version-Release number of selected component (if applicable):
sssd-1.11.6-4.el6

How reproducible:
Always

Steps to Reproduce:
1. sssd.conf has "ldap_uri=_srv_,ldap://server_hostname"
2. Empty /etc/resolv.conf and /etc/hosts
3. Perform lookup (Fails and sssd goes offline)
4. Update /etc/hosts with <server_ip> <server_hostname> . Verify that you can lookup the server.
5. Sleep for 60 seconds
6. Lookup user (Still fails) 

Actual results:

Log shows:

(Tue Jul 22 04:52:22 2014) [sssd[be[LDAP]]] [be_get_account_info] (0x0100):
Got request for [4097][1][name=testuser]
(Tue Jul 22 04:52:22 2014) [sssd[be[LDAP]]] [be_get_account_info] (0x0100):
Request processed. Returned 1,11,Fast reply - offline
(Tue Jul 22 04:52:22 2014) [sssd[be[LDAP]]] [be_req_set_domain] (0x0400):
Changing request domain from [LDAP] to [LDAP]
(Tue Jul 22 04:52:22 2014) [sssd[be[LDAP]]] [sdap_id_op_connect_step]
(0x4000): beginning to connect
(Tue Jul 22 04:52:22 2014) [sssd[be[LDAP]]] [fo_resolve_service_send]
(0x0100): Trying to resolve service 'LDAP'
(Tue Jul 22 04:52:22 2014) [sssd[be[LDAP]]] [get_port_status] (0x1000): Port
status of port 0 for server '(no name)' is 'not working'
(Tue Jul 22 04:52:22 2014) [sssd[be[LDAP]]] [get_port_status] (0x0100):
Reseting the status of port 0 for server '(no name)'
(Tue Jul 22 04:52:22 2014) [sssd[be[LDAP]]]
[fo_resolve_service_activate_timeout] (0x2000): Resolve timeout set to 6
seconds
(Tue Jul 22 04:52:22 2014) [sssd[be[LDAP]]] [resolve_srv_send] (0x0200): The
status of SRV lookup is not resolved
(Tue Jul 22 04:52:22 2014) [sssd[be[LDAP]]] [be_resolve_server_process]
(0x0080): Couldn't resolve server (SRV lookup meta-server), resolver
returned (5)
(Tue Jul 22 04:52:22 2014) [sssd[be[LDAP]]] [be_resolve_server_process]
(0x1000): Trying with the next one!
(Tue Jul 22 04:52:22 2014) [sssd[be[LDAP]]] [fo_resolve_service_send]
(0x0100): Trying to resolve service 'LDAP'
(Tue Jul 22 04:52:22 2014) [sssd[be[LDAP]]] [fo_resolve_service_send]
(0x0100): Trying to resolve service 'LDAP'
(Tue Jul 22 04:52:22 2014) [sssd[be[LDAP]]]
[fo_resolve_service_activate_timeout] (0x2000): Resolve timeout set to 6
seconds
(Tue Jul 22 04:52:22 2014) [sssd[be[LDAP]]] [resolve_srv_send] (0x0200): The
status of SRV lookup is not resolved
(Tue Jul 22 04:52:22 2014) [sssd[be[LDAP]]] [be_resolve_server_process]
(0x0080): Couldn't resolve server (SRV lookup meta-server), resolver
returned (5)
(Tue Jul 22 04:52:22 2014) [sssd[be[LDAP]]] [be_resolve_server_process]
(0x1000): Trying with the next one!
(Tue Jul 22 04:52:22 2014) [sssd[be[LDAP]]] [fo_resolve_service_send]
(0x0100): Trying to resolve service 'LDAP'
(Tue Jul 22 04:52:22 2014) [sssd[be[LDAP]]] [fo_resolve_service_send]
(0x0100): Trying to resolve service 'LDAP'
(Tue Jul 22 04:52:22 2014) [sssd[be[LDAP]]]
[fo_resolve_service_activate_timeout] (0x2000): Resolve timeout set to 6
seconds
(Tue Jul 22 04:52:22 2014) [sssd[be[LDAP]]] [resolve_srv_send] (0x0200): The
status of SRV lookup is not resolved
(Tue Jul 22 04:52:22 2014) [sssd[be[LDAP]]] [be_resolve_server_process]
(0x0080): Couldn't resolve server (SRV lookup meta-server), resolver
returned (5)
(Tue Jul 22 04:52:22 2014) [sssd[be[LDAP]]] [be_resolve_server_process]
(0x1000): Trying with the next one!
(Tue Jul 22 04:52:22 2014) [sssd[be[LDAP]]] [be_resolve_server_process]
(0x1000): Trying with the next one!
(Tue Jul 22 04:52:22 2014) [sssd[be[LDAP]]] [fo_resolve_service_send]
(0x0100): Trying to resolve service 'LDAP'
(Tue Jul 22 04:52:22 2014) [sssd[be[LDAP]]]
[fo_resolve_service_activate_timeout] (0x2000): Resolve timeout set to 6
seconds
(Tue Jul 22 04:52:22 2014) [sssd[be[LDAP]]] [resolve_srv_send] (0x0200): The
status of SRV lookup is not resolved
(Tue Jul 22 04:52:22 2014) [sssd[be[LDAP]]] [resolve_srv_send] (0x0200): The
status of SRV lookup is not resolved
(Tue Jul 22 04:52:22 2014) [sssd[be[LDAP]]] [be_resolve_server_process]
(0x0080): Couldn't resolve server (SRV lookup meta-server), resolver
returned (5)
(Tue Jul 22 04:52:22 2014) [sssd[be[LDAP]]] [be_resolve_server_process]
(0x1000): Trying with the next one!
(Tue Jul 22 04:52:22 2014) [sssd[be[LDAP]]] [fo_resolve_service_send]
(0x0100): Trying to resolve service 'LDAP'
(Tue Jul 22 04:52:22 2014) [sssd[be[LDAP]]]
[fo_resolve_service_activate_timeout] (0x2000): Resolve timeout set to 6
seconds
(Tue Jul 22 04:52:22 2014) [sssd[be[LDAP]]] [resolve_srv_send] (0x0200): The
status of SRV lookup is not resolved
(Tue Jul 22 04:52:22 2014) [sssd[be[LDAP]]] [be_resolve_server_process]
(0x0080): Couldn't resolve server (SRV lookup meta-server), resolver
returned (5)
(Tue Jul 22 04:52:22 2014) [sssd[be[LDAP]]] [be_resolve_server_process]
(0x1000): Trying with the next one!
(Tue Jul 22 04:52:22 2014) [sssd[be[LDAP]]] [fo_resolve_service_send]
(0x0100): Trying to resolve service 'LDAP'
(Tue Jul 22 04:52:22 2014) [sssd[be[LDAP]]]
[fo_resolve_service_activate_timeout] (0x2000): Resolve timeout set to 6
seconds
(Tue Jul 22 04:52:22 2014) [sssd[be[LDAP]]] [resolve_srv_send] (0x0200): The
status of SRV lookup is not resolved
(Tue Jul 22 04:52:22 2014) [sssd[be[LDAP]]] [be_resolve_server_process]
(0x0080): Couldn't resolve server (SRV lookup meta-server), resolver
returned (5)
(Tue Jul 22 04:52:22 2014) [sssd[be[LDAP]]] [be_resolve_server_process]
(0x1000): Trying with the next one!
(Tue Jul 22 04:52:22 2014) [sssd[be[LDAP]]] [fo_resolve_service_send]
(0x0100): Trying to resolve service 'LDAP'
(Tue Jul 22 04:52:22 2014) [sssd[be[LDAP]]]
[fo_resolve_service_activate_timeout] (0x2000): Resolve timeout set to 6
seconds
(Tue Jul 22 04:52:22 2014) [sssd[be[LDAP]]] [resolve_srv_send] (0x0200): The
status of SRV lookup is not resolved
(Tue Jul 22 04:52:22 2014) [sssd[be[LDAP]]] [be_resolve_server_process]
(0x0080): Couldn't resolve server (SRV lookup meta-server), resolver
returned (5)
(Tue Jul 22 04:52:22 2014) [sssd[be[LDAP]]] [be_resolve_server_process]
(0x1000): Trying with the next one!
(Tue Jul 22 04:52:22 2014) [sssd[be[LDAP]]] [fo_resolve_service_send]
(0x0100): Trying to resolve service 'LDAP'
(Tue Jul 22 04:52:22 2014) [sssd[be[LDAP]]]
[fo_resolve_service_activate_timeout] (0x2000): Resolve timeout set to 6
seconds
(Tue Jul 22 04:52:22 2014) [sssd[be[LDAP]]] [resolve_srv_send] (0x0200): The
status of SRV lookup is not resolved
(Tue Jul 22 04:52:22 2014) [sssd[be[LDAP]]] [be_resolve_server_process]
(0x0080): Couldn't resolve server (SRV lookup meta-server), resolver
returned (5)
(Tue Jul 22 04:52:22 2014) [sssd[be[LDAP]]] [be_resolve_server_process]
(0x1000): Trying with the next one!
(Tue Jul 22 04:52:22 2014) [sssd[be[LDAP]]] [fo_resolve_service_send]
(0x0100): Trying to resolve service 'LDAP'
(Tue Jul 22 04:52:22 2014) [sssd[be[LDAP]]]
[fo_resolve_service_activate_timeout] (0x2000): Resolve timeout set to 6
seconds
(Tue Jul 22 04:52:22 2014) [sssd[be[LDAP]]] [resolve_srv_send] (0x0200): The
status of SRV lookup is not resolved
(Tue Jul 22 04:52:22 2014) [sssd[be[LDAP]]] [be_resolve_server_process]
(0x0080): Couldn't resolve server (SRV lookup meta-server), resolver
returned (5)
(Tue Jul 22 04:52:22 2014) [sssd[be[LDAP]]] [be_resolve_server_process]
(0x1000): Trying with the next one!
(Tue Jul 22 04:52:22 2014) [sssd[be[LDAP]]] [fo_resolve_service_send]
(0x0100): Trying to resolve service 'LDAP'
(Tue Jul 22 04:52:22 2014) [sssd[be[LDAP]]]
[fo_resolve_service_activate_timeout] (0x2000): Resolve timeout set to 6
seconds
(Tue Jul 22 04:52:22 2014) [sssd[be[LDAP]]] [resolve_srv_send] (0x0200): The
status of SRV lookup is not resolved
(Tue Jul 22 04:52:22 2014) [sssd[be[LDAP]]] [be_resolve_server_process]
(0x0080): Couldn't resolve server (SRV lookup meta-server), resolver
returned (5)
(Tue Jul 22 04:52:22 2014) [sssd[be[LDAP]]] [be_resolve_server_process]
(0x0040): Failed to find a server after 10 attempts
(Tue Jul 22 04:52:22 2014) [sssd[be[LDAP]]] [be_resolve_server_done]
(0x1000): Server resolution failed: 5
(Tue Jul 22 04:52:22 2014) [sssd[be[LDAP]]] [sdap_id_op_connect_done]
(0x0020): Failed to connect, going offline (5 [Input/output error])
(Tue Jul 22 04:52:22 2014) [sssd[be[LDAP]]] [be_mark_offline] (0x2000):
Going offline!
(Tue Jul 22 04:52:22 2014) [sssd[be[LDAP]]] [be_run_offline_cb] (0x0080):
Going offline. Running callbacks. 

Expected results:
Failover works

Additional info:

Comment 3 Pavel Březina 2014-07-24 14:53:06 UTC
Hi,
I couldn't reproduce this bug. Do you see this also with 1.12 or current master? Can you please send me sssd.conf and the whole log file?

Thanks.

Comment 4 Kaushik Banerjee 2014-07-25 07:25:27 UTC
Created attachment 920861 [details]
Domain log

I can reproduce this with 1.11 version of sssd on RHEL6.6

The sssd.conf domain section has:
[domain/LDAP]
debug_level = 0xFFFF
id_provider = ldap
ldap_uri = _srv_ , ldap://<ldap server>
ldap_search_base = dc=example,dc=com

I will e-mail you my machine credentials privately in case you want to reproduce on my setup.

Comment 5 Pavel Březina 2014-07-25 12:11:37 UTC
Hi,
thanks for the info. It is reproducible by these steps only with sssd-1-11.

The problem is hidden in master by "failover: Shorter retry time for failed SRV" 
8ed6630fb25c7d5af6e2d9915786b94600198b01 commit, which lowers SRV retry timeout to 15 seconds instead of 14k if previous attempt to resolve SRV failed.

However, this commit only hides the problem. The log shows that for some reason we only try SRV record and we do not cycle through all configured servers. I will investigate more.

I have just a little tip for you test.

In SSSD we have a mechanism called "fast reply" which makes the user request return immediately if SSSD is in offline mode and than continue with the request out of band. I.e. the last getent command will still fail since SSSD is still offline but the user data will be downloaded in background (which brings SSSD back online).

So to make the test correct you should either grep SSSD logs to see if we are online or put another getent command at the end or signal SSSD with SIGUSR2 before the getent command (which will make it go online).

Comment 6 Jakub Hrozek 2014-07-28 07:32:36 UTC
Upstream ticket:
https://fedorahosted.org/sssd/ticket/2390

Comment 7 Jakub Hrozek 2014-07-31 12:09:58 UTC
For completeness, here are the commit hashes:
    master: 5490fa3ea4f6a4118a5188acb0e05daa302ed9d6
    sssd-1-11: 8b0aeee7fe668d8863699d942519a0ac51304049

Comment 9 Kaushik Banerjee 2014-08-05 07:23:53 UTC
Verified in build sssd-1.11.6-14.el6

Failover works properly now.
Log shows:

<snip>

(Tue Aug  5 02:39:16 2014) [sssd[be[LDAP]]] [set_srv_data_status] (0x0100): Marking SRV lookup of service 'LDAP' as 'not resolved'
(Tue Aug  5 02:39:16 2014) [sssd[be[LDAP]]] [be_resolve_server_process] (0x0080): Couldn't resolve server (SRV lookup meta-server), resolver returned (1432158226)
(Tue Aug  5 02:39:16 2014) [sssd[be[LDAP]]] [be_resolve_server_process] (0x1000): Trying with the next one!
(Tue Aug  5 02:39:16 2014) [sssd[be[LDAP]]] [fo_resolve_service_send] (0x0100): Trying to resolve service 'LDAP'
(Tue Aug  5 02:39:16 2014) [sssd[be[LDAP]]] [get_server_status] (0x1000): Status of server 'ldapserver.example.com' is 'name not resolved'
(Tue Aug  5 02:39:16 2014) [sssd[be[LDAP]]] [get_port_status] (0x1000): Port status of port 389 for server 'ldapserver.example.com' is 'neutral'


</snip>

Comment 10 errata-xmlrpc 2014-10-14 04:49:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-1375.html