Bug 966757 - SSSD failover doesn't work if the first DNS server in resolv.conf is unavailable
Summary: SSSD failover doesn't work if the first DNS server in resolv.conf is unavailable
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: sssd
Version: 6.4
Hardware: All
OS: Linux
unspecified
high
Target Milestone: rc
: ---
Assignee: Jakub Hrozek
QA Contact: Kaushik Banerjee
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-05-23 21:44 UTC by Justin Parisi
Modified: 2020-05-02 17:22 UTC (History)
9 users (show)

Fixed In Version: sssd-1.9.2-117.el6
Doc Type: Bug Fix
Doc Text:
Cause: The default DNS timeout values were set too high, which could prevent the SSSD to falling over to all configured DNS servers. Consequence: If a faulty DNS server was configured in /etc/resolv.conf, the DNS request might have been terminated before it had the chance to cycle through all DNS server configured in /etc/resolv.conf Fix: The default DNS timeouts were lowered, allowing the SSSD to cycle through all configured DNS servers. Result: The SSSD is now able to perform failover between DNS servers correctly.
Clone Of:
Environment:
Last Closed: 2013-11-21 22:18:00 UTC
Target Upstream Version:


Attachments (Terms of Use)
Traces from client and DC of failure and success (115.00 KB, application/zip)
2013-05-23 21:44 UTC, Justin Parisi
no flags Details
SSSD logs (4.75 KB, application/zip)
2013-05-24 14:27 UTC, Justin Parisi
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github SSSD sssd issues 3008 0 None closed SSSD failover doesn't work if the first DNS server in resolv.conf is unavailable 2021-02-09 14:39:35 UTC
Red Hat Product Errata RHBA-2013:1680 0 normal SHIPPED_LIVE sssd bug fix and enhancement update 2013-11-20 21:52:37 UTC

Description Justin Parisi 2013-05-23 21:44:43 UTC
Created attachment 752397 [details]
Traces from client and DC of failure and success

Description of problem: When using SRV record failover for integration of SSSD with Active Directory, everything works fine if the first listed DNS server specified in resolv.conf is alive and well.

However, if the first listed DNS server in resolv.conf is down, SSSD black holes the request.

In a packet trace, we can see the behavior as a RST after the DNS server that is down is queried.

This can be problematic in Active Directory, as often, the DNS servers are the DCs, which in turn are also the LDAP servers. The point of failover is to survive an outage of an LDAP server, but if that LDAP server is also the first listed DNS server, SSSD just breaks.

The behavior is similar to that of attempting to use RR DNS for failover, which is not supported as per the documentation.

SSSD should not fail if there are other viable DNS servers in resolv.conf. Instead, it should move on to the next DNS server and retry the request.


Version-Release number of selected component (if applicable):
# rpm -qa sssd
sssd-1.9.2-82.7.el6_4.x86_64


How reproducible:
Consistently reproducible

Steps to Reproduce:
1. Power down the first DNS server listed in resolv.conf
2. stop SSSD, remove the cache and start SSSD
3. attempt getent or id to the LDAP server

Actual results:
getent/id fails to return valid info
kerberos ticket is issued properly, SASL bind works, but LDAP connection gets reset

Expected results:
SSSD should pick up the next DNS server and re-try the request.

Additional info:

Contents of /etc/resolv.conf file:

# cat /etc/resolv.conf
search domain.com
nameserver 10.61.179.155
nameserver 10.61.179.152
nameserver 10.61.179.174

The first server in the list is unreachable:

# ping 10.61.179.155
PING 10.61.179.155 (10.61.179.155) 56(84) bytes of data.
From 10.61.179.150 icmp_seq=2 Destination Host Unreachable
From 10.61.179.150 icmp_seq=3 Destination Host Unreachable
From 10.61.179.150 icmp_seq=4 Destination Host Unreachable
^C
--- 10.61.179.155 ping statistics ---
5 packets transmitted, 0 received, +3 errors, 100% packet loss, time 4206ms
pipe 3


Dig still works fine, meaning the other DNS servers are working properly:

# dig SRV _ldap._tcp.domain.com

; <<>> DiG 9.8.2rc1-RedHat-9.8.2-0.17.rc1.el6 <<>> SRV _ldap._tcp.domain..com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 25072
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 3

;; QUESTION SECTION:
;_ldap._tcp.domain.com. IN        SRV

;; ANSWER SECTION:
_ldap._tcp.domain.com. 600 IN SRV 0 100 389 2k8-dc-3.domain.com.
_ldap._tcp.domain.com. 600 IN SRV 0 100 389 2k8-dc-1.domain.com.
_ldap._tcp.domain.com. 600 IN SRV 0 100 389 2k8-dc-2.domain.com.

;; ADDITIONAL SECTION:
2k8-dc-3.domain.com. 3600 IN A    10.61.179.174
2k8-dc-1.domain.com. 3600 IN A    10.61.179.152
2k8-dc-2.domain.com. 3600 IN A    10.61.179.155

;; Query time: 3 msec
;; SERVER: 10.61.179.152#53(10.61.179.152)
;; WHEN: Thu May 23 17:14:22 2013
;; MSG SIZE  rcvd: 260

The sssd.conf file is set up to leverage SRV records (by way of omitting the ldap_uri, krb5_kpasswd and krb5_server values:

# cat /etc/sssd/sssd.conf
[domain/default]
cache_credentials = True
case_sensitive = False
[sssd]
config_file_version = 2
services = nss, pam
domains = DOMAIN
debug_level = 0x4000
[nss]
filter_users = root,ldap,named,avahi,haldaemon,dbus,radiusd,news,nscd
filter_groups = root
[pam]
[domain/DOMAIN]
id_provider = ldap
auth_provider = krb5
case_sensitive = False
chpass_provider = krb5
ldap_search_base = dc=domain,dc=com
ldap_schema = rfc2307
ldap_sasl_mech = GSSAPI
ldap_user_object_class = user
ldap_group_object_class = group
ldap_user_home_directory = unixHomeDirectory
ldap_user_principal = userPrincipalName
ldap_group_member = memberUid
ldap_group_name = cn
ldap_account_expire_policy = ad
ldap_force_upper_case_realm = true
ldap_group_search_base = cn=Users,dc=domain,dc=com
ldap_sasl_authid = root/centos64.domain.com@DOMAIN.COM
entry_cache_timeout = 120
krb5_realm = DOMAIN.COM
cache_credentials = false
krb5_canonicalize = false

When the first server is down, SSSD lookups fail:

# service sssd stop
Stopping sssd:                                             [  OK  ]
# rm -f /var/lib/sss/db/*
# service sssd start
Starting sssd:                                             [  OK  ]
# getent passwd ldapuser
#


When a working DNS server is moved to first in the list, SSSD lookups succeed without even needing to restart SSSD:

# cat /etc/resolv.conf
search domain.com
nameserver 10.61.179.152
nameserver 10.61.179.155
nameserver 10.61.179.174

# ping 10.61.179.152
PING 10.61.179.152 (10.61.179.152) 56(84) bytes of data.
64 bytes from 10.61.179.152: icmp_seq=1 ttl=128 time=0.830 ms
64 bytes from 10.61.179.152: icmp_seq=2 ttl=128 time=0.266 ms
^C
--- 10.61.179.152 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1040ms
rtt min/avg/max/mdev = 0.266/0.548/0.830/0.282 ms


# getent passwd ldapuser
ldapuser:*:1603:513:ldapuser:/home/ldapuser:/bin/sh

Packet traces of the failed attempt and the successful attempt are attached.

In the trace:

- SSSD is started after clearing the cache
- getent passwd ldapuser is executed

Traces filtered for IPs on domain controllers and client. Actual domain names are in the traces. Domain names above just used as placeholders.

IPs:

client - 10.61.179.150
DC1/DNS1 - 10.61.179.152
DC2/DNS2 - 10.61.179.155
DC3/DNS3 - 10.61.179.174

Comment 2 Jakub Hrozek 2013-05-24 09:22:28 UTC
Hello Justin,

Thank you for the bug report. I reproduced the issue locally. Can you please just check with me that you're hitting the same codepath?

Can you put "debug_level=3" into the sssd.conf, restart the SSSD and re-run the faulty case? Then in the debug logs (/var/log/sssd/sssd_$domain.log) you should see an error message coming from the function called fo_resolve_service_done() reporting an "Input/Output Error".

In my testing the error looks like this:

[sssd[be[IDM.LAB.BOS.REDHAT.COM]]] [resolv_gethostbyname_done] (0x0040): querying hosts database failed [5]: Input/output error
[sssd[be[IDM.LAB.BOS.REDHAT.COM]]] [fo_resolve_service_done] (0x0020): Failed to resolve server 'vm-101.idm.lab.bos.redhat.com': Could not contact DNS servers
[sssd[be[IDM.LAB.BOS.REDHAT.COM]]] [set_server_common_status] (0x0100): Marking server 'vm-101.idm.lab.bos.redhat.com' as 'not working'

(and then a bunch of messages about going offline)

If you are seeing a similar error, then you're hitting upstream bug https://fedorahosted.org/sssd/ticket/1886

Comment 3 Justin Parisi 2013-05-24 13:20:38 UTC
I changed the debug level to 3, but don't see that error in the sssd_DOMAIN.log.

# cat /etc/sssd/sssd.conf | grep debug
debug_level = 3

# tail -3 /var/log/sssd/sssd_DOMAIN.log
(Thu May 23 17:36:22 2013) [sssd[be[DOMAIN]]] [id_callback] (0x0010): The Monitor returned an error [org.freedesktop.DBus.Error.NoReply]
(Fri May 24 09:05:00 2013) [sssd[be[DOMAIN]]] [id_callback] (0x0010): The Monitor returned an error [org.freedesktop.DBus.Error.NoReply]
(Fri May 24 09:07:03 2013) [sssd[be[DOMAIN]]] [id_callback] (0x0010): The Monitor returned an error [org.freedesktop.DBus.Error.NoReply]

This is what I see in messages during restart:

May 24 09:09:32 centos6-4 sssd: Starting up
May 24 09:09:32 centos6-4 sssd[be[DOMAIN]]: Starting up
May 24 09:09:37 centos6-4 sssd[pam]: Starting up
May 24 09:09:37 centos6-4 sssd[nss]: Starting up

And then this later in the process:
May 24 09:11:50 centos6-4 sssd_be: GSSAPI Error: An invalid name was supplied (Success)
May 24 09:11:50 centos6-4 sssd[be[DOMAIN]]: Shutting down
May 24 09:11:50 centos6-4 sssd[be[DOMAIN]]: Starting up

This is what I see in sssd.log during restart and getent:

(Fri May 24 09:09:32 2013) [sssd] [monitor_quit_signal] (0x0040): Monitor received Terminated: terminating children
(Fri May 24 09:09:32 2013) [sssd] [monitor_quit] (0x0040): Returned with: 0
(Fri May 24 09:09:32 2013) [sssd] [monitor_quit] (0x0020): Terminating [DOMAIN][21447]
(Fri May 24 09:09:32 2013) [sssd] [monitor_quit] (0x0020): Child [DOMAIN] exited gracefully
(Fri May 24 09:09:32 2013) [sssd] [monitor_quit] (0x0020): Terminating [pam][21437]
(Fri May 24 09:09:32 2013) [sssd] [monitor_quit] (0x0020): Child [pam] exited gracefully
(Fri May 24 09:09:32 2013) [sssd] [monitor_quit] (0x0020): Terminating [nss][21436]
(Fri May 24 09:09:32 2013) [sssd] [monitor_quit] (0x0020): Child [nss] exited gracefully
(Fri May 24 09:09:37 2013) [sssd] [services_startup_timeout] (0x0020): Providers did not start in time, forcing services 
(Fri May 24 09:10:52 2013) [sssd] [ping_check] (0x0020): A service PING timed out on [DOMAIN]. Attempt [0]
(Fri May 24 09:11:12 2013) [sssd] [ping_check] (0x0020): A service PING timed out on [DOMAIN]. Attempt [0]
(Fri May 24 09:11:22 2013) [sssd] [ping_check] (0x0020): A service PING timed out on [DOMAIN]. Attempt [1]
(Fri May 24 09:11:32 2013) [sssd] [ping_check] (0x0020): A service PING timed out on [DOMAIN]. Attempt [2]
(Fri May 24 09:11:42 2013) [sssd] [tasks_check_handler] (0x0020): Killing service [DOMAIN], not responding to pings!
(Fri May 24 09:11:42 2013) [sssd] [ping_check] (0x0020): A service PING timed out on [DOMAIN]. Attempt [3]
(Fri May 24 09:11:50 2013) [sssd] [sbus_dispatch] (0x0080): Connection is not open for dispatching.
(Fri May 24 09:11:50 2013) [sssd] [mt_svc_exit_handler] (0x0040): Child [DOMAIN] exited with code [0]
(Fri May 24 09:11:55 2013) [sssd] [mark_service_as_started] (0x0080): Invalid parent pid: 21482

# service sssd status
sssd (pid  21483) is running...

Then the service goes into ping_check and eventually errors out:

What's strange is that the parent pid is 21482, but the service was running as pid 21483.

I restarted sssd again and checked for its pid and then a pid one digit prior:

# service sssd status
sssd (pid  21530) is running...
# ps aux | grep 21530
root     21530  0.0  0.1 199284  2352 ?        Ss   09:14   0:00 /usr/sbin/sssd -f -D
root     21545  0.0  0.0 103244   828 pts/0    S+   09:14   0:00 grep 21530
# ps aux | grep 21529
root     21548  0.0  0.0 103244   828 pts/0    S+   09:14   0:00 grep 21529

Then I waited for the error to verify that it looks for the parent pid I suspect. I was not disappointed:

(Fri May 24 09:18:58 2013) [sssd] [mark_service_as_started] (0x0080): Invalid parent pid: 21529

It also seems that this service PING and eventual blackhole doesn't kick in until an actual LDAP request is attempted.

The second time I tried it, I waited longer to try the getent and the ping fails started during the getent command:

(Fri May 24 09:14:27 2013) [sssd] [services_startup_timeout] (0x0020): Providers did not start in time, forcing services startup!
(Fri May 24 09:17:52 2013) [sssd] [ping_check] (0x0020): A service PING timed out on [DOMAIN]. Attempt [0]
(Fri May 24 09:18:12 2013) [sssd] [ping_check] (0x0020): A service PING timed out on [DOMAIN]. Attempt [0]
(Fri May 24 09:18:22 2013) [sssd] [ping_check] (0x0020): A service PING timed out on [DOMAIN]. Attempt [1]
(Fri May 24 09:18:32 2013) [sssd] [ping_check] (0x0020): A service PING timed out on [DOMAIN]. Attempt [2]
(Fri May 24 09:18:42 2013) [sssd] [tasks_check_handler] (0x0020): Killing service [DOMAIN], not responding to pings!
(Fri May 24 09:18:42 2013) [sssd] [ping_check] (0x0020): A service PING timed out on [DOMAIN]. Attempt [3]
(Fri May 24 09:18:52 2013) [sssd] [ping_check] (0x0020): A service PING timed out on [DOMAIN]. Attempt [4]
(Fri May 24 09:18:53 2013) [sssd] [sbus_dispatch] (0x0080): Connection is not open for dispatching.
(Fri May 24 09:18:53 2013) [sssd] [mt_svc_exit_handler] (0x0040): Child [DOMAIN] exited with code [0]
(Fri May 24 09:18:58 2013) [sssd] [mark_service_as_started] (0x0080): Invalid parent pid: 21529

So I think that might be another issue... if the service is going to offline, should it wait until it's used? Or should it

Comment 4 Jakub Hrozek 2013-05-24 14:15:54 UTC
Yes, this sounds like a completely different issue, thank you for checking. I think the way I attempted to reproduce the issue might have been a little different as the machine I tested with was up but I stopped the DNS server.

It actually sounds like one of the components of the sssd is either crashing or one of the requests is not running to completion. But it's not easy to tell why from the snippets of the logs.

Could you please raise the debug_level to 10 (to get all the debug information) in both the [nss] and the [domain] section, restart the SSSD and re-run the case again, then attach the logs in /var/log/sssd/ ? Please obfuscate the logs in case they contain some kind of sensitive data like host names or user names. Or feel free to mail me the logs directly.

Also the invalid parent PID is concerning (and quite possibly yet another bug). Hopefully we'll be able to tell from the logs.

Comment 5 Justin Parisi 2013-05-24 14:26:53 UTC
Sure. My repro is powering down the DC entirely, as if there were a power outage at a site.

Debug level is now at 10:

# cat /etc/sssd/sssd.conf | grep debug
debug_level = 10


I blew away the old DB and logs to ensure we start from scratch:

# service sssd stop
# rm -f /var/lib/sss/db/*
# rm -f /var/log/sssd/*

Then started the service and ran getent.

# service sssd start
Starting sssd:                                             [  OK  ]

# service sssd status
sssd (pid  21791) is running...

One caveat is that sssd_nss doesn't have any output. My conf file is posted earlier in the bug report.

Attaching logs.

Comment 6 Justin Parisi 2013-05-24 14:27:38 UTC
Created attachment 752681 [details]
SSSD logs

Comment 7 Jakub Hrozek 2013-05-27 13:38:59 UTC
(In reply to Justin Parisi from comment #5)
> One caveat is that sssd_nss doesn't have any output. My conf file is posted
> earlier in the bug report.

Ah, I see, if you need to debug the subprocesses (like sssd_be or sssd_nss), then the debug_level directive belongs directly to the individual sssd.conf sections, not the [sssd] section itself.

(This used to work in earlier releases but was changed in favor of the current behaviour where debug_level only works for the process it is specified in).

Comment 8 Jakub Hrozek 2013-05-30 10:06:15 UTC
Hi Justin, any chance we could get the logs with debug_level=<high> directive in the [nss] and [domain] sections?

Comment 9 Justin Parisi 2013-05-30 13:57:37 UTC
Sure. Excerpts below. These were at level 10.

It looks like the following happens in the sssd_DOMAIN.log:

- SSSD attempts to get the SRV record from the DNS server and is able to resolve it, despite the first server being down

(Thu May 30 09:40:33 2013) [sssd[be[DOMAIN]]] [resolve_srv_cont] (0x0100): Searching for servers via SRV query '_ldap._tcp.domain.win2k8.netapp.com'
(Thu May 30 09:40:33 2013) [sssd[be[DOMAIN]]] [resolv_getsrv_send] (0x0100): Trying to resolve SRV record of '_ldap._tcp.domain.win2k8.netapp.com'
(Thu May 30 09:40:33 2013) [sssd[be[DOMAIN]]] [schedule_request_timeout] (0x2000): Scheduling a timeout of 5 seconds
(Thu May 30 09:40:33 2013) [sssd[be[DOMAIN]]] [schedule_timeout_watcher] (0x2000): Scheduling DNS timeout watcher
(Thu May 30 09:40:33 2013) [sssd[be[DOMAIN]]] [check_fd_timeouts] (0x4000): Checking for DNS timeouts
(Thu May 30 09:40:36 2013) [sssd[be[DOMAIN]]] [check_fd_timeouts] (0x4000): Checking for DNS timeouts
(Thu May 30 09:40:37 2013) [sssd[be[DOMAIN]]] [sbus_dispatch] (0x4000): dbus conn: 1642930
(Thu May 30 09:40:37 2013) [sssd[be[DOMAIN]]] [sbus_dispatch] (0x4000): Dispatching.
(Thu May 30 09:40:37 2013) [sssd[be[DOMAIN]]] [sbus_message_handler] (0x4000): Received SBUS method [ping]
(Thu May 30 09:40:38 2013) [sssd[be[DOMAIN]]] [check_fd_timeouts] (0x4000): Checking for DNS timeouts
(Thu May 30 09:40:38 2013) [sssd[be[DOMAIN]]] [unschedule_timeout_watcher] (0x4000): Unscheduling DNS timeout watcher
(Thu May 30 09:40:38 2013) [sssd[be[DOMAIN]]] [request_watch_destructor] (0x0400): Deleting request watch
(Thu May 30 09:40:38 2013) [sssd[be[DOMAIN]]] [resolve_srv_done] (0x0400): Inserted server '2k8-dc-3.domain.win2k8.netapp.com:389' for service LDAP
(Thu May 30 09:40:38 2013) [sssd[be[DOMAIN]]] [resolve_srv_done] (0x0400): Inserted server '2k8-dc-2.domain.win2k8.netapp.com:389' for service LDAP
(Thu May 30 09:40:38 2013) [sssd[be[DOMAIN]]] [resolve_srv_done] (0x0400): Inserted server '2k8-dc-1.domain.win2k8.netapp.com:389' for service LDAP
(Thu May 30 09:40:38 2013) [sssd[be[DOMAIN]]] [set_srv_data_status] (0x0100): Marking SRV lookup of service 'LDAP' as 'resolved'


- SSSD then attempts to resolve the server it knows to own the SRV record via DNS using /etc/resolv.conf. It is actually able to failover to the next DNS server in this case:

(Thu May 30 09:40:38 2013) [sssd[be[DOMAIN]]] [get_server_status] (0x1000): Status of server '2k8-dc-3.domain.win2k8.netapp.com' is 'name not resolved'
(Thu May 30 09:40:38 2013) [sssd[be[DOMAIN]]] [resolv_is_address] (0x4000): [2k8-dc-3.domain.win2k8.netapp.com] does not look like an IP address
(Thu May 30 09:40:38 2013) [sssd[be[DOMAIN]]] [resolv_gethostbyname_step] (0x2000): Querying files
(Thu May 30 09:40:38 2013) [sssd[be[DOMAIN]]] [resolv_gethostbyname_files_send] (0x0100): Trying to resolve A record of '2k8-dc-3.domain.win2k8.netapp.com' in files
(Thu May 30 09:40:38 2013) [sssd[be[DOMAIN]]] [set_server_common_status] (0x0100): Marking server '2k8-dc-3.domain.win2k8.netapp.com' as 'resolving name'
(Thu May 30 09:40:38 2013) [sssd[be[DOMAIN]]] [resolv_gethostbyname_step] (0x2000): Querying files
(Thu May 30 09:40:38 2013) [sssd[be[DOMAIN]]] [resolv_gethostbyname_files_send] (0x0100): Trying to resolve AAAA record of '2k8-dc-3.domain.win2k8.netapp.com' in files
(Thu May 30 09:40:38 2013) [sssd[be[DOMAIN]]] [resolv_gethostbyname_next] (0x0200): No more address families to retry
(Thu May 30 09:40:38 2013) [sssd[be[DOMAIN]]] [resolv_gethostbyname_step] (0x2000): Querying DNS
(Thu May 30 09:40:38 2013) [sssd[be[DOMAIN]]] [resolv_gethostbyname_dns_query] (0x0100): Trying to resolve A record of '2k8-dc-3.domain.win2k8.netapp.com' in DNS
(Thu May 30 09:40:38 2013) [sssd[be[DOMAIN]]] [schedule_request_timeout] (0x2000): Scheduling a timeout of 5 seconds
(Thu May 30 09:40:38 2013) [sssd[be[DOMAIN]]] [schedule_timeout_watcher] (0x2000): Scheduling DNS timeout watcher
(Thu May 30 09:40:38 2013) [sssd[be[DOMAIN]]] [fo_resolve_service_timeout] (0x0080): Service resolving timeout reached
(Thu May 30 09:40:38 2013) [sssd[be[DOMAIN]]] [be_resolve_server_process] (0x0080): Couldn't resolve server (2k8-dc-3.domain.win2k8.netapp.com), resolver returned (110)
(Thu May 30 09:40:38 2013) [sssd[be[DOMAIN]]] [be_resolve_server_process] (0x1000): Trying with the next one!
(Thu May 30 09:40:38 2013) [sssd[be[DOMAIN]]] [fo_resolve_service_send] (0x0100): Trying to resolve service 'LDAP'
(Thu May 30 09:40:38 2013) [sssd[be[DOMAIN]]] [get_server_status] (0x1000): Status of server '2k8-dc-3.domain.win2k8.netapp.com' is 'resolving name'
(Thu May 30 09:40:38 2013) [sssd[be[DOMAIN]]] [fo_resolve_service_activate_timeout] (0x2000): Resolve timeout set to 10 seconds
(Thu May 30 09:40:38 2013) [sssd[be[DOMAIN]]] [resolve_srv_send] (0x0200): The status of SRV lookup is resolved
(Thu May 30 09:40:38 2013) [sssd[be[DOMAIN]]] [get_server_status] (0x1000): Status of server '2k8-dc-3.domain.win2k8.netapp.com' is 'resolving name'
(Thu May 30 09:40:39 2013) [sssd[be[DOMAIN]]] [check_fd_timeouts] (0x4000): Checking for DNS timeouts
(Thu May 30 09:40:41 2013) [sssd[be[DOMAIN]]] [check_fd_timeouts] (0x4000): Checking for DNS timeouts
(Thu May 30 09:40:41 2013) [sssd[be[DOMAIN]]] [unschedule_timeout_watcher] (0x4000): Unscheduling DNS timeout watcher
(Thu May 30 09:40:41 2013) [sssd[be[DOMAIN]]] [resolv_gethostbyname_dns_parse] (0x1000): Parsing an A reply
(Thu May 30 09:40:41 2013) [sssd[be[DOMAIN]]] [request_watch_destructor] (0x0400): Deleting request watch
(Thu May 30 09:40:41 2013) [sssd[be[DOMAIN]]] [set_server_common_status] (0x0100): Marking server '2k8-dc-3.domain.win2k8.netapp.com' as 'name resolved'
(Thu May 30 09:40:41 2013) [sssd[be[DOMAIN]]] [be_resolve_server_process] (0x1000): Saving the first resolved server
(Thu May 30 09:40:41 2013) [sssd[be[DOMAIN]]] [be_resolve_server_process] (0x0200): Found address for server 2k8-dc-3.domain.win2k8.netapp.com: [10.61.179.174] TTL 3600
(Thu May 30 09:40:41 2013) [sssd[be[DOMAIN]]] [sdap_uri_callback] (0x0400): Constructed uri 'ldap://2k8-dc-3.domain.win2k8.netapp.com:389'
(Thu May 30 09:40:41 2013) [sssd[be[DOMAIN]]] [sss_ldap_init_send] (0x4000): Using file descriptor [23] for LDAP connection.
(Thu May 30 09:40:41 2013) [sssd[be[DOMAIN]]] [sss_ldap_init_send] (0x0400): Setting 6 seconds timeout for connecting
(Thu May 30 09:40:41 2013) [sssd[be[DOMAIN]]] [sdap_ldap_connect_callback_add] (0x1000): New LDAP connection to [ldap://2k8-dc-3.domain.win2k8.netapp.com:389/??base] with fd [23].


- SSSD then gets the Kerberos ticket via the SRV record and doesn't work with the first attempted server

(Thu May 30 09:40:41 2013) [sssd[be[DOMAIN]]] [sdap_kinit_send] (0x0400): Attempting kinit (default, root/centos64.domain.win2k8.netapp.com@DOMAIN.WIN2K8.NETAPP.COM, DOMAIN.WIN2K8.NETAPP.COM, 86400)
(Thu May 30 09:40:41 2013) [sssd[be[DOMAIN]]] [sdap_kinit_next_kdc] (0x1000): Resolving next KDC for service KERBEROS
(Thu May 30 09:40:41 2013) [sssd[be[DOMAIN]]] [fo_resolve_service_send] (0x0100): Trying to resolve service 'KERBEROS'
(Thu May 30 09:40:41 2013) [sssd[be[DOMAIN]]] [get_port_status] (0x1000): Port status of port 0 for server '(no name)' is 'neutral'
(Thu May 30 09:40:41 2013) [sssd[be[DOMAIN]]] [fo_resolve_service_activate_timeout] (0x2000): Resolve timeout set to 10 seconds
(Thu May 30 09:40:41 2013) [sssd[be[DOMAIN]]] [resolve_srv_send] (0x0200): The status of SRV lookup is neutral
(Thu May 30 09:40:41 2013) [sssd[be[DOMAIN]]] [resolve_srv_send] (0x0400): SRV resolution of service 'KERBEROS'. dns_discovery_domain not specified. Need to look it up.
(Thu May 30 09:40:41 2013) [sssd[be[DOMAIN]]] [resolve_get_domain_send] (0x1000): Host name is: centos64.domain.win2k8.netapp.com
(Thu May 30 09:40:41 2013) [sssd[be[DOMAIN]]] [resolv_is_address] (0x4000): [centos64.domain.win2k8.netapp.com] does not look like an IP address
(Thu May 30 09:40:41 2013) [sssd[be[DOMAIN]]] [resolv_gethostbyname_step] (0x2000): Querying files
(Thu May 30 09:40:41 2013) [sssd[be[DOMAIN]]] [resolv_gethostbyname_files_send] (0x0100): Trying to resolve A record of 'centos64.domain.win2k8.netapp.com' in files
(Thu May 30 09:40:41 2013) [sssd[be[DOMAIN]]] [resolv_gethostbyname_step] (0x2000): Querying files
(Thu May 30 09:40:41 2013) [sssd[be[DOMAIN]]] [resolv_gethostbyname_files_send] (0x0100): Trying to resolve AAAA record of 'centos64.domain.win2k8.netapp.com' in files
(Thu May 30 09:40:41 2013) [sssd[be[DOMAIN]]] [resolv_gethostbyname_next] (0x0200): No more address families to retry
(Thu May 30 09:40:41 2013) [sssd[be[DOMAIN]]] [resolv_gethostbyname_step] (0x2000): Querying DNS
(Thu May 30 09:40:41 2013) [sssd[be[DOMAIN]]] [sdap_process_result] (0x2000): Trace: sh[0x1658d80], connected[1], ops[(nil)], ldap[0x1658940]
(Thu May 30 09:40:41 2013) [sssd[be[DOMAIN]]] [sdap_process_result] (0x2000): Trace: ldap_result found nothing!
(Thu May 30 09:40:41 2013) [sssd[be[DOMAIN]]] [resolv_gethostbyname_dns_query] (0x0100): Trying to resolve A record of 'centos64.domain.win2k8.netapp.com' in DNS
(Thu May 30 09:40:41 2013) [sssd[be[DOMAIN]]] [schedule_request_timeout] (0x2000): Scheduling a timeout of 5 seconds
(Thu May 30 09:40:41 2013) [sssd[be[DOMAIN]]] [schedule_timeout_watcher] (0x2000): Scheduling DNS timeout watcher
(Thu May 30 09:40:42 2013) [sssd[be[DOMAIN]]] [check_fd_timeouts] (0x4000): Checking for DNS timeouts
(Thu May 30 09:40:45 2013) [sssd[be[DOMAIN]]] [check_fd_timeouts] (0x4000): Checking for DNS timeouts
(Thu May 30 09:40:45 2013) [sssd[be[DOMAIN]]] [unschedule_timeout_watcher] (0x4000): Unscheduling DNS timeout watcher
(Thu May 30 09:40:45 2013) [sssd[be[DOMAIN]]] [resolv_gethostbyname_dns_parse] (0x1000): Parsing an A reply
(Thu May 30 09:40:45 2013) [sssd[be[DOMAIN]]] [request_watch_destructor] (0x0400): Deleting request watch
(Thu May 30 09:40:45 2013) [sssd[be[DOMAIN]]] [resolve_get_domain_done] (0x1000): The full FQDN is: centos64.domain.win2k8.netapp.com
(Thu May 30 09:40:45 2013) [sssd[be[DOMAIN]]] [resolve_srv_cont] (0x0100): Searching for servers via SRV query '_KERBEROS._udp.domain.win2k8.netapp.com'
(Thu May 30 09:40:45 2013) [sssd[be[DOMAIN]]] [resolv_getsrv_send] (0x0100): Trying to resolve SRV record of '_KERBEROS._udp.domain.win2k8.netapp.com'
(Thu May 30 09:40:45 2013) [sssd[be[DOMAIN]]] [schedule_request_timeout] (0x2000): Scheduling a timeout of 5 seconds
(Thu May 30 09:40:45 2013) [sssd[be[DOMAIN]]] [schedule_timeout_watcher] (0x2000): Scheduling DNS timeout watcher
(Thu May 30 09:40:46 2013) [sssd[be[DOMAIN]]] [check_fd_timeouts] (0x4000): Checking for DNS timeouts
(Thu May 30 09:40:47 2013) [sssd[be[DOMAIN]]] [sbus_dispatch] (0x4000): dbus conn: 1642930
(Thu May 30 09:40:47 2013) [sssd[be[DOMAIN]]] [sbus_dispatch] (0x4000): Dispatching.
(Thu May 30 09:40:47 2013) [sssd[be[DOMAIN]]] [sbus_message_handler] (0x4000): Received SBUS method [ping]
(Thu May 30 09:40:49 2013) [sssd[be[DOMAIN]]] [check_fd_timeouts] (0x4000): Checking for DNS timeouts
(Thu May 30 09:40:49 2013) [sssd[be[DOMAIN]]] [unschedule_timeout_watcher] (0x4000): Unscheduling DNS timeout watcher
(Thu May 30 09:40:49 2013) [sssd[be[DOMAIN]]] [request_watch_destructor] (0x0400): Deleting request watch
(Thu May 30 09:40:49 2013) [sssd[be[DOMAIN]]] [resolve_srv_done] (0x0400): Inserted server '2k8-dc-1.domain.win2k8.netapp.com:88' for service KERBEROS
(Thu May 30 09:40:49 2013) [sssd[be[DOMAIN]]] [resolve_srv_done] (0x0400): Inserted server '2k8-dc-3.domain.win2k8.netapp.com:88' for service KERBEROS
(Thu May 30 09:40:49 2013) [sssd[be[DOMAIN]]] [resolve_srv_done] (0x0400): Inserted server '2k8-dc-2.domain.win2k8.netapp.com:88' for service KERBEROS
(Thu May 30 09:40:49 2013) [sssd[be[DOMAIN]]] [set_srv_data_status] (0x0100): Marking SRV lookup of service 'KERBEROS' as 'resolved'
(Thu May 30 09:40:49 2013) [sssd[be[DOMAIN]]] [get_server_status] (0x1000): Status of server '2k8-dc-1.domain.win2k8.netapp.com' is 'name not resolved'
(Thu May 30 09:40:49 2013) [sssd[be[DOMAIN]]] [resolv_is_address] (0x4000): [2k8-dc-1.domain.win2k8.netapp.com] does not look like an IP address
(Thu May 30 09:40:49 2013) [sssd[be[DOMAIN]]] [resolv_gethostbyname_step] (0x2000): Querying files
(Thu May 30 09:40:49 2013) [sssd[be[DOMAIN]]] [resolv_gethostbyname_files_send] (0x0100): Trying to resolve A record of '2k8-dc-1.domain.win2k8.netapp.com' in files
(Thu May 30 09:40:49 2013) [sssd[be[DOMAIN]]] [set_server_common_status] (0x0100): Marking server '2k8-dc-1.domain.win2k8.netapp.com' as 'resolving name'
(Thu May 30 09:40:49 2013) [sssd[be[DOMAIN]]] [resolv_gethostbyname_step] (0x2000): Querying files
(Thu May 30 09:40:49 2013) [sssd[be[DOMAIN]]] [resolv_gethostbyname_files_send] (0x0100): Trying to resolve AAAA record of '2k8-dc-1.domain.win2k8.netapp.com' in files
(Thu May 30 09:40:49 2013) [sssd[be[DOMAIN]]] [resolv_gethostbyname_next] (0x0200): No more address families to retry
(Thu May 30 09:40:49 2013) [sssd[be[DOMAIN]]] [resolv_gethostbyname_step] (0x2000): Querying DNS
(Thu May 30 09:40:49 2013) [sssd[be[DOMAIN]]] [resolv_gethostbyname_dns_query] (0x0100): Trying to resolve A record of '2k8-dc-1.domain.win2k8.netapp.com' in DNS
(Thu May 30 09:40:49 2013) [sssd[be[DOMAIN]]] [schedule_request_timeout] (0x2000): Scheduling a timeout of 5 seconds
(Thu May 30 09:40:49 2013) [sssd[be[DOMAIN]]] [schedule_timeout_watcher] (0x2000): Scheduling DNS timeout watcher
(Thu May 30 09:40:50 2013) [sssd[be[DOMAIN]]] [check_fd_timeouts] (0x4000): Checking for DNS timeouts
(Thu May 30 09:40:51 2013) [sssd[be[DOMAIN]]] [fo_resolve_service_timeout] (0x0080): Service resolving timeout reached
(Thu May 30 09:40:51 2013) [sssd[be[DOMAIN]]] [be_resolve_server_process] (0x0080): Couldn't resolve server (2k8-dc-1.domain.win2k8.netapp.com), resolver returned (110)
(Thu May 30 09:40:51 2013) [sssd[be[DOMAIN]]] [be_resolve_server_process] (0x1000): Trying with the next one!
(Thu May 30 09:40:51 2013) [sssd[be[DOMAIN]]] [fo_resolve_service_send] (0x0100): Trying to resolve service 'KERBEROS'
(Thu May 30 09:40:51 2013) [sssd[be[DOMAIN]]] [get_server_status] (0x1000): Status of server '2k8-dc-1.domain.win2k8.netapp.com' is 'resolving name'
(Thu May 30 09:40:51 2013) [sssd[be[DOMAIN]]] [fo_resolve_service_activate_timeout] (0x2000): Resolve timeout set to 10 seconds
(Thu May 30 09:40:51 2013) [sssd[be[DOMAIN]]] [resolve_srv_send] (0x0200): The status of SRV lookup is resolved
(Thu May 30 09:40:51 2013) [sssd[be[DOMAIN]]] [get_server_status] (0x1000): Status of server '2k8-dc-1.domain.win2k8.netapp.com' is 'resolving name'
(Thu May 30 09:40:53 2013) [sssd[be[DOMAIN]]] [check_fd_timeouts] (0x4000): Checking for DNS timeouts
(Thu May 30 09:40:53 2013) [sssd[be[DOMAIN]]] [unschedule_timeout_watcher] (0x4000): Unscheduling DNS timeout watcher
(Thu May 30 09:40:53 2013) [sssd[be[DOMAIN]]] [resolv_gethostbyname_dns_parse] (0x1000): Parsing an A reply
(Thu May 30 09:40:53 2013) [sssd[be[DOMAIN]]] [request_watch_destructor] (0x0400): Deleting request watch
(Thu May 30 09:40:53 2013) [sssd[be[DOMAIN]]] [set_server_common_status] (0x0100): Marking server '2k8-dc-1.domain.win2k8.netapp.com' as 'name resolved'
(Thu May 30 09:40:53 2013) [sssd[be[DOMAIN]]] [be_resolve_server_process] (0x1000): Saving the first resolved server
(Thu May 30 09:40:53 2013) [sssd[be[DOMAIN]]] [be_resolve_server_process] (0x0200): Found address for server 2k8-dc-1.domain.win2k8.netapp.com: [10.61.179.152] TTL 3600
(Thu May 30 09:40:53 2013) [sssd[be[DOMAIN]]] [sdap_kinit_kdc_resolved] (0x1000): KDC resolved, attempting to get TGT...
(Thu May 30 09:40:53 2013) [sssd[be[DOMAIN]]] [create_tgt_req_send_buffer] (0x1000): buffer size: 103
(Thu May 30 09:40:53 2013) [sssd[be[DOMAIN]]] [child_handler_setup] (0x2000): Setting up signal handler up for pid [19051]
(Thu May 30 09:40:53 2013) [sssd[be[DOMAIN]]] [child_handler_setup] (0x2000): Signal handler set up for pid [19051]
(Thu May 30 09:40:53 2013) [sssd[be[DOMAIN]]] [set_tgt_child_timeout] (0x0400): Setting 6 seconds timeout for tgt child
(Thu May 30 09:40:53 2013) [sssd[be[DOMAIN]]] [write_pipe_handler] (0x0400): All data has been sent!
(Thu May 30 09:40:57 2013) [sssd[be[DOMAIN]]] [sbus_dispatch] (0x4000): dbus conn: 1642930
(Thu May 30 09:40:57 2013) [sssd[be[DOMAIN]]] [sbus_dispatch] (0x4000): Dispatching.
(Thu May 30 09:40:57 2013) [sssd[be[DOMAIN]]] [sbus_message_handler] (0x4000): Received SBUS method [ping]
(Thu May 30 09:40:59 2013) [sssd[be[DOMAIN]]] [get_tgt_timeout_handler] (0x4000): timeout for tgt child [19051] reached.
(Thu May 30 09:40:59 2013) [sssd[be[DOMAIN]]] [sdap_kinit_done] (0x0080): Communication with KDC timed out, trying the next one
(Thu May 30 09:40:59 2013) [sssd[be[DOMAIN]]] [fo_set_port_status] (0x0100): Marking port 88 of server '2k8-dc-1.domain.win2k8.netapp.com' as 'not working'

- SSSD then tries  the next server, which is up and working
(Thu May 30 09:40:59 2013) [sssd[be[DOMAIN]]] [sdap_kinit_next_kdc] (0x1000): Resolving next KDC for service KERBEROS
(Thu May 30 09:40:59 2013) [sssd[be[DOMAIN]]] [fo_resolve_service_send] (0x0100): Trying to resolve service 'KERBEROS'
(Thu May 30 09:40:59 2013) [sssd[be[DOMAIN]]] [get_server_status] (0x1000): Status of server '2k8-dc-3.domain.win2k8.netapp.com' is 'name resolved'
(Thu May 30 09:40:59 2013) [sssd[be[DOMAIN]]] [get_port_status] (0x1000): Port status of port 88 for server '2k8-dc-3.domain.win2k8.netapp.com' is 'neutral'
(Thu May 30 09:40:59 2013) [sssd[be[DOMAIN]]] [fo_resolve_service_activate_timeout] (0x2000): Resolve timeout set to 10 seconds
(Thu May 30 09:40:59 2013) [sssd[be[DOMAIN]]] [resolve_srv_send] (0x0200): The status of SRV lookup is resolved
(Thu May 30 09:40:59 2013) [sssd[be[DOMAIN]]] [get_server_status] (0x1000): Status of server '2k8-dc-3.domain.win2k8.netapp.com' is 'name resolved'
(Thu May 30 09:40:59 2013) [sssd[be[DOMAIN]]] [be_resolve_server_process] (0x0200): Found address for server 2k8-dc-3.domain.win2k8.netapp.com: [10.61.179.174] TTL 3600
(Thu May 30 09:40:59 2013) [sssd[be[DOMAIN]]] [sdap_kinit_kdc_resolved] (0x1000): KDC resolved, attempting to get TGT...
(Thu May 30 09:40:59 2013) [sssd[be[DOMAIN]]] [create_tgt_req_send_buffer] (0x1000): buffer size: 103
(Thu May 30 09:40:59 2013) [sssd[be[DOMAIN]]] [child_handler_setup] (0x2000): Setting up signal handler up for pid [19052]
(Thu May 30 09:40:59 2013) [sssd[be[DOMAIN]]] [child_handler_setup] (0x2000): Signal handler set up for pid [19052]
(Thu May 30 09:40:59 2013) [sssd[be[DOMAIN]]] [set_tgt_child_timeout] (0x0400): Setting 6 seconds timeout for tgt child
(Thu May 30 09:40:59 2013) [sssd[be[DOMAIN]]] [child_sig_handler] (0x1000): Waiting for child [19052].
(Thu May 30 09:40:59 2013) [sssd[be[DOMAIN]]] [child_sig_handler] (0x0020): waitpid did not found a child with changed status.
(Thu May 30 09:40:59 2013) [sssd[be[DOMAIN]]] [child_sig_handler] (0x1000): Waiting for child [19051].
(Thu May 30 09:40:59 2013) [sssd[be[DOMAIN]]] [child_sig_handler] (0x0020): child [19051] was terminated by signal [9].
(Thu May 30 09:40:59 2013) [sssd[be[DOMAIN]]] [write_pipe_handler] (0x0400): All data has been sent!
(Thu May 30 09:40:59 2013) [sssd[be[DOMAIN]]] [read_pipe_handler] (0x0400): EOF received, client finished
(Thu May 30 09:40:59 2013) [sssd[be[DOMAIN]]] [sdap_get_tgt_recv] (0x0400): Child responded: 0 [FILE:/var/lib/sss/db/ccache_DOMAIN.WIN2K8.NETAPP.COM], expired on [1369957259]
(Thu May 30 09:40:59 2013) [sssd[be[DOMAIN]]] [sdap_cli_auth_step] (0x0100): expire timeout is 900
(Thu May 30 09:40:59 2013) [sssd[be[DOMAIN]]] [sdap_cli_auth_step] (0x1000): the connection will expire at 1369922159
(Thu May 30 09:40:59 2013) [sssd[be[DOMAIN]]] [sasl_bind_send] (0x0100): Executing sasl bind mech: GSSAPI, user: root/centos64.domain.win2k8.netapp.com@DOMAIN.WIN2K8.NETAPP.COM
(Thu May 30 09:41:19 2013) [sssd[be[DOMAIN]]] [child_sig_handler] (0x1000): Waiting for child [19052].
(Thu May 30 09:41:19 2013) [sssd[be[DOMAIN]]] [child_sig_handler] (0x0100): child [19052] finished successfully.
(Thu May 30 09:41:19 2013) [sssd[be[DOMAIN]]] [sss_child_handler] (0x2000): waitpid failed [10]: No child processes
(Thu May 30 09:41:19 2013) [sssd[be[DOMAIN]]] [fo_set_port_status] (0x0100): Marking port 389 of server '2k8-dc-3.domain.win2k8.netapp.com' as 'working'
(Thu May 30 09:41:19 2013) [sssd[be[DOMAIN]]] [set_server_common_status] (0x0100): Marking server '2k8-dc-3.domain.win2k8.netapp.com' as 'working'


- SSSD attempts the LDAP search and fails

(Thu May 30 09:41:19 2013) [sssd[be[DOMAIN]]] [be_get_account_info] (0x0100): Got request for [4097][1][idnumber=1603]
(Thu May 30 09:41:19 2013) [sssd[be[DOMAIN]]] [sdap_id_op_connect_step] (0x4000): reusing cached connection
(Thu May 30 09:41:19 2013) [sssd[be[DOMAIN]]] [sdap_get_users_next_base] (0x0400): Searching for users with base [dc=domain,dc=win2k8,dc=netapp,dc=com]
(Thu May 30 09:41:19 2013) [sssd[be[DOMAIN]]] [sdap_get_generic_ext_step] (0x0400): calling ldap_search_ext with [(&(uidNumber=1603)(objectclass=user))][dc=domain,dc=win2k8,dc=netapp,dc=com].
(Thu May 30 09:41:19 2013) [sssd[be[DOMAIN]]] [sdap_get_generic_ext_step] (0x1000): Requesting attrs: [objectClass]
(Thu May 30 09:41:19 2013) [sssd[be[DOMAIN]]] [sdap_get_generic_ext_step] (0x1000): Requesting attrs: [uid]
(Thu May 30 09:41:19 2013) [sssd[be[DOMAIN]]] [sdap_get_generic_ext_step] (0x1000): Requesting attrs: [userPassword]
(Thu May 30 09:41:19 2013) [sssd[be[DOMAIN]]] [sdap_get_generic_ext_step] (0x1000): Requesting attrs: [uidNumber]
(Thu May 30 09:41:19 2013) [sssd[be[DOMAIN]]] [sdap_get_generic_ext_step] (0x1000): Requesting attrs: [gidNumber]
(Thu May 30 09:41:19 2013) [sssd[be[DOMAIN]]] [sdap_get_generic_ext_step] (0x1000): Requesting attrs: [gecos]
(Thu May 30 09:41:19 2013) [sssd[be[DOMAIN]]] [sdap_get_generic_ext_step] (0x1000): Requesting attrs: [unixHomeDirectory]
(Thu May 30 09:41:19 2013) [sssd[be[DOMAIN]]] [sdap_get_generic_ext_step] (0x1000): Requesting attrs: [loginShell]
(Thu May 30 09:41:19 2013) [sssd[be[DOMAIN]]] [sdap_get_generic_ext_step] (0x1000): Requesting attrs: [userPrincipalName]
(Thu May 30 09:41:19 2013) [sssd[be[DOMAIN]]] [sdap_get_generic_ext_step] (0x1000): Requesting attrs: [cn]
(Thu May 30 09:41:19 2013) [sssd[be[DOMAIN]]] [sdap_get_generic_ext_step] (0x1000): Requesting attrs: [modifyTimestamp]
(Thu May 30 09:41:19 2013) [sssd[be[DOMAIN]]] [sdap_get_generic_ext_step] (0x1000): Requesting attrs: [uSNChanged]
(Thu May 30 09:41:19 2013) [sssd[be[DOMAIN]]] [sdap_get_generic_ext_step] (0x1000): Requesting attrs: [shadowLastChange]
(Thu May 30 09:41:19 2013) [sssd[be[DOMAIN]]] [sdap_get_generic_ext_step] (0x1000): Requesting attrs: [shadowMin]
(Thu May 30 09:41:19 2013) [sssd[be[DOMAIN]]] [sdap_get_generic_ext_step] (0x1000): Requesting attrs: [shadowMax]
(Thu May 30 09:41:19 2013) [sssd[be[DOMAIN]]] [sdap_get_generic_ext_step] (0x1000): Requesting attrs: [shadowWarning]
(Thu May 30 09:41:19 2013) [sssd[be[DOMAIN]]] [sdap_get_generic_ext_step] (0x1000): Requesting attrs: [shadowInactive]
(Thu May 30 09:41:19 2013) [sssd[be[DOMAIN]]] [sdap_get_generic_ext_step] (0x1000): Requesting attrs: [shadowExpire]
(Thu May 30 09:41:19 2013) [sssd[be[DOMAIN]]] [sdap_get_generic_ext_step] (0x1000): Requesting attrs: [shadowFlag]
(Thu May 30 09:41:19 2013) [sssd[be[DOMAIN]]] [sdap_get_generic_ext_step] (0x1000): Requesting attrs: [krbLastPwdChange]
(Thu May 30 09:41:19 2013) [sssd[be[DOMAIN]]] [sdap_get_generic_ext_step] (0x1000): Requesting attrs: [krbPasswordExpiration]
(Thu May 30 09:41:19 2013) [sssd[be[DOMAIN]]] [sdap_get_generic_ext_step] (0x1000): Requesting attrs: [pwdAttribute]
(Thu May 30 09:41:19 2013) [sssd[be[DOMAIN]]] [sdap_get_generic_ext_step] (0x1000): Requesting attrs: [authorizedService]
(Thu May 30 09:41:19 2013) [sssd[be[DOMAIN]]] [sdap_get_generic_ext_step] (0x1000): Requesting attrs: [accountExpires]
(Thu May 30 09:41:19 2013) [sssd[be[DOMAIN]]] [sdap_get_generic_ext_step] (0x1000): Requesting attrs: [userAccountControl]
(Thu May 30 09:41:19 2013) [sssd[be[DOMAIN]]] [sdap_get_generic_ext_step] (0x1000): Requesting attrs: [nsAccountLock]
(Thu May 30 09:41:19 2013) [sssd[be[DOMAIN]]] [sdap_get_generic_ext_step] (0x1000): Requesting attrs: [host]
(Thu May 30 09:41:19 2013) [sssd[be[DOMAIN]]] [sdap_get_generic_ext_step] (0x1000): Requesting attrs: [loginDisabled]
(Thu May 30 09:41:19 2013) [sssd[be[DOMAIN]]] [sdap_get_generic_ext_step] (0x1000): Requesting attrs: [loginExpirationTime]
(Thu May 30 09:41:19 2013) [sssd[be[DOMAIN]]] [sdap_get_generic_ext_step] (0x1000): Requesting attrs: [loginAllowedTimeMap]
(Thu May 30 09:41:19 2013) [sssd[be[DOMAIN]]] [sdap_get_generic_ext_step] (0x2000): ldap_search_ext called, msgid = 6
(Thu May 30 09:41:19 2013) [sssd[be[DOMAIN]]] [sdap_process_result] (0x2000): Trace: sh[0x1658d80], connected[1], ops[0x1663910], ldap[0x1658940]
(Thu May 30 09:41:19 2013) [sssd[be[DOMAIN]]] [sdap_process_message] (0x4000): Message type: [LDAP_RES_SEARCH_ENTRY]
(Thu May 30 09:41:19 2013) [sssd[be[DOMAIN]]] [sdap_parse_entry] (0x4000): OriginalDN: [CN=ldapuser,CN=Users,DC=domain,DC=win2k8,DC=netapp,DC=com].
(Thu May 30 09:41:19 2013) [sssd[be[DOMAIN]]] [sdap_parse_range] (0x2000): No sub-attributes for [objectClass]
(Thu May 30 09:41:19 2013) [sssd[be[DOMAIN]]] [sdap_parse_range] (0x2000): No sub-attributes for [cn]
(Thu May 30 09:41:19 2013) [sssd[be[DOMAIN]]] [sdap_parse_range] (0x2000): No sub-attributes for [userAccountControl]
(Thu May 30 09:41:19 2013) [sssd[be[DOMAIN]]] [sdap_parse_range] (0x2000): No sub-attributes for [accountExpires]
(Thu May 30 09:41:19 2013) [sssd[be[DOMAIN]]] [sdap_parse_range] (0x2000): No sub-attributes for [userPrincipalName]
(Thu May 30 09:41:19 2013) [sssd[be[DOMAIN]]] [sdap_parse_range] (0x2000): No sub-attributes for [uid]
(Thu May 30 09:41:19 2013) [sssd[be[DOMAIN]]] [sdap_parse_range] (0x2000): No sub-attributes for [modifyTimeStamp]
(Thu May 30 09:41:19 2013) [sssd[be[DOMAIN]]] [sdap_parse_range] (0x2000): No sub-attributes for [uidNumber]
(Thu May 30 09:41:19 2013) [sssd[be[DOMAIN]]] [sdap_parse_range] (0x2000): No sub-attributes for [gidNumber]
(Thu May 30 09:41:19 2013) [sssd[be[DOMAIN]]] [sdap_parse_range] (0x2000): No sub-attributes for [unixHomeDirectory]
(Thu May 30 09:41:19 2013) [sssd[be[DOMAIN]]] [sdap_parse_range] (0x2000): No sub-attributes for [loginShell]
(Thu May 30 09:41:19 2013) [sssd[be[DOMAIN]]] [sdap_process_result] (0x2000): Trace: sh[0x1658d80], connected[1], ops[0x1663910], ldap[0x1658940]
(Thu May 30 09:41:27 2013) [sssd[be[DOMAIN]]] [sdap_ldap_connect_callback_add] (0x1000): New LDAP connection to [ldap://DomainDnsZones.domain.win2k8.netapp.com/DC=DomainDnsZones,DC=domain,DC=win2k8,DC=netapp,DC=com] with fd [21].

[root@centos64 ~]# getent passwd ldapuser
[root@centos64 ~]# 

This is the failure in sssd_nss.log:

(Thu May 30 09:40:28 2013) [sssd[nss]] [sss_parse_name_for_domains] (0x0200): name 'ldapuser' matched without domain, user is ldapuser
(Thu May 30 09:40:28 2013) [sssd[nss]] [sss_parse_name_for_domains] (0x0200): using default domain [(null)]
(Thu May 30 09:40:28 2013) [sssd[nss]] [nss_cmd_getpwnam] (0x0100): Requesting info for [ldapuser] from [<ALL>]
(Thu May 30 09:40:28 2013) [sssd[nss]] [sss_dp_issue_request] (0x0400): Issuing request for [0x4339e0:domains@DOMAIN]
(Thu May 30 09:40:28 2013) [sssd[nss]] [sss_dp_get_domains_msg] (0x0400): Sending get domains request for [DOMAIN][not forced][]
(Thu May 30 09:40:28 2013) [sssd[nss]] [sbus_add_timeout] (0x2000): 0x1d59590
(Thu May 30 09:40:28 2013) [sssd[nss]] [sss_dp_internal_get_send] (0x0400): Entering request [0x4339e0:domains@DOMAIN]
(Thu May 30 09:40:28 2013) [sssd[nss]] [sbus_remove_timeout] (0x2000): 0x1d59590
(Thu May 30 09:40:28 2013) [sssd[nss]] [sbus_dispatch] (0x4000): dbus conn: 1D59EC0
(Thu May 30 09:40:28 2013) [sssd[nss]] [sbus_dispatch] (0x4000): Dispatching.
(Thu May 30 09:40:28 2013) [sssd[nss]] [sss_dp_get_reply] (0x1000): Got reply from Data Provider - DP error code: 3 errno: 19 error message: Subdomains back end target is not configured
(Thu May 30 09:40:28 2013) [sssd[nss]] [ldb] (0x4000): Added timed event "ltdb_callback": 0x1d5db00

(Thu May 30 09:40:28 2013) [sssd[nss]] [ldb] (0x4000): Added timed event "ltdb_timeout": 0x1d5dbb0

(Thu May 30 09:40:28 2013) [sssd[nss]] [ldb] (0x4000): Destroying timer event 0x1d5dbb0 "ltdb_timeout"

(Thu May 30 09:40:28 2013) [sssd[nss]] [ldb] (0x4000): Ending timer event 0x1d5db00 "ltdb_callback"

(Thu May 30 09:40:28 2013) [sssd[nss]] [sss_parse_name_for_domains] (0x0200): name 'ldapuser' matched without domain, user is ldapuser
(Thu May 30 09:40:28 2013) [sssd[nss]] [sss_parse_name_for_domains] (0x0200): using default domain [(null)]
(Thu May 30 09:40:28 2013) [sssd[nss]] [nss_cmd_getpwnam_cb] (0x0400): Requesting info for [ldapuser] from [<ALL>]
(Thu May 30 09:40:28 2013) [sssd[nss]] [sss_ncache_check_str] (0x2000): Checking negative cache for [NCE/USER/DOMAIN/ldapuser]
(Thu May 30 09:40:28 2013) [sssd[nss]] [nss_cmd_getpwnam_search] (0x0100): Requesting info for [ldapuser@DOMAIN]
(Thu May 30 09:40:28 2013) [sssd[nss]] [ldb] (0x4000): Added timed event "ltdb_callback": 0x1d5f390

(Thu May 30 09:40:28 2013) [sssd[nss]] [ldb] (0x4000): Added timed event "ltdb_timeout": 0x1d5da10

(Thu May 30 09:40:28 2013) [sssd[nss]] [ldb] (0x4000): Destroying timer event 0x1d5da10 "ltdb_timeout"

(Thu May 30 09:40:28 2013) [sssd[nss]] [ldb] (0x4000): Ending timer event 0x1d5f390 "ltdb_callback"

(Thu May 30 09:40:28 2013) [sssd[nss]] [sss_dp_issue_request] (0x0400): Issuing request for [0x430590:1:ldapuser@DOMAIN]
(Thu May 30 09:40:28 2013) [sssd[nss]] [sss_dp_get_account_msg] (0x0400): Creating request for [DOMAIN][4097][1][name=ldapuser]
(Thu May 30 09:40:28 2013) [sssd[nss]] [sbus_add_timeout] (0x2000): 0x1d5b6d0
(Thu May 30 09:40:28 2013) [sssd[nss]] [sss_dp_internal_get_send] (0x0400): Entering request [0x430590:1:ldapuser@DOMAIN]
(Thu May 30 09:40:28 2013) [sssd[nss]] [sss_dp_req_destructor] (0x0400): Deleting request: [0x4339e0:domains@DOMAIN]
(Thu May 30 09:40:32 2013) [sssd[nss]] [sbus_dispatch] (0x4000): dbus conn: 1D5A770
(Thu May 30 09:40:32 2013) [sssd[nss]] [sbus_dispatch] (0x4000): Dispatching.
(Thu May 30 09:40:32 2013) [sssd[nss]] [sbus_message_handler] (0x4000): Received SBUS method [ping]
(Thu May 30 09:40:42 2013) [sssd[nss]] [sbus_dispatch] (0x4000): dbus conn: 1D5A770
(Thu May 30 09:40:42 2013) [sssd[nss]] [sbus_dispatch] (0x4000): Dispatching.
(Thu May 30 09:40:42 2013) [sssd[nss]] [sbus_message_handler] (0x4000): Received SBUS method [ping]
(Thu May 30 09:40:52 2013) [sssd[nss]] [sbus_dispatch] (0x4000): dbus conn: 1D5A770
(Thu May 30 09:40:52 2013) [sssd[nss]] [sbus_dispatch] (0x4000): Dispatching.
(Thu May 30 09:40:52 2013) [sssd[nss]] [sbus_message_handler] (0x4000): Received SBUS method [ping]
(Thu May 30 09:41:02 2013) [sssd[nss]] [sbus_dispatch] (0x4000): dbus conn: 1D5A770
(Thu May 30 09:41:02 2013) [sssd[nss]] [sbus_dispatch] (0x4000): Dispatching.
(Thu May 30 09:41:02 2013) [sssd[nss]] [sbus_message_handler] (0x4000): Received SBUS method [ping]
(Thu May 30 09:41:12 2013) [sssd[nss]] [sbus_dispatch] (0x4000): dbus conn: 1D5A770
(Thu May 30 09:41:12 2013) [sssd[nss]] [sbus_dispatch] (0x4000): Dispatching.
(Thu May 30 09:41:12 2013) [sssd[nss]] [sbus_message_handler] (0x4000): Received SBUS method [ping]
(Thu May 30 09:41:12 2013) [sssd[nss]] [get_client_cred] (0x4000): Client creds: euid[0] egid[0] pid[21813].
(Thu May 30 09:41:12 2013) [sssd[nss]] [reset_idle_timer] (0x4000): Idle timer re-set for client [0x1d5da10][21]
(Thu May 30 09:41:12 2013) [sssd[nss]] [accept_fd_handler] (0x0400): Client connected!
(Thu May 30 09:41:12 2013) [sssd[nss]] [reset_idle_timer] (0x4000): Idle timer re-set for client [0x1d5da10][21]
(Thu May 30 09:41:12 2013) [sssd[nss]] [reset_idle_timer] (0x4000): Idle timer re-set for client [0x1d5da10][21]
(Thu May 30 09:41:12 2013) [sssd[nss]] [sss_cmd_get_version] (0x0200): Received client version [1].
(Thu May 30 09:41:12 2013) [sssd[nss]] [sss_cmd_get_version] (0x0200): Offered version [1].
(Thu May 30 09:41:12 2013) [sssd[nss]] [reset_idle_timer] (0x4000): Idle timer re-set for client [0x1d5da10][21]
(Thu May 30 09:41:12 2013) [sssd[nss]] [reset_idle_timer] (0x4000): Idle timer re-set for client [0x1d5da10][21]
(Thu May 30 09:41:12 2013) [sssd[nss]] [reset_idle_timer] (0x4000): Idle timer re-set for client [0x1d5da10][21]
(Thu May 30 09:41:12 2013) [sssd[nss]] [sss_ncache_check_str] (0x2000): Checking negative cache for [NCE/UID/1603]
(Thu May 30 09:41:12 2013) [sssd[nss]] [nss_cmd_getpwuid_search] (0x0100): Requesting info for [1603@DOMAIN]
(Thu May 30 09:41:12 2013) [sssd[nss]] [ldb] (0x4000): Added timed event "ltdb_callback": 0x1d65860

(Thu May 30 09:41:12 2013) [sssd[nss]] [ldb] (0x4000): Added timed event "ltdb_timeout": 0x1d65980

(Thu May 30 09:41:12 2013) [sssd[nss]] [ldb] (0x4000): Destroying timer event 0x1d65980 "ltdb_timeout"

(Thu May 30 09:41:12 2013) [sssd[nss]] [ldb] (0x4000): Ending timer event 0x1d65860 "ltdb_callback"

(Thu May 30 09:41:12 2013) [sssd[nss]] [sss_dp_issue_request] (0x0400): Issuing request for [0x430590:1:1603@DOMAIN]
(Thu May 30 09:41:12 2013) [sssd[nss]] [sss_dp_get_account_msg] (0x0400): Creating request for [DOMAIN][4097][1][idnumber=1603]
(Thu May 30 09:41:12 2013) [sssd[nss]] [sbus_add_timeout] (0x2000): 0x1d59590
(Thu May 30 09:41:12 2013) [sssd[nss]] [sss_dp_internal_get_send] (0x0400): Entering request [0x430590:1:1603@DOMAIN]
(Thu May 30 09:41:22 2013) [sssd[nss]] [sbus_dispatch] (0x4000): dbus conn: 1D5A770
(Thu May 30 09:41:22 2013) [sssd[nss]] [sbus_dispatch] (0x4000): Dispatching.
(Thu May 30 09:41:22 2013) [sssd[nss]] [sbus_message_handler] (0x4000): Received SBUS method [ping]
(Thu May 30 09:41:26 2013) [sssd[nss]] [sbus_remove_timeout] (0x2000): 0x1d5b6d0
(Thu May 30 09:41:26 2013) [sssd[nss]] [nss_cmd_getpwnam_dp_callback] (0x0040): Unable to get information from Data Provider
Error: 3, 5, (null)
Will try to return what we have in cache
(Thu May 30 09:41:26 2013) [sssd[nss]] [sss_dp_req_destructor] (0x0400): Deleting request: [0x430590:1:ldapuser@DOMAIN]
(Thu May 30 09:41:26 2013) [sssd[nss]] [reset_idle_timer] (0x4000): Idle timer re-set for client [0x1d59ae0][20]
(Thu May 30 09:41:26 2013) [sssd[nss]] [reset_idle_timer] (0x4000): Idle timer re-set for client [0x1d59ae0][20]
(Thu May 30 09:41:26 2013) [sssd[nss]] [client_recv] (0x0200): Client disconnected!
(Thu May 30 09:41:26 2013) [sssd[nss]] [client_destructor] (0x2000): Terminated client [0x1d59ae0][20]
(Thu May 30 09:41:32 2013) [sssd[nss]] [sbus_dispatch] (0x4000): dbus conn: 1D5A770
(Thu May 30 09:41:32 2013) [sssd[nss]] [sbus_dispatch] (0x4000): Dispatching.
(Thu May 30 09:41:32 2013) [sssd[nss]] [sbus_message_handler] (0x4000): Received SBUS method [ping]
(Thu May 30 09:41:42 2013) [sssd[nss]] [sbus_dispatch] (0x4000): dbus conn: 1D5A770
(Thu May 30 09:41:42 2013) [sssd[nss]] [sbus_dispatch] (0x4000): Dispatching.
(Thu May 30 09:41:42 2013) [sssd[nss]] [sbus_message_handler] (0x4000): Received SBUS method [ping]
(Thu May 30 09:41:52 2013) [sssd[nss]] [sbus_dispatch] (0x4000): dbus conn: 1D5A770
(Thu May 30 09:41:52 2013) [sssd[nss]] [sbus_dispatch] (0x4000): Dispatching.
(Thu May 30 09:41:52 2013) [sssd[nss]] [sbus_message_handler] (0x4000): Received SBUS method [ping]
(Thu May 30 09:42:02 2013) [sssd[nss]] [sbus_dispatch] (0x4000): dbus conn: 1D5A770
(Thu May 30 09:42:02 2013) [sssd[nss]] [sbus_dispatch] (0x4000): Dispatching.
(Thu May 30 09:42:02 2013) [sssd[nss]] [sbus_message_handler] (0x4000): Received SBUS method [ping]
(Thu May 30 09:42:10 2013) [sssd[nss]] [sbus_remove_timeout] (0x2000): 0x1d59590
(Thu May 30 09:42:10 2013) [sssd[nss]] [nss_cmd_getpwuid_dp_callback] (0x0040): Unable to get information from Data Provider
Error: 3, 5, (null)
Will try to return what we have in cache


One thing I notice when this issue occurs is the amount of time it takes for the request to bomb out. Perhaps there needs to be a DNS check before SSSD tries to look for SRV records for LDAP and Kerberos? That way, it can mark a known good DNS in its cache and then use that to bypass any failures down the stream.

I bumped debug back down to level 3 to see if I saw the error you mentioned earlier.

<snip>
Can you put "debug_level=3" into the sssd.conf, restart the SSSD and re-run the faulty case? Then in the debug logs (/var/log/sssd/sssd_$domain.log) you should see an error message coming from the function called fo_resolve_service_done() reporting an "Input/Output Error".

In my testing the error looks like this:

[sssd[be[IDM.LAB.BOS.REDHAT.COM]]] [resolv_gethostbyname_done] (0x0040): querying hosts database failed [5]: Input/output error
[sssd[be[IDM.LAB.BOS.REDHAT.COM]]] [fo_resolve_service_done] (0x0020): Failed to resolve server 'vm-101.idm.lab.bos.redhat.com': Could not contact DNS servers
[sssd[be[IDM.LAB.BOS.REDHAT.COM]]] [set_server_common_status] (0x0100): Marking server 'vm-101.idm.lab.bos.redhat.com' as 'not working'
</snip>


No error like that, so still not hitting that bug it seems:

[root@centos64 ~]# cat /var/log/sssd/sssd_DOMAIN.log | grep -i fo_resolve_service_done
[root@centos64 ~]#

Comment 10 Jakub Hrozek 2013-06-06 09:56:39 UTC
Upstream ticket:
https://fedorahosted.org/sssd/ticket/1966

Comment 15 Jakub Hrozek 2013-07-24 12:09:07 UTC
Fixed upstream.

Comment 17 Nirupama Karandikar 2013-10-22 12:00:26 UTC
Tested with sssd-1.9.2-128.el6.x86_64

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
:: [   LOG    ] :: bug_automation_002: BZ 966757 SSSD failover doesn't work
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

:: [   PASS   ] :: Running 'getent -s sss passwd sssduser1' (Expected 0, got 0)
:: [   PASS   ] :: Running 'getent -s sss passwd sssduser1' (Expected 0, got 0)
:: [   LOG    ] :: Duration: 9s
:: [   LOG    ] :: Assertions: 2 good, 0 bad
:: [   PASS   ] :: RESULT: bug_automation_002: BZ 966757 SSSD failover doesn't work

Comment 18 errata-xmlrpc 2013-11-21 22:18:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1680.html


Note You need to log in before you can comment on or make changes to this bug.