Bug 735090 - Failover very slow on SSSD
Summary: Failover very slow on SSSD
Keywords:
Status: CLOSED DUPLICATE of bug 745513
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: sssd
Version: 7.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: pre-dev-freeze
: ---
Assignee: Jakub Hrozek
QA Contact: Chandrasekar Kannan
URL:
Whiteboard:
Depends On:
Blocks: 756082
TreeView+ depends on / blocked
 
Reported: 2011-09-01 12:41 UTC by Kemot1000
Modified: 2015-01-04 23:50 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-03-18 09:00:20 UTC


Attachments (Terms of Use)
sssd.conf (1.77 KB, text/plain)
2011-09-01 12:41 UTC, Kemot1000
no flags Details
sssd.log (1.38 MB, text/plain)
2011-09-01 16:51 UTC, Kemot1000
no flags Details
sssd.log debug level 4 (31.76 KB, application/octet-stream)
2011-09-02 09:02 UTC, Kemot1000
no flags Details


Links
System ID Priority Status Summary Last Updated
FedoraHosted SSSD 860 None None None Never

Description Kemot1000 2011-09-01 12:41:03 UTC
Created attachment 521006 [details]
sssd.conf

Description of problem:
I have SSSD configured with MSAD. We have 2 Domain Controllers. I read documentation and setup fail-over if one of those DC would be rebooted or just go down. Last week I noticed that it is not working as expected. When first DC is unavailable I'm able to login but it takes a lot of time. With both DCs up its fast. Aslo when I finally login with second DC for about 30 sec ( I imagine this is some timeout before SSSD tries again first DC ) I'm able to go through su or logoff fast. After some period of inactivity it goes back to very slow logon. 

Attached is my config.
 

Version-Release number of selected component (if applicable):
sssd-client-1.5.1-34.el6_1.3.x86_64
sssd-1.5.1-34.el6_1.3.x86_64


How reproducible:
Stop one domain controller and try to login. 

Steps to Reproduce:
1. Stop domaincontroller1
2. Try to "time su - domainuser"

  
Actual results:
[domainuser@localhost]$ time su - domainuser
Password:
[domainuser@localhost]$ exit
logout

real    3m19.465s
user    0m0.007s
sys     0m0.010s
[domainuser@localhost]$ 

Expected results:
[domainuser@localhost]$ time su - domainuser
Password:
[domainuser@localhost]$ exit
logout

real    0m4.289s
user    0m0.006s
sys     0m0.013s
[domainuser@localhost]$ 

Additional info:

I use kerberos as auth_provider. I use domain group for 'su'.

Comment 2 Stephen Gallagher 2011-09-01 14:18:20 UTC
I have a few suspicions here. The first and most obvious would be that when the primary AD server goes down, it was also configured to be the primary DNS server for your clients (check /etc/resolv.conf).

If this is so, then it's possible that at least some of this time is being spent trying to contact an unreachable DNS server. This may be related to https://fedorahosted.org/sssd/ticket/976 (which we are working on and will be backporting to RHEL)

The other thing I'd like you to check is whether you're actually authenticating online or offline. You can find this out easily by adding
pam_verbosity = 3
to the [pam] section of sssd.conf and restart sssd. This will add an informational message to the login screen when you log in if the authentication is happening offline with cached credentials (if no message is seen, authentication happened online).

If possible, please also set
debug_level = 9
in the [MYDOMAIN] section of /etc/sssd/sssd.conf, restart sssd and then reproduce the issue. Then attach /var/log/sssd/sssd_MYDOMAIN.log and /var/log/sssd/krb5_child.log (sanitized as needed) to this ticket so we can see what is taking all of the time.

Comment 3 Kemot1000 2011-09-01 15:55:28 UTC
(In reply to comment #2)
> I have a few suspicions here. The first and most obvious would be that when the
> primary AD server goes down, it was also configured to be the primary DNS
> server for your clients (check /etc/resolv.conf).

Yes this is primary dns in my configuration:
search mydomain.com
nameserver 192.168.1.22
nameserver 192.168.1.23


I tired before to setup 
ldap_uri = ldap://domaincontroller1.mydomain.com,ldap://domaincontroller2.mydomain.com
as
ldap_uri = ldap://192.168.1.22,ldap://192.168.1.23

and even now I still have this setup and problem exists. 

also krb5_server is configured with IP addresses not dns names


> If this is so, then it's possible that at least some of this time is being
> spent trying to contact an unreachable DNS server. This may be related to
> https://fedorahosted.org/sssd/ticket/976 (which we are working on and will be
> backporting to RHEL)

> The other thing I'd like you to check is whether you're actually authenticating
> online or offline. You can find this out easily by adding
> pam_verbosity = 3
> to the [pam] section of sssd.conf and restart sssd. This will add an
> informational message to the login screen when you log in if the authentication
> is happening offline with cached credentials (if no message is seen,
> authentication happened online).

It looks that every time I'm being authenticated offline because after I added parameter mentioned above to sssd.config I got 'Authenticated with cached credentials. ' for all the tries I did. The first one (very slow) and next (fast). So this is even bigger problem than I thought since even offline authentication with first DC down is a problem. 

> 
> If possible, please also set
> debug_level = 9
> in the [MYDOMAIN] section of /etc/sssd/sssd.conf, restart sssd and then
> reproduce the issue. Then attach /var/log/sssd/sssd_MYDOMAIN.log and
> /var/log/sssd/krb5_child.log (sanitized as needed) to this ticket so we can see
> what is taking all of the time.

Will add this log shortly.

Comment 4 Kemot1000 2011-09-01 16:51:46 UTC
Created attachment 521057 [details]
sssd.log

Comment 5 Kemot1000 2011-09-01 16:52:07 UTC
log sssd_mydomain.log attached 

krb5_child.log was empty

Comment 6 Kemot1000 2011-09-01 16:53:45 UTC
I used this to sanitize this log.
cat sssd_mydomain.log |grep -v "Adding original mem
berOf attributes to" |grep -v "Adding original DN" |grep -v "Adding user princip
al" |grep -v "Adding [adAccountExpires]" |grep -v "Storing info for user" |grep
-v "(8): Removing attribute" |grep -v "Original memberOf is not available for" |
grep -v "Search users with filter:" |grep -v "Storing info for group" |grep -v "
Searching cache for" |grep -v "Adding member users to group" |grep -v "[sdap_fil
l_memberships] (7):" > sssd_mydomain2.log

if I cleared some important data please let me know what part of log you need

Comment 7 Stephen Gallagher 2011-09-01 17:55:46 UTC
Ok, the most telling part of this log is:

(Thu Sep  1 17:30:05 2011) [sssd[be[MYDOMAIN]]] [sdap_rebind_proc] (1): ldap_sasl_bind_s failed (-5)[Timed out]

This is appearing in multiple places, and it's definitely a major player in the poor performance. Interestingly, this function is only called when we are automatically following LDAP referrals. If you're not using referrals in your environment, please try setting:
ldap_referrals = false
in the [domain/MYDOMAIN] section of sssd.conf.


I think I know why the referrals are failing too. It's because they're being referred to:
ldap://DomainDnsZones.MYDOMAIN.COM/DC=DomainDnsZones,DC=mydomain,DC=com
and
ldap://ForestDnsZones.MYDOMAIN.COM/DC=ForestDnsZones,DC=mydomain,DC=com

Both of those domains would require DNS lookup, and I'm betting that they're either A) not listed in DNS or B) not resolvable within six seconds because the first DNS server in the list is the machine you shut down.

My recommendation to you would be to turn off referrals for now as I described above. That seems like the best move at this point.

My secondary recommendation would be for you to set up dnsmasq (or a similar DNS caching system) on your clients so that you don't break if your Primary Domain Controller goes down and stops serving DNS.

Before you ask, we cannot currently cache these hosts in SSSD because of the way that the openldap libraries process referrals. They do the DNS resolution on their own, through glibc, and do not allow us to alter them. We're going to be finding ways to handle this differently in the future.

Comment 8 Kemot1000 2011-09-02 09:01:56 UTC
(In reply to comment #7)
> Ok, the most telling part of this log is:
> 
> (Thu Sep  1 17:30:05 2011) [sssd[be[MYDOMAIN]]] [sdap_rebind_proc] (1):
> ldap_sasl_bind_s failed (-5)[Timed out]
> 
> 
> I think I know why the referrals are failing too. It's because they're being
> referred to:
> ldap://DomainDnsZones.MYDOMAIN.COM/DC=DomainDnsZones,DC=mydomain,DC=com
> and
> ldap://ForestDnsZones.MYDOMAIN.COM/DC=ForestDnsZones,DC=mydomain,DC=com
> 
> Both of those domains would require DNS lookup, and I'm betting that they're
> either A) not listed in DNS or B) not resolvable within six seconds because the
> first DNS server in the list is the machine you shut down.

As far as A) is concerned
I checked before I posted this if those entries are in DNS. And they are. 

For B)  per this part from documentation:

>The failover mechanism does not handle DNS A records with multiple IP addresses 
>(and ldap://ForestDnsZones.MYDOMAIN.COM/DC=ForestDnsZones,DC=mydomain,DC=com is 
>exacly two DNS A records with different IP addresses) ; instead it only uses the 
>first address. DNS round-robin cannot be used for failover. Further, providing 
>multiple A records does not provide failover. Only the first A record is used, 
>and if a lookup attempt on the first record fails then the system attempts no further lookups.

I switched to SRV records. But I still wonder why does it use this at all if ldap_uri shows SSSD what is the next IP it should use in case of timeout on first DC ??


Second problem is when I try to setup it according to "10.2.3.2.4.1. Using SRV Records with Failover" I still get slow login and end up loging in with cached cridentials ( I attached log with diaglevel 4) so even following documentation failover is not working. I think this is not the problem with DNS as DNS A record is being resolved


(Fri Sep  2 10:20:54 2011) [sssd[be[MYDOMAIN]]] [sssm_ldap_id_init] (1): Missing ldap_uri, will use service discovery
...
(Fri Sep  2 10:21:05 2011) [sssd[be[MYDOMAIN]]] [resolv_gethostbyname_send] (4): Trying to resolve A record of 'MYSERVER'
(Fri Sep  2 10:21:11 2011) [sssd[be[MYDOMAIN]]] [resolve_srv_cont] (4): Searching for servers via SRV query '_ldap._tcp.mydomain.com'
(Fri Sep  2 10:21:11 2011) [sssd[be[MYDOMAIN]]] [resolv_getsrv_send] (4): Trying to resolve SRV record of '_ldap._tcp.mydomain.com'
(Fri Sep  2 10:21:17 2011) [sssd[be[MYDOMAIN]]] [set_srv_data_status] (4): Marking SRV lookup of service 'LDAP' as 'resolved'
(Fri Sep  2 10:21:17 2011) [sssd[be[MYDOMAIN]]] [get_server_status] (4): Hostname resolution expired, reseting the server status of 'domaincontroller2.mydomain.com'
(Fri Sep  2 10:21:17 2011) [sssd[be[MYDOMAIN]]] [set_server_common_status] (4): Marking server 'domaincontroller2.mydomain.com' as 'name not resolved'
(Fri Sep  2 10:21:17 2011) [sssd[be[MYDOMAIN]]] [resolv_gethostbyname_send] (4): Trying to resolve A record of 'domaincontroller2.mydomain.com'
(Fri Sep  2 10:21:17 2011) [sssd[be[MYDOMAIN]]] [set_server_common_status] (4): Marking server 'domaincontroller2.mydomain.com' as 'resolving name'
(Fri Sep  2 10:21:23 2011) [sssd[be[MYDOMAIN]]] [set_server_common_status] (4): Marking server 'domaincontroller2.mydomain.com' as 'name resolved'
(Fri Sep  2 10:21:23 2011) [sssd[be[MYDOMAIN]]] [be_resolve_server_done] (4): Found address for server domaincontroller2.mydomain.com: [192.168.1.23]

and when I try to logon it takes almost 2 minutes and ends up as offline authentication. 
(Fri Sep  2 10:22:08 2011) [sssd[be[MYDOMAIN]]] [be_get_account_info] (4): Got request for [3][1][name=domainuser]
... 
(here SSSD for several times tires to use DC that is down Why ?? and get timeout on:
(Fri Sep  2 10:24:17 2011) [sssd[be[MYDOMAIN]]] [sdap_rebind_proc] (1): ldap_sasl_bind_s failed (-5)[Timed out]
...
(Fri Sep  2 10:23:51 2011) [sssd[be[MYDOMAIN]]] [acctinfo_callback] (4): Request processed. Returned 3,110,Init Groups Failed
(Fri Sep  2 10:23:51 2011) [sssd[be[MYDOMAIN]]] [be_pam_handler] (4): Got request with the following data
(Fri Sep  2 10:23:51 2011) [sssd[be[MYDOMAIN]]] [pam_print_data] (4): command: PAM_OPEN_SESSION
(Fri Sep  2 10:23:51 2011) [sssd[be[MYDOMAIN]]] [pam_print_data] (4): domain: MYDOMAIN
(Fri Sep  2 10:23:51 2011) [sssd[be[MYDOMAIN]]] [pam_print_data] (4): user: domainuser
(Fri Sep  2 10:23:51 2011) [sssd[be[MYDOMAIN]]] [pam_print_data] (4): service: su-l
(Fri Sep  2 10:23:51 2011) [sssd[be[MYDOMAIN]]] [pam_print_data] (4): tty: tty1
(Fri Sep  2 10:23:51 2011) [sssd[be[MYDOMAIN]]] [pam_print_data] (4): ruser: root
(Fri Sep  2 10:23:51 2011) [sssd[be[MYDOMAIN]]] [pam_print_data] (4): rhost: 
(Fri Sep  2 10:23:51 2011) [sssd[be[MYDOMAIN]]] [pam_print_data] (4): authtok type: 0
(Fri Sep  2 10:23:51 2011) [sssd[be[MYDOMAIN]]] [pam_print_data] (4): authtok size: 0
(Fri Sep  2 10:23:51 2011) [sssd[be[MYDOMAIN]]] [pam_print_data] (4): newauthtok type: 0
(Fri Sep  2 10:23:51 2011) [sssd[be[MYDOMAIN]]] [pam_print_data] (4): newauthtok size: 0
(Fri Sep  2 10:23:51 2011) [sssd[be[MYDOMAIN]]] [pam_print_data] (4): priv: 0
(Fri Sep  2 10:23:51 2011) [sssd[be[MYDOMAIN]]] [pam_print_data] (4): cli_pid: 1457
(Fri Sep  2 10:23:51 2011) [sssd[be[MYDOMAIN]]] [be_pam_handler] (4): Sending result [0][MYDOMAIN]
(Fri Sep  2 10:23:51 2011) [sssd[be[MYDOMAIN]]] [be_get_account_info] (4): Got request for [4099][1][name=domainuser]

Comment 9 Kemot1000 2011-09-02 09:02:38 UTC
Created attachment 521183 [details]
sssd.log debug level 4

Comment 10 Stephen Gallagher 2011-09-02 12:11:37 UTC
(In reply to comment #8)

> >The failover mechanism does not handle DNS A records with multiple IP addresses 
> >(and ldap://ForestDnsZones.MYDOMAIN.COM/DC=ForestDnsZones,DC=mydomain,DC=com is 
> >exacly two DNS A records with different IP addresses) ; instead it only uses the 
> >first address. DNS round-robin cannot be used for failover. Further, providing 
> >multiple A records does not provide failover. Only the first A record is used, 
> >and if a lookup attempt on the first record fails then the system attempts no further lookups.
> 
> I switched to SRV records. But I still wonder why does it use this at all if
> ldap_uri shows SSSD what is the next IP it should use in case of timeout on
> first DC ??
> 

Sorry, I know this is confusing, but we don't actually control this part of the process. This is done inside libldap (provided by the openldap-libs package) and we don't have any real insight into how it behaves. My guess though is that it uses the first value returned from DNS and discards the rest, the same way we do in our lookup.

Switching to SRV records will have no effect on this because it's not under our control. That only affects direct communication with the first LDAP server, not the referral process (as I said, we're looking at ways in the future to handle referrals ourselves instead of relying on the buggy openldap-libs implementation).



> 
> Second problem is when I try to setup it according to "10.2.3.2.4.1. Using SRV
> Records with Failover" I still get slow login and end up loging in with cached
> cridentials ( I attached log with diaglevel 4) so even following documentation
> failover is not working. I think this is not the problem with DNS as DNS A
> record is being resolved
> 

See above, this isn't necessarily going to help your situation because your DNS server is the machine that's becoming unavailable.

> 
> (Fri Sep  2 10:20:54 2011) [sssd[be[MYDOMAIN]]] [sssm_ldap_id_init] (1):
> Missing ldap_uri, will use service discovery
> ...
> (Fri Sep  2 10:21:05 2011) [sssd[be[MYDOMAIN]]] [resolv_gethostbyname_send]
> (4): Trying to resolve A record of 'MYSERVER'
> (Fri Sep  2 10:21:11 2011) [sssd[be[MYDOMAIN]]] [resolve_srv_cont] (4):
> Searching for servers via SRV query '_ldap._tcp.mydomain.com'
> (Fri Sep  2 10:21:11 2011) [sssd[be[MYDOMAIN]]] [resolv_getsrv_send] (4):
> Trying to resolve SRV record of '_ldap._tcp.mydomain.com'
> (Fri Sep  2 10:21:17 2011) [sssd[be[MYDOMAIN]]] [set_srv_data_status] (4):
> Marking SRV lookup of service 'LDAP' as 'resolved'
> (Fri Sep  2 10:21:17 2011) [sssd[be[MYDOMAIN]]] [get_server_status] (4):
> Hostname resolution expired, reseting the server status of
> 'domaincontroller2.mydomain.com'
> (Fri Sep  2 10:21:17 2011) [sssd[be[MYDOMAIN]]] [set_server_common_status] (4):
> Marking server 'domaincontroller2.mydomain.com' as 'name not resolved'
> (Fri Sep  2 10:21:17 2011) [sssd[be[MYDOMAIN]]] [resolv_gethostbyname_send]
> (4): Trying to resolve A record of 'domaincontroller2.mydomain.com'
> (Fri Sep  2 10:21:17 2011) [sssd[be[MYDOMAIN]]] [set_server_common_status] (4):
> Marking server 'domaincontroller2.mydomain.com' as 'resolving name'
> (Fri Sep  2 10:21:23 2011) [sssd[be[MYDOMAIN]]] [set_server_common_status] (4):
> Marking server 'domaincontroller2.mydomain.com' as 'name resolved'
> (Fri Sep  2 10:21:23 2011) [sssd[be[MYDOMAIN]]] [be_resolve_server_done] (4):
> Found address for server domaincontroller2.mydomain.com: [192.168.1.23]
> 
> and when I try to logon it takes almost 2 minutes and ends up as offline
> authentication. 


This is happening for exactly the same reasons as before. We're getting a referral message that requires DNS resolution, and your DNS server is down. So it times out for each individual referral request. Since you have a lot of large groups, this is causing a lot of referral requests, and therefore a lot of timeouts.

I strongly encourage you to try running with 'ldap_referrals = false' as a workaround for now.

> (Fri Sep  2 10:22:08 2011) [sssd[be[MYDOMAIN]]] [be_get_account_info] (4): Got
> request for [3][1][name=domainuser]
> ... 
> (here SSSD for several times tires to use DC that is down Why ?? and get
> timeout on:

As I said above, this is because AD is trying to initiate referrals for a lot of the objects in your groups. The openldap libraries try to initiate a separate individual connection for each of these referrals, and this is a blocking operation which is timing out.

> (Fri Sep  2 10:24:17 2011) [sssd[be[MYDOMAIN]]] [sdap_rebind_proc] (1):
> ldap_sasl_bind_s failed (-5)[Timed out]
> ...



As I said, I think the only way that this is going to be solved fully is for us to implement https://fedorahosted.org/sssd/ticket/860 (which is scheduled for our next upstream release). When we're tracking referrals ourselves, we can be smarter about (for example) caching when a referred LDAP server is unavailable and not try repeatedly to contact it.

Please try the 'ldap_referrals = false' workaround for now, until this is fixed in RHEL 6.3.

Comment 11 Kemot1000 2011-09-02 12:44:07 UTC
I double checked and when I setup this ldap_referrals = false (which I did after your post) I still was getting this error ( and I still do). 

(Fri Sep  2 14:38:11 2011) [sssd[be[MYDOMAIN]]] [sdap_rebind_proc] (1): ldap_sasl_bind_s failed (-5)[Timed out] 


And on 192.168.1.23 there is DNS running as a clone of the DNS from 192.168.1.22.

Comment 12 Stephen Gallagher 2011-09-02 12:55:10 UTC
(In reply to comment #11)
> I double checked and when I setup this ldap_referrals = false (which I did
> after your post) I still was getting this error ( and I still do). 
> 
> (Fri Sep  2 14:38:11 2011) [sssd[be[MYDOMAIN]]] [sdap_rebind_proc] (1):
> ldap_sasl_bind_s failed (-5)[Timed out] 
> 
> 
> And on 192.168.1.23 there is DNS running as a clone of the DNS from
> 192.168.1.22.

Would you mind checking for typos? That code path (sdap_rebind_proc) isn't reachable when ldap_referrals = false. If you're still seeing those errors, then something is definitely wrong with disabling that feature.

Comment 13 Kemot1000 2011-09-02 13:37:09 UTC
Sorry you are right. I misspelled this. And now I'm fail-over works with SRV Records as well as with two ldap_uri defined. 


Sorry to waist your time. 

Thanks for your help

Comment 14 Kemot1000 2011-09-02 13:39:12 UTC
BTW perhaps this option should be specified in Documentation as a workaround for now for any other user that would like to use failover.

Comment 15 Stephen Gallagher 2011-09-02 13:54:44 UTC
(In reply to comment #14)
> BTW perhaps this option should be specified in Documentation as a workaround
> for now for any other user that would like to use failover.

Well, it's not really a failover issue. Failover between LDAP servers is actually working more or less correctly. The problem is actually that the DNS server is unreachable, and because the processing has massive numbers of referrals, it's taking a lot of time to time out on everything.

I don't really want to document this as a workaround because not all environments can use it. For example, environments that have partial replicas at satellite offices actually MUST use referrals for when someone tries to log in that isn't configured locally.

We're going to be working on the referral issue very shortly, and it will definitely see its way into RHEL 6.3.

Comment 16 Kemot1000 2011-09-02 13:59:09 UTC
Thanks again.


Note You need to log in before you can comment on or make changes to this bug.