Bug 884733

Summary:	sssd tries to reconnect to ldap provider too often, slow serving requests while retrying
Product:	[Fedora] Fedora	Reporter:	Orion Poplawski <orion>
Component:	sssd	Assignee:	Jakub Hrozek <jhrozek>
Status:	CLOSED NOTABUG	QA Contact:	Fedora Extras Quality Assurance <extras-qa>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	17	CC:	jhrozek, sbose, sgallagh, ssorce
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2012-12-09 14:31:09 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Orion Poplawski 2012-12-06 16:13:56 UTC

Description of problem:

Using the following configuration:

[domain/default]

auth_provider = krb5
cache_credentials = True
krb5_realm = CORA.NWRA.COM
ldap_search_base = dc=nwra,dc=com
ldap_autofs_search_base = dc=cora,dc=nwra,dc=com
krb5_server = kerberos.cora.nwra.com,kerberos2.cora.nwra.com
id_provider = ldap
ldap_id_use_start_tls = True
chpass_provider = krb5
krb5_store_password_if_offline = True
ldap_uri = ldap://ldap.cora.nwra.com/,ldap://ldap2.cora.nwra.com/
krb5_kpasswd = kerberos.cora.nwra.com
ldap_tls_cacertdir = /etc/openldap/cacerts
[sssd]

services = nss, pam, autofs
config_file_version = 2

domains = default
[nss]

Our offline laptop users are noticing frequent delays while doing operations like "ls -l".  It seems that there are two related issues:

- sssd is trying too often to reconnect to the ldap servers - every 3 - 10 minutes (it tries first for ldap.cora.nwra.com then for ldap2.cora.nwra.com each time):

(Thu Dec  6 07:00:01 2012) [sssd[be[default]]] [fo_resolve_service_send] (0x0100): Trying to resolve service 'LDAP'
(Thu Dec  6 07:00:07 2012) [sssd[be[default]]] [fo_resolve_service_send] (0x0100): Trying to resolve service 'LDAP'
(Thu Dec  6 07:10:01 2012) [sssd[be[default]]] [fo_resolve_service_send] (0x0100): Trying to resolve service 'LDAP'
(Thu Dec  6 07:10:07 2012) [sssd[be[default]]] [fo_resolve_service_send] (0x0100): Trying to resolve service 'LDAP'
(Thu Dec  6 07:15:03 2012) [sssd[be[default]]] [fo_resolve_service_send] (0x0100): Trying to resolve service 'LDAP'
(Thu Dec  6 07:15:09 2012) [sssd[be[default]]] [fo_resolve_service_send] (0x0100): Trying to resolve service 'LDAP'


- requests are not processed while sssd is trying to reconnect to the ldap server.

e.g.:

(Thu Dec  6 07:10:01 2012) [sssd[be[default]]] [fo_resolve_service_send] (0x0100): Trying to resolve service 'LDAP'
(Thu Dec  6 07:10:01 2012) [sssd[be[default]]] [get_port_status] (0x0100): Reseting the status of port 389 for server 'ldap.cora.nwra.com'
(Thu Dec  6 07:10:01 2012) [sssd[be[default]]] [be_resolve_server_done] (0x0200): Found address for server ldap.cora.nwra.com: [4.28.99.180] TTL 21581
(Thu Dec  6 07:10:01 2012) [sssd[be[default]]] [sss_ldap_init_send] (0x0400): Setting 6 seconds timeout for connecting
(Thu Dec  6 07:10:01 2012) [sssd[be[default]]] [be_get_account_info] (0x0100): Got request for [4098][1][name=pulse-rt]
(Thu Dec  6 07:10:07 2012) [sssd[be[default]]] [sdap_async_sys_connect_timeout] (0x0100): The LDAP connection timed out
(Thu Dec  6 07:10:07 2012) [sssd[be[default]]] [sss_ldap_init_sys_connect_done] (0x0020): sdap_async_sys_connect request failed.
(Thu Dec  6 07:10:07 2012) [sssd[be[default]]] [sdap_sys_connect_done] (0x0020): sdap_async_connect_call request failed.
(Thu Dec  6 07:10:07 2012) [sssd[be[default]]] [fo_set_port_status] (0x0100): Marking port 389 of server 'ldap.cora.nwra.com' as 'not working'
(Thu Dec  6 07:10:07 2012) [sssd[be[default]]] [fo_resolve_service_send] (0x0100): Trying to resolve service 'LDAP'
(Thu Dec  6 07:10:07 2012) [sssd[be[default]]] [get_server_status] (0x0100): Reseting the server status of 'ldap2.cora.nwra.com'
(Thu Dec  6 07:10:07 2012) [sssd[be[default]]] [resolv_gethostbyname_files_send] (0x0100): Trying to resolve A record of 'ldap2.cora.nwra.com' in files
(Thu Dec  6 07:10:07 2012) [sssd[be[default]]] [set_server_common_status] (0x0100): Marking server 'ldap2.cora.nwra.com' as 'resolving name'
(Thu Dec  6 07:10:07 2012) [sssd[be[default]]] [resolv_gethostbyname_files_send] (0x0100): Trying to resolve AAAA record of 'ldap2.cora.nwra.com' in files
(Thu Dec  6 07:10:07 2012) [sssd[be[default]]] [resolv_gethostbyname_next] (0x0200): No more address families to retry
(Thu Dec  6 07:10:07 2012) [sssd[be[default]]] [resolv_gethostbyname_dns_query] (0x0100): Trying to resolve A record of 'ldap2.cora.nwra.com' in DNS
(Thu Dec  6 07:10:07 2012) [sssd[be[default]]] [request_watch_destructor] (0x0400): Deleting request watch
(Thu Dec  6 07:10:07 2012) [sssd[be[default]]] [resolv_gethostbyname_dns_query] (0x0100): Trying to resolve AAAA record of 'ldap2.cora.nwra.com' in DNS
(Thu Dec  6 07:10:07 2012) [sssd[be[default]]] [request_watch_destructor] (0x0400): Deleting request watch
(Thu Dec  6 07:10:07 2012) [sssd[be[default]]] [resolv_gethostbyname_next] (0x0200): No more address families to retry
(Thu Dec  6 07:10:07 2012) [sssd[be[default]]] [resolv_gethostbyname_next] (0x0100): No more hosts databases to retry
(Thu Dec  6 07:10:07 2012) [sssd[be[default]]] [fo_resolve_service_done] (0x0020): Failed to resolve server 'ldap2.cora.nwra.com': Domain name not found
(Thu Dec  6 07:10:07 2012) [sssd[be[default]]] [set_server_common_status] (0x0100): Marking server 'ldap2.cora.nwra.com' as 'not working'
(Thu Dec  6 07:10:07 2012) [sssd[be[default]]] [sdap_id_op_connect_done] (0x0020): Failed to connect, going offline (5 [Input/output error])
(Thu Dec  6 07:10:07 2012) [sssd[be[default]]] [be_run_offline_cb] (0x0080): Going offline. Running callbacks.
(Thu Dec  6 07:10:07 2012) [sssd[be[default]]] [acctinfo_callback] (0x0100): Request processed. Returned 1,11,Offline

It ended up taking 6 seconds to process the request for "pulse-rt" (Actually, I don't see a "Request processed" entry to match the pulse-rt request so I'm just assuming this, perhaps something else is going on).  It seems to me that the reconnect should be happening asynchronously and sssd should continue to serve fast offline replies while that is happening.

It seems like this behavior may have started when we added the backup ldap server to the configuration, so that may be causing problems (and perhaps the fact that ldap2 cannot be resolved makes for a different situation).  I'm moving back to a single server to see if that helps.

Version-Release number of selected component (if applicable):
sssd-1.8.5-3.fc17.x86_64

Comment 1 Jakub Hrozek 2012-12-06 19:22:29 UTC

Sorry, but I'm a little confused now -- does this timeout only happen for a user who is completely offline? Or when just one of the servers is not reachable?

I'm confused because you stated "our offline laptop users" but the logs show that the server could be resolved, just that the connect timeout fires.

Local or system users like pulse-rt might be good candidates to filter out completely using the "filter_users" configuration option, by the way.

Comment 2 Orion Poplawski 2012-12-06 20:12:16 UTC

Sorry, replace offline with offsite (and so no access to ldap server) but network connected.  I haven't really checked with completely offline use.

Why do items in /etc/passwd get searched in sss if:

passwd:     files sss
shadow:     files sss
group:      files sss

Shouldn't it get it from passwd?  If filter_users is still useful for this situation it would be nice to have it automatically configured.

I'm usually on irc (orionp) if it's helpful to chat about it there.

Comment 3 Jakub Hrozek 2012-12-07 10:07:38 UTC

OK, so there are several problems at play.

Your ldap.cora. LDAP server address is resolvable from the Internet (unlike ldap2), so whenever a request comes in, the address is resolved and we attempt to connect. I assume the firewall on the machine drops packets rather then rejects? Because with reject, the SSSD should immediately detect to move on. When the packets are dropped, the SSSD waits until the ldap_network_timeout fires, which is 6 seconds by default (so you can also experiment with lowering the value of ldap_network_timeout).

You're right that when a local group is requested AND it exists, the Name Service Switch would stop processing after first hit as the default action in nsswitch.conf is [sucess=return]. However, to my surprise, the pulse-rt group[1] doesn't exist at all on none of the Fedora. The fact that pulse-rt is missing is a problem not only for the SSSD, nss_ldap/nss-pam-ldapd/any other nss module would get queried as well. I consider that a bug in pulseaudio, the group should be present, but empty, because then the request would just shortcut from libnss_files. I filed:
https://bugzilla.redhat.com/show_bug.cgi?id=885020

A short-term workaround of putting the group into SSSD's negative cache might do the trick. I'm not sure about putting pulse-rt into the default negcache list, because some admin might actually want to create the group in LDAP rather than on the clients in a completely centrally managed fashion.

In conclusion, I don't this there's much we can be doing differently in the SSSD.

[1] pulse-rt is a pulseaudio group. According to pulseaudio(1), the membership in that group determines if sounds played by the particular user should receive realtime priority.

Comment 4 Orion Poplawski 2012-12-07 18:51:01 UTC

Thanks for the analysis - I've changed our firewall to reject the ldap requests and will try creating the local pulse-rt group as well.

It stills seems to me that sssd should be able to serve nss requests quickly from cache even while trying to reconnect to the ldap server.  But I'll be the first to admit that I don't know all of the issues surrounding that behavior.

Comment 5 Jakub Hrozek 2012-12-09 14:31:09 UTC

(In reply to comment #4)
> Thanks for the analysis - I've changed our firewall to reject the ldap
> requests and will try creating the local pulse-rt group as well.
> 

That sounds like the best approach to me..I haven't heard back from the pulseaudio devs if creating the local pulse-rt group has any disadvantages, but their man page doesn't make it sound like there are any.

> It stills seems to me that sssd should be able to serve nss requests quickly
> from cache even while trying to reconnect to the ldap server.  But I'll be
> the first to admit that I don't know all of the issues surrounding that
> behavior.

It does so for existing groups, during cache midpoint refresh -- see the entry_cache_percentage option in man sssd.conf

I'm not sure if it would be possible to do the same for non-existing entries like pulse-rt. We need to connect to the remote server to reliably answer if the entry is there or not.

The negative hits are cached by the negative cache, which is rather short-lived by default and non-persistent between reboots.