Bugzilla will be upgraded to version 5.0 on December 2, 2018. The outage period for the upgrade will start at 0:00 UTC and have a duration of 12 hours
Bug 1400614 - [RFE] sssd should remember DNS sites from first search
[RFE] sssd should remember DNS sites from first search
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: sssd (Show other bugs)
7.3
All Linux
medium Severity medium
: rc
: ---
Assigned To: SSSD Maintainers
Dan Lavu
Aneta Šteflová Petrová
: FutureFeature
Depends On:
Blocks: 1420851 1467835 1477926 1472344 1490412 1504554
  Show dependency treegraph
 
Reported: 2016-12-01 10:46 EST by Thorsten Scherf
Modified: 2018-04-10 13:10 EDT (History)
13 users (show)

See Also:
Fixed In Version: sssd-1.16.0-2.el7
Doc Type: Enhancement
Doc Text:
SSSD enrolled to an AD domain remembers the discovered AD site after the first successful connection Previously, the System Security Services Daemon (SSSD) sent an LDAP ping to any Active Directory (AD) domain controller (DC) in order to determine a client's AD site. If the contacted DC was unreachable, a timeout occurred, which delayed the connection for several seconds. With this update, SSSD remembers the client's site after the first successful discovery. All subsequent LDAP pings are performed on the DC from the client's site, which helps speed up the request.
Story Points: ---
Clone Of:
: 1504554 (view as bug list)
Environment:
Last Closed: 2018-04-10 13:09:10 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2018:0929 None None None 2018-04-10 13:10 EDT

  None (edit)
Description Thorsten Scherf 2016-12-01 10:46:50 EST
Description of problem:
When a sssd client is enrolled into a larger AD domain with multiple sites, the client should always talk to the closest domain controllers in the local site.

When the option "ad_site" is not set in sssd.conf, sssd has to discover the local site automatically. This is done separately for the GC and AD service. While the site discovery for the GC service works well, it might be that the discovery for the AD service takes a long time. A delay could be caused by domain controllers which are not reachable and where sssd then runs into a 6s timeout for each request.

The RfE is to remember the site from the GC service discovery also for the AD service, so that the site needs to be discovered just once.

Version-Release number of selected component (if applicable):
sssd-1.14.x

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
Comment 1 Pavel Březina 2016-12-02 04:34:30 EST
I talked about this case with Thorsten yesterday and proposed this solution. Basically we should store discovered site name in the sysdb and prefer this site during ldap ping.

Because some directory controllers may be unreachable from the client, it will still hit the timeouts during the first SRV resolution but we would avoid them next time, unless the site name is destroyed in which case we would fallback to the original search.
Comment 2 Jakub Hrozek 2016-12-02 04:42:24 EST
(In reply to Pavel Březina from comment #1)
> I talked about this case with Thorsten yesterday and proposed this solution.
> Basically we should store discovered site name in the sysdb and prefer this
> site during ldap ping.
> 

Why sysdb and not in memory?

> Because some directory controllers may be unreachable from the client, it
> will still hit the timeouts during the first SRV resolution but we would
> avoid them next time, unless the site name is destroyed in which case we
> would fallback to the original search.

Can you estimate how much work this is (iow, is it doable in 7.4 w/o removing anything already acked) ?
Comment 3 Pavel Březina 2016-12-02 05:04:23 EST
Memory would work as well. However, storing it in sysdb will help us contact the right site also upon sssd restart, which may be helpful especially during system boot.

I think all required code is already there (thanks to [1]), at this moment I see only two changes:
- load ad site if not explicitly configured from sysdb
- store is when we get it

Unless there is some problem I don't see now, it should be fairly simple.

[]1 https://fedorahosted.org/sssd/ticket/2765
Comment 4 Jakub Hrozek 2016-12-02 05:24:45 EST
(In reply to Pavel Březina from comment #3)
> Memory would work as well. However, storing it in sysdb will help us contact
> the right site also upon sssd restart, which may be helpful especially
> during system boot.
> 

But isn't site something that shoudl always be discovered at least once? How would this work when the admin changes the site and restarts sssd on the clients to pick up the new site? Or even assigns the client to a new site and expects the clients to pick up the site dynamically?

> I think all required code is already there (thanks to [1]), at this moment I
> see only two changes:
> - load ad site if not explicitly configured from sysdb
> - store is when we get it
> 
> Unless there is some problem I don't see now, it should be fairly simple.
> 
> []1 https://fedorahosted.org/sssd/ticket/2765
Comment 5 Pavel Březina 2016-12-02 06:30:07 EST
The site will be always discovered again during ldap ping. We will use the stored one only as an information what servers should be preferred during ldap ping.

Scenarios:

1) New start, no ad site information available
a) SSSD looks up directory controllers in dns_discovery_domain
b) SSSD sends ldap ping to one of these dc
c) if some of these dc are not reachable, we may experience timeouts
d) we get forest and site as a reply, remember it, and use this information to get our servers

2) SSSD is running, SRV records needs to be renewed
a) SSSD looks up directory controllers in selected site
b) SSSD sends ldap ping to one of these dc
c) all dc should be reachable in site
d) we get forest and site as a reply, remember it, and use this information to get our servers (if site is changed, it does not matter, we use the new one)

2) Stored site does not exist anymore
a) SSSD looks up directory controllers in selected site which will fail then we use dns_discovery_domain
b) SSSD sends ldap ping to one of these dc
c) if some of these dc are not reachable, we may experience timeouts
d) we get forest and site as a reply, remember it, and use this information to get our servers
Comment 6 Jakub Hrozek 2016-12-08 11:15:24 EST
Upstream ticket:
https://fedorahosted.org/sssd/ticket/3265
Comment 7 Jakub Hrozek 2017-08-10 12:58:27 EDT
To reproduce, prepare a forest with AD DCs. Assign the client to a site, some DCs should be in the site, some out of the site.

Start the test with an empty cache. The sssd would try to find the AD site it belongs to and with the empty cache, it can choose any of the AD DCs.

Restart SSSD. It would try to find the site again after the restart, but this time it should remember the site it belongs to and should only check its site with the DCs from the site it already remembers.

As far as how the test should be implemented, you can either watch for DNS traffic with tcpdump and verify that with the populated cache the queries only hit the site or you can firewall off (with DROP rules?) the AD DC outside the site. The, with the unpatched version, sssd would switch to offline mode because finding the site would take too long and should stay online with the patched version.
Comment 10 Jakub Hrozek 2017-11-02 07:56:32 EDT
* master:
 * fb0431b13a9fcd8ac31e622503acbd10d2b73ac9                                                                                                                                                                                                    
 * e16539779668dacff868999bd59dbf33e3eab872
 * f54d202db528207d7794870aabef0656b20369f1
Comment 15 Dan Lavu 2017-12-14 05:43:58 EST
Verified against sssd-1.16.0-11.el7.x86_64

Logs from the first start, with nothing cached.

=========
(Thu Dec 14 12:40:07 2017) [sssd[be[sssdad2012r2.com]]] [ad_get_dc_servers_done] (0x0400): Found 2 domain controllers in domain sssdad2012r2.com
(Thu Dec 14 12:40:07 2017) [sssd[be[sssdad2012r2.com]]] [ad_srv_plugin_dcs_done] (0x0400): About to locate suitable site
(Thu Dec 14 12:40:07 2017) [sssd[be[sssdad2012r2.com]]] [sdap_connect_host_send] (0x0400): Resolving host bsod2-bdc.sssdad2012r2.com
(Thu Dec 14 12:40:07 2017) [sssd[be[sssdad2012r2.com]]] [resolv_is_address] (0x4000): [bsod2-bdc.sssdad2012r2.com] does not look like an IP address
(Thu Dec 14 12:40:07 2017) [sssd[be[sssdad2012r2.com]]] [resolv_gethostbyname_step] (0x2000): Querying files
(Thu Dec 14 12:40:07 2017) [sssd[be[sssdad2012r2.com]]] [resolv_gethostbyname_files_send] (0x0100): Trying to resolve A record of 'bsod2-bdc.sssdad2012r2.com' in files
(Thu Dec 14 12:40:07 2017) [sssd[be[sssdad2012r2.com]]] [resolv_gethostbyname_step] (0x2000): Querying files
(Thu Dec 14 12:40:07 2017) [sssd[be[sssdad2012r2.com]]] [resolv_gethostbyname_files_send] (0x0100): Trying to resolve AAAA record of 'bsod2-bdc.sssdad2012r2.com' in files
(Thu Dec 14 12:40:07 2017) [sssd[be[sssdad2012r2.com]]] [resolv_gethostbyname_next] (0x0200): No more address families to retry
(Thu Dec 14 12:40:07 2017) [sssd[be[sssdad2012r2.com]]] [resolv_gethostbyname_step] (0x2000): Querying DNS
(Thu Dec 14 12:40:07 2017) [sssd[be[sssdad2012r2.com]]] [resolv_gethostbyname_dns_query] (0x0100): Trying to resolve A record of 'bsod2-bdc.sssdad2012r2.com' in DNS
(Thu Dec 14 12:40:07 2017) [sssd[be[sssdad2012r2.com]]] [schedule_request_timeout] (0x2000): Scheduling a timeout of 6 seconds
(Thu Dec 14 12:40:07 2017) [sssd[be[sssdad2012r2.com]]] [schedule_timeout_watcher] (0x2000): Scheduling DNS timeout watcher
(Thu Dec 14 12:40:08 2017) [sssd[be[sssdad2012r2.com]]] [unschedule_timeout_watcher] (0x4000): Unscheduling DNS timeout watcher
(Thu Dec 14 12:40:08 2017) [sssd[be[sssdad2012r2.com]]] [resolv_gethostbyname_dns_parse] (0x1000): Parsing an A reply
(Thu Dec 14 12:40:08 2017) [sssd[be[sssdad2012r2.com]]] [request_watch_destructor] (0x0400): Deleting request watch
(Thu Dec 14 12:40:08 2017) [sssd[be[sssdad2012r2.com]]] [sdap_connect_host_resolv_done] (0x0400): Connecting to ldap://bsod2-bdc.sssdad2012r2.com:389
(Thu Dec 14 12:40:08 2017) [sssd[be[sssdad2012r2.com]]] [sssd_async_socket_init_send] (0x4000): Using file descriptor [23] for the connection.
(Thu Dec 14 12:40:08 2017) [sssd[be[sssdad2012r2.com]]] [sssd_async_socket_init_send] (0x0400): Setting 6 seconds timeout for connecting
(Thu Dec 14 12:40:08 2017) [sssd[be[sssdad2012r2.com]]] [sdap_ldap_connect_callback_add] (0x1000): New LDAP connection to [ldap://bsod2-bdc.sssdad2012r2.com:389/??base] with fd [23].
(Thu Dec 14 12:40:08 2017) [sssd[be[sssdad2012r2.com]]] [sdap_connect_host_done] (0x0400): Successful connection to ldap://bsod2-bdc.sssdad2012r2.com:389
=========

Logs from a restarted session

=========
(Thu Dec 14 12:41:04 2017) [sssd[be[sssdad2012r2.com]]] [child_sig_handler] (0x1000): Waiting for child [629].
(Thu Dec 14 12:41:04 2017) [sssd[be[sssdad2012r2.com]]] [child_sig_handler] (0x0100): child [629] finished successfully.
(Thu Dec 14 12:41:04 2017) [sssd[be[sssdad2012r2.com]]] [read_pipe_handler] (0x0400): EOF received, client finished
(Thu Dec 14 12:41:04 2017) [sssd[be[sssdad2012r2.com]]] [sdap_get_tgt_recv] (0x0400): Child responded: 0 [FILE:/var/lib/sss/db/ccache_SSSDAD2012R2.COM], expired on [1513271463]
(Thu Dec 14 12:41:04 2017) [sssd[be[sssdad2012r2.com]]] [sdap_cli_auth_step] (0x0100): expire timeout is 900
(Thu Dec 14 12:41:04 2017) [sssd[be[sssdad2012r2.com]]] [sdap_cli_auth_step] (0x1000): the connection will expire at 1513236364
(Thu Dec 14 12:41:04 2017) [sssd[be[sssdad2012r2.com]]] [sasl_bind_send] (0x0100): Executing sasl bind mech: gssapi, user: VM-IDM-013$
(Thu Dec 14 12:41:06 2017) [sssd[be[sssdad2012r2.com]]] [sdap_cli_connect_recv] (0x0400): Connection established.
(Thu Dec 14 12:41:06 2017) [sssd[be[sssdad2012r2.com]]] [_be_fo_set_port_status] (0x8000): Setting status: PORT_WORKING. Called from: src/providers/ldap/sdap_async_connection.c: sdap_cli_connect_recv: 2067
(Thu Dec 14 12:41:06 2017) [sssd[be[sssdad2012r2.com]]] [fo_set_port_status] (0x0100): Marking port 389 of server 'bsod2-bdc.sssdad2012r2.com' as 'working'
(Thu Dec 14 12:41:06 2017) [sssd[be[sssdad2012r2.com]]] [set_server_common_status] (0x0100): Marking server 'bsod2-bdc.sssdad2012r2.com' as 'working'
<---snip---->
(Thu Dec 14 12:41:06 2017) [sssd[be[sssdad2012r2.com]]] [sdap_id_op_connect_step] (0x4000): reusing cached connection
(Thu Dec 14 12:41:06 2017) [sssd[be[sssdad2012r2.com]]] [check_ipv6_addr] (0x0200): Link local IPv6 address fe80::5054:ff:fea0:7bb8
(Thu Dec 14 12:41:06 2017) [sssd[be[sssdad2012r2.com]]] [sdap_id_op_destroy] (0x4000): releasing operation connection
=========
Comment 18 errata-xmlrpc 2018-04-10 13:09:10 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:0929

Note You need to log in before you can comment on or make changes to this bug.