Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1400614

Summary:	[RFE] sssd should remember DNS sites from first search
Product:	Red Hat Enterprise Linux 7	Reporter:	Thorsten Scherf <tscherf>
Component:	sssd	Assignee:	SSSD Maintainers <sssd-maint>
Status:	CLOSED ERRATA	QA Contact:	Dan Lavu <dlavu>
Severity:	medium	Docs Contact:	Aneta Šteflová Petrová <apetrova>
Priority:	medium
Version:	7.3	CC:	enewland, fidencio, gparente, grajaiya, jhrozek, ldelouw, lslebodn, mkosek, mzidek, pasik, pbrezina, sbose, sgoveas
Target Milestone:	rc	Keywords:	FutureFeature
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	sssd-1.16.0-2.el7	Doc Type:	Enhancement
Doc Text:	SSSD enrolled to an AD domain remembers the discovered AD site after the first successful connection Previously, the System Security Services Daemon (SSSD) sent an LDAP ping to any Active Directory (AD) domain controller (DC) in order to determine a client's AD site. If the contacted DC was unreachable, a timeout occurred, which delayed the connection for several seconds. With this update, SSSD remembers the client's site after the first successful discovery. All subsequent LDAP pings are performed on the DC from the client's site, which helps speed up the request.	Story Points:	---
Clone Of:
Clones:	1504554 (view as bug list)		Environment:
Last Closed:	2018-04-10 17:09:10 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1420851, 1467835, 1472344, 1477926, 1490412, 1504554

Description Thorsten Scherf 2016-12-01 15:46:50 UTC

Description of problem:
When a sssd client is enrolled into a larger AD domain with multiple sites, the client should always talk to the closest domain controllers in the local site.

When the option "ad_site" is not set in sssd.conf, sssd has to discover the local site automatically. This is done separately for the GC and AD service. While the site discovery for the GC service works well, it might be that the discovery for the AD service takes a long time. A delay could be caused by domain controllers which are not reachable and where sssd then runs into a 6s timeout for each request.

The RfE is to remember the site from the GC service discovery also for the AD service, so that the site needs to be discovered just once.

Version-Release number of selected component (if applicable):
sssd-1.14.x

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Pavel Březina 2016-12-02 09:34:30 UTC

I talked about this case with Thorsten yesterday and proposed this solution. Basically we should store discovered site name in the sysdb and prefer this site during ldap ping.

Because some directory controllers may be unreachable from the client, it will still hit the timeouts during the first SRV resolution but we would avoid them next time, unless the site name is destroyed in which case we would fallback to the original search.

Comment 2 Jakub Hrozek 2016-12-02 09:42:24 UTC

(In reply to Pavel Březina from comment #1)
> I talked about this case with Thorsten yesterday and proposed this solution.
> Basically we should store discovered site name in the sysdb and prefer this
> site during ldap ping.
> 

Why sysdb and not in memory?

> Because some directory controllers may be unreachable from the client, it
> will still hit the timeouts during the first SRV resolution but we would
> avoid them next time, unless the site name is destroyed in which case we
> would fallback to the original search.

Can you estimate how much work this is (iow, is it doable in 7.4 w/o removing anything already acked) ?

Comment 3 Pavel Březina 2016-12-02 10:04:23 UTC

Memory would work as well. However, storing it in sysdb will help us contact the right site also upon sssd restart, which may be helpful especially during system boot.

I think all required code is already there (thanks to [1]), at this moment I see only two changes:
- load ad site if not explicitly configured from sysdb
- store is when we get it

Unless there is some problem I don't see now, it should be fairly simple.

[]1 https://fedorahosted.org/sssd/ticket/2765

Comment 4 Jakub Hrozek 2016-12-02 10:24:45 UTC

(In reply to Pavel Březina from comment #3)
> Memory would work as well. However, storing it in sysdb will help us contact
> the right site also upon sssd restart, which may be helpful especially
> during system boot.
> 

But isn't site something that shoudl always be discovered at least once? How would this work when the admin changes the site and restarts sssd on the clients to pick up the new site? Or even assigns the client to a new site and expects the clients to pick up the site dynamically?

> I think all required code is already there (thanks to [1]), at this moment I
> see only two changes:
> - load ad site if not explicitly configured from sysdb
> - store is when we get it
> 
> Unless there is some problem I don't see now, it should be fairly simple.
> 
> []1 https://fedorahosted.org/sssd/ticket/2765

Comment 5 Pavel Březina 2016-12-02 11:30:07 UTC

The site will be always discovered again during ldap ping. We will use the stored one only as an information what servers should be preferred during ldap ping.

Scenarios:

1) New start, no ad site information available
a) SSSD looks up directory controllers in dns_discovery_domain
b) SSSD sends ldap ping to one of these dc
c) if some of these dc are not reachable, we may experience timeouts
d) we get forest and site as a reply, remember it, and use this information to get our servers

2) SSSD is running, SRV records needs to be renewed
a) SSSD looks up directory controllers in selected site
b) SSSD sends ldap ping to one of these dc
c) all dc should be reachable in site
d) we get forest and site as a reply, remember it, and use this information to get our servers (if site is changed, it does not matter, we use the new one)

2) Stored site does not exist anymore
a) SSSD looks up directory controllers in selected site which will fail then we use dns_discovery_domain
b) SSSD sends ldap ping to one of these dc
c) if some of these dc are not reachable, we may experience timeouts
d) we get forest and site as a reply, remember it, and use this information to get our servers

Comment 6 Jakub Hrozek 2016-12-08 16:15:24 UTC

Upstream ticket:
https://fedorahosted.org/sssd/ticket/3265

Comment 7 Jakub Hrozek 2017-08-10 16:58:27 UTC

To reproduce, prepare a forest with AD DCs. Assign the client to a site, some DCs should be in the site, some out of the site.

Start the test with an empty cache. The sssd would try to find the AD site it belongs to and with the empty cache, it can choose any of the AD DCs.

Restart SSSD. It would try to find the site again after the restart, but this time it should remember the site it belongs to and should only check its site with the DCs from the site it already remembers.

As far as how the test should be implemented, you can either watch for DNS traffic with tcpdump and verify that with the populated cache the queries only hit the site or you can firewall off (with DROP rules?) the AD DC outside the site. The, with the unpatched version, sssd would switch to offline mode because finding the site would take too long and should stay online with the patched version.

Comment 10 Jakub Hrozek 2017-11-02 11:56:32 UTC

* master:
 * fb0431b13a9fcd8ac31e622503acbd10d2b73ac9                                                                                                                                                                                                    
 * e16539779668dacff868999bd59dbf33e3eab872
 * f54d202db528207d7794870aabef0656b20369f1

Comment 15 Dan Lavu 2017-12-14 10:43:58 UTC

Verified against sssd-1.16.0-11.el7.x86_64

Logs from the first start, with nothing cached.

=========
(Thu Dec 14 12:40:07 2017) [sssd[be[sssdad2012r2.com]]] [ad_get_dc_servers_done] (0x0400): Found 2 domain controllers in domain sssdad2012r2.com
(Thu Dec 14 12:40:07 2017) [sssd[be[sssdad2012r2.com]]] [ad_srv_plugin_dcs_done] (0x0400): About to locate suitable site
(Thu Dec 14 12:40:07 2017) [sssd[be[sssdad2012r2.com]]] [sdap_connect_host_send] (0x0400): Resolving host bsod2-bdc.sssdad2012r2.com
(Thu Dec 14 12:40:07 2017) [sssd[be[sssdad2012r2.com]]] [resolv_is_address] (0x4000): [bsod2-bdc.sssdad2012r2.com] does not look like an IP address
(Thu Dec 14 12:40:07 2017) [sssd[be[sssdad2012r2.com]]] [resolv_gethostbyname_step] (0x2000): Querying files
(Thu Dec 14 12:40:07 2017) [sssd[be[sssdad2012r2.com]]] [resolv_gethostbyname_files_send] (0x0100): Trying to resolve A record of 'bsod2-bdc.sssdad2012r2.com' in files
(Thu Dec 14 12:40:07 2017) [sssd[be[sssdad2012r2.com]]] [resolv_gethostbyname_step] (0x2000): Querying files
(Thu Dec 14 12:40:07 2017) [sssd[be[sssdad2012r2.com]]] [resolv_gethostbyname_files_send] (0x0100): Trying to resolve AAAA record of 'bsod2-bdc.sssdad2012r2.com' in files
(Thu Dec 14 12:40:07 2017) [sssd[be[sssdad2012r2.com]]] [resolv_gethostbyname_next] (0x0200): No more address families to retry
(Thu Dec 14 12:40:07 2017) [sssd[be[sssdad2012r2.com]]] [resolv_gethostbyname_step] (0x2000): Querying DNS
(Thu Dec 14 12:40:07 2017) [sssd[be[sssdad2012r2.com]]] [resolv_gethostbyname_dns_query] (0x0100): Trying to resolve A record of 'bsod2-bdc.sssdad2012r2.com' in DNS
(Thu Dec 14 12:40:07 2017) [sssd[be[sssdad2012r2.com]]] [schedule_request_timeout] (0x2000): Scheduling a timeout of 6 seconds
(Thu Dec 14 12:40:07 2017) [sssd[be[sssdad2012r2.com]]] [schedule_timeout_watcher] (0x2000): Scheduling DNS timeout watcher
(Thu Dec 14 12:40:08 2017) [sssd[be[sssdad2012r2.com]]] [unschedule_timeout_watcher] (0x4000): Unscheduling DNS timeout watcher
(Thu Dec 14 12:40:08 2017) [sssd[be[sssdad2012r2.com]]] [resolv_gethostbyname_dns_parse] (0x1000): Parsing an A reply
(Thu Dec 14 12:40:08 2017) [sssd[be[sssdad2012r2.com]]] [request_watch_destructor] (0x0400): Deleting request watch
(Thu Dec 14 12:40:08 2017) [sssd[be[sssdad2012r2.com]]] [sdap_connect_host_resolv_done] (0x0400): Connecting to ldap://bsod2-bdc.sssdad2012r2.com:389
(Thu Dec 14 12:40:08 2017) [sssd[be[sssdad2012r2.com]]] [sssd_async_socket_init_send] (0x4000): Using file descriptor [23] for the connection.
(Thu Dec 14 12:40:08 2017) [sssd[be[sssdad2012r2.com]]] [sssd_async_socket_init_send] (0x0400): Setting 6 seconds timeout for connecting
(Thu Dec 14 12:40:08 2017) [sssd[be[sssdad2012r2.com]]] [sdap_ldap_connect_callback_add] (0x1000): New LDAP connection to [ldap://bsod2-bdc.sssdad2012r2.com:389/??base] with fd [23].
(Thu Dec 14 12:40:08 2017) [sssd[be[sssdad2012r2.com]]] [sdap_connect_host_done] (0x0400): Successful connection to ldap://bsod2-bdc.sssdad2012r2.com:389
=========

Logs from a restarted session

=========
(Thu Dec 14 12:41:04 2017) [sssd[be[sssdad2012r2.com]]] [child_sig_handler] (0x1000): Waiting for child [629].
(Thu Dec 14 12:41:04 2017) [sssd[be[sssdad2012r2.com]]] [child_sig_handler] (0x0100): child [629] finished successfully.
(Thu Dec 14 12:41:04 2017) [sssd[be[sssdad2012r2.com]]] [read_pipe_handler] (0x0400): EOF received, client finished
(Thu Dec 14 12:41:04 2017) [sssd[be[sssdad2012r2.com]]] [sdap_get_tgt_recv] (0x0400): Child responded: 0 [FILE:/var/lib/sss/db/ccache_SSSDAD2012R2.COM], expired on [1513271463]
(Thu Dec 14 12:41:04 2017) [sssd[be[sssdad2012r2.com]]] [sdap_cli_auth_step] (0x0100): expire timeout is 900
(Thu Dec 14 12:41:04 2017) [sssd[be[sssdad2012r2.com]]] [sdap_cli_auth_step] (0x1000): the connection will expire at 1513236364
(Thu Dec 14 12:41:04 2017) [sssd[be[sssdad2012r2.com]]] [sasl_bind_send] (0x0100): Executing sasl bind mech: gssapi, user: VM-IDM-013$
(Thu Dec 14 12:41:06 2017) [sssd[be[sssdad2012r2.com]]] [sdap_cli_connect_recv] (0x0400): Connection established.
(Thu Dec 14 12:41:06 2017) [sssd[be[sssdad2012r2.com]]] [_be_fo_set_port_status] (0x8000): Setting status: PORT_WORKING. Called from: src/providers/ldap/sdap_async_connection.c: sdap_cli_connect_recv: 2067
(Thu Dec 14 12:41:06 2017) [sssd[be[sssdad2012r2.com]]] [fo_set_port_status] (0x0100): Marking port 389 of server 'bsod2-bdc.sssdad2012r2.com' as 'working'
(Thu Dec 14 12:41:06 2017) [sssd[be[sssdad2012r2.com]]] [set_server_common_status] (0x0100): Marking server 'bsod2-bdc.sssdad2012r2.com' as 'working'
<---snip---->
(Thu Dec 14 12:41:06 2017) [sssd[be[sssdad2012r2.com]]] [sdap_id_op_connect_step] (0x4000): reusing cached connection
(Thu Dec 14 12:41:06 2017) [sssd[be[sssdad2012r2.com]]] [check_ipv6_addr] (0x0200): Link local IPv6 address fe80::5054:ff:fea0:7bb8
(Thu Dec 14 12:41:06 2017) [sssd[be[sssdad2012r2.com]]] [sdap_id_op_destroy] (0x4000): releasing operation connection
=========

Comment 18 errata-xmlrpc 2018-04-10 17:09:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:0929