Bug 1819012

Summary: [RFE] Improve AD site discovery process
Product: Red Hat Enterprise Linux 8 Reporter: toasty <wrydberg>
Component: sssdAssignee: Pavel Březina <pbrezina>
Status: CLOSED ERRATA QA Contact: Dan Lavu <dlavu>
Severity: unspecified Docs Contact: lmcgarry
Priority: unspecified    
Version: 8.2CC: apeddire, atikhono, dave, dlavu, grajaiya, jentrena, jhrozek, lslebodn, mzidek, pbrezina, thalman, tscherf
Target Milestone: rcKeywords: FutureFeature, Triaged
Target Release: ---Flags: pm-rhel: mirror+
Hardware: Unspecified   
OS: Linux   
Whiteboard: sync-to-jira qetodo tested
Fixed In Version: sssd-2.4.0-1.el8 Doc Type: Enhancement
Doc Text:
.Improved Active Directory site discovery process The SSSD service now discovers Active Directory sites in parallel over connection-less LDAP (CLDAP) to multiple domain controllers to speed up site discovery in situations where some domain controllers are unreachable. Previously, site discovery was performed sequentially and, in situations where domain controllers were unreachable, a timeout eventually occurred and SSSD went offline.
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-05-18 15:03:54 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1881992    
Bug Blocks:    

Description toasty 2020-03-30 23:40:31 UTC
Description of problem:
-----------------------
(Much of the following text has been copied directly from https://pagure.io/SSSD/sssd/issue/2702, which this BZ is being raised specifically to be linked to)

Currently, AD site discovery as it is implemented in SSSD follows this document:
https://docs.pagure.org/SSSD.sssd/design_pages/active_directory_dns_sites.html 

The problem with this process is, that it assumes all domain controllers are accessible from the client. If they are not (for example different geographical region behind a VPN), we are in a danger of timeout (even after determining the site, since we cannot guarantee that all DC’s registered for a particular site are responsive).

The solution (proposed in 2016 by Ondrej Valousek in https://pagure.io/SSSD/sssd/issue/2702) is:

1. Run _ldap._tcp.domain.name SRV query to obtain list of all DC’s for domain (no change)
2. Send LDAP ping to ALL DC’s in parallel (instead of sequentially as currently happens)
3. Attempt to extract client site from whichever DC response is received first
4. Proceed with service discovery, only communicating with DC’s that responded to LDAP ping

This would cover scenario where only couple of DCs are accessible to the client and would be also more in line of the AD locator process as described here: https://docs.microsoft.com/en-us/previous-versions/windows/it-pro/windows-2000-server/cc978011(v=technet.10) :

Quote:

4. The Net Logon service sends a datagram to the discovered domain controllers ("pings” the computers) that register the name. For NetBIOS domain names, the datagram is implemented as a mailslot message. For DNS domain names, the datagram is implemented as an LDAP UDP search.
5. Each available domain controller responds to the datagram to indicate that it is currently operational and then returns the information to DsGetDcName.
6. The Net Logon service returns the information to the client from the domain controller that responds first.

Also as Ondrej observes in https://pagure.io/SSSD/sssd/issue/2702#comment-223568, the site discovery process is being run periodically, which is unnecessary since the site is almost guaranteed not to have changed if the client IP address is unchanged. Site discovery should only be run during startup or after a network reconfiguration event.

Version-Release number of selected component (if applicable):
--------------------------------------------------------------
1.16.4-21.el7_7.1

How reproducible:
------------------
Every time

Steps to Reproduce:
--------------------
1. Join domain and configure sssd-ad (without using ad_site parameter)
2. Enable debug logging of the domain (0x7770) 
3. Monitor for a period of 24 hours or so

Actual results:
--------------
At startup and on a periodic schedule (every 12 hours) the AD backend will:
1.	Query DNS for a list of all DC’s in the domain
2.	To each DC in turn, send an ‘LDAP ping’/netlogon query to determine client site, until one sensible response is received. (If the currently selected DC is unavailable or behind a firewall, the process is subject to certain pre-defined timeouts, which are generally in the order of whole seconds)

Expected results:
-------------------
At startup and after a network reconfiguration event the AD backend should:
1.	Query DNS for a list of all DC’s in the domain
2.	Send an LDAP ping to all DC’s simultaneously
3.	Attempt to extract client site from whichever DC response is received first
4.	Proceed with service discovery, only communicating with DC’s that responded to LDAP ping

Additional info:
---------------
The only known workaround for this behaviour is to disable site discovery altogether by manually specifying the site using the ad_site parameter, but this inconvenient when one is attempting to engineer an automated service capable of deploying into multiple sites, regions, networks and AD domains.

Comment 2 Pavel Březina 2020-03-31 09:36:56 UTC
Upstream ticket:
https://pagure.io/SSSD/sssd/issue/2702

Comment 3 dave 2020-05-04 15:12:00 UTC
Upstream ticket migrated to https://github.com/SSSD/sssd/issues/3743

Comment 6 Pavel Březina 2020-08-26 11:06:52 UTC
Upstream PR:
https://github.com/SSSD/sssd/pull/5300

Dave, would you be willing to test a scratch build?

Comment 7 dave 2020-08-27 15:37:40 UTC
Sure Pavel, but it might be a few weeks before I can turn that round because someone recently decided our lab environment needs to be isolated from the rest of our corporate network, so we're still in the process of arguing about what that isolation will look like!

In the mean time, would you mind sending over any details I'll need to get the scratch build running (do I just follow https://sssd.io/docs/developers/contribute.html#building-sssd ?, do I need to run on el8? anything else...)

Thanks

Comment 8 Pavel Březina 2020-09-03 12:47:11 UTC
Which el8 version do you run? I will give you a scratch build.

Or if you prefer to build it on your own you can get the source here and then follow the link you found:
https://github.com/pbrezina/sssd/tree/adsite-cldap-parallel

Comment 10 Pavel Březina 2020-10-02 10:19:40 UTC
Pushed PR: https://github.com/SSSD/sssd/pull/5300

* `master`
    * f0d650799d4390f90890d17c56a4e395e931d8cb - tevent: correctly handle req timeout error
    * 9fdf5cfacd1a425691d44db53897096887bb3e6f - ad: renew site information only when SSSD was previously offline
    * a62a13ae61d4e08b21e706df6ca266c38891f430 - man: fix typo in failover description
    * fcfd834c9d80d7690f938582335d81231a5f6e60 - ad: if all in-site dc are unreachable try off-site controllers
    * 1889ca60a9c642f0cca60b20a5b94de7a66924f6 - ad: connect to the first available server for cldap ping
    * 8265674a055e5cdb57acebad72d935356408540a - ad: use cldap for site and forrest discover (perform CLDAP ping)
    * 414593cca65ed09fe4659e2786370a4553664cd0 - ldap: add support for cldap and udp connections

Comment 11 Pavel Březina 2020-10-06 09:55:20 UTC
Pushed PR: https://github.com/SSSD/sssd/pull/5345

* `master`
    * 37ba37a425453d8222584176ae5975a795422091 - ad: fix handling of current site and forest in cldap ping

Comment 22 errata-xmlrpc 2021-05-18 15:03:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (sssd bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:1666