Bug 1851112

Summary: LDAP bind can fail due to unconfigurable DNS server timeouts that inhibit SSSD failover [rhel-7.9.z]
Product: Red Hat Enterprise Linux 7 Reporter: David Ward <david.ward>
Component: sssdAssignee: SSSD Maintainers <sssd-maint>
Status: CLOSED ERRATA QA Contact: Dan Lavu <dlavu>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 7.9CC: atikhono, dlavu, g63it, grajaiya, jhrozek, jreznik, lslebodn, mzidek, pbrezina, sgoveas, thalman, tscherf
Target Milestone: rcKeywords: Triaged, ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: sync-to-jira qetodo
Fixed In Version: sssd-1.16.5-10.el7_9.6 Doc Type: If docs needed, set a value
Doc Text:
If this bug requires documentation, please select an appropriate Doc Type value.sssd-1.16.5-10.el7_9.5.x86_64
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-12-15 11:22:17 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description David Ward 2020-06-25 15:54:24 UTC
See the linked Bugzilla issues for RHEL 8. The same issue affects RHEL 7.


The fix for RHEL 8 (applied upstream) partly involved making two previously hard-coded DNS timeouts configurable:
 - 'dns_resolver_server_timeout' (response to a query by a single DNS server)
 - 'dns_resolver_op_timeout' (resolution of a query by any DNS server)

These same options need to be exposed in RHEL 7, so that the administrator can adjust them as necessary, in a way that ensures the ordering in [1] is met. Otherwise SSSD failover will be inhibited at certain levels.


The following upstream commits from the sssd-1-16 branch [2], which are very simple, can be added directly to the RHEL 7 sssd package to resolve this:
 - c317452 (config: add dns_resolver_op_timeout to option list)
 - f768f73 (man: fix description of dns_resolver_op_timeout)
 - 9b1f6f8 (man: fix description of dns_resolver_timeout)
 - 2c7c694 (failover: add dns_resolver_server_timeout option)

Please note that this does *not* change any default timeout values; this is backwards compatible. This simply makes it possible for the administrator to set these timeout values when it is necessary, especially in an AD environment.


[1] https://github.com/SSSD/sssd/blob/2c7c69485883ab0f408cf2f8cdcc4617462d68ec/src/man/include/failover.xml#L123-L129
[2] https://github.com/SSSD/sssd/pull/5207

Comment 2 Alexey Tikhonov 2020-06-25 16:57:02 UTC
* `sssd-1-16`
  * 2c7c69485883ab0f408cf2f8cdcc4617462d68ec - failover: add dns_resolver_server_timeout option
  * 9b1f6f88b1245bc8737a4b2616299a8427d822e0 - man: fix description of dns_resolver_timeout
  * f768f73903bf0c3d0d1b4361ebff6b111a918470 - man: fix description of dns_resolver_op_timeout



(In reply to David Ward from comment #0)
> 
> The following upstream commits from the sssd-1-16 branch [2], which are very
> simple, can be added directly to the RHEL 7 sssd package to resolve this:
>  - c317452 (config: add dns_resolver_op_timeout to option list)

Can't find this ^^ hash.
Did you mean https://github.com/SSSD/sssd/commit/7cb6a9f70dd65c88e0f9be97c79702566ab37afb ? This one was included with bz 1719718

>  - f768f73 (man: fix description of dns_resolver_op_timeout)
>  - 9b1f6f8 (man: fix description of dns_resolver_timeout)
>  - 2c7c694 (failover: add dns_resolver_server_timeout option)

Comment 3 David Ward 2020-06-25 17:08:46 UTC
(In reply to Alexey Tikhonov from comment #2)
> >  - c317452 (config: add dns_resolver_op_timeout to option list)
> 
> Can't find this ^^ hash.
> Did you mean
> https://github.com/SSSD/sssd/commit/7cb6a9f70dd65c88e0f9be97c79702566ab37afb
> ? This one was included with bz 1719718

Yes, my apologies. Please disregard that commit since it's covered in RHBZ#1719718.

Comment 6 Alexey Tikhonov 2020-07-02 15:21:29 UTC
Hi David,

could you please open a customer case to back up this request?

This would increase approval chances.

Comment 17 Dan Lavu 2020-12-07 17:02:45 UTC
Verified against sssd-ad-1.16.5-10.el7_9.6.x86_64

[root@rhel79 x86_64]# cat /etc/sssd/sssd.conf 

[sssd]
domains = example.com
config_file_version = 2
services = nss, pam

[domain/example.com]
debug_level = 9
ad_domain = example.com
krb5_realm = EXAMPLE.COM
realmd_tags = manages-system joined-with-samba 
cache_credentials = True
id_provider = ad
krb5_store_password_if_offline = True
default_shell = /bin/bash
ldap_id_mapping = True
use_fully_qualified_names = True
fallback_homedir = /home/%u@%d
access_provider = ad
dns_resolver_server_timeout = 300
dns_resolver_op_timeout = 5


The sssd-ad-1.16.5-10.el7_9.5.x86_64 complained about the new parameters
(2020-12-07  9:03:50:673718): [sssd] [sss_ini_call_validators] (0x0020): [rule/allowed_domain_options]: Attribute 'dns_resolver_server_timeout' is not allowed in section 'domain/example.com'. Check for typos.

ad1 is offline, logs containing dns lookups. 


(2020-12-07 10:14:55): [be[example.com]] [sdap_connect_host_resolv_done] (0x0400): Connecting to ldap://ad2.example.com:389
(2020-12-07 10:14:55): [be[example.com]] [sssd_async_socket_init_send] (0x4000): Using file descriptor [25] for the connection.
(2020-12-07 10:14:55): [be[example.com]] [sssd_async_socket_init_send] (0x0400): Setting 6 seconds timeout for connecting
(2020-12-07 10:14:55): [be[example.com]] [sdap_ldap_connect_callback_add] (0x1000): New LDAP connection to [ldap://ad2.example.com:389/??base] with fd [25].
(2020-12-07 10:14:55): [be[example.com]] [sdap_connect_host_done] (0x0400): Successful connection to ldap://ad2.example.com:389
(2020-12-07 10:14:55): [be[example.com]] [sdap_print_server] (0x2000): Searching 192.168.75.5:389
(2020-12-07 10:14:55): [be[example.com]] [sdap_get_generic_ext_step] (0x0400): calling ldap_search_ext with [(&(DnsDomain=example.com)(NtVer=\14\00\00\00))][].
(2020-12-07 10:14:55): [be[example.com]] [sdap_get_generic_ext_step] (0x1000): Requesting attrs: [netlogon]
(2020-12-07 10:14:55): [be[example.com]] [sdap_get_generic_ext_step] (0x2000): ldap_search_ext called, msgid = 1
(2020-12-07 10:14:55): [be[example.com]] [sdap_op_add] (0x2000): New operation 1 timeout 6
(2020-12-07 10:14:55): [be[example.com]] [sdap_process_result] (0x2000): Trace: sh[0x55f2720bc9c0], connected[1], ops[0x55f2720b8f90], ldap[0x55f2720c02e0]
(2020-12-07 10:14:55): [be[example.com]] [sdap_process_message] (0x4000): Message type: [LDAP_RES_SEARCH_ENTRY]
(2020-12-07 10:14:55): [be[example.com]] [sdap_parse_entry] (0x1000): OriginalDN: [].
(2020-12-07 10:14:55): [be[example.com]] [sdap_parse_range] (0x2000): No sub-attributes for [netlogon]
(2020-12-07 10:14:55): [be[example.com]] [sdap_process_result] (0x2000): Trace: sh[0x55f2720bc9c0], connected[1], ops[0x55f2720b8f90], ldap[0x55f2720c02e0]
(2020-12-07 10:14:55): [be[example.com]] [sdap_process_message] (0x4000): Message type: [LDAP_RES_SEARCH_RESULT]
(2020-12-07 10:14:55): [be[example.com]] [sdap_get_generic_op_finished] (0x0400): Search result: Success(0), no errmsg set
(2020-12-07 10:14:55): [be[example.com]] [sdap_op_destructor] (0x2000): Operation 1 finished
(2020-12-07 10:14:55): [be[example.com]] [sdap_handle_release] (0x2000): Trace: sh[0x55f2720bc9c0], connected[1], ops[(nil)], ldap[0x55f2720c02e0], destructor_lock[0], release_memory[0]
(2020-12-07 10:14:55): [be[example.com]] [remove_connection_callback] (0x4000): Successfully removed connection callback.
(2020-12-07 10:14:55): [be[example.com]] [ad_get_client_site_done] (0x0400): Found site: Default-First-Site-Name
(2020-12-07 10:14:55): [be[example.com]] [ad_get_client_site_done] (0x0400): Found forest: example.com
(2020-12-07 10:14:55): [be[example.com]] [ad_srv_plugin_site_done] (0x0400): About to discover primary and backup servers
(2020-12-07 10:14:55): [be[example.com]] [fo_discover_servers_send] (0x0400): Looking up primary servers
(2020-12-07 10:14:55): [be[example.com]] [resolv_discover_srv_next_domain] (0x0400): SRV resolution of service 'ldap'. Will use DNS discovery domain 'Default-First-Site-Name._sites.example.com'
(2020-12-07 10:14:55): [be[example.com]] [resolv_getsrv_send] (0x0100): Trying to resolve SRV record of '_ldap._tcp.Default-First-Site-Name._sites.example.com'
(2020-12-07 10:14:55): [be[example.com]] [schedule_request_timeout] (0x2000): Scheduling a timeout of 5 seconds
(2020-12-07 10:14:55): [be[example.com]] [schedule_timeout_watcher] (0x2000): Scheduling DNS timeout watcher
(2020-12-07 10:14:56): [be[example.com]] [check_fd_timeouts] (0x4000): Checking for DNS timeouts
(2020-12-07 10:14:56): [be[example.com]] [unschedule_timeout_watcher] (0x4000): Unscheduling DNS timeout watcher
(2020-12-07 10:14:56): [be[example.com]] [resolv_getsrv_done] (0x1000): Using TTL [600]
(2020-12-07 10:14:56): [be[example.com]] [request_watch_destructor] (0x0400): Deleting request watch
(2020-12-07 10:14:56): [be[example.com]] [fo_discover_srv_done] (0x0400): Got answer. Processing...
(2020-12-07 10:14:56): [be[example.com]] [fo_discover_srv_done] (0x0400): Got 3 servers
(2020-12-07 10:14:56): [be[example.com]] [fo_discover_servers_primary_done] (0x0400): Looking up backup servers
(2020-12-07 10:14:56): [be[example.com]] [resolv_discover_srv_next_domain] (0x0400): SRV resolution of service 'ldap'. Will use DNS discovery domain 'example.com'
(2020-12-07 10:14:56): [be[example.com]] [resolv_getsrv_send] (0x0100): Trying to resolve SRV record of '_ldap._tcp.example.com'
(2020-12-07 10:14:56): [be[example.com]] [schedule_request_timeout] (0x2000): Scheduling a timeout of 5 seconds
(2020-12-07 10:14:56): [be[example.com]] [schedule_timeout_watcher] (0x2000): Scheduling DNS timeout watcher
(2020-12-07 10:14:57): [be[example.com]] [fo_resolve_service_timeout] (0x0080): Service resolving timeout reached
(2020-12-07 10:14:57): [be[example.com]] [request_watch_destructor] (0x0400): Deleting request watch
(2020-12-07 10:14:57): [be[example.com]] [be_resolve_server_done] (0x1000): Server resolution failed: [14]: Bad address
(2020-12-07 10:14:57): [be[example.com]] [sdap_id_op_connect_done] (0x0020): Failed to connect, going offline (5 [Input/output error])
(2020-12-07 10:14:57): [be[example.com]] [be_mark_offline] (0x2000): Going offline!
(2020-12-07 10:14:57): [be[example.com]] [be_mark_offline] (0x2000): Initialize check_if_online_ptask.
(2020-12-07 10:14:57): [be[example.com]] [be_ptask_create] (0x0400): Periodic task [Check if online (periodic)] was created
(2020-12-07 10:14:57): [be[example.com]] [be_ptask_schedule] (0x0400): Task [Check if online (periodic)]: scheduling task 69 seconds from now [1607354166]
(2020-12-07 10:14:57): [be[example.com]] [be_run_offline_cb] (0x0080): Going offline. Running callbacks.
(2020-12-07 10:14:57): [be[example.com]] [sdap_id_op_connect_done] (0x4000): notify offline to op #1
(2020-12-07 10:14:57): [be[example.com]] [ad_subdomains_refresh_connect_done] (0x0020): Unable to connect to LDAP [11]: Resource temporarily unavailable
(2020-12-07 10:14:57): [be[example.com]] [ad_subdomains_refresh_connect_done] (0x0080): No AD server is available, cannot get the subdomain list while offline
(2020-12-07 10:14:57): [be[example.com]] [be_ptask_done] (0x0040): Task [Subdomains Refresh]: failed with [1432158212]: SSSD is offline
(2020-12-07 10:14:57): [be[example.com]] [be_ptask_schedule] (0x0400): Task [Subdomains Refresh]: scheduling task 14400 seconds from now [1607368497]
(2020-12-07 10:14:57): [be[example.com]] [sdap_id_op_connect_done] (0x4000): notify offline to op #2
(2020-12-07 10:14:57): [be[example.com]] [ad_subdomains_refresh_connect_done] (0x0020): Unable to connect to LDAP [11]: Resource temporarily unavailable
(2020-12-07 10:14:57): [be[example.com]] [ad_subdomains_refresh_connect_done] (0x0080): No AD server is available, cannot get the subdomain list while offline
(2020-12-07 10:14:57): [be[example.com]] [dp_req_done] (0x0400): DP Request [Subdomains #0]: Request handler finished [0]: Success
(2020-12-07 10:14:57): [be[example.com]] [_dp_req_recv] (0x0400): DP Request [Subdomains #0]: Receiving request data.
(2020-12-07 10:14:57): [be[example.com]] [dp_req_reply_list_success] (0x0400): DP Request [Subdomains #0]: Finished. Success.
(2020-12-07 10:14:57): [be[example.com]] [dp_req_reply_std] (0x1000): DP Request [Subdomains #0]: Returning [Provider is Offline]: 1,1432158212,Offline
(2020-12-07 10:14:57): [be[example.com]] [dp_req_reply_std] (0x1000): DP Request [Subdomains #0]: Returning [Provider is Offline]: 1,1432158212,Offline
(2020-12-07 10:14:57): [be[example.com]] [dp_table_value_destructor] (0x0400): Removing [8:7:0000:<ALL>] from reply table
(2020-12-07 10:14:57): [be[example.com]] [dp_req_destructor] (0x0400): DP Request [Subdomains #0]: Request removed.
(2020-12-07 10:14:57): [be[example.com]] [dp_req_destructor] (0x0400): Number of active DP request: 0
(2020-12-07 10:14:57): [be[example.com]] [sdap_id_release_conn_data] (0x4000): releasing unused connection

Comment 22 errata-xmlrpc 2020-12-15 11:22:17 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (sssd bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:5459