Bug 1283798

Summary: sssd failover does not work on connecting to non-responsive ldaps:// server
Product: Red Hat Enterprise Linux 8 Reporter: Striker Leggette <striker>
Component: sssdAssignee: Tomas Halman <thalman>
Status: CLOSED ERRATA QA Contact: sssd-qe <sssd-qe>
Severity: high Docs Contact:
Priority: high    
Version: 8.0CC: cww, dbula, dpal, grajaiya, jgalipea, jhrozek, lslebodn, mhonek, mkosek, mniranja, mzidek, pbrezina, raines, sgoveas, thalman, tscherf
Target Milestone: pre-dev-freezeKeywords: Reopened
Target Release: 8.1   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: sssd-2.2.0-5.el8 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-11-05 22:33:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1310069, 1682305    
Bug Blocks: 1269194, 1461138, 1679810, 1689138    

Description Striker Leggette 2015-11-19 22:43:45 UTC
Description of problem:
 - SSSD should be able to detect when the LDAP service on a host goes down (process dies or becomes unresponsive)

Version-Release number of selected component (if applicable):


How reproducible:
 - Always

Steps to Reproduce:
1. Connect SSSD to two LDAP URI
2. Stop the LDAP process on the connected server
3. 

Actual results:
 - SSSD fails to failover to backup LDAP URI

Expected results:
 - SSSD fails over to second LDAP server

Additional info:

Comment 3 Jakub Hrozek 2015-11-20 09:34:16 UTC
I corrected the title to make it clear this only happens with ldaps://

The SSSD-side fix is available, but there is still some strange behaviour within OpenLDAP that needs to be investigated.

Comment 4 Jakub Hrozek 2015-11-20 09:49:37 UTC
Upstream ticket:
https://fedorahosted.org/sssd/ticket/2878

Comment 5 Jakub Hrozek 2015-11-20 09:51:12 UTC
btw an easy workaround might be to use StartTLS instead of ldaps://. Please note that SSSD will also automatically try TLS for authentication -- cleartext auth is never permitted with SSSD.

Comment 6 Jakub Hrozek 2015-12-02 09:09:50 UTC
Update: I verified the same happens also when openldap-libs is compiled against OpenSSL, so as a next step, I involved the OpenLDAP maintainer to see if this might be an issue in the openldap-libs TLS manipulation.

Comment 10 Jakub Hrozek 2016-08-10 13:12:44 UTC
I'm afraid I need to cond-nak this bug until the dependency bug in openldap-libs (#1310069) is fixed.

Comment 11 Jakub Hrozek 2016-11-02 11:22:09 UTC
Since this bug depends on bugzilla #1310069 which is still not fixed, I need to move the flags to track rhel-6.10, sorry..

Comment 13 Jakub Hrozek 2017-11-13 21:26:30 UTC
Since this bug depends on a fix for openldap which is not even ready for RHEL-7 and RHEL-6 is already in Production phase 3, I'm closing this bug report as CANTFIX.

Comment 14 Paul Raines 2018-03-02 13:23:28 UTC
I had this happen to me last night.  This is a highly critical bug for sssd to not be able to failover.  I cannot believe it has been 2 years with no fix.  The only workaround for now is to use insecure ldap instead of ldaps?   sssd does not even recover once the primary LDAP server is back online.  One has to manually restart sssd on the client.

Did not affect all clients.  On most I would see the error:

sssd[be[default]]: Could not start TLS encryption. TLS
error -5938:Encountered end of file

but LDAP lookups still worked to the backup server.  Maybe only ones in the middle of a lookup when the server went down got stuck.

I only had CentOS7 clients be affected too.

Comment 15 Jakub Hrozek 2018-03-02 14:58:34 UTC
(In reply to Paul Raines from comment #14)
> I had this happen to me last night.  This is a highly critical bug for sssd
> to not be able to failover.  I cannot believe it has been 2 years with no
> fix.  The only workaround for now is to use insecure ldap instead of ldaps? 

StartTLS does not suffer from this issue. Only ldaps://

> sssd does not even recover once the primary LDAP server is back online.  One
> has to manually restart sssd on the client.
> 
> Did not affect all clients.  On most I would see the error:
> 
> sssd[be[default]]: Could not start TLS encryption. TLS
> error -5938:Encountered end of file
> 
> but LDAP lookups still worked to the backup server.  Maybe only ones in the
> middle of a lookup when the server went down got stuck.
> 
> I only had CentOS7 clients be affected too.

Since this bugzilla was filed against RHEL-6 where we definitely won't be addressing the issue, I would recommend to track the RHEL-7 version of the OpenLDAP bug, which is a dependency of fixing sssd: https://bugzilla.redhat.com/show_bug.cgi?id=1310069

Comment 23 Tomas Halman 2019-06-25 10:53:25 UTC
PR# https://github.com/SSSD/sssd/pull/835

Comment 24 Jakub Hrozek 2019-07-05 10:32:11 UTC
* master: 2d657df
* sssd-1-16: 442cd65

Comment 26 Niranjan Mallapadi Raghavender 2019-08-23 09:36:06 UTC
Versions:
sssd-client-2.2.0-16.el8.x86_64
sssd-dbus-2.2.0-16.el8.x86_64
sssd-krb5-common-2.2.0-16.el8.x86_64
sssd-winbind-idmap-2.2.0-16.el8.x86_64
sssd-tools-2.2.0-16.el8.x86_64
sssd-common-2.2.0-16.el8.x86_64
sssd-common-pac-2.2.0-16.el8.x86_64
python3-sssdconfig-2.2.0-16.el8.noarch
sssd-krb5-2.2.0-16.el8.x86_64
sssd-proxy-2.2.0-16.el8.x86_64
sssd-ad-2.2.0-16.el8.x86_64
sssd-2.2.0-16.el8.x86_64
sssd-nfs-idmap-2.2.0-16.el8.x86_64
sssd-kcm-2.2.0-16.el8.x86_64
sssd-ldap-2.2.0-16.el8.x86_64
sssd-libwbclient-2.2.0-16.el8.x86_64
sssd-ipa-2.2.0-16.el8.x86_64
[root@ci-vm-10-0-145-194 ~]# 

[Server]
Setup 2 directory server (389-ds) on 2 different systems serving
users foo0 to foo10 with TLS configured and configured with 
same suffix dc=example,dc=test

Directory servers were configured on 8.1 
Version: 389-ds-base-1.4.1.3-5.module+el8.1.0+3776+ece1ae4c.x86_64

[client configuration]

Cat /etc/sssd/sssd.conf as below:

[sssd]
config_file_version = 2
services = nss, pam
domains = example1

[domain/example1]
ldap_search_base = dc=example,dc=test
id_provider = ldap
auth_provider = ldap
ldap_user_home_directory = /home/%u
ldap_uri = ldaps://ci-vm-10-0-146-127.hosted.upshift.rdu2.redhat.com, ldaps://ci-vm-10-0-146-167.hosted.upshift.rdu2.redhat.com
ldap_tls_cacert = /etc/openldap/cacerts/cacert.pem
use_fully_qualified_names = True
debug_level = 9


Ran the below test cases:
1. Stop Directory server instance on DS server1(ci-vm-10-0-146-127.hosted.upshift.rdu2.redhat.com)
Verify getent foo0@example1 user can be queried

[root@ci-vm-10-0-145-194 sssd]# getent passwd foo0@example1
foo0@example1:*:14583100:14564100:foo0 User::/bin/bash


<debug logs>
(Fri Aug 23 05:28:06 2019) [sssd[be[example1]]] [sdap_handle_release] (0x2000): Trace: sh[0x5572b5a19b80], connected[0], ops[(nil)], ldap[(nil)], destructor_lock[0], release_memory[0]
(Fri Aug 23 05:28:06 2019) [sssd[be[example1]]] [_be_fo_set_port_status] (0x8000): Setting status: PORT_NOT_WORKING. Called from: src/providers/ldap/sdap_async_connection.c: sdap_cli_connect_done: 1591
(Fri Aug 23 05:28:06 2019) [sssd[be[example1]]] [fo_set_port_status] (0x0100): Marking port 636 of server 'ci-vm-10-0-146-127.hosted.upshift.rdu2.redhat.com' as 'not working'
(Fri Aug 23 05:28:06 2019) [sssd[be[example1]]] [fo_set_port_status] (0x0400): Marking port 636 of duplicate server 'ci-vm-10-0-146-127.hosted.upshift.rdu2.redhat.com' as 'not working'
(Fri Aug 23 05:28:06 2019) [sssd[be[example1]]] [fo_resolve_service_send] (0x0100): Trying to resolve service 'LDAP' 
(Fri Aug 23 05:28:06 2019) [sssd[be[example1]]] [get_server_status] (0x1000): Status of server 'ci-vm-10-0-146-127.hosted.upshift.rdu2.redhat.com' is 'working'
(Fri Aug 23 05:28:06 2019) [sssd[be[example1]]] [get_port_status] (0x1000): Port status of port 636 for server 'ci-vm-10-0-146-127.hosted.upshift.rdu2.redhat.com' is 'not working'
(Fri Aug 23 05:28:06 2019) [sssd[be[example1]]] [get_port_status] (0x0080): SSSD is unable to complete the full connection request, this internal status does not necessarily indicate network port issues.
(Fri Aug 23 05:28:06 2019) [sssd[be[example1]]] [get_server_status] (0x1000): Status of server 'ci-vm-10-0-146-167.hosted.upshift.rdu2.redhat.com' is 'name not resolved'
(Fri Aug 23 05:28:06 2019) [sssd[be[example1]]] [get_port_status] (0x1000): Port status of port 636 for server 'ci-vm-10-0-146-167.hosted.upshift.rdu2.redhat.com' is 'neutral'
(Fri Aug 23 05:28:06 2019) [sssd[be[example1]]] [fo_resolve_service_activate_timeout] (0x2000): Resolve timeout set to 6 seconds
(Fri Aug 23 05:28:06 2019) [sssd[be[example1]]] [get_server_status] (0x1000): Status of server 'ci-vm-10-0-146-167.hosted.upshift.rdu2.redhat.com' is 'name not resolved'
(Fri Aug 23 05:28:06 2019) [sssd[be[example1]]] [resolv_is_address] (0x4000): [ci-vm-10-0-146-167.hosted.upshift.rdu2.redhat.com] does not look like an IP address
(Fri Aug 23 05:28:06 2019) [sssd[be[example1]]] [resolv_gethostbyname_step] (0x2000): Querying files
(Fri Aug 23 05:28:06 2019) [sssd[be[example1]]] [resolv_gethostbyname_files_send] (0x0100): Trying to resolve A record of 'ci-vm-10-0-146-167.hosted.upshift.rdu2.redhat.com' in files
(Fri Aug 23 05:28:06 2019) [sssd[be[example1]]] [set_server_common_status] (0x0100): Marking server 'ci-vm-10-0-146-167.hosted.upshift.rdu2.redhat.com' as 'resolving name'
(Fri Aug 23 05:28:06 2019) [sssd[be[example1]]] [resolv_gethostbyname_step] (0x2000): Querying files
(Fri Aug 23 05:28:06 2019) [sssd[be[example1]]] [resolv_gethostbyname_files_send] (0x0100): Trying to resolve AAAA record of 'ci-vm-10-0-146-167.hosted.upshift.rdu2.redhat.com' in files
(Fri Aug 23 05:28:06 2019) [sssd[be[example1]]] [resolv_gethostbyname_next] (0x0200): No more address families to retry
(Fri Aug 23 05:28:06 2019) [sssd[be[example1]]] [resolv_gethostbyname_step] (0x2000): Querying DNS
(Fri Aug 23 05:28:06 2019) [sssd[be[example1]]] [resolv_gethostbyname_dns_query] (0x0100): Trying to resolve A record of 'ci-vm-10-0-146-167.hosted.upshift.rdu2.redhat.com' in DNS
(Fri Aug 23 05:28:06 2019) [sssd[be[example1]]] [schedule_request_timeout] (0x2000): Scheduling a timeout of 3 seconds
(Fri Aug 23 05:28:06 2019) [sssd[be[example1]]] [schedule_timeout_watcher] (0x2000): Scheduling DNS timeout watcher
(Fri Aug 23 05:28:06 2019) [sssd[be[example1]]] [unschedule_timeout_watcher] (0x4000): Unscheduling DNS timeout watcher
(Fri Aug 23 05:28:06 2019) [sssd[be[example1]]] [resolv_gethostbyname_dns_parse] (0x1000): Parsing an A reply
(Fri Aug 23 05:28:06 2019) [sssd[be[example1]]] [request_watch_destructor] (0x0400): Deleting request watch
(Fri Aug 23 05:28:06 2019) [sssd[be[example1]]] [set_server_common_status] (0x0100): Marking server 'ci-vm-10-0-146-167.hosted.upshift.rdu2.redhat.com' as 'name resolved'
(Fri Aug 23 05:28:06 2019) [sssd[be[example1]]] [be_resolve_server_process] (0x0200): Found address for server ci-vm-10-0-146-167.hosted.upshift.rdu2.redhat.com: [10.0.146.167] TTL 25948
(Fri Aug 23 05:28:06 2019) [sssd[be[example1]]] [sdap_uri_callback] (0x0400): Constructed uri 'ldaps://ci-vm-10-0-146-167.hosted.upshift.rdu2.redhat.com'
(Fri Aug 23 05:28:06 2019) [sssd[be[example1]]] [sssd_async_socket_init_send] (0x4000): Using file descriptor [22] for the connection.
(Fri Aug 23 05:28:06 2019) [sssd[be[example1]]] [sssd_async_socket_init_send] (0x0400): Setting 6 seconds timeout for connecting
(Fri Aug 23 05:28:06 2019) [sssd[be[example1]]] [sdap_ldap_connect_callback_add] (0x1000): New LDAP connection to [ldaps://ci-vm-10-0-146-167.hosted.upshift.rdu2.redhat.com:636/??base] with fd [22].
(Fri Aug 23 05:28:06 2019) [sssd[be[example1]]] [sdap_get_rootdse_send] (0x4000): Getting rootdse
(Fri Aug 23 05:28:06 2019) [sssd[be[example1]]] [sdap_print_server] (0x2000): Searching 10.0.146.167:636
(Fri Aug 23 05:28:06 2019) [sssd[be[example1]]] [sdap_get_generic_ext_step] (0x0400): calling ldap_search_ext with [(objectclass=*)][].
(Fri Aug 23 05:28:06 2019) [sssd[be[example1]]] [sdap_get_generic_ext_step] (0x1000): Requesting attrs: [*]
(Fri Aug 23 05:28:06 2019) [sssd[be[example1]]] [sdap_get_generic_ext_step] (0x1000): Requesting attrs: [altServer]
(Fri Aug 23 05:28:06 2019) [sssd[be[example1]]] [sdap_get_generic_ext_step] (0x1000): Requesting attrs: [namingContexts]
(Fri Aug 23 05:28:06 2019) [sssd[be[example1]]] [sdap_get_generic_ext_step] (0x1000): Requesting attrs: [supportedControl]
(Fri Aug 23 05:28:06 2019) [sssd[be[example1]]] [sdap_get_generic_ext_step] (0x1000): Requesting attrs: [supportedExtension]
(Fri Aug 23 05:28:06 2019) [sssd[be[example1]]] [sdap_get_generic_ext_step] (0x1000): Requesting attrs: [supportedFeatures]
(Fri Aug 23 05:28:06 2019) [sssd[be[example1]]] [sdap_get_generic_ext_step] (0x1000): Requesting attrs: [supportedLDAPVersion]
(Fri Aug 23 05:28:06 2019) [sssd[be[example1]]] [sdap_get_generic_ext_step] (0x1000): Requesting attrs: [supportedSASLMechanisms]
(Fri Aug 23 05:28:06 2019) [sssd[be[example1]]] [sdap_get_generic_ext_step] (0x1000): Requesting attrs: [domainControllerFunctionality]
(Fri Aug 23 05:28:06 2019) [sssd[be[example1]]] [sdap_get_generic_ext_step] (0x1000): Requesting attrs: [defaultNamingContext]
(Fri Aug 23 05:28:06 2019) [sssd[be[example1]]] [sdap_get_generic_ext_step] (0x1000): Requesting attrs: [lastUSN]
(Fri Aug 23 05:28:06 2019) [sssd[be[example1]]] [sdap_get_generic_ext_step] (0x1000): Requesting attrs: [highestCommittedUSN]
(Fri Aug 23 05:28:06 2019) [sssd[be[example1]]] [sdap_get_generic_ext_step] (0x2000): ldap_search_ext called, msgid = 1
(Fri Aug 23 05:28:06 2019) [sssd[be[example1]]] [sdap_op_add] (0x2000): New operation 1 timeout 6
(Fri Aug 23 05:28:06 2019) [sssd[be[example1]]] [sdap_process_result] (0x2000): Trace: sh[0x5572b5abda90], connected[1
</debug logs>



2. Verify ssh logins are working
Login as another user (user which is not in cache yet):

[root@ci-vm-10-0-145-194 sssd]# ssh foo4@example1@localhost
foo4@example1@localhost's password: 
Activate the web console with: systemctl enable --now cockpit.socket

Could not chdir to home directory : No such file or directory
[foo4@example1@ci-vm-10-0-145-194 /]$ 

Login was successfull


3. Start the first Directory server and stop the second directory server
and verify login are successfull (Stop ci-vm-10-0-146-167.hosted.upshift.rdu2.redhat.com). 

login with another user who is not in cache. (foo5@example1)

[root@ci-vm-10-0-145-194 sssd]# ssh foo5@example1@localhost
foo5@example1@localhost's password: 
Activate the web console with: systemctl enable --now cockpit.socket

Last login: Fri Aug 23 04:41:09 2019 from ::1
Could not chdir to home directory : No such file or directory
[foo5@example1@ci-vm-10-0-145-194 /]$ pwd
/
[foo5@example1@ci-vm-10-0-145-194 /]$ id
uid=14583105(foo5@example1) gid=14564100(ldapusers@example1) groups=14564100(ldapusers@example1) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
[foo5@example1@ci-vm-10-0-145-194 /]$ 

With above results. Marking bug verified.

Comment 28 errata-xmlrpc 2019-11-05 22:33:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2019:3651