Bug 1645461

Summary:	Slow ldb search causes blocking during startup which might cause the registration to time out
Product:	Red Hat Enterprise Linux 7	Reporter:	afox <afox>
Component:	sssd	Assignee:	SSSD Maintainers <sssd-maint>
Status:	CLOSED ERRATA	QA Contact:	Amith <apeetham>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	7.4	CC:	aborah, afox, apeetham, atikhono, grajaiya, jhrozek, lslebodn, mzidek, pbrezina, rvdwees, sbose, sgoveas, tmihinto, tscherf, yann.lopez
Target Milestone:	rc
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	sssd-1.16.4-8.el7	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-08-06 13:02:13 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description afox@redhat.com 2018-11-02 09:46:55 UTC

Description of problem:
Following upgrade to sssd-client-1.16.2-13, users are no longer able to login to their workstations following system reboot. sssd must be manually restarted, and then login works. 

This is causing major disruption for my customer, as they have several hundred workstation users who are now experiencing login problems. 

Downgrading to sssd-1.16.0-19 resolves the issue. 

Version-Release number of selected component (if applicable):
sssd-1.16.2-13

How reproducible:
Always

Steps to Reproduce:
1. Install sssd-1.16.2-13
2. Reboot system.
3. Users are unable to login.

Actual results:
Users cannot login following system reboot. 

Expected results:
Users should be able to login after system reboot.

Comment 5 Jakub Hrozek 2018-11-02 10:00:02 UTC

debug logs say:
(Wed Oct 31 13:30:07 2018) [sssd[be[cmpd1.metoffice.gov.uk]]] [ad_get_dc_servers_send] (0x0400): Looking up domain controllers in domain cmpd1.metoffice.gov.uk and site Exeter                                                              
(Wed Oct 31 13:30:07 2018) [sssd[be[cmpd1.metoffice.gov.uk]]] [resolv_discover_srv_next_domain] (0x0400): SRV resolution of service 'ldap'. Will use DNS discovery domain 'Exeter._sites.cmpd1.metoffice.gov.uk'                             
(Wed Oct 31 13:30:07 2018) [sssd[be[cmpd1.metoffice.gov.uk]]] [resolv_getsrv_send] (0x0100): Trying to resolve SRV record of '_ldap._tcp.Exeter._sites.cmpd1.metoffice.gov.uk'                                                               
(Wed Oct 31 13:30:07 2018) [sssd[be[cmpd1.metoffice.gov.uk]]] [be_ptask_done] (0x0400): Task [Check if online (periodic)]: finished successfully                                                                                             
(Wed Oct 31 13:30:07 2018) [sssd[be[cmpd1.metoffice.gov.uk]]] [be_ptask_schedule] (0x0400): Task [Check if online (periodic)]: scheduling task 146 seconds from last execution time [1540992753]                                             
(Wed Oct 31 13:30:07 2018) [sssd[be[cmpd1.metoffice.gov.uk]]] [schedule_request_timeout] (0x2000): Scheduling a timeout of 6 seconds                                                                                                         
(Wed Oct 31 13:30:07 2018) [sssd[be[cmpd1.metoffice.gov.uk]]] [schedule_timeout_watcher] (0x2000): Scheduling DNS timeout watcher                                                                                                            
(Wed Oct 31 13:30:07 2018) [sssd[be[cmpd1.metoffice.gov.uk]]] [unschedule_timeout_watcher] (0x4000): Unscheduling DNS timeout watcher                                                                                                        
(Wed Oct 31 13:30:07 2018) [sssd[be[cmpd1.metoffice.gov.uk]]] [request_watch_destructor] (0x0400): Deleting request watch                                                                                                                    
(Wed Oct 31 13:30:07 2018) [sssd[be[cmpd1.metoffice.gov.uk]]] [resolv_discover_srv_done] (0x0040): SRV query failed [11]: Could not contact DNS servers

And the syslog says:
2018-10-31T13:27:40.900803+00:00 eld115 sssd[be[cmpd1.metoffice.gov.uk]]: Backend is offline                                                                                                                                                

but only afterwards I see:
2018-10-31T13:27:41.013067+00:00 eld115 network: Bringing up loopback interface:  [  OK  ]                                                                                                                                                  
2018-10-31T13:27:41.046843+00:00 eld115 network: Bringing up interface enp0s25:  [  OK  ]                                                                                                                                                   
2018-10-31T13:27:41.068487+00:00 eld115 systemd: Started LSB: Bring up/down networking.                                                                                                                                                     
2018-10-31T13:27:41.068692+00:00 eld115 systemd: Reached target Network.

SSSD should be listening to messages from netlink and retry the network connection once the network is up.

As an immediate workaround, you can make the sssd service start after the network is up:
e.g.
sh# cat /etc/systemd/system/sssd.service.d/network.conf
[Unit]
Before=systemd-user-sessions.service nss-user-lookup.target network-online.target
Wants=nss-user-lookup.target network-online.target

sh# systemctl daemon-reload

sh# systemctl cat sssd

But it would be also good to see why didn't sssd detect the change in networking. for that, can we see debug logs with debug_level=10 in the [sssd] section, please?

Comment 18 Pavel Březina 2018-11-16 09:45:48 UTC

Can we get the logs with debug_microseconds=true in both monitor and backend? I wonder if there is some race condition, perhaps backend writes to the socket before the monitor setup tevent integration on the connection.

Comment 25 Alexey Tikhonov 2019-02-22 20:04:13 UTC

So, what is happening is the following:

sssd_be initialization codepath:

`be_process_init()` -> `dp_init()` -> `dp_init_targets()` -> `dp_load_targets()` -> `dp_target_init()` ->
    -> `dp_load_module()` -> `dp_module_create()` -> `dp_module_run_constructor()` ...
    -> `dp_target_run_constructor()`

In the log:
> (Mon Nov 12 11:02:30 2018) [sssd[be[...]]] [dp_target_init] (0x0400): Initializing target [id] with module [ad]
> (Mon Nov 12 11:02:31 2018) [sssd[be[...]]] [dp_module_run_constructor] (0x0400): Executing module [ad] constructor.
...
> (Mon Nov 12 11:02:40 2018) [sssd[be[...]]] [dp_target_run_constructor] (0x0400): Executing target [id] constructor

So it *seems* like `dp_module_run_constructor()` [i.e. calling `sssm_%s_init()` from module] for `ad` module hangs and probably is reset by watchdog (timings correlate but can't be sure without strace logs).


So initialization is completed
> (Mon Nov 12 11:02:40 2018) [sssd[be[...]]] [main] (0x0400): Backend provider (...) started!
and tevent main loop entered only at 11:02:40
Before that internal dbus "OK ..." response to "AUTH EXTERNAL" request sent from Monitor sbus server is not even read and dbus handshake is not completed and "Sending ID: " (despite how it looks in the log) is *not* actually sent.


Now, Monitor allows exactly 10 seconds for service to start:
>     /* 10 seconds should be plenty */
>    tv = tevent_timeval_current_ofs(10, 0);
>    mini->timeout = tevent_add_timer(ctx->ev, mini, tv, init_timeout, mini);
and service was started at 11:02:29:
> (Mon Nov 12 11:02:29 2018) [sssd] [start_service] (0x0100): Queueing service ... for startup

So I guess Monitor handles timeout and closes connection ~ in the same moment as provider is almost ready to proceed.

This is somewhat similar to BZ 1666819 but different flavor.


Btw, I do not uderstand why Monitor allows every service (including providers) 10 seconds to start, but starts responders earlier:
> (Mon Nov 12 11:02:34 2018) [sssd] [services_startup_timeout] (0x0020): Providers did not start in time, forcing services startup!


So:
(1) What is `ad` module doing at construction? (If I am correct in spotting "hang point") Does `ad` module have anything to do with `winbindd` (as in BZ 1666819)?
(2) How is this problem related with sssd upgrade?
(3) Would it make sense to synchronize two Monitor timeouts (to prevent case when responder connects to starting provider)?


(I clear "needinfo" because my point of view logs with "debug_microseconds" will not add anything useful)

Comment 26 Alexey Tikhonov 2019-02-25 11:02:24 UTC

Lets dig deeper.

(In reply to Alexey Tikhonov from comment #25)
> (Mon Nov 12 11:02:31 2018) [sssd[be[...]]] [dp_module_run_constructor] (0x0400): Executing module [ad] constructor.
> ...
> (Mon Nov 12 11:02:40 2018) [sssd[be[...]]] [dp_target_run_constructor] (0x0400): Executing target [id] constructor
> 
> So it *seems* like `dp_module_run_constructor()` [i.e. calling 
> `sssm_%s_init()` from module] for `ad` module hangs

> (1) What is `ad` module doing at construction?


`sssm_ad_init()` -> `ad_init_auth_ctx()` -> `ad_get_auth_options()`
> (Mon Nov 12 11:02:31 2018) [sssd[be[...]]] [ad_get_auth_options] (0x0100): Option krb5_use_kdcinfo set to true

-> `krb5_child_init()` -> `init_renew_tgt()` -> `check_ccache_files()`
> (Mon Nov 12 11:02:40 2018) [sssd[be[...]]] [check_ccache_files] (0x4000): Found [3] entries with ccache file in cache.

So it is busy somewhere inside `krb5_child_init()`


> Does `ad` module have anything to do with `winbindd` (as in BZ 1666819)?

Seems no.

Comment 27 Alexey Tikhonov 2019-02-25 11:46:27 UTC

> `sssm_ad_init()` -> `ad_init_auth_ctx()` -> `ad_get_auth_options()`
> (Mon Nov 12 11:02:31 2018) [sssd[be[...]]] [ad_get_auth_options] (0x0100): Option krb5_use_kdcinfo set to true
> 
> -> `krb5_child_init()` -> `init_renew_tgt()` -> `check_ccache_files()`
> (Mon Nov 12 11:02:40 2018) [sssd[be[...]]] [check_ccache_files] (0x4000): Found [3] entries with ccache file in cache.
> 
> So it is busy somewhere inside `krb5_child_init()`

It *looks* like it is stuck inside `check_ccache_files()` -> `sysdb_search_entry()`:
> (Mon Nov 12 11:02:31 2018) [sssd[be[cmpd1.metoffice.gov.uk]]] [ldb] (0x4000): Added timed event "ltdb_callback": 0x56266ed45190
> (Mon Nov 12 11:02:31 2018) [sssd[be[cmpd1.metoffice.gov.uk]]] [ldb] (0x4000): Added timed event "ltdb_timeout": 0x56266ed45250
> (Mon Nov 12 11:02:31 2018) [sssd[be[cmpd1.metoffice.gov.uk]]] [ldb] (0x4000): Running timer event 0x56266ed45190 "ltdb_callback"
... take a note about timestamps (9 secs pause)
> (Mon Nov 12 11:02:40 2018) [sssd[be[cmpd1.metoffice.gov.uk]]] [ldb] (0x4000): Destroying timer event 0x56266ed45250 "ltdb_timeout"
> (Mon Nov 12 11:02:40 2018) [sssd[be[cmpd1.metoffice.gov.uk]]] [ldb] (0x4000): Ending timer event 0x56266ed45190 "ltdb_callback"

Comment 28 Sumit Bose 2019-02-25 12:11:35 UTC

(In reply to Alexey Tikhonov from comment #27)
> > `sssm_ad_init()` -> `ad_init_auth_ctx()` -> `ad_get_auth_options()`
> > (Mon Nov 12 11:02:31 2018) [sssd[be[...]]] [ad_get_auth_options] (0x0100): Option krb5_use_kdcinfo set to true
> > 
> > -> `krb5_child_init()` -> `init_renew_tgt()` -> `check_ccache_files()`
> > (Mon Nov 12 11:02:40 2018) [sssd[be[...]]] [check_ccache_files] (0x4000): Found [3] entries with ccache file in cache.
> > 
> > So it is busy somewhere inside `krb5_child_init()`
> 
> It *looks* like it is stuck inside `check_ccache_files()` ->
> `sysdb_search_entry()`:
> > (Mon Nov 12 11:02:31 2018) [sssd[be[cmpd1.metoffice.gov.uk]]] [ldb] (0x4000): Added timed event "ltdb_callback": 0x56266ed45190
> > (Mon Nov 12 11:02:31 2018) [sssd[be[cmpd1.metoffice.gov.uk]]] [ldb] (0x4000): Added timed event "ltdb_timeout": 0x56266ed45250
> > (Mon Nov 12 11:02:31 2018) [sssd[be[cmpd1.metoffice.gov.uk]]] [ldb] (0x4000): Running timer event 0x56266ed45190 "ltdb_callback"
> ... take a note about timestamps (9 secs pause)
> > (Mon Nov 12 11:02:40 2018) [sssd[be[cmpd1.metoffice.gov.uk]]] [ldb] (0x4000): Destroying timer event 0x56266ed45250 "ltdb_timeout"
> > (Mon Nov 12 11:02:40 2018) [sssd[be[cmpd1.metoffice.gov.uk]]] [ldb] (0x4000): Ending timer event 0x56266ed45190 "ltdb_callback"

The search in check_ccache_files() '"(&("SYSDB_CCACHE_FILE"=*)("SYSDB_UC"))"' will be an unindexed search, so the whole cache file must be read and parsed. Do you have any information about how many entries might be in the cache?

bye,
Sumit

Comment 29 Jakub Hrozek 2019-02-25 14:12:35 UTC

(In reply to Sumit Bose from comment #28)
> (In reply to Alexey Tikhonov from comment #27)
> > > `sssm_ad_init()` -> `ad_init_auth_ctx()` -> `ad_get_auth_options()`
> > > (Mon Nov 12 11:02:31 2018) [sssd[be[...]]] [ad_get_auth_options] (0x0100): Option krb5_use_kdcinfo set to true
> > > 
> > > -> `krb5_child_init()` -> `init_renew_tgt()` -> `check_ccache_files()`
> > > (Mon Nov 12 11:02:40 2018) [sssd[be[...]]] [check_ccache_files] (0x4000): Found [3] entries with ccache file in cache.
> > > 
> > > So it is busy somewhere inside `krb5_child_init()`
> > 
> > It *looks* like it is stuck inside `check_ccache_files()` ->
> > `sysdb_search_entry()`:
> > > (Mon Nov 12 11:02:31 2018) [sssd[be[cmpd1.metoffice.gov.uk]]] [ldb] (0x4000): Added timed event "ltdb_callback": 0x56266ed45190
> > > (Mon Nov 12 11:02:31 2018) [sssd[be[cmpd1.metoffice.gov.uk]]] [ldb] (0x4000): Added timed event "ltdb_timeout": 0x56266ed45250
> > > (Mon Nov 12 11:02:31 2018) [sssd[be[cmpd1.metoffice.gov.uk]]] [ldb] (0x4000): Running timer event 0x56266ed45190 "ltdb_callback"
> > ... take a note about timestamps (9 secs pause)
> > > (Mon Nov 12 11:02:40 2018) [sssd[be[cmpd1.metoffice.gov.uk]]] [ldb] (0x4000): Destroying timer event 0x56266ed45250 "ltdb_timeout"
> > > (Mon Nov 12 11:02:40 2018) [sssd[be[cmpd1.metoffice.gov.uk]]] [ldb] (0x4000): Ending timer event 0x56266ed45190 "ltdb_callback"
> 
> The search in check_ccache_files()
> '"(&("SYSDB_CCACHE_FILE"=*)("SYSDB_UC"))"' will be an unindexed search, so
> the whole cache file must be read and parsed. Do you have any information
> about how many entries might be in the cache?
> 
> bye,
> Sumit

It does not help that the base DN is the root DN, not the user container:
base_dn = sysdb_base_dn(renew_tgt_ctx->be_ctx->domain->sysdb, tmp_ctx);

so maybe a minimal change would be to use sysdb_user_base_dn() here instead and drop the objectcategory=user clause from the filter (that will be implied). Or even just search for all the users and filter those with no ccache on the sssd side..

Comment 30 Alexey Tikhonov 2019-02-25 15:00:10 UTC

(In reply to Alexey Tikhonov from comment #27)
> It *looks* like it is stuck inside `check_ccache_files()` -> `sysdb_search_entry()`:

So it seems issue might be caused by unindexed (slow) search in large cache.


(In reply to Alexey Tikhonov from comment #25)
> (2) How is this problem related with sssd upgrade?

As Sumit pointed out, this could be triggered by upgrade since in 1.16.1 `objectclass` was changed to `objectcategory` for users (and groups) and hence users were removed from the index.

Upstream ticket with possible solutions: https://pagure.io/SSSD/sssd/issue/3968


> (3) Would it make sense to synchronize two Monitor timeouts (to prevent case
> when responder connects to starting provider)?

Upstream ticket: https://pagure.io/SSSD/sssd/issue/3969

Comment 31 Jakub Hrozek 2019-02-27 14:58:52 UTC

Upstream ticket:
https://pagure.io/SSSD/sssd/issue/3968

Comment 32 Jakub Hrozek 2019-02-27 14:59:37 UTC

Upstream ticket:
https://pagure.io/SSSD/sssd/issue/3969

Comment 33 Jakub Hrozek 2019-03-28 12:41:00 UTC

master:
e474c2d
96013bb
22fc051
sssd-1-16:
e4dd284
7d8b28a
23fb7ea

Comment 35 Amith 2019-05-20 04:29:46 UTC

Verified the bug on SSSD Version : sssd-1.16.4-13.el7.x86_64

Steps followed during verification:

1. Install the old SSSD build : sssd-1.16.0-19.el7.x86_64 in RHEL-7.7 system. 

2. Add 100s of users & groups in ldap server and enumerate them into SSSD client to ensure a large cache size.

# ls -lh /var/lib/sss/db/
total 16.0M
-rw-------. 1 root root 11.0M May 20 09:37 cache_LDAP.ldb
-rw-------. 1 root root 1.3M May 20 09:27 config.ldb
-rw-------. 1 root root 1.3M May 20 09:27 sssd.ldb
-rw-------. 1 root root 3.6M May 20 09:32 timestamps_LDAP.ldb

3. Execute simultaneous user login by running a script in the background. This can be done from different terminals in such a way that atleast 50 different users attempt login within a span of 10 minutes.

OBSERVATION : With old build, all the user logins were successful and the duration for individual user login was in the range of 0.5 - 1.2 seconds.

4. Now, upgrade the SSSD build to latest version: sssd-1.16.4-13.el7.x86_64

5. Reboot the system.

6. Repeat step-3.

OBSERVATION: With the latest build, all the user login attempts were successful.

Comment 37 errata-xmlrpc 2019-08-06 13:02:13 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2019:2177