Bug 1645461
Summary: | Slow ldb search causes blocking during startup which might cause the registration to time out | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | afox <afox> |
Component: | sssd | Assignee: | SSSD Maintainers <sssd-maint> |
Status: | CLOSED ERRATA | QA Contact: | Amith <apeetham> |
Severity: | urgent | Docs Contact: | |
Priority: | unspecified | ||
Version: | 7.4 | CC: | aborah, afox, apeetham, atikhono, grajaiya, jhrozek, lslebodn, mzidek, pbrezina, rvdwees, sbose, sgoveas, tmihinto, tscherf, yann.lopez |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | sssd-1.16.4-8.el7 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2019-08-06 13:02:13 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
afox@redhat.com
2018-11-02 09:46:55 UTC
debug logs say: (Wed Oct 31 13:30:07 2018) [sssd[be[cmpd1.metoffice.gov.uk]]] [ad_get_dc_servers_send] (0x0400): Looking up domain controllers in domain cmpd1.metoffice.gov.uk and site Exeter (Wed Oct 31 13:30:07 2018) [sssd[be[cmpd1.metoffice.gov.uk]]] [resolv_discover_srv_next_domain] (0x0400): SRV resolution of service 'ldap'. Will use DNS discovery domain 'Exeter._sites.cmpd1.metoffice.gov.uk' (Wed Oct 31 13:30:07 2018) [sssd[be[cmpd1.metoffice.gov.uk]]] [resolv_getsrv_send] (0x0100): Trying to resolve SRV record of '_ldap._tcp.Exeter._sites.cmpd1.metoffice.gov.uk' (Wed Oct 31 13:30:07 2018) [sssd[be[cmpd1.metoffice.gov.uk]]] [be_ptask_done] (0x0400): Task [Check if online (periodic)]: finished successfully (Wed Oct 31 13:30:07 2018) [sssd[be[cmpd1.metoffice.gov.uk]]] [be_ptask_schedule] (0x0400): Task [Check if online (periodic)]: scheduling task 146 seconds from last execution time [1540992753] (Wed Oct 31 13:30:07 2018) [sssd[be[cmpd1.metoffice.gov.uk]]] [schedule_request_timeout] (0x2000): Scheduling a timeout of 6 seconds (Wed Oct 31 13:30:07 2018) [sssd[be[cmpd1.metoffice.gov.uk]]] [schedule_timeout_watcher] (0x2000): Scheduling DNS timeout watcher (Wed Oct 31 13:30:07 2018) [sssd[be[cmpd1.metoffice.gov.uk]]] [unschedule_timeout_watcher] (0x4000): Unscheduling DNS timeout watcher (Wed Oct 31 13:30:07 2018) [sssd[be[cmpd1.metoffice.gov.uk]]] [request_watch_destructor] (0x0400): Deleting request watch (Wed Oct 31 13:30:07 2018) [sssd[be[cmpd1.metoffice.gov.uk]]] [resolv_discover_srv_done] (0x0040): SRV query failed [11]: Could not contact DNS servers And the syslog says: 2018-10-31T13:27:40.900803+00:00 eld115 sssd[be[cmpd1.metoffice.gov.uk]]: Backend is offline but only afterwards I see: 2018-10-31T13:27:41.013067+00:00 eld115 network: Bringing up loopback interface: [ OK ] 2018-10-31T13:27:41.046843+00:00 eld115 network: Bringing up interface enp0s25: [ OK ] 2018-10-31T13:27:41.068487+00:00 eld115 systemd: Started LSB: Bring up/down networking. 2018-10-31T13:27:41.068692+00:00 eld115 systemd: Reached target Network. SSSD should be listening to messages from netlink and retry the network connection once the network is up. As an immediate workaround, you can make the sssd service start after the network is up: e.g. sh# cat /etc/systemd/system/sssd.service.d/network.conf [Unit] Before=systemd-user-sessions.service nss-user-lookup.target network-online.target Wants=nss-user-lookup.target network-online.target sh# systemctl daemon-reload sh# systemctl cat sssd But it would be also good to see why didn't sssd detect the change in networking. for that, can we see debug logs with debug_level=10 in the [sssd] section, please? Can we get the logs with debug_microseconds=true in both monitor and backend? I wonder if there is some race condition, perhaps backend writes to the socket before the monitor setup tevent integration on the connection. So, what is happening is the following: sssd_be initialization codepath: `be_process_init()` -> `dp_init()` -> `dp_init_targets()` -> `dp_load_targets()` -> `dp_target_init()` -> -> `dp_load_module()` -> `dp_module_create()` -> `dp_module_run_constructor()` ... -> `dp_target_run_constructor()` In the log: > (Mon Nov 12 11:02:30 2018) [sssd[be[...]]] [dp_target_init] (0x0400): Initializing target [id] with module [ad] > (Mon Nov 12 11:02:31 2018) [sssd[be[...]]] [dp_module_run_constructor] (0x0400): Executing module [ad] constructor. ... > (Mon Nov 12 11:02:40 2018) [sssd[be[...]]] [dp_target_run_constructor] (0x0400): Executing target [id] constructor So it *seems* like `dp_module_run_constructor()` [i.e. calling `sssm_%s_init()` from module] for `ad` module hangs and probably is reset by watchdog (timings correlate but can't be sure without strace logs). So initialization is completed > (Mon Nov 12 11:02:40 2018) [sssd[be[...]]] [main] (0x0400): Backend provider (...) started! and tevent main loop entered only at 11:02:40 Before that internal dbus "OK ..." response to "AUTH EXTERNAL" request sent from Monitor sbus server is not even read and dbus handshake is not completed and "Sending ID: " (despite how it looks in the log) is *not* actually sent. Now, Monitor allows exactly 10 seconds for service to start: > /* 10 seconds should be plenty */ > tv = tevent_timeval_current_ofs(10, 0); > mini->timeout = tevent_add_timer(ctx->ev, mini, tv, init_timeout, mini); and service was started at 11:02:29: > (Mon Nov 12 11:02:29 2018) [sssd] [start_service] (0x0100): Queueing service ... for startup So I guess Monitor handles timeout and closes connection ~ in the same moment as provider is almost ready to proceed. This is somewhat similar to BZ 1666819 but different flavor. Btw, I do not uderstand why Monitor allows every service (including providers) 10 seconds to start, but starts responders earlier: > (Mon Nov 12 11:02:34 2018) [sssd] [services_startup_timeout] (0x0020): Providers did not start in time, forcing services startup! So: (1) What is `ad` module doing at construction? (If I am correct in spotting "hang point") Does `ad` module have anything to do with `winbindd` (as in BZ 1666819)? (2) How is this problem related with sssd upgrade? (3) Would it make sense to synchronize two Monitor timeouts (to prevent case when responder connects to starting provider)? (I clear "needinfo" because my point of view logs with "debug_microseconds" will not add anything useful) Lets dig deeper. (In reply to Alexey Tikhonov from comment #25) > (Mon Nov 12 11:02:31 2018) [sssd[be[...]]] [dp_module_run_constructor] (0x0400): Executing module [ad] constructor. > ... > (Mon Nov 12 11:02:40 2018) [sssd[be[...]]] [dp_target_run_constructor] (0x0400): Executing target [id] constructor > > So it *seems* like `dp_module_run_constructor()` [i.e. calling > `sssm_%s_init()` from module] for `ad` module hangs > (1) What is `ad` module doing at construction? `sssm_ad_init()` -> `ad_init_auth_ctx()` -> `ad_get_auth_options()` > (Mon Nov 12 11:02:31 2018) [sssd[be[...]]] [ad_get_auth_options] (0x0100): Option krb5_use_kdcinfo set to true -> `krb5_child_init()` -> `init_renew_tgt()` -> `check_ccache_files()` > (Mon Nov 12 11:02:40 2018) [sssd[be[...]]] [check_ccache_files] (0x4000): Found [3] entries with ccache file in cache. So it is busy somewhere inside `krb5_child_init()` > Does `ad` module have anything to do with `winbindd` (as in BZ 1666819)? Seems no. > `sssm_ad_init()` -> `ad_init_auth_ctx()` -> `ad_get_auth_options()` > (Mon Nov 12 11:02:31 2018) [sssd[be[...]]] [ad_get_auth_options] (0x0100): Option krb5_use_kdcinfo set to true > > -> `krb5_child_init()` -> `init_renew_tgt()` -> `check_ccache_files()` > (Mon Nov 12 11:02:40 2018) [sssd[be[...]]] [check_ccache_files] (0x4000): Found [3] entries with ccache file in cache. > > So it is busy somewhere inside `krb5_child_init()` It *looks* like it is stuck inside `check_ccache_files()` -> `sysdb_search_entry()`: > (Mon Nov 12 11:02:31 2018) [sssd[be[cmpd1.metoffice.gov.uk]]] [ldb] (0x4000): Added timed event "ltdb_callback": 0x56266ed45190 > (Mon Nov 12 11:02:31 2018) [sssd[be[cmpd1.metoffice.gov.uk]]] [ldb] (0x4000): Added timed event "ltdb_timeout": 0x56266ed45250 > (Mon Nov 12 11:02:31 2018) [sssd[be[cmpd1.metoffice.gov.uk]]] [ldb] (0x4000): Running timer event 0x56266ed45190 "ltdb_callback" ... take a note about timestamps (9 secs pause) > (Mon Nov 12 11:02:40 2018) [sssd[be[cmpd1.metoffice.gov.uk]]] [ldb] (0x4000): Destroying timer event 0x56266ed45250 "ltdb_timeout" > (Mon Nov 12 11:02:40 2018) [sssd[be[cmpd1.metoffice.gov.uk]]] [ldb] (0x4000): Ending timer event 0x56266ed45190 "ltdb_callback" (In reply to Alexey Tikhonov from comment #27) > > `sssm_ad_init()` -> `ad_init_auth_ctx()` -> `ad_get_auth_options()` > > (Mon Nov 12 11:02:31 2018) [sssd[be[...]]] [ad_get_auth_options] (0x0100): Option krb5_use_kdcinfo set to true > > > > -> `krb5_child_init()` -> `init_renew_tgt()` -> `check_ccache_files()` > > (Mon Nov 12 11:02:40 2018) [sssd[be[...]]] [check_ccache_files] (0x4000): Found [3] entries with ccache file in cache. > > > > So it is busy somewhere inside `krb5_child_init()` > > It *looks* like it is stuck inside `check_ccache_files()` -> > `sysdb_search_entry()`: > > (Mon Nov 12 11:02:31 2018) [sssd[be[cmpd1.metoffice.gov.uk]]] [ldb] (0x4000): Added timed event "ltdb_callback": 0x56266ed45190 > > (Mon Nov 12 11:02:31 2018) [sssd[be[cmpd1.metoffice.gov.uk]]] [ldb] (0x4000): Added timed event "ltdb_timeout": 0x56266ed45250 > > (Mon Nov 12 11:02:31 2018) [sssd[be[cmpd1.metoffice.gov.uk]]] [ldb] (0x4000): Running timer event 0x56266ed45190 "ltdb_callback" > ... take a note about timestamps (9 secs pause) > > (Mon Nov 12 11:02:40 2018) [sssd[be[cmpd1.metoffice.gov.uk]]] [ldb] (0x4000): Destroying timer event 0x56266ed45250 "ltdb_timeout" > > (Mon Nov 12 11:02:40 2018) [sssd[be[cmpd1.metoffice.gov.uk]]] [ldb] (0x4000): Ending timer event 0x56266ed45190 "ltdb_callback" The search in check_ccache_files() '"(&("SYSDB_CCACHE_FILE"=*)("SYSDB_UC"))"' will be an unindexed search, so the whole cache file must be read and parsed. Do you have any information about how many entries might be in the cache? bye, Sumit (In reply to Sumit Bose from comment #28) > (In reply to Alexey Tikhonov from comment #27) > > > `sssm_ad_init()` -> `ad_init_auth_ctx()` -> `ad_get_auth_options()` > > > (Mon Nov 12 11:02:31 2018) [sssd[be[...]]] [ad_get_auth_options] (0x0100): Option krb5_use_kdcinfo set to true > > > > > > -> `krb5_child_init()` -> `init_renew_tgt()` -> `check_ccache_files()` > > > (Mon Nov 12 11:02:40 2018) [sssd[be[...]]] [check_ccache_files] (0x4000): Found [3] entries with ccache file in cache. > > > > > > So it is busy somewhere inside `krb5_child_init()` > > > > It *looks* like it is stuck inside `check_ccache_files()` -> > > `sysdb_search_entry()`: > > > (Mon Nov 12 11:02:31 2018) [sssd[be[cmpd1.metoffice.gov.uk]]] [ldb] (0x4000): Added timed event "ltdb_callback": 0x56266ed45190 > > > (Mon Nov 12 11:02:31 2018) [sssd[be[cmpd1.metoffice.gov.uk]]] [ldb] (0x4000): Added timed event "ltdb_timeout": 0x56266ed45250 > > > (Mon Nov 12 11:02:31 2018) [sssd[be[cmpd1.metoffice.gov.uk]]] [ldb] (0x4000): Running timer event 0x56266ed45190 "ltdb_callback" > > ... take a note about timestamps (9 secs pause) > > > (Mon Nov 12 11:02:40 2018) [sssd[be[cmpd1.metoffice.gov.uk]]] [ldb] (0x4000): Destroying timer event 0x56266ed45250 "ltdb_timeout" > > > (Mon Nov 12 11:02:40 2018) [sssd[be[cmpd1.metoffice.gov.uk]]] [ldb] (0x4000): Ending timer event 0x56266ed45190 "ltdb_callback" > > The search in check_ccache_files() > '"(&("SYSDB_CCACHE_FILE"=*)("SYSDB_UC"))"' will be an unindexed search, so > the whole cache file must be read and parsed. Do you have any information > about how many entries might be in the cache? > > bye, > Sumit It does not help that the base DN is the root DN, not the user container: base_dn = sysdb_base_dn(renew_tgt_ctx->be_ctx->domain->sysdb, tmp_ctx); so maybe a minimal change would be to use sysdb_user_base_dn() here instead and drop the objectcategory=user clause from the filter (that will be implied). Or even just search for all the users and filter those with no ccache on the sssd side.. (In reply to Alexey Tikhonov from comment #27) > It *looks* like it is stuck inside `check_ccache_files()` -> `sysdb_search_entry()`: So it seems issue might be caused by unindexed (slow) search in large cache. (In reply to Alexey Tikhonov from comment #25) > (2) How is this problem related with sssd upgrade? As Sumit pointed out, this could be triggered by upgrade since in 1.16.1 `objectclass` was changed to `objectcategory` for users (and groups) and hence users were removed from the index. Upstream ticket with possible solutions: https://pagure.io/SSSD/sssd/issue/3968 > (3) Would it make sense to synchronize two Monitor timeouts (to prevent case > when responder connects to starting provider)? Upstream ticket: https://pagure.io/SSSD/sssd/issue/3969 Upstream ticket: https://pagure.io/SSSD/sssd/issue/3968 Upstream ticket: https://pagure.io/SSSD/sssd/issue/3969 master: e474c2d 96013bb 22fc051 sssd-1-16: e4dd284 7d8b28a 23fb7ea Verified the bug on SSSD Version : sssd-1.16.4-13.el7.x86_64 Steps followed during verification: 1. Install the old SSSD build : sssd-1.16.0-19.el7.x86_64 in RHEL-7.7 system. 2. Add 100s of users & groups in ldap server and enumerate them into SSSD client to ensure a large cache size. # ls -lh /var/lib/sss/db/ total 16.0M -rw-------. 1 root root 11.0M May 20 09:37 cache_LDAP.ldb -rw-------. 1 root root 1.3M May 20 09:27 config.ldb -rw-------. 1 root root 1.3M May 20 09:27 sssd.ldb -rw-------. 1 root root 3.6M May 20 09:32 timestamps_LDAP.ldb 3. Execute simultaneous user login by running a script in the background. This can be done from different terminals in such a way that atleast 50 different users attempt login within a span of 10 minutes. OBSERVATION : With old build, all the user logins were successful and the duration for individual user login was in the range of 0.5 - 1.2 seconds. 4. Now, upgrade the SSSD build to latest version: sssd-1.16.4-13.el7.x86_64 5. Reboot the system. 6. Repeat step-3. OBSERVATION: With the latest build, all the user login attempts were successful. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2019:2177 |