Bug 1022242 - ldap enumeration stuck in an unadjustable 110 second restart loop
ldap enumeration stuck in an unadjustable 110 second restart loop
Status: CLOSED DUPLICATE of bug 888739
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: sssd (Show other bugs)
7.0
Unspecified Linux
unspecified Severity unspecified
: rc
: 7.1
Assigned To: SSSD Maintainers
Namita Soman
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-10-22 16:02 EDT by John Brunelle
Modified: 2016-01-08 06:21 EST (History)
7 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-01-08 06:21:56 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description John Brunelle 2013-10-22 16:02:22 EDT
Description of problem:

With ldap enumeration enabled on a domain with a large number of accounts, the enumeration loop never finishes before being killed and restarted, which happens every 110 seconds.  Furthermore, the various *enum* parameters appear to have no effect.

Version-Release number of selected component (if applicable):

sssd-1.9.2-82.10.el6_4.x86_64

How reproducible:

always

Steps to Reproduce:

1. enable enumeration on an sssd ldap domain with a large number of accounts

2. set these parameters:

    [nss]
    enum_cache_timeout = 9999
 
    [domain/rc.domain]
    ldap_enumeration_search_timeout = 9999
    ldap_enumeration_refresh_timeout = 9999
  
3. clean out the cache:

   # rm -fr /var/lib/sss/db/*

4. start sssd:

   # sssd -i -d 0x0480 2>&1 | tee sssd.out

Actual results:

Enumeration restarts every 110 seconds.  The sssd_be process is replaced each time.  The sssd_be processes use ~100% cpu.

Expected results:

The parameters would allow us to extend the timeouts so that a full enumeration could run.

Additional info:

Here are some relevant lines from the output:

# cat sssd.out | grep -iP 'enum|sched'
(Tue Oct 22 13:52:14:804201 2013) [sssd] [confdb_create_ldif] (0x0400): Processing attribute [enum_cache_timeout]
(Tue Oct 22 13:52:14:804266 2013) [sssd] [confdb_create_ldif] (0x0400): Processing attribute [ldap_enumeration_search_timeout]
(Tue Oct 22 13:52:14:804280 2013) [sssd] [confdb_create_ldif] (0x0400): Processing attribute [ldap_enumeration_refresh_timeout]
(Tue Oct 22 13:52:14:804306 2013) [sssd] [confdb_create_ldif] (0x0400): Processing attribute [enumerate]
(Tue Oct 22 13:52:15 2013) [sssd[be[rc.domain]]] [dp_get_options] (0x0400): Option ldap_enumeration_refresh_timeout has value 9999
(Tue Oct 22 13:52:15 2013) [sssd[be[rc.domain]]] [dp_get_options] (0x0400): Option ldap_enumeration_search_timeout has value 9999
(Tue Oct 22 13:52:15 2013) [sssd[be[rc.domain]]] [ldap_id_enumerate_set_timer] (0x0400): Scheduling next enumeration at 1382464335.782098
(Tue Oct 22 13:54:05 2013) [sssd] [mt_svc_restart] (0x0400): Scheduling service rc.domain for restart 1
(Tue Oct 22 13:54:05 2013) [sssd[be[rc.domain]]] [dp_get_options] (0x0400): Option ldap_enumeration_refresh_timeout has value 9999
(Tue Oct 22 13:54:05 2013) [sssd[be[rc.domain]]] [dp_get_options] (0x0400): Option ldap_enumeration_search_timeout has value 9999
(Tue Oct 22 13:54:05 2013) [sssd[be[rc.domain]]] [ldap_id_enumerate_set_timer] (0x0400): Scheduling next enumeration at 1382464445.44580
(Tue Oct 22 13:55:55 2013) [sssd] [mt_svc_restart] (0x0400): Scheduling service rc.domain for restart 1

In daemon mode, it logs messages like this each time the enumeration is killed and restarted:

(Tue Oct 22 12:14:34 2013) [sssd] [mt_svc_sigkill] (0x0010): [rc.domain][31476] is not responding to SIGTERM. Sending SIGKILL.
Comment 2 Lukas Slebodnik 2013-10-22 16:45:53 EDT
What do you exactly mean by "large number of accounts"?
I have approximately 23000 users in my test environment. In my case, it took 90 seconds to cache all users.

You can also try to increase value of parameters timeout and force_timeout.
Detail description and default values are described in manual page
"man sssd.conf"

I would not recommend very big values (9999). 60 seconds can be good starting point.
Comment 3 Jakub Hrozek 2013-10-22 17:41:31 EDT
Hello John,

the option you're looking for is "ldap_enumeration_refresh_timeout". The SSSD sets a timer that, if previous enumeration cannot be completed in time, kills the previous request. In the debug logs, this should be indicated by a log message saying ""Enumeration timed out! Timeout too small?". Do you see such debug message in the logs? (You'd need to set debug_level=1 or higher in the domain section).

According to your opening comment, the problem is not downloading the entries per se, but rather storing them to the cache. You can test easily if it's the case by symlinking the cache that normally resides in /var/lib/sss/db/*.ldb to /dev/shm or similar. But this experiment is only suitable for production systems.

The reason the sssd_be is getting killed is that the worker processes are "pinged" by the main sssd process and in case they don't "pong" in time, they are killed. The processes usually don't pong when they are genuinely stuck as a result of a bug or when there is a long blocking operation, such as saving a huge number of entries to the cache.

In case symlinking the cache would help, we can help you further by raising the time between pings (there is an undocumented option), but I would prefer to find the root cause first.
Comment 4 John Brunelle 2013-10-22 18:41:27 EDT
Aha!  It is indeed timeout and force_timeout that are limiting the enumeration.  By playing with the values, I see it's getting killed and restarted every 5*timeout+force_timeout seconds (5+10+60=110 by default).  So ldap_enumeration_refresh_timeout has no effect on here -- the default value of 300 is longer than 110, and my attempts above at setting it did nothing.  Also, that corresponding message is not in the output.

With huge values for all the timeout and ldap_enumeration parameters, the enumeration takes 6m25s.  This is for 8k users and 2.5k groups.  Okay, so not that large Lukas, just slow :)  Seems it's not the disk though -- I emptied and symlinked /var/lib/sss/db -> /dev/shm/db, reran, and it was essentially no different (6m11s).

To the question of why sssd_be would is so unresponsive that it get's killed, which seems to be the real bug here:  In the log, there is just a steady state of messages about adding users, followed by a repeat of the startup messages:

(Tue Oct 22 14:27:04 2013) [sssd[be[rc.domain]]] [sysdb_search_user_by_name] (0x0400): No such entry
(Tue Oct 22 14:27:04 2013) [sssd[be[rc.domain]]] [sysdb_search_user_by_uid] (0x0400): No such entry
(Tue Oct 22 14:27:04 2013) [sssd[be[rc.domain]]] [sdap_save_user] (0x0400): Storing info for user REDACTED_USERNAME1
(Tue Oct 22 14:27:04 2013) [sssd[be[rc.domain]]] [sysdb_search_user_by_name] (0x0400): No such entry
(Tue Oct 22 14:27:04 2013) [sssd[be[rc.domain]]] [sysdb_search_user_by_uid] (0x0400): No such entry
(Tue Oct 22 14:27:04 2013) [sssd[be[rc.domain]]] [sdap_save_user] (0x0400): Storing info for user REDACTED_USERNAME2
(Tue Oct 22 14:27:04 2013) [sssd[be[rc.domain]]] [sysdb_search_user_by_name] (0x0400): No such entry
(Tue Oct 22 14:27:04 2013) [sssd[be[rc.domain]]] [sysdb_search_user_by_uid] (0x0400): No such entry
(Tue Oct 22 14:27:05 2013) [sssd] [sbus_dispatch] (0x0080): Connection is not open for dispatching.
(Tue Oct 22 14:27:05 2013) [sssd] [mt_svc_restart] (0x0400): Scheduling service rc.domain for restart 1
(Tue Oct 22 14:27:05:020404 2013) [sssd[be[rc.domain]]] [ldb] (0x0400): server_sort:Unable to register control with rootdse!
(Tue Oct 22 14:27:05 2013) [sssd[be[rc.domain]]] [server_setup] (0x0400): CONFDB: /var/lib/sss/db/config.ldb
(Tue Oct 22 14:27:05 2013) [sssd[be[rc.domain]]] [fo_context_init] (0x0400): Created new fail over context, retry timeout is 30

So it appears to just be doing its thing and then gets killed.  I've straced sssd_be, and it's just alternating memory management and locks on the /var/lib/sss/db/cache_rc.domain.ldb file, e.g.:

brk(0x9207000)                          = 0x9207000
brk(0x9228000)                          = 0x9228000
brk(0x9249000)                          = 0x9249000
fcntl(14, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=48, len=1}) = 0
fcntl(14, F_SETLKW, {type=F_UNLCK, whence=SEEK_SET, start=48, len=1}) = 0
fcntl(14, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=48, len=1}) = 0

Thank you very much for the prompt replies, Lukas and Jakub.
Comment 5 Jakub Hrozek 2013-10-22 18:58:34 EDT
I think we might bee seeing the memberof plugin acting up.. The memberof plugin is a cache plugin that automatically builds transitive links between a user or a group and its LDAP parent(s) (if B is a member of C already and you assing A as member of B, the links are autogenerated all the way through to C). The memberof plugin was known to be a culprit already, albeit in much larger environments than yours. Do you see CPU spiking up when the SSSD runs the enumeration? That might point to the memberof plugin..

The number of users and groups in your environment is not so large, so I'm surprised a bit about the bad performance.

What LDAP server does your organization use? Many recent LDAP servers support receiving only "deltas" since the last update by checking the lastUSN attribute, but since the attribute is per-server, the deltas only have effect after the initial full enumeration.

As per why the sssd is unresponsive when saving the users or groups -- the data is only ever written to disk during a cache transaction. Because transactions are a bit costly, we try to save all users and groups in one large go (IIRC also the memberof plugin is only invoked during the transaction commit). The single large transaction is a synchronous action during which the SSSD does not return control to the mainloop which would otherwise "pong" the main sssd process until all data is written to the disk and all memberships are build. This is a known shortcoming but not resolved in the current SSSD versions.

May I ask what is the reason to run enumeration at all? Does any legacy application depend on having all the users available? If possible, I think the easiest solution would be to disable enumeration completely..
Comment 6 John Brunelle 2013-10-23 08:07:48 EDT
Yes, sssd_be cpu is ~100% during the enumeration.  We don't have any intentions one way or the other with memberof (our config file doesn't mention it).

The LDAP server is Microsoft Active Directory.

Enumeration is just useful, with several scripts and utilities here and there expecting it to work.  Nothing we can't arrange to run differently, just sticking with it until it really doesn't work.

If the sssd_be unresponsivness and killing is understood, even if not desirable, I suppose the only issue here was that the associated timeouts are less than ldap_enumeration_refresh_timeout, making the latter moot in our case.  So, just confusion about the parameters; no actual bug.

Given we can raise those timeouts to get a full enumeration, and everything (including enumeration) works with default timeouts if we pre-seed sssd with a recent cache_DOMAIN.ldb file, we can work around this, with an ultimate goal of moving away from enumeration.

Thanks again!
Comment 7 Jakub Hrozek 2013-10-24 04:42:30 EDT
I'm glad the setup works for you now. We are planning on enhancing long-running requests in general as part of upstream version 1.13. For now, I'm going to link this bugzilla with the appropriate upstream ticket.

Thanks for reporting the bugzilla and working with us to resolve your issue!
Comment 8 Jakub Hrozek 2013-10-24 04:46:14 EDT
Upstream ticket:
https://fedorahosted.org/sssd/ticket/1829
Comment 9 Jakub Hrozek 2013-10-24 04:48:35 EDT
Upstream ticket:
https://fedorahosted.org/sssd/ticket/1729
Comment 10 Jakub Hrozek 2016-01-08 06:21:56 EST
There were two bugzillas tracking the same problem, marking as a duplicate.

*** This bug has been marked as a duplicate of bug 888739 ***

Note You need to log in before you can comment on or make changes to this bug.