Bug 1022242
Summary: | ldap enumeration stuck in an unadjustable 110 second restart loop | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | John Brunelle <john_brunelle> |
Component: | sssd | Assignee: | SSSD Maintainers <sssd-maint> |
Status: | CLOSED DUPLICATE | QA Contact: | Namita Soman <nsoman> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 7.0 | CC: | grajaiya, jgalipea, jhrozek, jwm, lslebodn, mkosek, pbrezina |
Target Milestone: | rc | ||
Target Release: | 7.1 | ||
Hardware: | Unspecified | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2016-01-08 11:21:56 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
John Brunelle
2013-10-22 20:02:22 UTC
What do you exactly mean by "large number of accounts"? I have approximately 23000 users in my test environment. In my case, it took 90 seconds to cache all users. You can also try to increase value of parameters timeout and force_timeout. Detail description and default values are described in manual page "man sssd.conf" I would not recommend very big values (9999). 60 seconds can be good starting point. Hello John, the option you're looking for is "ldap_enumeration_refresh_timeout". The SSSD sets a timer that, if previous enumeration cannot be completed in time, kills the previous request. In the debug logs, this should be indicated by a log message saying ""Enumeration timed out! Timeout too small?". Do you see such debug message in the logs? (You'd need to set debug_level=1 or higher in the domain section). According to your opening comment, the problem is not downloading the entries per se, but rather storing them to the cache. You can test easily if it's the case by symlinking the cache that normally resides in /var/lib/sss/db/*.ldb to /dev/shm or similar. But this experiment is only suitable for production systems. The reason the sssd_be is getting killed is that the worker processes are "pinged" by the main sssd process and in case they don't "pong" in time, they are killed. The processes usually don't pong when they are genuinely stuck as a result of a bug or when there is a long blocking operation, such as saving a huge number of entries to the cache. In case symlinking the cache would help, we can help you further by raising the time between pings (there is an undocumented option), but I would prefer to find the root cause first. Aha! It is indeed timeout and force_timeout that are limiting the enumeration. By playing with the values, I see it's getting killed and restarted every 5*timeout+force_timeout seconds (5+10+60=110 by default). So ldap_enumeration_refresh_timeout has no effect on here -- the default value of 300 is longer than 110, and my attempts above at setting it did nothing. Also, that corresponding message is not in the output. With huge values for all the timeout and ldap_enumeration parameters, the enumeration takes 6m25s. This is for 8k users and 2.5k groups. Okay, so not that large Lukas, just slow :) Seems it's not the disk though -- I emptied and symlinked /var/lib/sss/db -> /dev/shm/db, reran, and it was essentially no different (6m11s). To the question of why sssd_be would is so unresponsive that it get's killed, which seems to be the real bug here: In the log, there is just a steady state of messages about adding users, followed by a repeat of the startup messages: (Tue Oct 22 14:27:04 2013) [sssd[be[rc.domain]]] [sysdb_search_user_by_name] (0x0400): No such entry (Tue Oct 22 14:27:04 2013) [sssd[be[rc.domain]]] [sysdb_search_user_by_uid] (0x0400): No such entry (Tue Oct 22 14:27:04 2013) [sssd[be[rc.domain]]] [sdap_save_user] (0x0400): Storing info for user REDACTED_USERNAME1 (Tue Oct 22 14:27:04 2013) [sssd[be[rc.domain]]] [sysdb_search_user_by_name] (0x0400): No such entry (Tue Oct 22 14:27:04 2013) [sssd[be[rc.domain]]] [sysdb_search_user_by_uid] (0x0400): No such entry (Tue Oct 22 14:27:04 2013) [sssd[be[rc.domain]]] [sdap_save_user] (0x0400): Storing info for user REDACTED_USERNAME2 (Tue Oct 22 14:27:04 2013) [sssd[be[rc.domain]]] [sysdb_search_user_by_name] (0x0400): No such entry (Tue Oct 22 14:27:04 2013) [sssd[be[rc.domain]]] [sysdb_search_user_by_uid] (0x0400): No such entry (Tue Oct 22 14:27:05 2013) [sssd] [sbus_dispatch] (0x0080): Connection is not open for dispatching. (Tue Oct 22 14:27:05 2013) [sssd] [mt_svc_restart] (0x0400): Scheduling service rc.domain for restart 1 (Tue Oct 22 14:27:05:020404 2013) [sssd[be[rc.domain]]] [ldb] (0x0400): server_sort:Unable to register control with rootdse! (Tue Oct 22 14:27:05 2013) [sssd[be[rc.domain]]] [server_setup] (0x0400): CONFDB: /var/lib/sss/db/config.ldb (Tue Oct 22 14:27:05 2013) [sssd[be[rc.domain]]] [fo_context_init] (0x0400): Created new fail over context, retry timeout is 30 So it appears to just be doing its thing and then gets killed. I've straced sssd_be, and it's just alternating memory management and locks on the /var/lib/sss/db/cache_rc.domain.ldb file, e.g.: brk(0x9207000) = 0x9207000 brk(0x9228000) = 0x9228000 brk(0x9249000) = 0x9249000 fcntl(14, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=48, len=1}) = 0 fcntl(14, F_SETLKW, {type=F_UNLCK, whence=SEEK_SET, start=48, len=1}) = 0 fcntl(14, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=48, len=1}) = 0 Thank you very much for the prompt replies, Lukas and Jakub. I think we might bee seeing the memberof plugin acting up.. The memberof plugin is a cache plugin that automatically builds transitive links between a user or a group and its LDAP parent(s) (if B is a member of C already and you assing A as member of B, the links are autogenerated all the way through to C). The memberof plugin was known to be a culprit already, albeit in much larger environments than yours. Do you see CPU spiking up when the SSSD runs the enumeration? That might point to the memberof plugin.. The number of users and groups in your environment is not so large, so I'm surprised a bit about the bad performance. What LDAP server does your organization use? Many recent LDAP servers support receiving only "deltas" since the last update by checking the lastUSN attribute, but since the attribute is per-server, the deltas only have effect after the initial full enumeration. As per why the sssd is unresponsive when saving the users or groups -- the data is only ever written to disk during a cache transaction. Because transactions are a bit costly, we try to save all users and groups in one large go (IIRC also the memberof plugin is only invoked during the transaction commit). The single large transaction is a synchronous action during which the SSSD does not return control to the mainloop which would otherwise "pong" the main sssd process until all data is written to the disk and all memberships are build. This is a known shortcoming but not resolved in the current SSSD versions. May I ask what is the reason to run enumeration at all? Does any legacy application depend on having all the users available? If possible, I think the easiest solution would be to disable enumeration completely.. Yes, sssd_be cpu is ~100% during the enumeration. We don't have any intentions one way or the other with memberof (our config file doesn't mention it). The LDAP server is Microsoft Active Directory. Enumeration is just useful, with several scripts and utilities here and there expecting it to work. Nothing we can't arrange to run differently, just sticking with it until it really doesn't work. If the sssd_be unresponsivness and killing is understood, even if not desirable, I suppose the only issue here was that the associated timeouts are less than ldap_enumeration_refresh_timeout, making the latter moot in our case. So, just confusion about the parameters; no actual bug. Given we can raise those timeouts to get a full enumeration, and everything (including enumeration) works with default timeouts if we pre-seed sssd with a recent cache_DOMAIN.ldb file, we can work around this, with an ultimate goal of moving away from enumeration. Thanks again! I'm glad the setup works for you now. We are planning on enhancing long-running requests in general as part of upstream version 1.13. For now, I'm going to link this bugzilla with the appropriate upstream ticket. Thanks for reporting the bugzilla and working with us to resolve your issue! Upstream ticket: https://fedorahosted.org/sssd/ticket/1829 Upstream ticket: https://fedorahosted.org/sssd/ticket/1729 There were two bugzillas tracking the same problem, marking as a duplicate. *** This bug has been marked as a duplicate of bug 888739 *** |