Bug 1886492

Summary:	Lookups in SSSD cache of AD accounts entries took long time if cache size is above 100MB
Product:	Red Hat Enterprise Linux 8	Reporter:	PALLAVI <palsoni>
Component:	sssd	Assignee:	Alexey Tikhonov <atikhono>
Status:	CLOSED WONTFIX	QA Contact:	sssd-qe
Severity:	medium	Docs Contact:
Priority:	low
Version:	8.3	CC:	aboscatt, atikhono, grajaiya, jhrozek, lslebodn, mzidek, pbrezina, sbose, tscherf
Target Milestone:	rc	Keywords:	Triaged
Target Release:	8.0	Flags:	pm-rhel: mirror+
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	sync-to-jira
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-05-31 16:48:58 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Comment 6 Alexey Tikhonov 2020-10-08 18:24:18 UTC

User is a member of 678 groups, hence `id` triggers 678 "Group by ID" requests.
There are no gaps, all those requests are processed evenly.
So it takes ~120 ms to process one group. Not blazing fast taking into account "ignore_group_members=true" and ldb-cache is mounted on tmpfs...


It seems what happens is the following:
 - despite "ignore_group_members=true" groups in cache are populated with users via repeated `id` (initgroups) calls (as a result they have 160 users in every group in cache on *average*)
 - sysdb_add_group_member_overrides() iterates over all those users (checks for "ignore_group_members" happens later)
 - it takes less than 0.7 ms / user on average, still it adds up to 80 seconds


What can be considered for improvement:
  - perhaps `sysdb_add_group_member_overrides()` could be optimized a little bit
  - perhaps `sysdb_add_group_member_overrides()` could be skipped completely in case "ignore_group_members=true"

But to be realistic and taking into account status of RHEL7 I doubt this will/can be addressed in RHEL7.

Lets see if we can come up with a work around.

Taking into account user is fine with requesting into from LDAP on every lookup (to quote: "expected is less then 30s, but optimaly proven as 2-3s"), I think we can propose:
 - either setting very low value of `entry_cache_timeout` (or perhaps `entry_cache_group_timeout`)
 - or to destroy ldb-cache periodically via call to `sss_cache`.
Sumit, what would be your opinion on this?

Comment 7 PALLAVI 2020-10-09 12:14:32 UTC

Hi, To you query below,

---------------

Created By:  (10/9/2020 12:03 AM) Last Modified By: Alexey Tikhonov  (10/9/2020 12:03 AM)
Hi,

>  Whatever SSSD configuration was used the results after 500 user lookups are always same, taking more than minute

Does lookup time grow gradually (i.e. a little bit slower with every new user) or does it happen abruptly (i.e. it is fast until ~500 users and then suddenly slow)?

(I suspect it grows gradually.)

---------------

Cu has replied,

-----------------------------------------------------------
Yes bigger the cache it is slower the lookups are.

Size of cache grows with more distinct users are logged on and we need to clear cache from time to time.  Systems with many customer contact require this cleanup every week or even ealier.

If we used 'ldap_purge_cache_timeout = 10800'. SSSD stopped to work properly after 3-6 hours as the purge cache operation caused sssd_be to be 100% CPU consumption and so lookup were very slow. And so slower then  no purge cache was done.


Technically on idle systems this worked fine. I've run test on 500 users. And after 3 hours the cache was purged and lookups where fine. We had implemented on 300 servers and it looks promising. But  last 20 highest utilized systems had almost not working SSSD authentication. So we've roll back this change and situation is stabilized but still not good.

-------------------------------------------------------

Thanks & Regards,
Pallavi Soni

Comment 9 Sumit Bose 2020-10-12 06:44:41 UTC

Hi,

due to the correlation between cache-size and delay it might be a missing index. I would suggest to add

    LDB_WARN_UNINDEXED=1
    LDB_WARN_REINDEX=1

which will add log messages like:

    ... ldb FULL SEARCH: (|(objectClass=*)(distinguishedName=*)) SCOPE: sub DN: cn=config ...

or

    ... [sssd] [ldb] (0x0020): Reindexing /var/lib/sss/db/config.ldb due to modification on ...

respectively.

Some of the messages are expected but it would be nice to have some debug logs with those messages enabled if the system is slow to understand if adding an index might help to speed things up.

bye,
Sumit

Comment 12 Alexey Tikhonov 2020-10-23 10:12:10 UTC

Hi,

as a work around please try to set lower value of `entry_cache_timeout` and `ldap_purge_cache_timeout`.
The idea behind this work-around is that while you can't disable cache completely, you may try to setup its expiration and purging to prevent group cache growth (that results in poor performance).

Specific value for those options should be tuned depending on machine payload. I would start with setting fairly low `entry_cache_timeout` value (depends on how often users do log in) and see if this already helps. If that's not enough than I would add `ldap_purge_cache_timeout` with value comparable to the value of `entry_cache_timeout` to actually remove expired entries from the cache.

Please take a note this is not a proper fix (described in the comment 6) but merely a work-around.