1886492 – Lookups in SSSD cache of AD accounts entries took long time if cache size is above 100MB

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1886492 - Lookups in SSSD cache of AD accounts entries took long time if cache size is above 100MB

Summary: Lookups in SSSD cache of AD accounts entries took long time if cache size is ...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	sssd
Sub Component:
Version:	8.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	medium
Target Milestone:	rc
Target Release:	8.0
Assignee:	Alexey Tikhonov
QA Contact:	sssd-qe
Docs Contact:
URL:
Whiteboard:	sync-to-jira
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-10-08 15:03 UTC by PALLAVI
Modified:	2024-03-25 16:40 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-05-31 16:48:58 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:
Flags:	pm-rhel: mirror+

Attachments	(Terms of Use)

Comment 6 Alexey Tikhonov 2020-10-08 18:24:18 UTC

User is a member of 678 groups, hence `id` triggers 678 "Group by ID" requests.
There are no gaps, all those requests are processed evenly.
So it takes ~120 ms to process one group. Not blazing fast taking into account "ignore_group_members=true" and ldb-cache is mounted on tmpfs...


It seems what happens is the following:
 - despite "ignore_group_members=true" groups in cache are populated with users via repeated `id` (initgroups) calls (as a result they have 160 users in every group in cache on *average*)
 - sysdb_add_group_member_overrides() iterates over all those users (checks for "ignore_group_members" happens later)
 - it takes less than 0.7 ms / user on average, still it adds up to 80 seconds


What can be considered for improvement:
  - perhaps `sysdb_add_group_member_overrides()` could be optimized a little bit
  - perhaps `sysdb_add_group_member_overrides()` could be skipped completely in case "ignore_group_members=true"

But to be realistic and taking into account status of RHEL7 I doubt this will/can be addressed in RHEL7.

Lets see if we can come up with a work around.

Taking into account user is fine with requesting into from LDAP on every lookup (to quote: "expected is less then 30s, but optimaly proven as 2-3s"), I think we can propose:
 - either setting very low value of `entry_cache_timeout` (or perhaps `entry_cache_group_timeout`)
 - or to destroy ldb-cache periodically via call to `sss_cache`.
Sumit, what would be your opinion on this?

Comment 7 PALLAVI 2020-10-09 12:14:32 UTC

Hi, To you query below,

---------------

Created By:  (10/9/2020 12:03 AM) Last Modified By: Alexey Tikhonov  (10/9/2020 12:03 AM)
Hi,

>  Whatever SSSD configuration was used the results after 500 user lookups are always same, taking more than minute

Does lookup time grow gradually (i.e. a little bit slower with every new user) or does it happen abruptly (i.e. it is fast until ~500 users and then suddenly slow)?

(I suspect it grows gradually.)

---------------

Cu has replied,

-----------------------------------------------------------
Yes bigger the cache it is slower the lookups are.

Size of cache grows with more distinct users are logged on and we need to clear cache from time to time.  Systems with many customer contact require this cleanup every week or even ealier.

If we used 'ldap_purge_cache_timeout = 10800'. SSSD stopped to work properly after 3-6 hours as the purge cache operation caused sssd_be to be 100% CPU consumption and so lookup were very slow. And so slower then  no purge cache was done.


Technically on idle systems this worked fine. I've run test on 500 users. And after 3 hours the cache was purged and lookups where fine. We had implemented on 300 servers and it looks promising. But  last 20 highest utilized systems had almost not working SSSD authentication. So we've roll back this change and situation is stabilized but still not good.

-------------------------------------------------------

Thanks & Regards,
Pallavi Soni

Comment 9 Sumit Bose 2020-10-12 06:44:41 UTC

Hi,

due to the correlation between cache-size and delay it might be a missing index. I would suggest to add

    LDB_WARN_UNINDEXED=1
    LDB_WARN_REINDEX=1

which will add log messages like:

    ... ldb FULL SEARCH: (|(objectClass=*)(distinguishedName=*)) SCOPE: sub DN: cn=config ...

or

    ... [sssd] [ldb] (0x0020): Reindexing /var/lib/sss/db/config.ldb due to modification on ...

respectively.

Some of the messages are expected but it would be nice to have some debug logs with those messages enabled if the system is slow to understand if adding an index might help to speed things up.

bye,
Sumit

Comment 12 Alexey Tikhonov 2020-10-23 10:12:10 UTC

Hi,

as a work around please try to set lower value of `entry_cache_timeout` and `ldap_purge_cache_timeout`.
The idea behind this work-around is that while you can't disable cache completely, you may try to setup its expiration and purging to prevent group cache growth (that results in poor performance).

Specific value for those options should be tuned depending on machine payload. I would start with setting fairly low `entry_cache_timeout` value (depends on how often users do log in) and see if this already helps. If that's not enough than I would add `ldap_purge_cache_timeout` with value comparable to the value of `entry_cache_timeout` to actually remove expired entries from the cache.

Please take a note this is not a proper fix (described in the comment 6) but merely a work-around.

Note You need to log in before you can comment on or make changes to this bug.