Bug 888739 - Enumerating large number of users makes sssd_be hog the cpu for a long time.
Summary: Enumerating large number of users makes sssd_be hog the cpu for a long time.
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: sssd   
(Show other bugs)
Version: 7.0
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: rc
: ---
Assignee: SSSD Maintainers
QA Contact: Amith
Aneta Šteflová Petrová
URL:
Whiteboard:
Keywords:
: 1022242 (view as bug list)
Depends On:
Blocks: 1420851 1467835 1472344
TreeView+ depends on / blocked
 
Reported: 2012-12-19 11:34 UTC by Kaushik Banerjee
Modified: 2018-05-02 07:20 UTC (History)
21 users (show)

Fixed In Version: sssd-1.16.0-7.el7
Doc Type: Known Issue
Doc Text:
Enumerating a large number of users results in high CPU load and slows down other operations When `enumerate=true` is set in the `etc/sssd/sssd.conf` file and a large number of users (for example, 30,000 users) are present in the LDAP server, certain performance problems occur: * the *sssd_be* process consumes almost 99% of CPU resources * certain operations, such as logging in as a local user or logging out, take unexpectedly long to complete * running the *ldbsearch* operation on the *sysdb* and *timestamp* caches fails with an error reporting that the indexed and full searches both failed Note that this is not a new known issue, as these problems occurred in previous releases of SSSD as well.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2018-04-10 17:09:10 UTC
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
SSSD Domain log file, refer comment 27 (6.63 MB, application/x-gzip)
2016-09-23 19:39 UTC, Amith
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2018:0929 None None None 2018-04-10 17:10 UTC
Red Hat Knowledge Base (Solution) 2162321 None None None 2016-02-16 20:13 UTC

Description Kaushik Banerjee 2012-12-19 11:34:06 UTC
Description of problem:
Enumerating large number of users makes sssd_be hog the cpu for a long time.

Version-Release number of selected component (if applicable):
1.9.2-46

How reproducible:
Always

Steps to Reproduce:
1. I have 30k users in ldap server.
2. Set enumerate=true in sssd.conf and restart sssd.
3. Check the cpu usage and try to login as a local user.
  
Actual results:
sssd_be consumes almost 99% of cpu forever(I observed for 20 mins)
 4850 root      20   0  387m 172m 3464 R 99.3 17.3   1:22.40 sssd_be           

During this time, a simple login/logout of a local user took almost 2 mins:

# time su - testlocal
[testlocal@dhcp201-200 ~]$ logout

real	1m46.156s
user	0m0.007s
sys	0m0.013s


Expected results:
Other operations should not be slowed down during enumeration.

Additional info:

Comment 2 Pavel Březina 2012-12-19 11:56:14 UTC
Upstream ticket:
https://fedorahosted.org/sssd/ticket/1729

Comment 3 Jakub Hrozek 2012-12-19 15:47:05 UTC
Kaushik, have you had a chance to test 6.3 packages on exactly the same setup? Do they also eat CPU?

Comment 4 Kaushik Banerjee 2012-12-19 16:51:13 UTC
(In reply to comment #3)
> Kaushik, have you had a chance to test 6.3 packages on exactly the same
> setup? Do they also eat CPU?

Just did a test on 6.3.
Almost the same result. sssd_be CPU usage is at 100% since last 30mins and login as a local user took 10mins.

Comment 8 Gary Molenkamp 2013-03-26 17:28:18 UTC
I have observed this behaviour after updating from 6.2 to 6.3, and it only occurs when enumerate is set to true.

When running with debug_level=8, I observed sssd_be loading our openldap db, then immediately doing so again, until I killed the sssd_be with sigkill.

ie:
(Tue Mar 26 13:20:42 2013) [sssd[be[default]]] [sdap_save_user] (0x2000): Adding originalDN [uid=xxxxxxx] to attributes of [gary].
(Tue Mar 26 13:20:42 2013) [sssd[be[default]]] [sdap_save_user] (0x1000): Original memberOf is not available for [gary].
(Tue Mar 26 13:20:42 2013) [sssd[be[default]]] [sdap_attrs_add_ldap_attr] (0x2000): Adding original mod-Timestamp [20130221151906Z] to attributes of [gary].
(Tue Mar 26 13:20:42 2013) [sssd[be[default]]] [sdap_save_user] (0x1000): Original USN value is not available for [gary].
(Tue Mar 26 13:20:42 2013) [sssd[be[default]]] [sdap_save_user] (0x1000): User principal is not available for [gary].

This repeats ~10 times for all of our users in 15 minutes.  

Turning enumerate off return sssd to normal behaviour.

Comment 9 Jakub Hrozek 2013-03-26 18:28:31 UTC
(In reply to comment #8)
> I have observed this behaviour after updating from 6.2 to 6.3, and it only
> occurs when enumerate is set to true.
> 

Yes, this is quite expected and not really easily solvable with the current SSSD architecture. We are tracking an enhancement upstream to handle long-running requests better, but it's not going to happen soon.

Keep in mind that enumerate=true can in some cases amount to saving a copy of your directory locally.

> When running with debug_level=8, I observed sssd_be loading our openldap db,
> then immediately doing so again, until I killed the sssd_be with sigkill.
> 
> ie:
> (Tue Mar 26 13:20:42 2013) [sssd[be[default]]] [sdap_save_user] (0x2000):
> Adding originalDN [uid=xxxxxxx] to attributes of [gary].
> (Tue Mar 26 13:20:42 2013) [sssd[be[default]]] [sdap_save_user] (0x1000):
> Original memberOf is not available for [gary].
> (Tue Mar 26 13:20:42 2013) [sssd[be[default]]] [sdap_attrs_add_ldap_attr]
> (0x2000): Adding original mod-Timestamp [20130221151906Z] to attributes of
> [gary].
> (Tue Mar 26 13:20:42 2013) [sssd[be[default]]] [sdap_save_user] (0x1000):
> Original USN value is not available for [gary].
> (Tue Mar 26 13:20:42 2013) [sssd[be[default]]] [sdap_save_user] (0x1000):
> User principal is not available for [gary].
> 
> This repeats ~10 times for all of our users in 15 minutes.  
> 
> Turning enumerate off return sssd to normal behaviour.

enumerate=false is the recommended setting. It's currently not reflected in the man pages, I think, but it's going to be in the next release.

Comment 10 Serigne Diagne 2013-10-11 05:57:38 UTC
Hi all, 

We are experiencing the same on Red Hat Enterprise Linux Server release 5.10 (Tikanga) machine, does the fix apply for that release?

Thanks in advance for your feedback. 

regards, 
Serigne.

Comment 11 Jakub Hrozek 2013-10-11 07:34:22 UTC
(In reply to Serigne Diagne from comment #10)
> Hi all, 
> 
> We are experiencing the same on Red Hat Enterprise Linux Server release 5.10
> (Tikanga) machine, does the fix apply for that release?
> 
> Thanks in advance for your feedback. 
> 
> regards, 
> Serigne.

There is no fix yet, this bugzilla is tracking a future improvement that is not implemented yet. Currently we recommend to not use enumeration at all, in most cases it can be avoided.

Comment 12 Serigne Diagne 2013-10-11 08:37:08 UTC
Hi(In reply to Jakub Hrozek from comment #11)
> (In reply to Serigne Diagne from comment #10)
> Hi all, 
> 
> We are
> experiencing the same on Red Hat Enterprise Linux Server release 5.10
>
> (Tikanga) machine, does the fix apply for that release?
> 
> Thanks in
> advance for your feedback. 
> 
> regards, 
> Serigne.

There is no fix yet,
> this bugzilla is tracking a future improvement that is not implemented yet.
> Currently we recommend to not use enumeration at all, in most cases it can
> be avoided.


Hi Jakub, 

Thanks for the prompt feedback. 
Just to make sure we are on the same page: I hace checked the sssd.conf on the server the enumeration is set to false. Is this what you mean by not using the enumeration or do I need some additional actions to exclude the enumeration?

Serigne.

Comment 13 Dmitri Pal 2013-10-11 22:00:47 UTC
The term "server" might be a bit confusing here.
We are talking about enumeration in the sssd.conf on the client machine.
The enumeration should be turned off.
If it is turned off and you still see some issues the we would need your sssd.conf and SSSD logs with a debug_level at 7.
Please open a support case and provide this information.

Comment 14 Jakub Hrozek 2013-10-15 18:53:07 UTC
Sorry for the late response, I was on PTO for a couple of days. Dmitri's questions are valid, I will just re-iterate one point -- enumeration should ideally be off (enumerate=true in the domain section of the sssd.conf) on all clients unless there is an application that needs to see the whole list of users (aka call setpwent/getpwent/endpwent) for one reason or another.

Comment 15 Jakub Hrozek 2016-01-08 11:21:56 UTC
*** Bug 1022242 has been marked as a duplicate of this bug. ***

Comment 16 Jakub Hrozek 2016-01-11 15:05:28 UTC
We are planning performance improvements for 7.3, however mostly aimed at the default case which is non-enumerate. I'm marking this request as 'low', chances the changes would also help the enumerate case, but I don't think we should actively pursue enumeration improvements.

Comment 25 Amith 2016-09-22 13:01:35 UTC
Tested the BZ on SSSD Version: sssd-1.14.0-43.el7.x86_64

In this case, we have more than 15,000 users in 389-ds ldap server. Enumeration turned ON and it is expected that SSSD fetches all the users. Following are the observations from the test:

1) Execution time for initial user (local) login, before enumeration turned ON : 11 secs OR .183 mins

2) Execution time for user (local) login, 20 secs after enumeration : 10 secs OR .166 mins

3) Monitored CPU usage every 5 secs during the run: 
    - 100 % for the initial 5 secs
    - 0,0 % after 20 secs

4) Total number of enumerated records: 4991 records
   - This is strange as we expect all the users to be fetched and cached. Do we have new changes in SSSD related to this? Does SSSD fetches users in batches of 1000's of records? Does it has anything to do with paging in 389-DS server?

5) Manually executed a for loop to fetch all the users:
   - for ((i=10000 ; i<=25000 ; i++)); do getent passwd -s sss puser$i; done
   - All the users were fetched. SYSDB cache shows over 15000 records. So, i believe user look works fine manually but it is not enumerated properly by default.

6) In order to test the performance in terms of CPU%, I invalidated all the users in cache, sss_cache -E.
   and then restarted sssd service. Monitored CPU usage every 5 secs during the run. Observed 100% usage for over 40 secs. However, strangely the cache has only over 1000 records and updated.

  - Question is where has all the records gone, which existed before sssd service restart ? Does invalidation has anything to do with it ? Looks like a different bug to me !!

Bottom line is user login case and CPU % shows much better performance than reported in rhel-6.4 by kaushik. However i believe it has to do with enumeration not properly done in rhel-7.3.

Please advice.

Comment 28 Amith 2016-09-23 19:39 UTC
Created attachment 1204257 [details]
SSSD Domain log file, refer comment 27

Comment 29 Amith 2016-09-26 19:23:16 UTC
Retested with timeout = 30, I still see sssd_be process restarting. Looks like the sysdb cache is updated, the size being 44 Mb. However when i run ldbsearh on it, command fails with search error - Indexed and full searches both failed!
Also, timestamp cache after returning approx. 30000 records fails with : search error - Indexed and full searches both failed!

# service sssd status
Redirecting to /bin/systemctl status  sssd.service
Main PID: 32258 (sssd)
   CGroup: /system.slice/sssd.service
           ├─ 2784 /usr/libexec/sssd/sssd_be --domain LDAP --uid 0 --gid 0 --debug-to-files
           ├─32258 /usr/sbin/sssd -D -f
           ├─32260 /usr/libexec/sssd/sssd_nss --uid 0 --gid 0 --debug-to-files
           └─32261 /usr/libexec/sssd/sssd_pam --uid 0 --gid 0 --debug-to-files

Sep 26 06:23:28 auto-hv-02-guest01.idmqe.lab.eng.bos.redhat.com sssd[32258]: Starting up
Sep 26 06:23:28 auto-hv-02-guest01.idmqe.lab.eng.bos.redhat.com sssd[be[LDAP]][32259]: Starting up
Sep 26 06:23:28 auto-hv-02-guest01.idmqe.lab.eng.bos.redhat.com sssd[nss][32260]: Starting up
Sep 26 06:23:28 auto-hv-02-guest01.idmqe.lab.eng.bos.redhat.com sssd[pam][32261]: Starting up
Sep 26 06:23:28 auto-hv-02-guest01.idmqe.lab.eng.bos.redhat.com systemd[1]: Started System Security Services Daemon.
Sep 26 06:25:30 auto-hv-02-guest01.idmqe.lab.eng.bos.redhat.com sssd[be[LDAP]][342]: Starting up
Sep 26 06:27:31 auto-hv-02-guest01.idmqe.lab.eng.bos.redhat.com sssd[be[LDAP]][914]: Starting up
Sep 26 06:29:32 auto-hv-02-guest01.idmqe.lab.eng.bos.redhat.com sssd[be[LDAP]][1468]: Starting up
Sep 26 06:31:33 auto-hv-02-guest01.idmqe.lab.eng.bos.redhat.com sssd[be[LDAP]][1984]: Starting up
Sep 26 06:35:04 auto-hv-02-guest01.idmqe.lab.eng.bos.redhat.com sssd[be[LDAP]][2784]: Starting up

Comment 36 Abhinay Reddy Peddireddy 2016-10-16 06:00:22 UTC
There is a similar issue with one of my customer who has enumeration enabled. He is requesting to raise an RFE that ito  would be great to be able to have similar new feature where one could simply set enumeration for only specific groups (with full members) and for other groups use "ignore_group_members = True".

Can I expect any suggestion or reply for this request on this bugzilla itself or should I raise a new RFE ?

Comment 37 Jakub Hrozek 2016-10-16 15:53:51 UTC
(In reply to Abhinay Reddy Peddireddy from comment #36)
> There is a similar issue with one of my customer who has enumeration
> enabled. He is requesting to raise an RFE that ito  would be great to be
> able to have similar new feature where one could simply set enumeration for
> only specific groups (with full members) and for other groups use
> "ignore_group_members = True".
> 
> Can I expect any suggestion or reply for this request on this bugzilla
> itself or should I raise a new RFE ?

I don't like this feature to be honest. I would prefer to make sssd perform well by default and not add an option for everyone's particular case. I would suggest they try 7.3 first and see if using 7.3 makes a difference. Mounting the cache to tmpfs is a workaround that should also give the customer better performance without having to add new options.

btw this bugzilla is about enumerate=true performing badly. In this context, enumerate means downloading all the users and groups from the directory. This will always be quite slow and should be discouraged.

Comment 41 Jakub Hrozek 2017-03-28 08:07:16 UTC
On one hand I still agree with comment #40 in the sense that we did improve performance of SSSD in 7.3.0 and fixed issues related to watchdog in 7.3.z. 

However, we are still seeing performance issues with enumeration enabled with the current release and have plans on addressing in the next upstream release therefore I'm reproposing the bugzilla to 7.5.0.

Comment 43 Jakub Hrozek 2017-08-10 15:02:32 UTC
QE: To verify the bug, enable enumeration in a large domain. Observe the sssd_be behaviour when the enumeration is running. It will probably spike to full CPU usage, but should return to idle much faster than in 7.4.

Also, large domains that would otherwise trigger the watchdog (thousands or tens of thousands of entries) should no longer trigger the watchdog.

When you get around to verifying the bug, please also ping us about systemtap scripts, we should merge those by then and the scripts should be usable to verify that the parsing and saving is faster.

Comment 47 Amith 2018-01-31 20:51:59 UTC
Verified the bug on SSSD Version : sssd-1.16.0-14.el7.x86_64

Test Steps:
1. Configure sssd.conf with enumeration ON
2. Start SSSD service and capture CPU usage during initial enumeration.

This bug was tested on RHEL-7.4 and RHEL-7.5. 
LDAP server was setup with over 25000 users and large number of groups. Size limit and look through limit in ldap server were modified to "Unlimited". 
SSSD Client was setup with following configuration with enumeration ON:

[sssd]
config_file_version = 2
services = nss, pam
domains = LDAP, LOCAL

[domain/LOCAL]
id_provider = local

[pam]
timeout = 60

[nss]
timeout = 60

[domain/LDAP]
id_provider = ldap
ldap_uri = ldap://SERVER
ldap_tls_cacert = /etc/openldap/certs/cacert.asc
enumerate = true
ldap_schema = rfc2307bis
timeout = 60

In the case of RHEL-7.4: 
------------------------------------------------------------------------------------------------------------------
1) The SSSD Version under test was sssd-1.15.2-50.el7_4.8.x86_64.
2) After enumeration, the total number of records returned from sysdb cache : 32020
3) Size of LDB files after enumeration:
total 195M
-rw-------. 1 root root 162M Jan 31 11:18 cache_LDAP.ldb
-rw-------. 1 root root 1.3M Jan 31 10:45 config.ldb
-rw-------. 1 root root 1.3M Jan 31 10:45 sssd.ldb
-rw-------. 1 root root  31M Jan 31 11:21 timestamps_LDAP.ldb

4) During enumeration, the performance statistics captured using TOP utility every 20 seconds (iteration of 20 secs):

--------------------------------------------------------------------------------------
Tests:    PID  USER      PR   NI  VIRT    RES    SHR  S   %CPU   %MEM    TIME+  COMMAND
--------------------------------------------------------------------------------------
ITR-01:   7753 root      20   0  438264 150844   9140 R   93.8   3.9   0:08.83 sssd_be
 
ITR-02:   7753 root      20   0  525700 227136   9280 R   93.3   5.9   0:28.98 sssd_be
 
ITR-03:   7753 root      20   0  614156 298040  14048 R  100.0   7.7   0:49.14 sssd_be
 
ITR-04:   7753 root      20   0  672536 349816  11148 R  100.0   9.0   1:09.29 sssd_be
 
ITR-05:   7753 root      20   0  794700 429144  97728 R   93.8  11.1   1:25.48 sssd_be
 
ITR-06:   7753 root      20   0  794700 431024  99608 R  100.0  11.1   1:45.59 sssd_be
 
ITR-07:   7753 root      20   0  826108 423216  91800 R  100.0  10.9   2:05.69 sssd_be
 
ITR-08:   7753 root      20   0  826108 424380  92964 R  100.0  10.9   2:25.82 sssd_be
 
ITR-09:   7753 root      20   0  826108 436236 104820 R  100.0  11.2   2:45.97 sssd_be
 
ITR-10:   7753 root      20   0  842144 447932 100608 R   93.8  11.5   3:06.11 sssd_be
 
ITR-11:   7753 root      20   0  842144 449708 102384 R  100.0  11.6   3:26.26 sssd_be
 
ITR-12:   7753 root      20   0  842144 449252 101928 R  100.0  11.6   3:46.40 sssd_be
 
ITR-13:   7753 root      20   0  842144 447932 100608 R   93.8  11.5   4:06.53 sssd_be
 
ITR-14:   7753 root      20   0  856160 425356  77936 R  100.0  11.0   4:26.67 sssd_be
 
ITR-15:   7753 root      20   0  931776 414012  86392 S   0.0   10.7   4:31.27 sssd_be
 
ITR-16:   7753 root      20   0  931776 413104  85484 S   0.0   10.6   4:31.28 sssd_be
 
ITR-17:   7753 root      20   0  931776 414136  86516 S   0.0   10.7   4:31.28 sssd_be
 
ITR-18:   7753 root      20   0  931776 414136  86516 S   0.0   10.7   4:31.28 sssd_be
 
ITR-19:   7753 root      20   0  931776 414136  86516 S   0.0   10.7   4:31.28 sssd_be
 
ITR-20:   7753 root      20   0  931776 414136  86516 S   0.0   10.7   4:31.28 sssd_be
---------------------------------------------------------------------------------------


In the case of RHEL-7.5: 
---------------------------------------------------------------------------------------------------------------------
1) The SSSD Version under test was sssd-1.16.0-14.el7.x86_64.
2) After enumeration, the total number of records returned from sysdb cache : 32020
3) Size of LDB files after enumeration:
total 175M
-rw-------. 1 root root 148M Jan 31 13:48 cache_LDAP.ldb
-rw-------. 1 root root 1.3M Jan 31 13:38 config.ldb
-rw-------. 1 root root 1.3M Jan 31 13:48 sssd.ldb
-rw-------. 1 root root  24M Jan 31 13:40 timestamps_LDAP.ldb

4) During enumeration, the performance statistics captured using TOP utility every 20 seconds (iteration of 20 secs):

--------------------------------------------------------------------------------------
Tests:    PID  USER      PR   NI  VIRT    RES    SHR  S   %CPU  %MEM    TIME+  COMMAND
--------------------------------------------------------------------------------------
ITR-01:  14196 root      20   0  446124 147280   8368 R   94.1  0.9   0:04.98 sssd_be
 
ITR-02:  14196 root      20   0  590520 276680  12384 R  100.0  1.7   0:25.13 sssd_be
 
ITR-03:  14196 root      20   0  684504 355828  16968 R  100.0  2.2   0:45.28 sssd_be
 
ITR-04:  14196 root      20   0  592928 230308  83992 R   87.5  1.4   1:01.86 sssd_be
 
ITR-05:  14196 root      20   0  680724 298164  86932 R  100.0  1.8   1:22.03 sssd_be
 
ITR-06:  14196 root      20   0  816864 438300  90920 R  100.0  2.7   1:42.19 sssd_be
 
ITR-07:  14196 root      20   0  816864 435680  88300 R  100.0  2.7   2:02.36 sssd_be
 
ITR-08:  14196 root      20   0  831444 441332  93952 R  100.0  2.7   2:22.51 sssd_be
 
ITR-09:  14196 root      20   0  845460 417560  70180 R  100.0  2.6   2:42.68 sssd_be
 
ITR-10:  14196 root      20   0  836200 334132  77316 S   0.0   2.1   2:55.37 sssd_be
 
ITR-11:  14196 root      20   0  836200 334132  77316 S   0.0   2.1   2:55.37 sssd_be
 
ITR-12:  14196 root      20   0  836200 334132  77316 S   0.0   2.1   2:55.37 sssd_be
 
ITR-13:  14196 root      20   0  836200 334132  77316 S   0.0   2.1   2:55.37 sssd_be
 
ITR-14:  14196 root      20   0  836200 334132  77316 S   0.0   2.1   2:55.37 sssd_be
 
ITR-15:  14196 root      20   0  836200 334132  77316 S   0.0   2.1   2:55.37 sssd_be
 
ITR-16:  14196 root      20   0  836200 334096  77280 S   0.0   2.1   2:55.38 sssd_be
 
ITR-17:  14196 root      20   0  836200 334096  77280 S   0.0   2.1   2:55.38 sssd_be
 
ITR-18:  14196 root      20   0  836200 334096  77280 S   0.0   2.1   2:55.38 sssd_be
 
ITR-19:  14196 root      20   0  836200 334096  77280 S   0.0   2.1   2:55.38 sssd_be
 
ITR-20:  14196 root      20   0  836200 334096  77280 S   0.0   2.1   2:55.38 sssd_be
---------------------------------------------------------------------------------------


Observation:
1. In the case of RHEL-7.4, the sssd_be process shows spiked CPU usage for approximately 300 seconds, however in the case of latest build in RHEL-7.5 it took approx. 200 seconds to settle down.
2. %MEM usage for sssd_be process is much less in RHEL-7.5, when compared with RHEL-7.4. 
3. Size of ldb files is less in RHEL-7.5 than RHEL-7.4. Difference is almost 20 MB for the same set of records in LDAP server.
4. Overall SSSD build in RHEL-7.5 shows better performance than RHEL-7.4.

Comment 50 errata-xmlrpc 2018-04-10 17:09:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:0929


Note You need to log in before you can comment on or make changes to this bug.