Bug 888739
| Summary: | Enumerating large number of users makes sssd_be hog the cpu for a long time. | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Kaushik Banerjee <kbanerje> | ||||
| Component: | sssd | Assignee: | SSSD Maintainers <sssd-maint> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Amith <apeetham> | ||||
| Severity: | low | Docs Contact: | Aneta Šteflová Petrová <apetrova> | ||||
| Priority: | low | ||||||
| Version: | 7.0 | CC: | apeddire, apeetham, apetrova, ccoursey, dpal, fjayalat, gary, grajaiya, jgalipea, jhrozek, john_brunelle, kperrier, lslebodn, mkosek, mniranja, mpanaous, ngaywood, pbrezina, serigne.gueye.diagne, sgoveas, sssd-maint, tscherf | ||||
| Target Milestone: | rc | ||||||
| Target Release: | --- | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | sssd-1.16.0-7.el7 | Doc Type: | Known Issue | ||||
| Doc Text: |
Enumerating a large number of users results in high CPU load and slows down other operations
When `enumerate=true` is set in the `etc/sssd/sssd.conf` file and a large number of users (for example, 30,000 users) are present in the LDAP server, certain performance problems occur:
* the *sssd_be* process consumes almost 99% of CPU resources
* certain operations, such as logging in as a local user or logging out, take unexpectedly long to complete
* running the *ldbsearch* operation on the *sysdb* and *timestamp* caches fails with an error reporting that the indexed and full searches both failed
Note that this is not a new known issue, as these problems occurred in previous releases of SSSD as well.
|
Story Points: | --- | ||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2018-04-10 17:09:10 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 1420851, 1467835, 1472344 | ||||||
| Attachments: |
|
||||||
|
Description
Kaushik Banerjee
2012-12-19 11:34:06 UTC
Upstream ticket: https://fedorahosted.org/sssd/ticket/1729 Kaushik, have you had a chance to test 6.3 packages on exactly the same setup? Do they also eat CPU? (In reply to comment #3) > Kaushik, have you had a chance to test 6.3 packages on exactly the same > setup? Do they also eat CPU? Just did a test on 6.3. Almost the same result. sssd_be CPU usage is at 100% since last 30mins and login as a local user took 10mins. I have observed this behaviour after updating from 6.2 to 6.3, and it only occurs when enumerate is set to true. When running with debug_level=8, I observed sssd_be loading our openldap db, then immediately doing so again, until I killed the sssd_be with sigkill. ie: (Tue Mar 26 13:20:42 2013) [sssd[be[default]]] [sdap_save_user] (0x2000): Adding originalDN [uid=xxxxxxx] to attributes of [gary]. (Tue Mar 26 13:20:42 2013) [sssd[be[default]]] [sdap_save_user] (0x1000): Original memberOf is not available for [gary]. (Tue Mar 26 13:20:42 2013) [sssd[be[default]]] [sdap_attrs_add_ldap_attr] (0x2000): Adding original mod-Timestamp [20130221151906Z] to attributes of [gary]. (Tue Mar 26 13:20:42 2013) [sssd[be[default]]] [sdap_save_user] (0x1000): Original USN value is not available for [gary]. (Tue Mar 26 13:20:42 2013) [sssd[be[default]]] [sdap_save_user] (0x1000): User principal is not available for [gary]. This repeats ~10 times for all of our users in 15 minutes. Turning enumerate off return sssd to normal behaviour. (In reply to comment #8) > I have observed this behaviour after updating from 6.2 to 6.3, and it only > occurs when enumerate is set to true. > Yes, this is quite expected and not really easily solvable with the current SSSD architecture. We are tracking an enhancement upstream to handle long-running requests better, but it's not going to happen soon. Keep in mind that enumerate=true can in some cases amount to saving a copy of your directory locally. > When running with debug_level=8, I observed sssd_be loading our openldap db, > then immediately doing so again, until I killed the sssd_be with sigkill. > > ie: > (Tue Mar 26 13:20:42 2013) [sssd[be[default]]] [sdap_save_user] (0x2000): > Adding originalDN [uid=xxxxxxx] to attributes of [gary]. > (Tue Mar 26 13:20:42 2013) [sssd[be[default]]] [sdap_save_user] (0x1000): > Original memberOf is not available for [gary]. > (Tue Mar 26 13:20:42 2013) [sssd[be[default]]] [sdap_attrs_add_ldap_attr] > (0x2000): Adding original mod-Timestamp [20130221151906Z] to attributes of > [gary]. > (Tue Mar 26 13:20:42 2013) [sssd[be[default]]] [sdap_save_user] (0x1000): > Original USN value is not available for [gary]. > (Tue Mar 26 13:20:42 2013) [sssd[be[default]]] [sdap_save_user] (0x1000): > User principal is not available for [gary]. > > This repeats ~10 times for all of our users in 15 minutes. > > Turning enumerate off return sssd to normal behaviour. enumerate=false is the recommended setting. It's currently not reflected in the man pages, I think, but it's going to be in the next release. Hi all, We are experiencing the same on Red Hat Enterprise Linux Server release 5.10 (Tikanga) machine, does the fix apply for that release? Thanks in advance for your feedback. regards, Serigne. (In reply to Serigne Diagne from comment #10) > Hi all, > > We are experiencing the same on Red Hat Enterprise Linux Server release 5.10 > (Tikanga) machine, does the fix apply for that release? > > Thanks in advance for your feedback. > > regards, > Serigne. There is no fix yet, this bugzilla is tracking a future improvement that is not implemented yet. Currently we recommend to not use enumeration at all, in most cases it can be avoided. Hi(In reply to Jakub Hrozek from comment #11) > (In reply to Serigne Diagne from comment #10) > Hi all, > > We are > experiencing the same on Red Hat Enterprise Linux Server release 5.10 > > (Tikanga) machine, does the fix apply for that release? > > Thanks in > advance for your feedback. > > regards, > Serigne. There is no fix yet, > this bugzilla is tracking a future improvement that is not implemented yet. > Currently we recommend to not use enumeration at all, in most cases it can > be avoided. Hi Jakub, Thanks for the prompt feedback. Just to make sure we are on the same page: I hace checked the sssd.conf on the server the enumeration is set to false. Is this what you mean by not using the enumeration or do I need some additional actions to exclude the enumeration? Serigne. The term "server" might be a bit confusing here. We are talking about enumeration in the sssd.conf on the client machine. The enumeration should be turned off. If it is turned off and you still see some issues the we would need your sssd.conf and SSSD logs with a debug_level at 7. Please open a support case and provide this information. Sorry for the late response, I was on PTO for a couple of days. Dmitri's questions are valid, I will just re-iterate one point -- enumeration should ideally be off (enumerate=true in the domain section of the sssd.conf) on all clients unless there is an application that needs to see the whole list of users (aka call setpwent/getpwent/endpwent) for one reason or another. *** Bug 1022242 has been marked as a duplicate of this bug. *** We are planning performance improvements for 7.3, however mostly aimed at the default case which is non-enumerate. I'm marking this request as 'low', chances the changes would also help the enumerate case, but I don't think we should actively pursue enumeration improvements.
Tested the BZ on SSSD Version: sssd-1.14.0-43.el7.x86_64
In this case, we have more than 15,000 users in 389-ds ldap server. Enumeration turned ON and it is expected that SSSD fetches all the users. Following are the observations from the test:
1) Execution time for initial user (local) login, before enumeration turned ON : 11 secs OR .183 mins
2) Execution time for user (local) login, 20 secs after enumeration : 10 secs OR .166 mins
3) Monitored CPU usage every 5 secs during the run:
- 100 % for the initial 5 secs
- 0,0 % after 20 secs
4) Total number of enumerated records: 4991 records
- This is strange as we expect all the users to be fetched and cached. Do we have new changes in SSSD related to this? Does SSSD fetches users in batches of 1000's of records? Does it has anything to do with paging in 389-DS server?
5) Manually executed a for loop to fetch all the users:
- for ((i=10000 ; i<=25000 ; i++)); do getent passwd -s sss puser$i; done
- All the users were fetched. SYSDB cache shows over 15000 records. So, i believe user look works fine manually but it is not enumerated properly by default.
6) In order to test the performance in terms of CPU%, I invalidated all the users in cache, sss_cache -E.
and then restarted sssd service. Monitored CPU usage every 5 secs during the run. Observed 100% usage for over 40 secs. However, strangely the cache has only over 1000 records and updated.
- Question is where has all the records gone, which existed before sssd service restart ? Does invalidation has anything to do with it ? Looks like a different bug to me !!
Bottom line is user login case and CPU % shows much better performance than reported in rhel-6.4 by kaushik. However i believe it has to do with enumeration not properly done in rhel-7.3.
Please advice.
Created attachment 1204257 [details] SSSD Domain log file, refer comment 27
Retested with timeout = 30, I still see sssd_be process restarting. Looks like the sysdb cache is updated, the size being 44 Mb. However when i run ldbsearh on it, command fails with search error - Indexed and full searches both failed!
Also, timestamp cache after returning approx. 30000 records fails with : search error - Indexed and full searches both failed!
# service sssd status
Redirecting to /bin/systemctl status sssd.service
Main PID: 32258 (sssd)
CGroup: /system.slice/sssd.service
├─ 2784 /usr/libexec/sssd/sssd_be --domain LDAP --uid 0 --gid 0 --debug-to-files
├─32258 /usr/sbin/sssd -D -f
├─32260 /usr/libexec/sssd/sssd_nss --uid 0 --gid 0 --debug-to-files
└─32261 /usr/libexec/sssd/sssd_pam --uid 0 --gid 0 --debug-to-files
Sep 26 06:23:28 auto-hv-02-guest01.idmqe.lab.eng.bos.redhat.com sssd[32258]: Starting up
Sep 26 06:23:28 auto-hv-02-guest01.idmqe.lab.eng.bos.redhat.com sssd[be[LDAP]][32259]: Starting up
Sep 26 06:23:28 auto-hv-02-guest01.idmqe.lab.eng.bos.redhat.com sssd[nss][32260]: Starting up
Sep 26 06:23:28 auto-hv-02-guest01.idmqe.lab.eng.bos.redhat.com sssd[pam][32261]: Starting up
Sep 26 06:23:28 auto-hv-02-guest01.idmqe.lab.eng.bos.redhat.com systemd[1]: Started System Security Services Daemon.
Sep 26 06:25:30 auto-hv-02-guest01.idmqe.lab.eng.bos.redhat.com sssd[be[LDAP]][342]: Starting up
Sep 26 06:27:31 auto-hv-02-guest01.idmqe.lab.eng.bos.redhat.com sssd[be[LDAP]][914]: Starting up
Sep 26 06:29:32 auto-hv-02-guest01.idmqe.lab.eng.bos.redhat.com sssd[be[LDAP]][1468]: Starting up
Sep 26 06:31:33 auto-hv-02-guest01.idmqe.lab.eng.bos.redhat.com sssd[be[LDAP]][1984]: Starting up
Sep 26 06:35:04 auto-hv-02-guest01.idmqe.lab.eng.bos.redhat.com sssd[be[LDAP]][2784]: Starting up
There is a similar issue with one of my customer who has enumeration enabled. He is requesting to raise an RFE that ito would be great to be able to have similar new feature where one could simply set enumeration for only specific groups (with full members) and for other groups use "ignore_group_members = True". Can I expect any suggestion or reply for this request on this bugzilla itself or should I raise a new RFE ? (In reply to Abhinay Reddy Peddireddy from comment #36) > There is a similar issue with one of my customer who has enumeration > enabled. He is requesting to raise an RFE that ito would be great to be > able to have similar new feature where one could simply set enumeration for > only specific groups (with full members) and for other groups use > "ignore_group_members = True". > > Can I expect any suggestion or reply for this request on this bugzilla > itself or should I raise a new RFE ? I don't like this feature to be honest. I would prefer to make sssd perform well by default and not add an option for everyone's particular case. I would suggest they try 7.3 first and see if using 7.3 makes a difference. Mounting the cache to tmpfs is a workaround that should also give the customer better performance without having to add new options. btw this bugzilla is about enumerate=true performing badly. In this context, enumerate means downloading all the users and groups from the directory. This will always be quite slow and should be discouraged. On one hand I still agree with comment #40 in the sense that we did improve performance of SSSD in 7.3.0 and fixed issues related to watchdog in 7.3.z. However, we are still seeing performance issues with enumeration enabled with the current release and have plans on addressing in the next upstream release therefore I'm reproposing the bugzilla to 7.5.0. QE: To verify the bug, enable enumeration in a large domain. Observe the sssd_be behaviour when the enumeration is running. It will probably spike to full CPU usage, but should return to idle much faster than in 7.4. Also, large domains that would otherwise trigger the watchdog (thousands or tens of thousands of entries) should no longer trigger the watchdog. When you get around to verifying the bug, please also ping us about systemtap scripts, we should merge those by then and the scripts should be usable to verify that the parsing and saving is faster. Verified the bug on SSSD Version : sssd-1.16.0-14.el7.x86_64 Test Steps: 1. Configure sssd.conf with enumeration ON 2. Start SSSD service and capture CPU usage during initial enumeration. This bug was tested on RHEL-7.4 and RHEL-7.5. LDAP server was setup with over 25000 users and large number of groups. Size limit and look through limit in ldap server were modified to "Unlimited". SSSD Client was setup with following configuration with enumeration ON: [sssd] config_file_version = 2 services = nss, pam domains = LDAP, LOCAL [domain/LOCAL] id_provider = local [pam] timeout = 60 [nss] timeout = 60 [domain/LDAP] id_provider = ldap ldap_uri = ldap://SERVER ldap_tls_cacert = /etc/openldap/certs/cacert.asc enumerate = true ldap_schema = rfc2307bis timeout = 60 In the case of RHEL-7.4: ------------------------------------------------------------------------------------------------------------------ 1) The SSSD Version under test was sssd-1.15.2-50.el7_4.8.x86_64. 2) After enumeration, the total number of records returned from sysdb cache : 32020 3) Size of LDB files after enumeration: total 195M -rw-------. 1 root root 162M Jan 31 11:18 cache_LDAP.ldb -rw-------. 1 root root 1.3M Jan 31 10:45 config.ldb -rw-------. 1 root root 1.3M Jan 31 10:45 sssd.ldb -rw-------. 1 root root 31M Jan 31 11:21 timestamps_LDAP.ldb 4) During enumeration, the performance statistics captured using TOP utility every 20 seconds (iteration of 20 secs): -------------------------------------------------------------------------------------- Tests: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND -------------------------------------------------------------------------------------- ITR-01: 7753 root 20 0 438264 150844 9140 R 93.8 3.9 0:08.83 sssd_be ITR-02: 7753 root 20 0 525700 227136 9280 R 93.3 5.9 0:28.98 sssd_be ITR-03: 7753 root 20 0 614156 298040 14048 R 100.0 7.7 0:49.14 sssd_be ITR-04: 7753 root 20 0 672536 349816 11148 R 100.0 9.0 1:09.29 sssd_be ITR-05: 7753 root 20 0 794700 429144 97728 R 93.8 11.1 1:25.48 sssd_be ITR-06: 7753 root 20 0 794700 431024 99608 R 100.0 11.1 1:45.59 sssd_be ITR-07: 7753 root 20 0 826108 423216 91800 R 100.0 10.9 2:05.69 sssd_be ITR-08: 7753 root 20 0 826108 424380 92964 R 100.0 10.9 2:25.82 sssd_be ITR-09: 7753 root 20 0 826108 436236 104820 R 100.0 11.2 2:45.97 sssd_be ITR-10: 7753 root 20 0 842144 447932 100608 R 93.8 11.5 3:06.11 sssd_be ITR-11: 7753 root 20 0 842144 449708 102384 R 100.0 11.6 3:26.26 sssd_be ITR-12: 7753 root 20 0 842144 449252 101928 R 100.0 11.6 3:46.40 sssd_be ITR-13: 7753 root 20 0 842144 447932 100608 R 93.8 11.5 4:06.53 sssd_be ITR-14: 7753 root 20 0 856160 425356 77936 R 100.0 11.0 4:26.67 sssd_be ITR-15: 7753 root 20 0 931776 414012 86392 S 0.0 10.7 4:31.27 sssd_be ITR-16: 7753 root 20 0 931776 413104 85484 S 0.0 10.6 4:31.28 sssd_be ITR-17: 7753 root 20 0 931776 414136 86516 S 0.0 10.7 4:31.28 sssd_be ITR-18: 7753 root 20 0 931776 414136 86516 S 0.0 10.7 4:31.28 sssd_be ITR-19: 7753 root 20 0 931776 414136 86516 S 0.0 10.7 4:31.28 sssd_be ITR-20: 7753 root 20 0 931776 414136 86516 S 0.0 10.7 4:31.28 sssd_be --------------------------------------------------------------------------------------- In the case of RHEL-7.5: --------------------------------------------------------------------------------------------------------------------- 1) The SSSD Version under test was sssd-1.16.0-14.el7.x86_64. 2) After enumeration, the total number of records returned from sysdb cache : 32020 3) Size of LDB files after enumeration: total 175M -rw-------. 1 root root 148M Jan 31 13:48 cache_LDAP.ldb -rw-------. 1 root root 1.3M Jan 31 13:38 config.ldb -rw-------. 1 root root 1.3M Jan 31 13:48 sssd.ldb -rw-------. 1 root root 24M Jan 31 13:40 timestamps_LDAP.ldb 4) During enumeration, the performance statistics captured using TOP utility every 20 seconds (iteration of 20 secs): -------------------------------------------------------------------------------------- Tests: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND -------------------------------------------------------------------------------------- ITR-01: 14196 root 20 0 446124 147280 8368 R 94.1 0.9 0:04.98 sssd_be ITR-02: 14196 root 20 0 590520 276680 12384 R 100.0 1.7 0:25.13 sssd_be ITR-03: 14196 root 20 0 684504 355828 16968 R 100.0 2.2 0:45.28 sssd_be ITR-04: 14196 root 20 0 592928 230308 83992 R 87.5 1.4 1:01.86 sssd_be ITR-05: 14196 root 20 0 680724 298164 86932 R 100.0 1.8 1:22.03 sssd_be ITR-06: 14196 root 20 0 816864 438300 90920 R 100.0 2.7 1:42.19 sssd_be ITR-07: 14196 root 20 0 816864 435680 88300 R 100.0 2.7 2:02.36 sssd_be ITR-08: 14196 root 20 0 831444 441332 93952 R 100.0 2.7 2:22.51 sssd_be ITR-09: 14196 root 20 0 845460 417560 70180 R 100.0 2.6 2:42.68 sssd_be ITR-10: 14196 root 20 0 836200 334132 77316 S 0.0 2.1 2:55.37 sssd_be ITR-11: 14196 root 20 0 836200 334132 77316 S 0.0 2.1 2:55.37 sssd_be ITR-12: 14196 root 20 0 836200 334132 77316 S 0.0 2.1 2:55.37 sssd_be ITR-13: 14196 root 20 0 836200 334132 77316 S 0.0 2.1 2:55.37 sssd_be ITR-14: 14196 root 20 0 836200 334132 77316 S 0.0 2.1 2:55.37 sssd_be ITR-15: 14196 root 20 0 836200 334132 77316 S 0.0 2.1 2:55.37 sssd_be ITR-16: 14196 root 20 0 836200 334096 77280 S 0.0 2.1 2:55.38 sssd_be ITR-17: 14196 root 20 0 836200 334096 77280 S 0.0 2.1 2:55.38 sssd_be ITR-18: 14196 root 20 0 836200 334096 77280 S 0.0 2.1 2:55.38 sssd_be ITR-19: 14196 root 20 0 836200 334096 77280 S 0.0 2.1 2:55.38 sssd_be ITR-20: 14196 root 20 0 836200 334096 77280 S 0.0 2.1 2:55.38 sssd_be --------------------------------------------------------------------------------------- Observation: 1. In the case of RHEL-7.4, the sssd_be process shows spiked CPU usage for approximately 300 seconds, however in the case of latest build in RHEL-7.5 it took approx. 200 seconds to settle down. 2. %MEM usage for sssd_be process is much less in RHEL-7.5, when compared with RHEL-7.4. 3. Size of ldb files is less in RHEL-7.5 than RHEL-7.4. Difference is almost 20 MB for the same set of records in LDAP server. 4. Overall SSSD build in RHEL-7.5 shows better performance than RHEL-7.4. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:0929 |