1375182 – SSSD goes offline when the LDAP server returns sizelimit exceeded

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1375182 - SSSD goes offline when the LDAP server returns sizelimit exceeded

Summary: SSSD goes offline when the LDAP server returns sizelimit exceeded

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	sssd
Sub Component:
Version:	7.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	rc
Target Release:	---
Assignee:	SSSD Maintainers
QA Contact:	Steeve Goveas
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-09-12 12:25 UTC by Amith
Modified:	2020-05-02 18:29 UTC (History)
CC List:	8 users (show)
Fixed In Version:	sssd-1.14.0-41.el7
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-11-04 07:21:29 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
SSSD Domain log for this issue. (4.53 MB, text/plain) 2016-09-12 12:25 UTC, Amith	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	SSSD sssd issues 4218	0	None	closed	SSSD goes offline when the LDAP server returns sizelimit exceeded	2021-02-12 13:12:50 UTC
Red Hat Product Errata	RHEA-2016:2476	0	normal	SHIPPED_LIVE	sssd bug fix and enhancement update	2016-11-03 14:08:11 UTC

Description Amith 2016-09-12 12:25:05 UTC

Created attachment 1200199 [details]
SSSD Domain log for this issue.

Description of problem:
This issue was observed during regression round of the existing performance suite. This test fails due to a reproduction step mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=889182#c2 :

for i in `getent group someverylargegroup | tr ',' ' '`; do id $i; done

In the test environment we have approximately 16000 users shared by 3 large groups (puser10000 to puser26000). There is a delay in fetching the users as well, while running a user lookup for puser15677 or say puser25788.

Here we are trying to retrieve the large group and then run "id" command on each user. Group retrieval works fine but id command fails. The id command works only for the first user and then it fails for others. The work around is to first execute "getent passwd -s sss <user>" and then run id command which i manually did.

Version-Release number of selected component (if applicable):
sssd-1.14.0-36.el7.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Setup a 389-DS server with large user-group. lets say 5000 users in a bulkgroup.

2. Setup SSSD client with ldap provider. See the sssd.conf below:

[sssd]
config_file_version = 2
services = nss, pam
domains = LDAP

[nss]
filter_groups = root
filter_users = root

[pam]

[domain/LDAP]
debug_level = 0xFFF0
id_provider = ldap
ldap_uri = ldap://<SERVER>
ldap_tls_cacert = /etc/openldap/certs/cacert.asc

3. Run the following in a script:
for i in `getent group bulkgroup1 | tr ',' ' '`; do id $i;

Actual results:
id command fails for all users except the first one.

Expected results:
id command should work for all.

Additional info:
SSSD domain log attached.

Comment 1 Jakub Hrozek 2016-09-12 12:55:54 UTC

Well, SSSD seems to be going offline, because there's too much data being returned:
(Mon Sep 12 15:25:59 2016) [sssd[be[LDAP]]] [sdap_get_generic_op_finished] (0x0040): Unexpected result from ldap: Administrative limit exceeded(11), no errmsg set
(Mon Sep 12 15:25:59 2016) [sssd[be[LDAP]]] [sdap_op_destructor] (0x2000): Operation 4 finished
(Mon Sep 12 15:25:59 2016) [sssd[be[LDAP]]] [generic_ext_search_handler] (0x0040): sdap_get_generic_ext_recv failed [5]: Input/output error
(Mon Sep 12 15:25:59 2016) [sssd[be[LDAP]]] [sdap_get_users_done] (0x0040): Failed to retrieve users [5][Input/output error].
(Mon Sep 12 15:25:59 2016) [sssd[be[LDAP]]] [sdap_id_op_done] (0x0200): communication error on cached connection, moving to next server
(Mon Sep 12 15:25:59 2016) [sssd[be[LDAP]]] [sdap_id_op_done] (0x4000): advising for connection retry #1
(Mon Sep 12 15:25:59 2016) [sssd[be[LDAP]]] [sdap_id_op_done] (0x4000): releasing operation connection
(Mon Sep 12 15:25:59 2016) [sssd[be[LDAP]]] [sdap_id_release_conn_data] (0x4000): releasing unused connection
(Mon Sep 12 15:25:59 2016) [sssd[be[LDAP]]] [sdap_handle_release] (0x2000): Trace: sh[0x7f460fc81d20], connected[1], ops[(nil)], ldap[0x7f460fc75570], destructor_lock[0], release_memory[0]
(Mon Sep 12 15:25:59 2016) [sssd[be[LDAP]]] [remove_connection_callback] (0x4000): Successfully removed connection callback.
(Mon Sep 12 15:25:59 2016) [sssd[be[LDAP]]] [sdap_id_op_connect_step] (0x4000): beginning to connect
(Mon Sep 12 15:25:59 2016) [sssd[be[LDAP]]] [fo_resolve_service_send] (0x0100): Trying to resolve service 'LDAP'
(Mon Sep 12 15:25:59 2016) [sssd[be[LDAP]]] [get_server_status] (0x1000): Status of server 'hp-dl380pgen8-02-vm-16.lab.bos.redhat.com' is 'working'
(Mon Sep 12 15:25:59 2016) [sssd[be[LDAP]]] [get_port_status] (0x1000): Port status of port 389 for server 'hp-dl380pgen8-02-vm-16.lab.bos.redhat.com' is 'not working'
(Mon Sep 12 15:25:59 2016) [sssd[be[LDAP]]] [fo_resolve_service_send] (0x0020): No available servers for service 'LDAP'
(Mon Sep 12 15:25:59 2016) [sssd[be[LDAP]]] [be_resolve_server_done] (0x1000): Server resolution failed: [5]: Input/output error
(Mon Sep 12 15:25:59 2016) [sssd[be[LDAP]]] [sdap_id_op_connect_done] (0x0020): Failed to connect, going offline (5 [Input/output error])

Is the server set up with paging?

Comment 5 Lukas Slebodnik 2016-09-13 06:36:52 UTC

Upstream ticket:
https://fedorahosted.org/sssd/ticket/3185

Comment 6 Jakub Hrozek 2016-09-14 09:15:19 UTC

* master: 3319d964721396c07daba383ded6aaaf33ed6e3b

Comment 8 Amith 2016-09-21 15:25:46 UTC

Verified the bug on SSSD Version: sssd-1.14.0-42.el7.x86_64

This bug was logged due to failures in existing "SSSD Performance" test suite during the regression rounds. Successfully verified the bug with the latest SSSD build.  See the beaker job log details:

:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
:: [   LOG    ] :: performance_01: Bz617623 - SSSD suffers from serious performance issues on initgroup calls
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

:: [   LOG    ] :: Sleeping for 5 seconds
:: [   PASS   ] :: Command '( time ./initgroups.test puser10999 ) > /tmp/output 2>&1' (Expected 0, got 0)
:: [   PASS   ] :: User puser10999 returned in less than 5 seconds 
:: [   PASS   ] :: Command '( time ./initgroups.test puser15999 ) > /tmp/output 2>&1' (Expected 0, got 0)
:: [   PASS   ] :: User puser15999 returned in less than 5 seconds 
:: [   PASS   ] :: Command 'getent -s sss passwd puser15999' (Expected 0, got 0)
:: [   PASS   ] :: Command '( time ./initgroups.test puser25999 ) > /tmp/output 2>&1' (Expected 0, got 0)
:: [   PASS   ] :: User puser25999 returned in less than 5 seconds 
:: [   PASS   ] :: Command 'getent -s sss passwd puser25999' (Expected 0, got 0)
:: [   LOG    ] :: Duration: 8s
:: [   LOG    ] :: Assertions: 8 good, 0 bad
:: [   PASS   ] :: RESULT: performance_01: Bz617623 - SSSD suffers from serious performance issues on initgroup calls

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
:: [   LOG    ] :: performance_02: bz889182 and 888800 - crash in memory cache
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

:: [   LOG    ] :: Sleeping for 5 seconds
:: [   PASS   ] :: Command 'for i in bulkgroup1:*:9997:puser11000
.
.
.
puser10930 puser10931 puser10932 puser10933 puser10934 puser10935 puser10936 puser10937 puser10938 puser10939 puser10940 puser10941 puser10942 puser10943 puser10944 puser10945 puser10946 puser10947 puser10948 puser10949 puser10950 puser10951 puser10952 puser10953 puser10954 puser10955 puser10956 puser10957 puser10958 puser10959 puser10960 puser10961 puser10962 puser10963 puser10964 puser10965 puser10966 puser10967 puser10968 puser10969 puser10970 puser10971 puser10972 puser10973 puser10974 puser10975 puser10976 puser10977 puser10978 puser10979 puser10980 puser10981 puser10982 puser10983 puser10984 puser10985 puser10986 puser10987 puser10988 puser10989 puser10990 puser10991 puser10992 puser10993 puser10994 puser10995 puser10996 puser10997 puser10998 puser10999; do id $i; done' (Expected 0, got 0)
:: [   PASS   ] :: File '/var/log/messages' should not contain 'segfault' 
:: [   PASS   ] :: Looking up 1000 users increased sssd_nss memory usage by 3200 kB 
:: [   LOG    ] :: Sleeping for 5 seconds
:: [   PASS   ] :: Command 'getent group bulkgroup2' (Expected 0, got 0)
:: [   PASS   ] :: Command 'id puser11001' (Expected 0, got 0)
:: [   PASS   ] :: Command 'id puser11002' (Expected 0, got 0)
.
.
.
:: [   PASS   ] :: Command 'id puser15995' (Expected 0, got 0)
:: [   PASS   ] :: Command 'id puser15996' (Expected 0, got 0)
:: [   PASS   ] :: Command 'id puser15997' (Expected 0, got 0)
:: [   PASS   ] :: Command 'id puser15998' (Expected 0, got 0)
:: [   PASS   ] :: Command 'id puser15999' (Expected 0, got 0)
:: [   PASS   ] :: Command 'id puser16000' (Expected 0, got 0)
:: [   PASS   ] :: id lookup on next 1000 users increased sssd_nss memory usage by 6564 kB 
:: [   PASS   ] :: File '/var/log/messages' should not contain 'segfault' 
:: [   LOG    ] :: Duration: 1h 33m 59s
:: [   LOG    ] :: Assertions: 5010 good, 0 bad
:: [   PASS   ] :: RESULT: performance_02: bz889182 and 888800 - crash in memory cache

Comment 10 errata-xmlrpc 2016-11-04 07:21:29 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-2476.html

Note You need to log in before you can comment on or make changes to this bug.