Bug 1913199

Summary:	IPA server (389ds) is very slow in execution of some searches (`&(memberOf=...)(objectClass=ipaHost)` in particular)
Product:	Red Hat Enterprise Linux 8	Reporter:	Akshay Sakure <asakure>
Component:	389-ds-base	Assignee:	LDAP Maintainers <ldap-maint>
Status:	CLOSED ERRATA	QA Contact:	RHDS QE <ds-qe-bugs>
Severity:	urgent	Docs Contact:
Priority:	high
Version:	8.4	CC:	adumitru, atikhono, dminnich, dwysocha, grajaiya, jhrozek, ldap-maint, lslebodn, mreynolds, msauton, mzidek, pbrezina, pkulkarn, ppolawsk, rcritten, sbose, sgouvern, spichugi, tbordaz, tmihinto, toneata, tscherf, vvasilev
Target Milestone:	rc	Keywords:	Performance, ZStream
Target Release:	---	Flags:	toneata: needinfo+
Hardware:	x86_64
OS:	Linux
Whiteboard:	sync-to-jira
Fixed In Version:	389-ds-1.4-8060020211118202403.ce3e8c9c	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	2010976 2021998 2066801 (view as bug list)		Environment:
Last Closed:	2022-05-10 13:43:00 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2010976, 2021998, 2066801

Description Akshay Sakure 2021-01-06 09:00:17 UTC

Description of problem:
Improve performance in sssd to avoid slowness in sudo on IPA client

Version-Release number of selected component (if applicable):
RHEL 7.9
sssd-1.16.5-10.el7_9.5.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Create sudo rules in IPA 
2. Create hostgroups in IPA with a large number of IPA clients (~500)
3. Run sudo on IPA client, it takes close to 10-15 seconds for execution.

Actual results:
Sudo taking longer time for execution on IPA clients.
Ansible Tower jobs are failing because sudo prompt takes a lot of time.
This mostly happens for the first time when information is not cached. 

Expected results:
Sudo should not take a longer time for execution on IPA clients.

Additional info:
This issue has already been discussed with IPA/SSSD developers, and this BZ is filed to keep track of this issue for future.

Comment 3 Alexey Tikhonov 2021-01-06 11:18:10 UTC

Quotes from the thread:

> Alexey Tikhonov:
  -- step:
(11:56:07:952043): [sdap_process_result] (0x2000): Trace: sh[0x561d901854e0], connected[1], ops[0x561d90179850], ldap[0x561d90184990]
(11:56:07:952047): [sdap_process_result] (0x2000): going to execute ldap_result()...
(11:56:07:986317): [sdap_process_result] (0x2000): ldap_result() was executed
(11:56:07:986334): [sdap_process_result] (0x2000): memset() was executed
...
(11:56:07:986465): [sdap_process_result] (0x2000): sdap_process_message() was executed, returning
  --  next step:
(11:56:07:986483): [sdap_process_result] (0x2000): Trace: sh[0x561d901854e0], connected[1], ops[0x561d90179850], ldap[0x561d90184990]
```
  --  most of the time is spent in `ldap_result()`  ~  almost all 34 msecs
This function is called with ` timeout == {0, 0}`. According to the man page this means polling.

Moreover, other invocations of `ldap_result()` in the same log are fast (within 0.1 ms or less). So it seems it depends on the content in the socket (or other internal openldap structures).

Frankly this seems like an issue in openldap lib. I think chances this could be addressed there are low, but... Simon, what would you think?

Probably another option could be to rework sssd code to use non-zero `all` parameter of `ldap_result()`...


> Sumit Bose:
If I understand the case correctly most of the time is spend in processing the results
in SSSD code. Given the number of results 34 msecs sounds reasonable to
me for decoding the LDAP reply and preparing the libldap structs holding
the results.


> Alexey Tikhonov:
This is 34 msec for every entry in 450 total. This results in 12..14 secs for processing BE_REQ_NETGROUP results.
That's why I say reading everything at once (`all` param of ldap_result()) might help (but not sure)


> Sumit Bose:
thanks for the explanation. I didn't realize that the ldap_result() call
was in a loop.

Yes, it might be more efficient to use LDAP_MSG_ALL at least for the
case where expect many results. But this requires an additional
ldap_first_message(), ldap_next_message() processing which we currently
can avoid because of LDAP_MSG_ONE. Maybe this can be integrated into
sdap_process_result() by adding a LDAPMessage member to struct
sdap_handle and iterate over the result return by ldap_next_message()
before calling ldap_result() again.

I guess it has to be tested if LDAP_MSG_ALL or LDAP_MSG_RECEIVED is
better suited for out use case. If I understand the latter correctly it
would return as many results as there are received at the time
ldap_result() is called. So I guess it might not block as long as
LDAP_MSG_ALL which should wait until everything is received. But it has
to be tested.

Comment 32 Florence Blanc-Renaud 2021-08-23 15:33:23 UTC

1. I checked the LDAP searches done when a sudo command is executed, and none is showing unindexed attributes (they would appear with notes=U or notes=A in the access log). This means we can eliminate unindexed searches as the root cause of the performance issue.

2. the LDAP searches are bound with fqdn=client.ipa.test,cn=computers,cn=accounts,dc=ipa,dc=test
Indeed we can see that the group membership takes significantly more time with this bind dn, compared to similar search bound with cn=directory manager:

(for a group with ~700 hosts):
Search authenticated as fqdn=host: RESULT err=0 tag=101 nentries=708 etime=0.184823245
----------------------------------
echo $PASSWORD | kinit -kt /etc/krb5.keytab; time ldapsearch -Y GSSAPI -b "cn=accounts,dc=ipa,dc=test" "(&(|(memberOf=cn=myhostgroup,cn=ng,cn=alt,dc=ipa,dc=test))(objectClass=ipaHost))" objectClass cn fqdn serverHostName memberOf ipaSshPubKey ipaUniqueID > /dev/null

real	0m0.251s
user	0m0.045s
sys	0m0.012s


Search authenticated as Dir Mgr: RESULT err=0 tag=101 nentries=708 etime=0.021605313
--------------------------------
time ldapsearch -D cn=directory\ manager -w Secret123 -b "cn=accounts,dc=ipa,dc=test" "(&(|(memberOf=cn=myhostgroup,cn=ng,cn=alt,dc=ipa,dc=test))(objectClass=ipaHost))" objectClass cn fqdn serverHostName memberOf ipaSshPubKey ipaUniqueID > /dev/null

real	0m0.073s
user	0m0.042s
sys	0m0.008s

The results are really far from the customer's observations and IMO it may be explained by the permissions added by the customer. Are the custom ACIs/permissions available somewhere in the case?

Comment 74 sgouvern 2021-12-15 11:05:12 UTC

With 389-ds-base-1.4.3.28-2.module+el8.6.0+13331+9eba6551.x86_64

Tested with the steps from https://bugzilla.redhat.com/show_bug.cgi?id=1913199#c54 :

With targetfilter cache enabled :

# time ldapsearch -LLL -h localhost -p 389 -D "uid=demo_user,ou=people,dc=example,dc=com" -w password -b "ou=aci,dc=example,dc=com"
...
real	0m0.436s
user	0m0.012s
sys	0m0.013s

In the access log, etime for the operation :0.395336762

[15/Dec/2021:05:51:34.156167709 -0500] conn=505 op=1 SRCH base="ou=aci,dc=example,dc=com" scope=2 filter="(objectClass=*)" attrs=ALL
[15/Dec/2021:05:51:34.551444685 -0500] conn=505 op=1 RESULT err=0 tag=101 nentries=501 wtime=0.000062420 optime=0.395276718 etime=0.395336762 notes=U details="Partially Unindexed Filter"


With targetfilter cache disabled :

# time ldapsearch -LLL -h localhost -p 389 -D "uid=demo_user,ou=people,dc=example,dc=com" -w password -b "ou=aci,dc=example,dc=com"
...
real	0m3.922s
user	0m0.032s
sys	0m0.006s


In the access log, etime for the operation : 3.881943696

[15/Dec/2021:05:53:52.076649623 -0500] conn=508 op=1 SRCH base="ou=aci,dc=example,dc=com" scope=2 filter="(objectClass=*)" attrs=ALL
[15/Dec/2021:05:53:55.958511049 -0500] conn=508 op=1 RESULT err=0 tag=101 nentries=501 wtime=0.000079718 optime=3.881866800 etime=3.881943696 notes=U details="Partially Unindexed Filter"

The operation is much faster when targetfilter cache is enabled.
Marking as verified:tested / VERIFIED

Comment 79 errata-xmlrpc 2022-05-10 13:43:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (389-ds:1.4 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:1815