Bug 1625937

Summary: sssd restarts unexpectedly when enumeration is set to true unless the timeout is raised from its default value
Product: Red Hat Enterprise Linux 8 Reporter: German Parente <gparente>
Component: sssdAssignee: Alexey Tikhonov <atikhono>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Jakub Vavra <jvavra>
Severity: low Docs Contact:
Priority: low    
Version: 8.2CC: aboscatt, atikhono, grajaiya, jhrozek, lslebodn, mkosek, mpanaous, mzidek, pbrezina, sbose, sssd-maint, thalman, tmihinto, tscherf
Target Milestone: rcKeywords: Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-05-09 19:43:33 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description German Parente 2018-09-06 09:13:04 UTC
Description of problem:

we have a customer case with sssd-ad integration with enumeration set to true.
The process is restarted unexpectedly when performing enumeration after the whole set of users is saved into the cache and before processing the groups.

The process does not restart when enumeration is set to false, or with a greater timeout.

I will attach internally the logs.



Version-Release number of selected component (if applicable): sssd-1.16.0-19.el7_5.5.x86_64



How reproducible: always (customer site)

Comment 11 Alexey Tikhonov 2019-09-27 17:57:18 UTC
There are two issues can be seen in those logs:


(I) Performance degradation of enumeration task in RHEL7 vs RHEL6 SSSD version (if customer claim of the same env is valid)

(1) Fetching data:

RHEL6:
> (Wed Sep  5 12:42:02 2018) [sssd[be[AUDI.VWG]]] [be_ptask_execute] (0x0400): Task [enumeration]: executing task, timeout 300 seconds
> ...
> (Wed Sep  5 12:42:22 2018) [sssd[be[AUDI.VWG]]] [sdap_search_user_process] (0x0400): Search for users, returned 12012 results.
  --  fetching data takes 20 seconds

RHEL7:
> (Wed Sep  5 12:42:30 2018) [sssd[be[AUDI.VWG]]] [be_ptask_execute] (0x0400): Task [enumeration]: executing task, timeout 300 seconds
> ...
> (Wed Sep  5 12:42:56 2018) [sssd[be[AUDI.VWG]]] [sdap_search_user_process] (0x0400): Search for users, returned 12012 results.
  --  fetching data takes 26 seconds

So RHEL7 is 30% slower than RHEL6 version. That not that terrible yet.


(2) Saving data:

RHEL6:
> (Wed Sep  5 12:42:22 2018) [sssd[be[AUDI.VWG]]] [sdap_save_users] (0x4000): User 0 processed!
> ...
> (Wed Sep  5 12:42:38 2018) [sssd[be[AUDI.VWG]]] [sdap_get_users_done] (0x4000): Saving 12012 Users - Done
  --  16 seconds

RHEL7:
> (Wed Sep  5 12:42:56 2018) [sssd[be[AUDI.VWG]]] [sdap_save_users] (0x4000): User 0 processed!
> ...
> (Wed Sep  5 12:43:42 2018) [sssd[be[AUDI.VWG]]] [sdap_get_users_done] (0x4000): Saving 12012 Users - Done
  --  46 seconds.

RHEL7 is 2.8 times slower that RHEL6 version. And that *is* terrible.


I do not think reading two additional attributes from AD (that are not set anyway) and updating 6 additional search indexes (in addition to existing 16) explain this huge difference. But not sure.
In RHEL7 there is additionally update of timestamp ldb. But again this should not be that slow as this ldb is NOSYNC.

It would be good to have logs with milliseconds (RHEL7 version spends about 4 ms for every user).

I will keep this ticket open to track investigation of this performance degradation. But customer case is closed (so investigation would require setting up own reproducer), so it seems to be very low prio.



(II) Watchdog doesn't work properly.

> (Wed Sep  5 12:42:56 2018) [sssd[be[AUDI.VWG]]] [sdap_save_users] (0x4000): User 0 processed!
> ...
> (Wed Sep  5 12:43:42 2018) [sssd[be[AUDI.VWG]]] [sdap_get_users_done] (0x4000): Saving 12012 Users - Done
  --  46 seconds.

With default watchdog timeout 10 secs be-process should be terminated during execution of sdap_save_users()  --  not later than in (10+1)*3 seconds. But process is terminated only after task execution.


This appears to be a bug and I opened upstream tickets: https://pagure.io/SSSD/sssd/issue/4089
(and somewhat related https://pagure.io/SSSD/sssd/issue/4088)

I will not link this bz with those upstream tickets explicitly as those bugs was not cause of customer case (logs just helped to spot it indirectly).

Comment 16 Alexey Tikhonov 2022-05-09 19:43:33 UTC
(In reply to Alexey Tikhonov from comment #11)
> 
> (2) Saving data:
> 
> RHEL6:
> > (Wed Sep  5 12:42:22 2018) [sssd[be[AUDI.VWG]]] [sdap_save_users] (0x4000): User 0 processed!
> > ...
> > (Wed Sep  5 12:42:38 2018) [sssd[be[AUDI.VWG]]] [sdap_get_users_done] (0x4000): Saving 12012 Users - Done
>   --  16 seconds
> 
> RHEL7:
> > (Wed Sep  5 12:42:56 2018) [sssd[be[AUDI.VWG]]] [sdap_save_users] (0x4000): User 0 processed!
> > ...
> > (Wed Sep  5 12:43:42 2018) [sssd[be[AUDI.VWG]]] [sdap_get_users_done] (0x4000): Saving 12012 Users - Done
>   --  46 seconds.
> 
> RHEL7 is 2.8 times slower that RHEL6 version. And that *is* terrible.
> 
> 
> I do not think reading two additional attributes from AD (that are not set
> anyway) and updating 6 additional search indexes (in addition to existing
> 16) explain this huge difference. But not sure.
> In RHEL7 there is additionally update of timestamp ldb. But again this
> should not be that slow as this ldb is NOSYNC.


Looking at the logs, this - update of timestamp ldb - is a most significant difference.
Another (I guess, minor) things are:
 - `[sss_domain_get_state] (0x1000): Domain ... is Active`
 - `[sdap_attrs_add_ldap_attr] (0x2000): authorizedRHost is not available for [...].`
 - `[sdap_attrs_add_ldap_attr] (0x2000): mail is not available for [...].`
 - `Removing attribute [userCertificate]`

I think further comparison of RHEL7 with RHEL6 doesn't make sense.
What make sense is to profile current implementation of `sdap_save_user()`.