Bug 1421194

Summary: SSSD doesn't use AD global catalog for gidnumber lookup, resulting in unacceptable delay for large forests
Product: Red Hat Enterprise Linux 7 Reporter: afox <afox>
Component: sssdAssignee: SSSD Maintainers <sssd-maint>
Status: CLOSED ERRATA QA Contact: Dan Lavu <dlavu>
Severity: high Docs Contact:
Priority: unspecified    
Version: 7.3CC: dominik.mierzejewski, fidencio, grajaiya, jhrozek, lslebodn, mkosek, mzidek, pbrezina
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: sssd-1.16.0-10.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-04-10 17:09:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1420851    

Description afox@redhat.com 2017-02-10 15:10:14 UTC
Description of problem:
Customer has a large number of users on RHEL7 SSSD clients that are complaining about extreme slowness in name service lookups. Investigations indicate that SSSD does not appear to be using the global catalog for gidnumber lookups.

SSSD is trawling through all the domains in the forest (and other trusted domains), rather that doing a global catalog search.

This is causing serious performance problems on RHEL7 clients in the production AD forest.

Please can you explain under what circumstances the SSSD client is supposed to use the GC and confirm whether the behaviour seen in my testing is as expected or not.  Specifically could you provide information on what attributes are searched for in the GC, whether this is effective for forward (eg user->uidnumber) and reverse (eg uidnumber->user) lookups, and what is the expected behaviour if a value for the attribute is not found in the GC.

Version-Release number of selected component (if applicable):
python-sssdconfig-1.14.0-43.el7.noarch
sssd-common-1.14.0-43.el7.x86_64
sssd-ipa-1.14.0-43.el7.x86_64
sssd-common-pac-1.14.0-43.el7.x86_64
sssd-ldap-1.14.0-43.el7.x86_64
sssd-1.14.0-43.el7.x86_64
sssd-krb5-common-1.14.0-43.el7.x86_64
sssd-krb5-1.14.0-43.el7.x86_64
sssd-ad-1.14.0-43.el7.x86_64
sssd-client-1.14.0-43.el7.x86_64
sssd-proxy-1.14.0-43.el7.x86_64

How reproducible:
Reliably. 

Steps to Reproduce:

Test Performed:

# Stop SSSD, clear cache and logs, and restart

[root@oo-vm5 ~]# service sssd stop
Redirecting to /bin/systemctl stop  sssd.service
[root@oo-vm5 ~]# rm -rf /var/lib/sss/db/* /var/log/sssd/*
[root@oo-vm5 ~]# service sssd start
Redirecting to /bin/systemctl start  sssd.service

# Do a ls of a directory containing files with group owners from local domain, trusted domains, and not existing at all.

[jw83113@oo-vm5 test]$ ls -lg
total 0
-rw-rw-r--. 1 unixeng-access-grp.net 0 Feb  3 04:12 eurgroup
-rw-rw-r--. 1                                 9999999 0 Feb  3 04:12 groupnotfound
-rw-rw-r--. 1 jwlacgrp2.net          0 Feb  3 09:53 lacgroup
-rw-rw-r--. 1 jw83113                                 0 Feb  3 04:12 namgroup

# Look in sssd_nss.log file to see what domains are being queried (log level 0x01F0)
#  It can be observed that SSSD is trawling domains until it finds the gidnumber or worse when it does not exist, it tries every domain rather than checking the GC.  (SSSD should know that gidnumber is replicated to GC from the schema objects)

[root@oo-vm5 ~]# tail -f  /var/log/sssd/sssd_nss.log
[…]

(Mon Feb  6 13:14:55 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100): Requesting info for [1200955690.NET]
(Mon Feb  6 13:14:55 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100): Requesting info for [1200955690.NET]
(Mon Feb  6 13:14:55 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100): Requesting info for [9999999.NET]
(Mon Feb  6 13:14:55 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100): Requesting info for [9999999.NET]
(Mon Feb  6 13:14:55 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100): Requesting info for [9999999]
(Mon Feb  6 13:14:55 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100): Requesting info for [9999999]
(Mon Feb  6 13:14:55 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100): Requesting info for [9999999.net]
(Mon Feb  6 13:14:55 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100): Requesting info for [9999999.net]
(Mon Feb  6 13:14:55 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100): Requesting info for [9999999.net]
(Mon Feb  6 13:14:55 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100): Requesting info for [9999999.net]
(Mon Feb  6 13:14:55 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100): Requesting info for [9999999.net]
(Mon Feb  6 13:14:55 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100): Requesting info for [9999999.net]
(Mon Feb  6 13:14:55 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0080): No matching domain found for [9999999]
(Mon Feb  6 13:14:55 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100): Requesting info for [1250152112.NET]
(Mon Feb  6 13:14:55 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100): Requesting info for [1250152112.NET]
(Mon Feb  6 13:14:55 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100): Requesting info for [1250152112]
(Mon Feb  6 13:14:55 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100): Requesting info for [1250152112]
(Mon Feb  6 13:14:55 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100): Requesting info for [1250152112.net]
(Mon Feb  6 13:14:55 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100): Requesting info for [1250152112.net]
(Mon Feb  6 13:14:55 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100): Requesting info for [1290120733.NET]
(Mon Feb  6 13:14:55 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100): Requesting info for [1290120733.NET]
(Mon Feb  6 13:14:55 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100): Requesting info for [1290120733]
(Mon Feb  6 13:14:55 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100): Requesting info for [1290120733]
(Mon Feb  6 13:14:55 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100): Requesting info for [1290120733.net]
(Mon Feb  6 13:14:55 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100): Requesting info for [1290120733.net]
(Mon Feb  6 13:14:55 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100): Requesting info for [1290120733.net]
(Mon Feb  6 13:14:55 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100): Requesting info for [1290120733.net]

# A TCP dump shows that SSSD does communicate with the GC but then goes to each domain anyway.

13:14:55.561889 IP oo-vm5.nam.nsroot.net.56466 > namdev003.nam.nsroot.net.msft-gc: Flags [P.], seq 2020:2222, ack 3596, win 321, options [nop,nop,TS val 2718168430 ecr 113087555], length 202
13:14:55.580684 IP namdev003.nam.nsroot.net.msft-gc > oo-vm5.nam.nsroot.net.56466: Flags [P.], seq 3596:3868, ack 2222, win 258, options [nop,nop,TS val 113087586 ecr 2718168430], length 272
13:14:55.580698 IP oo-vm5.nam.nsroot.net.56466 > namdev003.nam.nsroot.net.msft-gc: Flags [.], ack 3868, win 344, options [nop,nop,TS val 2718168448 ecr 113087586], length 0
13:14:55.590267 IP oo-vm5.nam.nsroot.net.32812 > namdev003.nam.nsroot.net.ldap: Flags [P.], seq 5116:5416, ack 31133, win 836, options [nop,nop,TS val 2718168458 ecr 113087583], length 300
13:14:55.590811 IP oo-vm5.nam.nsroot.net.32812 > namdev003.nam.nsroot.net.ldap: Flags [P.], seq 5416:5716, ack 31219, win 836, options [nop,nop,TS val 2718168459 ecr 113087587], length 300
13:14:55.591781 IP oo-vm5.nam.nsroot.net.35076 > nsrootdev001.nam.nsroot.net.ldap: Flags [P.], seq 3455:3745, ack 9334, win 406, options [nop,nop,TS val 2718168460 ecr 247010140], length 290
13:14:55.592662 IP oo-vm5.nam.nsroot.net.35076 > nsrootdev001.nam.nsroot.net.ldap: Flags [P.], seq 3745:4035, ack 9970, win 416, options [nop,nop,TS val 2718168460 ecr 247010144], length 290
13:14:55.593851 IP oo-vm5.nam.nsroot.net.46548 > eurdev003.nam.nsroot.net.ldap: Flags [P.], seq 2686:2986, ack 4045, win 321, options [nop,nop,TS val 2718168462 ecr 246862079], length 300
13:14:55.594750 IP oo-vm5.nam.nsroot.net.46770 > ausdev001.nam.nsroot.net.ldap: Flags [.], ack 3466, win 299, options [nop,nop,TS val 2718168463 ecr 272910065], length 0
13:14:55.595372 IP oo-vm5.nam.nsroot.net.46548 > eurdev003.nam.nsroot.net.ldap: Flags [P.], seq 2986:3286, ack 4131, win 321, options [nop,nop,TS val 2718168463 ecr 246862082], length 300
13:14:55.596562 IP oo-vm5.nam.nsroot.net.46992 > lacdev001.nam.nsroot.net.ldap: Flags [P.], seq 2386:2686, ack 3466, win 299, options [nop,nop,TS val 2718168464 ecr 272909609], length 300
13:14:55.598262 IP oo-vm5.nam.nsroot.net.46992 > lacdev001.nam.nsroot.net.ldap: Flags [.], ack 3948, win 321, options [nop,nop,TS val 2718168466 ecr 272909622], length 0
13:14:55.630765 IP oo-vm5.nam.nsroot.net.32812 > namdev003.nam.nsroot.net.ldap: Flags [.], ack 31305, win 836, options [nop,nop,TS val 2718168499 ecr 113087587], length 0
13:14:55.632766 IP oo-vm5.nam.nsroot.net.35076 > nsrootdev001.nam.nsroot.net.ldap: Flags [.], ack 10606, win 426, options [nop,nop,TS val 2718168501 ecr 247010144], length 0
13:14:55.635786 IP oo-vm5.nam.nsroot.net.46548 > eurdev003.nam.nsroot.net.ldap: Flags [.], ack 4217, win 321, options [nop,nop,TS val 2718168504 ecr 246862082], length 0

In the production AD forest this takes 18 seconds to search for a  single non-existent group.  Once this goes into the negative entry cache (which I have increased the time to live), the lookup is fast because it’s fulfilled from the local cache but this initial lookup is unacceptably slow.  SSSD should do one single search against the GC not trawl around all the domains.

(Mon Feb  6 13:20:24 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100): Requesting info for [500008636.NET]
(Mon Feb  6 13:20:24 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100): Requesting info for [500008636.NET]
(Mon Feb  6 13:20:24 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100): Requesting info for [500008636]
(Mon Feb  6 13:20:25 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100): Requesting info for [500008636]
(Mon Feb  6 13:20:25 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100): Requesting info for [500008636.citicorp.com]
(Mon Feb  6 13:20:27 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100): Requesting info for [500008636.citicorp.com]
(Mon Feb  6 13:20:27 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100): Requesting info for [500008636.net]
(Mon Feb  6 13:20:27 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100): Requesting info for [500008636.net]
(Mon Feb  6 13:20:27 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100): Requesting info for [500008636.narb.citicorp.com]
(Mon Feb  6 13:20:29 2017) [sssd[nss]] [nss_cmd_getby_dp_callback] (0x0040): Unable to get information from Data Provider
Error: 3, 1432158270, Group lookup failed
Will try to return what we have in cache
(Mon Feb  6 13:20:29 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100): Requesting info for [500008636.net]
(Mon Feb  6 13:20:31 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100): Requesting info for [500008636.net]
(Mon Feb  6 13:20:31 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100): Requesting info for [500008636.net]
(Mon Feb  6 13:20:33 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100): Requesting info for [500008636.net]
(Mon Feb  6 13:20:33 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100): Requesting info for [500008636.dyn.nsroot.net]
(Mon Feb  6 13:20:35 2017) [sssd[nss]] [nss_cmd_getby_dp_callback] (0x0040): Unable to get information from Data Provider
Error: 3, 1432158270, Group lookup failed
Will try to return what we have in cache
(Mon Feb  6 13:20:35 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100): Requesting info for [500008636]
(Mon Feb  6 13:20:38 2017) [sssd[nss]] [nss_cmd_getby_dp_callback] (0x0040): Unable to get information from Data Provider
Error: 3, 1432158270, Group lookup failed
Will try to return what we have in cache
(Mon Feb  6 13:20:38 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100): Requesting info for [500008636.net]
(Mon Feb  6 13:20:41 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100): Requesting info for [500008636.net]
(Mon Feb  6 13:20:41 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100): Requesting info for [500008636.narb.citicorp.com]
(Mon Feb  6 13:20:42 2017) [sssd[nss]] [nss_cmd_getby_dp_callback] (0x0040): Unable to get information from Data Provider
Error: 3, 1432158270, Group lookup failed
Will try to return what we have in cache
(Mon Feb  6 13:20:42 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100): Requesting info for [500008636.net]
(Mon Feb  6 13:20:45 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0100): Requesting info for [500008636.net]
(Mon Feb  6 13:20:45 2017) [sssd[nss]] [nss_cmd_getgrgid_search] (0x0080): No matching domain found for [500008636]


Actual results:
SSSD goes to global catalog, but then trawls subdomains any way. 

Expected results:
SSSD should always use global catalog first, and only trawl subdomains if data cannot be found. 

Additional info:
The performance problem is occuring in a production AD environment which is very large (approx 400000 users), however the behaviour has been replicated in a development AD environment too.  The delays in development are much smaller because there are a lot fewer entries however the effects are the same in that many domains are being searched individually rather than a single GC query.

When does the behavior occur? Frequently?  Repeatedly?   At certain times?

Always when an entry is not in the local SSSD cache.  It's repeatable.

What information can you provide around timeframes and the business impact?

This is an important case for the customer as it has visibility to production end users.  They are managing the impact by increasing the SSSD negative entry cache time to live to reduce the frequency of these lookups but it's not a root cause fix.  They need a root cause fix for this problem.

Comment 4 Jakub Hrozek 2017-08-10 11:13:03 UTC
Upstream ticket:
https://pagure.io/SSSD/sssd/issue/3468

Comment 5 Jakub Hrozek 2017-08-10 18:15:42 UTC
QE: to test this, prepare an AD forest with overlapping search bases, e.g.
 dc=foo,dc=bar
 dc=child,dc=foo,dc=bar

Have POSIX attributes replicated into the Global Catalog or use ID mapping.

Request an ID that doesn't exist in either domain. With the current code, SSSD would iterate over all domains. With the patched code, SSSD should run only one search against the global catalog.

Please note that the behaviour should also be tested in an environment where the POSIX attributes are not replicated to the global catalog. There, sssd should behave as it did before, so just iterate over all the domains.

Comment 6 Lukas Slebodnik 2017-12-06 21:21:52 UTC
master:
* a72919af8347b5bbc65a3b1fb3e5d31447240b24
* f2a5e29f063f9d623c1336d76f4b2bc500c1a5e2
* a6eb9c4c3ff68d134bc745e8374f182737e9696b
* 0a0b34f5fbe8f4a8c533a7d65f0f2961ee264054
* 2856dac5818265a6b4e42d768b73c65e333d14ff
* 800b1a27543fa83bc6cd73d8e2789f3cdbaf584a
* 6cd367da68ff56eb48b8b4167dbdd5e53992d194
* 07452697a67902dc6876d2f40d364cf1eadf2431
* 95fd82a4d7b50e64fed6906bc5345f271e8247d9
* 095844d6b48aef483c33e5a369a405ae686e044d
* c0f9f5a0f6d71a1596ee3cef549b4b02295313c3
* ba8a92bbd59f189bd1323dd0c4010cdfc694be35
* 6ae22d9adc0b075361defc99b8f14480ba8e7b46
* dacfe74113dde62ddaaa7f9abf9d2b6448d89db6
* 8e93ebb2a6f7644c389c1d1f4e92a21c4d0b2b45
* 37fdd9dc1ad5968067f8e3c43a51ed2ac9f3b104

Comment 10 Dan Lavu 2017-12-15 13:32:42 UTC
Verified against sssd-1.16.0-11.el7.x86_64

==== logs output  ====

(Fri Dec 15 17:17:37 2017) [sssd[nss]] [accept_fd_handler] (0x0400): Client connected!
(Fri Dec 15 17:17:37 2017) [sssd[nss]] [sss_cmd_get_version] (0x0200): Received client version [1].
(Fri Dec 15 17:17:37 2017) [sssd[nss]] [sss_cmd_get_version] (0x0200): Offered version [1].
(Fri Dec 15 17:17:37 2017) [sssd[nss]] [nss_getby_name] (0x0400): Input name: unknown.com
(Fri Dec 15 17:17:37 2017) [sssd[nss]] [cache_req_set_plugin] (0x2000): CR #5: Setting "User by name" plugin
(Fri Dec 15 17:17:37 2017) [sssd[nss]] [cache_req_send] (0x0400): CR #5: New request 'User by name'
(Fri Dec 15 17:17:37 2017) [sssd[nss]] [cache_req_process_input] (0x0400): CR #5: Parsing input name [unknown.com]
(Fri Dec 15 17:17:37 2017) [sssd[nss]] [sss_domain_get_state] (0x1000): Domain sssdad.com is Active
(Fri Dec 15 17:17:37 2017) [sssd[nss]] [sss_domain_get_state] (0x1000): Domain child1.sssdad.com is Active
(Fri Dec 15 17:17:37 2017) [sssd[nss]] [sss_parse_name_for_domains] (0x0200): name 'unknown.com' matched expression for domain 'child1.sssdad.com', user is unknown
(Fri Dec 15 17:17:37 2017) [sssd[nss]] [cache_req_set_name] (0x0400): CR #5: Setting name [unknown]
(Fri Dec 15 17:17:37 2017) [sssd[nss]] [cache_req_select_domains] (0x0400): CR #5: Performing a single domain search
(Fri Dec 15 17:17:37 2017) [sssd[nss]] [sss_domain_get_state] (0x1000): Domain sssdad.com is Active
(Fri Dec 15 17:17:37 2017) [sssd[nss]] [sss_domain_get_state] (0x1000): Domain child1.sssdad.com is Active
(Fri Dec 15 17:17:37 2017) [sssd[nss]] [cache_req_search_domains] (0x0400): CR #5: Search will check the cache and check the data provider
(Fri Dec 15 17:17:37 2017) [sssd[nss]] [cache_req_validate_domain_type] (0x2000): Request type POSIX-only for domain child1.sssdad.com type POSIX is valid


posix enabled searching only the global catalog

id: unknown.com: no such user

real	0m0.926s
user	0m0.000s
sys	0m0.003s



id mapping enabled iterating through all the domains

[root@vm-idm-013 db]# time id unknown.com
id: unknown.com: no such user

real	0m23.515s
user	0m0.000s
sys	0m0.004s

Comment 13 errata-xmlrpc 2018-04-10 17:09:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:0929