Bug 1660461

Summary: responders chain requests that were issued before reconnection to sssd_be
Product: Red Hat Enterprise Linux 8 Reporter: Madhuri <mupadhye>
Component: sssdAssignee: Pavel Březina <pbrezina>
Status: CLOSED ERRATA QA Contact: sssd-qe <sssd-qe>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 8.0CC: dbula, grajaiya, jhrozek, lslebodn, mzidek, pbrezina, sgoveas, tscherf
Target Milestone: rcKeywords: Regression
Target Release: 8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: sssd-2.1.0-1.el8 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-11-05 22:33:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1682305    
Bug Blocks:    

Description Madhuri 2018-12-18 11:30:30 UTC
Description of problem:
sssd_be take too long time to reconnect (almost more than 80 seconds) after killing the pid of sssd_be.

Version-Release number of selected component (if applicable):
sssd-2.0.0-27.el8.x86_64

How reproducible:
Always


Steps to Reproduce:
1. Configure sssd client
2. Clear the sss cache
3. Find pid of sssd_be and do a kill -STOP `pidof sssd_be`
4. Do a new getent passwd username in another terminal or run getent passwd as back groud process.
5. Now kill -9 `pidof sssd_be`
6. And check user lookup again.

Actual results:
sssd_be take too long time to reconnect (almost more than 80 seconds).

Expected results:
Should not take that much long time to reconnect.


Additional info:

# cat sssd.conf
    [sssd]
    config_file_version = 2
    services = nss, pam
    domains = LDAP

    [nss]
    filter_groups = root
    filter_users = root

    [pam]

    [domain/LDAP]
    debug_level=0xFFF0
    id_provider = ldap
    ldap_uri = ldap://server.example.com
    ldap_search_base = dc=example,dc=com
    ldap_tls_cacert = /etc/openldap/certs/cacert.asc

Comment 2 Jakub Hrozek 2018-12-18 11:36:04 UTC
I debugged the issue with Madhuri and this is what happens:
 - the first request is created
 - sssd_be dies due to timeout and is restarted. This of course kills the first request
 - sssd_nss reconnects to the new sssd_be instance
 - a second request is created and chained with the first request that had never a chance to finish
 - the client waits until the client timeout passes

I guess we need to drop the pending requests table when reconnecting.

Comment 3 Jakub Hrozek 2018-12-18 13:11:31 UTC
Upstream ticket:
https://pagure.io/SSSD/sssd/issue/3907

Comment 4 Pavel Březina 2019-02-14 12:32:23 UTC
Upstream PR:
https://github.com/SSSD/sssd/pull/752

Comment 5 Jakub Hrozek 2019-02-27 18:46:04 UTC
* master: ffd7536dfa402a6d0dec2fb0bb3e3a221f5f9aab

Comment 7 Madhuri 2019-08-23 11:06:15 UTC
Verified with:
sssd-2.2.0-11.el8.x86_64

From automation,

:: [ 16:19:12 ] :: [  BEGIN   ] :: Running '> /var/log/sssd/sssd_nss.log'
:: [ 16:19:12 ] :: [   PASS   ] :: Command '> /var/log/sssd/sssd_nss.log' (Expected 0, got 0)
:: [ 16:19:12 ] :: [  BEGIN   ] :: Running 'restart_clearing_cache'
Redirecting to /bin/systemctl stop sssd.service
Redirecting to /bin/systemctl start sssd.service
:: [ 16:19:13 ] :: [   LOG    ] :: Sleeping for 15 seconds
:: [ 16:19:28 ] :: [   PASS   ] :: Command 'restart_clearing_cache' (Expected 0, got 0)
:: [ 16:19:28 ] :: [  BEGIN   ] :: Running 'kill -STOP 31891 31890'
:: [ 16:19:28 ] :: [   PASS   ] :: Command 'kill -STOP 31891 31890' (Expected 0, got 0)
:: [ 16:19:28 ] :: [  BEGIN   ] :: Running 'sleep 10'
:: [ 16:19:38 ] :: [   PASS   ] :: Command 'sleep 10' (Expected 0, got 0)
:: [ 16:19:38 ] :: [  BEGIN   ] :: Running 'getent -s sss passwd usr1 &'
:: [ 16:19:38 ] :: [   PASS   ] :: Command 'getent -s sss passwd usr1 &' (Expected 0, got 0)
:: [ 16:19:39 ] :: [  BEGIN   ] :: Running 'kill -9 31891 31890'
:: [ 16:19:39 ] :: [   PASS   ] :: Command 'kill -9 31891 31890' (Expected 0, got 0)
:: [ 16:19:39 ] :: [  BEGIN   ] :: Running 'sleep 10'
:: [ 16:19:49 ] :: [   PASS   ] :: Command 'sleep 10' (Expected 0, got 0)
:: [ 16:19:49 ] :: [  BEGIN   ] :: Running 'getent -s sss passwd usr1'
usr1:*:111111:111111:usr1:/home/usr1:
:: [ 16:19:49 ] :: [   PASS   ] :: Command 'getent -s sss passwd usr1' (Expected 0, got 0)
:: [ 16:19:49 ] :: [   PASS   ] :: File '/var/log/sssd/sssd_nss.log' should not contain 'Identical request in progress' 


From above marking this bug as Verified.

Comment 10 errata-xmlrpc 2019-11-05 22:33:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2019:3651