Bug 1660461

Summary:	responders chain requests that were issued before reconnection to sssd_be
Product:	Red Hat Enterprise Linux 8	Reporter:	Madhuri <mupadhye>
Component:	sssd	Assignee:	Pavel Březina <pbrezina>
Status:	CLOSED ERRATA	QA Contact:	sssd-qe <sssd-qe>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	8.0	CC:	dbula, grajaiya, jhrozek, lslebodn, mzidek, pbrezina, sgoveas, tscherf
Target Milestone:	rc	Keywords:	Regression
Target Release:	8.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	sssd-2.1.0-1.el8	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-11-05 22:33:53 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1682305
Bug Blocks:

Description Madhuri 2018-12-18 11:30:30 UTC

Description of problem:
sssd_be take too long time to reconnect (almost more than 80 seconds) after killing the pid of sssd_be.

Version-Release number of selected component (if applicable):
sssd-2.0.0-27.el8.x86_64

How reproducible:
Always


Steps to Reproduce:
1. Configure sssd client
2. Clear the sss cache
3. Find pid of sssd_be and do a kill -STOP `pidof sssd_be`
4. Do a new getent passwd username in another terminal or run getent passwd as back groud process.
5. Now kill -9 `pidof sssd_be`
6. And check user lookup again.

Actual results:
sssd_be take too long time to reconnect (almost more than 80 seconds).

Expected results:
Should not take that much long time to reconnect.


Additional info:

# cat sssd.conf
    [sssd]
    config_file_version = 2
    services = nss, pam
    domains = LDAP

    [nss]
    filter_groups = root
    filter_users = root

    [pam]

    [domain/LDAP]
    debug_level=0xFFF0
    id_provider = ldap
    ldap_uri = ldap://server.example.com
    ldap_search_base = dc=example,dc=com
    ldap_tls_cacert = /etc/openldap/certs/cacert.asc

Comment 2 Jakub Hrozek 2018-12-18 11:36:04 UTC

I debugged the issue with Madhuri and this is what happens:
 - the first request is created
 - sssd_be dies due to timeout and is restarted. This of course kills the first request
 - sssd_nss reconnects to the new sssd_be instance
 - a second request is created and chained with the first request that had never a chance to finish
 - the client waits until the client timeout passes

I guess we need to drop the pending requests table when reconnecting.

Comment 3 Jakub Hrozek 2018-12-18 13:11:31 UTC

Upstream ticket:
https://pagure.io/SSSD/sssd/issue/3907

Comment 4 Pavel Březina 2019-02-14 12:32:23 UTC

Upstream PR:
https://github.com/SSSD/sssd/pull/752

Comment 5 Jakub Hrozek 2019-02-27 18:46:04 UTC

* master: ffd7536dfa402a6d0dec2fb0bb3e3a221f5f9aab

Comment 7 Madhuri 2019-08-23 11:06:15 UTC

Verified with:
sssd-2.2.0-11.el8.x86_64

From automation,

:: [ 16:19:12 ] :: [  BEGIN   ] :: Running '> /var/log/sssd/sssd_nss.log'
:: [ 16:19:12 ] :: [   PASS   ] :: Command '> /var/log/sssd/sssd_nss.log' (Expected 0, got 0)
:: [ 16:19:12 ] :: [  BEGIN   ] :: Running 'restart_clearing_cache'
Redirecting to /bin/systemctl stop sssd.service
Redirecting to /bin/systemctl start sssd.service
:: [ 16:19:13 ] :: [   LOG    ] :: Sleeping for 15 seconds
:: [ 16:19:28 ] :: [   PASS   ] :: Command 'restart_clearing_cache' (Expected 0, got 0)
:: [ 16:19:28 ] :: [  BEGIN   ] :: Running 'kill -STOP 31891 31890'
:: [ 16:19:28 ] :: [   PASS   ] :: Command 'kill -STOP 31891 31890' (Expected 0, got 0)
:: [ 16:19:28 ] :: [  BEGIN   ] :: Running 'sleep 10'
:: [ 16:19:38 ] :: [   PASS   ] :: Command 'sleep 10' (Expected 0, got 0)
:: [ 16:19:38 ] :: [  BEGIN   ] :: Running 'getent -s sss passwd usr1 &'
:: [ 16:19:38 ] :: [   PASS   ] :: Command 'getent -s sss passwd usr1 &' (Expected 0, got 0)
:: [ 16:19:39 ] :: [  BEGIN   ] :: Running 'kill -9 31891 31890'
:: [ 16:19:39 ] :: [   PASS   ] :: Command 'kill -9 31891 31890' (Expected 0, got 0)
:: [ 16:19:39 ] :: [  BEGIN   ] :: Running 'sleep 10'
:: [ 16:19:49 ] :: [   PASS   ] :: Command 'sleep 10' (Expected 0, got 0)
:: [ 16:19:49 ] :: [  BEGIN   ] :: Running 'getent -s sss passwd usr1'
usr1:*:111111:111111:usr1:/home/usr1:
:: [ 16:19:49 ] :: [   PASS   ] :: Command 'getent -s sss passwd usr1' (Expected 0, got 0)
:: [ 16:19:49 ] :: [   PASS   ] :: File '/var/log/sssd/sssd_nss.log' should not contain 'Identical request in progress' 


From above marking this bug as Verified.

Comment 10 errata-xmlrpc 2019-11-05 22:33:53 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2019:3651