Bug 801407

Summary: sssd_nss gets hung processing identical search requests
Product: Red Hat Enterprise Linux 6 Reporter: Stephen Gallagher <sgallagh>
Component: sssdAssignee: Stephen Gallagher <sgallagh>
Status: CLOSED ERRATA QA Contact: IDM QE LIST <seceng-idm-qe-list>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 6.3CC: grajaiya, jgalipea, jhrozek, kbanerje, prc
Target Milestone: rcKeywords: Regression
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: sssd-1.8.0-12.el6 Doc Type: Bug Fix
Doc Text:
Cause: The function that handled pending requests on reconnect was checking an orphaned global variable that was never used Consequence: If there was a request SSSD never received an answer for, the request was left hanging there and all subsequent requests for the same information never finished Fix: The correct hash table us used now Result: Identical requests are processed correctly now even if the original request fails
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-06-20 11:55:46 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Stephen Gallagher 2012-03-08 13:17:53 UTC
This bug is created as a clone of upstream ticket:
https://fedorahosted.org/sssd/ticket/1229

There is a reoccurring intermittent issue that I have been experiencing with sssd 1.7 where SSSD either during or after a failure & recovery of a FreeIPA server where sssd will fail to process a user and forget to mark the status.  Thus when the FreeIPA server comes back, an effected user cannot login. (This is paticular painful when this user is present in pam_access as a (Do not allow this user to login) since it causes pam to stop processing while it waits for sssd to finish processing the pam_access user and never gets to the user logging in...  This issue seems only to occur during some timing based situations where the user is being accessed in an outage of the FreeIPA server.

( I am updating to 1.8 in the hopes that that solves this problem )

Here is a snip it from the sssd_nss.log:
(Mon Mar  5 14:43:08 2012) [sssd[nss]] [accept_fd_handler] (0x0100): Client connected!
(Mon Mar  5 14:43:08 2012) [sssd[nss]] [sss_cmd_get_version] (0x0200): Received client version [1].
(Mon Mar  5 14:43:08 2012) [sssd[nss]] [sss_cmd_get_version] (0x0200): Offered version [1].
(Mon Mar  5 14:43:08 2012) [sssd[nss]] [nss_cmd_getpwnam] (0x0100): Requesting info for [brokenuser] from [<ALL>]
(Mon Mar  5 14:43:08 2012) [sssd[nss]] [sss_ncache_check_str] (0x2000): Checking negative cache for [NCE/USER/example.com/brokenuser]
(Mon Mar  5 14:43:08 2012) [sssd[nss]] [nss_cmd_getpwnam_search] (0x0100): Requesting info for [brokenuser]
(Mon Mar  5 14:43:08 2012) [sssd[nss]] [sss_dp_get_account_send] (0x0400): Identical request in progress: [1:brokenuser]

Comment 1 Stephen Gallagher 2012-03-08 13:23:56 UTC
To reproduce do the following (provided by Simo in the upstream ticket)

Clear your caches (or expire them all) and start sssd.

ps xa|grep sssd
find the backend pid and do a kill -STOP <pid>

now do a getent passwd username the sssd_nss responder will send a message to the backend but it is blocked so it will not act.

Now kill -9 <pid> the backend. Do a new getent passwd username in anothe terminal.

Look at the nss logs (level 8) and you'll see that the call is stalled waiting for a reply that will never come.

With the patch the second name resolution will cause the cleanup function to fire and a brand new call to the backend is issued. no more stalling.

Comment 4 Jakub Hrozek 2012-04-03 18:39:02 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause: The function that handled pending requests on reconnect was checking an orphaned global variable that was never used

Consequence: If there was a request SSSD never received an answer for, the request was left hanging there and all subsequent requests for the same information never finished

Fix: The correct hash table us used now

Result: Identical requests are processed correctly now even if the original request fails

Comment 6 Kaushik Banerjee 2012-04-19 06:36:02 UTC
Verified in version:

# rpm -qi sssd | head
Name        : sssd                         Relocations: (not relocatable)
Version     : 1.8.0                             Vendor: Red Hat, Inc.
Release     : 22.el6                        Build Date: Mon 09 Apr 2012 07:40:33 PM IST
Install Date: Mon 16 Apr 2012 04:57:02 PM IST      Build Host: x86-003.build.bos.redhat.com
Group       : Applications/System           Source RPM: sssd-1.8.0-22.el6.src.rpm
Size        : 7870660                          License: GPLv3+
Signature   : (none)
Packager    : Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>
URL         : http://fedorahosted.org/sssd/
Summary     : System Security Services Daemon

Comment 8 errata-xmlrpc 2012-06-20 11:55:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2012-0747.html