Bug 645434

Summary: NSS responder dies if DP dies during a request
Product: [Fedora] Fedora Reporter: Sumit Bose <sbose>
Component: sssdAssignee: Stephen Gallagher <sgallagh>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: low    
Version: 14CC: jhrozek, sbose, sgallagh, ssorce
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: sssd-1.4.1-1.fc14 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 645437 645438 (view as bug list) Environment:
Last Closed: 2010-11-16 23:19:42 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 645437, 645438    

Description Sumit Bose 2010-10-21 13:44:33 UTC
Description of problem:
If a data provider dies during a NSS request the NSS responder dies if the timeout of the open and unhandled requests is reached.

Version-Release number of selected component (if applicable):
At least sssd-1.2 and above

How reproducible:
There is no know error in the LDAP provider which can be used to trigger this issue, so the sssd_be process must be killed manually. 

Steps to Reproduce:
1. Configure sssd with id_provider=ldap.
2. Choose a slow LDAP server and a very large group or find some other way to make the LDAP request last long.
3. getent group very_large_group
4. kill sssd_be immediatly after calling getent
5. wait until the timeout is reached (couple of minutes)
  
Actual results:
NSS responder dies.

Expected results:
NSS responder returns an error to the client.

Additional info:
The upstream bug can be found here: https://fedorahosted.org/sssd/ticket/654

Comment 1 Sumit Bose 2010-10-22 11:45:59 UTC
Based on an idea from Jan Zelený <jzeleny> I found an easier way to reproduce this issue:

Steps to Reproduce:
1. Configure sssd with id_provider=ldap, any LDAP server is ok
2. Start sssd, preferably with an empty cache (rm -f /var/lib/sss/db/*)
3. Find the pid of sssd_nss
   pgrep sssd_nss
4. Define a delay on the interface which is used to contact the LDAP server. If the LDAP server runs locally use lo, e.g.
   tc qdisc add dev lo root netem delay 3s
5. run
   while /bin/true; do if pgrep getent 1> /dev/null; then killall -9 /usr/libexec/sssd/sssd_be; break; fi; sleep 1; done
   (this will kill sssd_be as soon as a getent command is running
6. in a differrent shell call
   getent group some_group_which_is_not_in_the_cache
7. Wait until the getent call returns, this call last up to 5 minutes 
8. Call
   pgrep sssd_nss
   again
9. remove the delay
   tc qdisc del dev lo root

Actual results:
The two PIDs differ, i.e. sssd_nss dies and was restarted

Expected results:
The two PIDs are the same, i.e. sssd_nss didn't die

Comment 2 Fedora Update System 2010-11-05 18:34:42 UTC
sssd-1.4.1-1.fc14 has been submitted as an update for Fedora 14.
https://admin.fedoraproject.org/updates/sssd-1.4.1-1.fc14

Comment 3 Fedora Update System 2010-11-06 23:40:59 UTC
sssd-1.4.1-1.fc14 has been pushed to the Fedora 14 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing update sssd'.  You can provide feedback for this update here: https://admin.fedoraproject.org/updates/sssd-1.4.1-1.fc14

Comment 4 Fedora Update System 2010-11-16 23:19:33 UTC
sssd-1.4.1-1.fc14 has been pushed to the Fedora 14 stable repository.  If problems still persist, please make note of it in this bug report.