Bug 783138

Summary: Backend occasionally goes offline under heavy load.
Product: Red Hat Enterprise Linux 6 Reporter: Kaushik Banerjee <kbanerje>
Component: sssdAssignee: Stephen Gallagher <sgallagh>
Status: CLOSED ERRATA QA Contact: IDM QE LIST <seceng-idm-qe-list>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 6.2CC: grajaiya, jgalipea, jhrozek, prc, samuel.kielek
Target Milestone: rcKeywords: Reopened
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: sssd-1.8.0-5.el6 Doc Type: Bug Fix
Doc Text:
No documentation needed
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-06-20 11:51:28 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Kaushik Banerjee 2012-01-19 14:02:54 UTC
Description of problem:
Backend(ldap auth provider) occasionally goes offline under heavy load.

Version-Release number of selected component (if applicable):
sssd-1.5.1-66.el6_2.3

How reproducible:
Occasionally

Steps to Reproduce:
1. sssd.conf domain section:
debug_level = 9
id_provider = ldap
ldap_uri = ldap://openldapsrv.example.com
ldap_search_base = dc=example,dc=com
ldap_tls_cacert = /etc/openldap/cacerts/server.pem

2. Use the reproducer from https://bugzilla.redhat.com/show_bug.cgi?id=725281 and run 15 concurrent instances of it.
  
Actual results:
Auth fails occasionally.

Expected results:
Auth should succeed.

Additional info:
1. This issue was noticed while verifying bug #782443

2. Domain log:

(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_get_generic_step] (6):
calling ldap_search_ext with
[(&(memberuid=kau1)(objectclass=posixGroup)(cn=*)(&(gidNumber=*)(!(gidNumber=0))))][dc=example,dc=com].
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_get_generic_step] (7):
Requesting attrs: [objectClass]
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_get_generic_step] (7):
Requesting attrs: [cn]
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_get_generic_step] (7):
Requesting attrs: [userPassword]
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_get_generic_step] (7):
Requesting attrs: [gidNumber]
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_get_generic_step] (7):
Requesting attrs: [memberuid]
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_get_generic_step] (7):
Requesting attrs: [modifyTimestamp]
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_get_generic_step] (7):
Requesting attrs: [modifyTimestamp]
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_get_generic_step] (8):
ldap_search_ext called, msgid = 2092
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_process_result] (8):
Trace: sh[0x1e28b20], connected[1], ops[0x1ef6680], ldap[0x1e28d00]
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_process_result] (8):
Trace: ldap_result found nothing!
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [auth_bind_user_done] (9):
Found ppolicy data, assuming LDAP password policies are active.
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_handle_release] (8):
Trace: sh[0x205de50], connected[1], ops[(nil)], ldap[0x204f7f0],
destructor_lock[0], release_memory[0]
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [remove_connection_callback]
(9): Successfully removed connection callback.
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [be_mark_offline] (8): Going
offline!
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [be_run_offline_cb] (3):
Going offline. Running callbacks.
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [be_pam_handler_callback]
(4): Backend returned: (1, 9, <NULL>) [Provider is Offline
(Authentication service cannot retrieve authentication info)]
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [be_pam_handler_callback]
(4): Sending result [9][LDAP]
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [be_pam_handler_callback]
(4): Sent result [9][LDAP]
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_process_result] (8):
Trace: sh[0x1fb8340], connected[1], ops[0x1f0f950], ldap[0x1f87d90]
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [simple_bind_done] (5):
Server returned no controls.
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [simple_bind_done] (3): Bind
result: Success(0), (null)
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_handle_release] (8):
Trace: sh[0x1fb8340], connected[1], ops[(nil)], ldap[0x1f87d90],
destructor_lock[0], release_memory[0]
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [remove_connection_callback]
(9): Successfully removed connection callback.
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [be_pam_handler_callback]
(4): Backend returned: (0, 0, <NULL>) [Success]
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [be_pam_handler_callback]
(4): Sending result [0][LDAP]
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [be_pam_handler_callback]
(4): Sent result [0][LDAP]
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_process_result] (8):
Trace: sh[0x1e28b20], connected[1], ops[0x1ef6680], ldap[0x1e28d00]
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_parse_entry] (9):
OriginalDN: [cn=kau1_grp1,dc=example,dc=com].
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_process_result] (8): Trace: sh[0x1e28b20], connected[1], ops[0x1ef6680], ldap[0x1e28d00]
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_get_generic_done] (6): Search result: Success(0), (null)
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_get_generic_done] (7): Total count [0]
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [ldb] (9): tevent: Added timed event "ltdb_callback": 0x20339b0

(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [ldb] (9): tevent: Added timed event "ltdb_timeout": 0x1fb81d0

(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [ldb] (9): tevent: Destroying timer event 0x1fb81d0 "ltdb_timeout"

(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [ldb] (9): tevent: Ending timer event 0x20339b0 "ltdb_callback"

(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_initgr_common_store] (8): Updating memberships for kau1
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [ldb] (9): start ldb transaction (nesting: 0)
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [ldb] (9): commit ldb transaction (nesting: 0)
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_get_initgr_done] (9): Initgroups done
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_id_op_connect_step] (9): reusing cached connection
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_id_op_destroy] (9): releasing operation connection
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_id_op_done] (9): releasing operation connection
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [acctinfo_callback] (4): Request processed. Returned 0,0,Success
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_process_result] (8): Trace: sh[0x1e28b20], connected[1], ops[(nil)], ldap[0x1e28d00]
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_process_result] (8): Trace: ldap_result found nothing!
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sbus_dispatch] (9): dbus conn: 1E23F20
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sbus_dispatch] (9): Dispatching.
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sbus_message_handler] (9): Received SBUS method [pamHandler]
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [be_pam_handler] (4): Got request with the following data
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [pam_print_data] (4): command: PAM_AUTHENTICATE
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [pam_print_data] (4): domain: LDAP
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [pam_print_data] (4): user: kau1
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [pam_print_data] (4): service: passwd
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [pam_print_data] (4): tty:
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [pam_print_data] (4): ruser:
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [pam_print_data] (4): rhost:
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [pam_print_data] (4): authtok type: 1
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [pam_print_data] (4): authtok size: 9
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [pam_print_data] (4): newauthtok type: 0
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [pam_print_data] (4): newauthtok size: 0
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [pam_print_data] (4): priv: 0
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [pam_print_data] (4): cli_pid: 18804
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_pam_auth_handler] (4): Backend is marked offline, retry later!
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [be_pam_handler_callback] (4): Backend returned: (1, 9, <NULL>) [Provider is Offline (Authentication service cannot retrieve authentication info)]
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [be_pam_handler_callback] (4): Sending result [9][LDAP]
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [be_pam_handler_callback] (4): Sent result [9][LDAP]

Comment 2 Stephen Gallagher 2012-01-19 16:22:40 UTC
Upstream ticket:
https://fedorahosted.org/sssd/ticket/1147

Comment 3 Jakub Hrozek 2012-02-22 19:30:50 UTC
I've been triaging this issue with Kaushik's help. Neither of us was able to reproduce the issue with 1.8, only with 1.5.x. Considering the amount of changes between 1.5 and 1.8, we are closing the ticket as closed in the current release.

In case the bug hits us again, here are the symptoms I was seeing with 1.5:

I instrumented SSSD code with getsockname() calls to be able to cross-reference the client machine port with logs on the LDAP server. The LDAP server logs indicated that the bind operation completed successfully, but SSSD never called the _done() callback for the operation. After the hardcoded sdap_op timeout passed, the bind operation timed out, marked the whole request with an error and marked the provider as offline.

Another suspicious thing was the *after* the request that later timed out was issued, several others were started and completed successfully.

Comment 6 Jakub Hrozek 2012-04-03 18:12:23 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
No documentation needed

Comment 7 Kaushik Banerjee 2012-04-05 07:42:29 UTC
This was observed in 1.5.x. Works fine in sssd version 1.8.x

Verified in version:
# rpm -qi sssd | head
Name        : sssd                         Relocations: (not relocatable)
Version     : 1.8.0                             Vendor: Red Hat, Inc.
Release     : 20.el6                        Build Date: Fri 30 Mar 2012 06:45:57 PM IST
Install Date: Mon 02 Apr 2012 05:36:37 PM IST      Build Host: x86-002.build.bos.redhat.com
Group       : Applications/System           Source RPM: sssd-1.8.0-20.el6.src.rpm
Size        : 7865577                          License: GPLv3+
Signature   : (none)
Packager    : Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>
URL         : http://fedorahosted.org/sssd/
Summary     : System Security Services Daemon

Comment 9 errata-xmlrpc 2012-06-20 11:51:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2012-0747.html