Bug 783138 - Backend occasionally goes offline under heavy load.
Summary: Backend occasionally goes offline under heavy load.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: sssd
Version: 6.2
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: ---
Assignee: Stephen Gallagher
QA Contact: IDM QE LIST
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-01-19 14:02 UTC by Kaushik Banerjee
Modified: 2020-05-02 16:43 UTC (History)
5 users (show)

Fixed In Version: sssd-1.8.0-5.el6
Doc Type: Bug Fix
Doc Text:
No documentation needed
Clone Of:
Environment:
Last Closed: 2012-06-20 11:51:28 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github SSSD sssd issues 2189 0 None closed LDAP Backend occasionally goes offline under heavy load. 2021-02-02 12:47:45 UTC
Red Hat Product Errata RHBA-2012:0747 0 normal SHIPPED_LIVE sssd bug fix and enhancement update 2012-06-19 19:31:43 UTC

Description Kaushik Banerjee 2012-01-19 14:02:54 UTC
Description of problem:
Backend(ldap auth provider) occasionally goes offline under heavy load.

Version-Release number of selected component (if applicable):
sssd-1.5.1-66.el6_2.3

How reproducible:
Occasionally

Steps to Reproduce:
1. sssd.conf domain section:
debug_level = 9
id_provider = ldap
ldap_uri = ldap://openldapsrv.example.com
ldap_search_base = dc=example,dc=com
ldap_tls_cacert = /etc/openldap/cacerts/server.pem

2. Use the reproducer from https://bugzilla.redhat.com/show_bug.cgi?id=725281 and run 15 concurrent instances of it.
  
Actual results:
Auth fails occasionally.

Expected results:
Auth should succeed.

Additional info:
1. This issue was noticed while verifying bug #782443

2. Domain log:

(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_get_generic_step] (6):
calling ldap_search_ext with
[(&(memberuid=kau1)(objectclass=posixGroup)(cn=*)(&(gidNumber=*)(!(gidNumber=0))))][dc=example,dc=com].
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_get_generic_step] (7):
Requesting attrs: [objectClass]
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_get_generic_step] (7):
Requesting attrs: [cn]
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_get_generic_step] (7):
Requesting attrs: [userPassword]
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_get_generic_step] (7):
Requesting attrs: [gidNumber]
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_get_generic_step] (7):
Requesting attrs: [memberuid]
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_get_generic_step] (7):
Requesting attrs: [modifyTimestamp]
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_get_generic_step] (7):
Requesting attrs: [modifyTimestamp]
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_get_generic_step] (8):
ldap_search_ext called, msgid = 2092
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_process_result] (8):
Trace: sh[0x1e28b20], connected[1], ops[0x1ef6680], ldap[0x1e28d00]
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_process_result] (8):
Trace: ldap_result found nothing!
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [auth_bind_user_done] (9):
Found ppolicy data, assuming LDAP password policies are active.
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_handle_release] (8):
Trace: sh[0x205de50], connected[1], ops[(nil)], ldap[0x204f7f0],
destructor_lock[0], release_memory[0]
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [remove_connection_callback]
(9): Successfully removed connection callback.
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [be_mark_offline] (8): Going
offline!
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [be_run_offline_cb] (3):
Going offline. Running callbacks.
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [be_pam_handler_callback]
(4): Backend returned: (1, 9, <NULL>) [Provider is Offline
(Authentication service cannot retrieve authentication info)]
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [be_pam_handler_callback]
(4): Sending result [9][LDAP]
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [be_pam_handler_callback]
(4): Sent result [9][LDAP]
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_process_result] (8):
Trace: sh[0x1fb8340], connected[1], ops[0x1f0f950], ldap[0x1f87d90]
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [simple_bind_done] (5):
Server returned no controls.
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [simple_bind_done] (3): Bind
result: Success(0), (null)
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_handle_release] (8):
Trace: sh[0x1fb8340], connected[1], ops[(nil)], ldap[0x1f87d90],
destructor_lock[0], release_memory[0]
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [remove_connection_callback]
(9): Successfully removed connection callback.
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [be_pam_handler_callback]
(4): Backend returned: (0, 0, <NULL>) [Success]
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [be_pam_handler_callback]
(4): Sending result [0][LDAP]
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [be_pam_handler_callback]
(4): Sent result [0][LDAP]
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_process_result] (8):
Trace: sh[0x1e28b20], connected[1], ops[0x1ef6680], ldap[0x1e28d00]
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_parse_entry] (9):
OriginalDN: [cn=kau1_grp1,dc=example,dc=com].
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_process_result] (8): Trace: sh[0x1e28b20], connected[1], ops[0x1ef6680], ldap[0x1e28d00]
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_get_generic_done] (6): Search result: Success(0), (null)
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_get_generic_done] (7): Total count [0]
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [ldb] (9): tevent: Added timed event "ltdb_callback": 0x20339b0

(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [ldb] (9): tevent: Added timed event "ltdb_timeout": 0x1fb81d0

(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [ldb] (9): tevent: Destroying timer event 0x1fb81d0 "ltdb_timeout"

(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [ldb] (9): tevent: Ending timer event 0x20339b0 "ltdb_callback"

(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_initgr_common_store] (8): Updating memberships for kau1
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [ldb] (9): start ldb transaction (nesting: 0)
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [ldb] (9): commit ldb transaction (nesting: 0)
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_get_initgr_done] (9): Initgroups done
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_id_op_connect_step] (9): reusing cached connection
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_id_op_destroy] (9): releasing operation connection
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_id_op_done] (9): releasing operation connection
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [acctinfo_callback] (4): Request processed. Returned 0,0,Success
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_process_result] (8): Trace: sh[0x1e28b20], connected[1], ops[(nil)], ldap[0x1e28d00]
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_process_result] (8): Trace: ldap_result found nothing!
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sbus_dispatch] (9): dbus conn: 1E23F20
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sbus_dispatch] (9): Dispatching.
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sbus_message_handler] (9): Received SBUS method [pamHandler]
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [be_pam_handler] (4): Got request with the following data
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [pam_print_data] (4): command: PAM_AUTHENTICATE
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [pam_print_data] (4): domain: LDAP
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [pam_print_data] (4): user: kau1
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [pam_print_data] (4): service: passwd
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [pam_print_data] (4): tty:
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [pam_print_data] (4): ruser:
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [pam_print_data] (4): rhost:
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [pam_print_data] (4): authtok type: 1
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [pam_print_data] (4): authtok size: 9
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [pam_print_data] (4): newauthtok type: 0
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [pam_print_data] (4): newauthtok size: 0
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [pam_print_data] (4): priv: 0
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [pam_print_data] (4): cli_pid: 18804
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_pam_auth_handler] (4): Backend is marked offline, retry later!
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [be_pam_handler_callback] (4): Backend returned: (1, 9, <NULL>) [Provider is Offline (Authentication service cannot retrieve authentication info)]
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [be_pam_handler_callback] (4): Sending result [9][LDAP]
(Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [be_pam_handler_callback] (4): Sent result [9][LDAP]

Comment 2 Stephen Gallagher 2012-01-19 16:22:40 UTC
Upstream ticket:
https://fedorahosted.org/sssd/ticket/1147

Comment 3 Jakub Hrozek 2012-02-22 19:30:50 UTC
I've been triaging this issue with Kaushik's help. Neither of us was able to reproduce the issue with 1.8, only with 1.5.x. Considering the amount of changes between 1.5 and 1.8, we are closing the ticket as closed in the current release.

In case the bug hits us again, here are the symptoms I was seeing with 1.5:

I instrumented SSSD code with getsockname() calls to be able to cross-reference the client machine port with logs on the LDAP server. The LDAP server logs indicated that the bind operation completed successfully, but SSSD never called the _done() callback for the operation. After the hardcoded sdap_op timeout passed, the bind operation timed out, marked the whole request with an error and marked the provider as offline.

Another suspicious thing was the *after* the request that later timed out was issued, several others were started and completed successfully.

Comment 6 Jakub Hrozek 2012-04-03 18:12:23 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
No documentation needed

Comment 7 Kaushik Banerjee 2012-04-05 07:42:29 UTC
This was observed in 1.5.x. Works fine in sssd version 1.8.x

Verified in version:
# rpm -qi sssd | head
Name        : sssd                         Relocations: (not relocatable)
Version     : 1.8.0                             Vendor: Red Hat, Inc.
Release     : 20.el6                        Build Date: Fri 30 Mar 2012 06:45:57 PM IST
Install Date: Mon 02 Apr 2012 05:36:37 PM IST      Build Host: x86-002.build.bos.redhat.com
Group       : Applications/System           Source RPM: sssd-1.8.0-20.el6.src.rpm
Size        : 7865577                          License: GPLv3+
Signature   : (none)
Packager    : Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>
URL         : http://fedorahosted.org/sssd/
Summary     : System Security Services Daemon

Comment 9 errata-xmlrpc 2012-06-20 11:51:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2012-0747.html


Note You need to log in before you can comment on or make changes to this bug.