Hide Forgot
Description of problem: Backend(ldap auth provider) occasionally goes offline under heavy load. Version-Release number of selected component (if applicable): sssd-1.5.1-66.el6_2.3 How reproducible: Occasionally Steps to Reproduce: 1. sssd.conf domain section: debug_level = 9 id_provider = ldap ldap_uri = ldap://openldapsrv.example.com ldap_search_base = dc=example,dc=com ldap_tls_cacert = /etc/openldap/cacerts/server.pem 2. Use the reproducer from https://bugzilla.redhat.com/show_bug.cgi?id=725281 and run 15 concurrent instances of it. Actual results: Auth fails occasionally. Expected results: Auth should succeed. Additional info: 1. This issue was noticed while verifying bug #782443 2. Domain log: (Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_get_generic_step] (6): calling ldap_search_ext with [(&(memberuid=kau1)(objectclass=posixGroup)(cn=*)(&(gidNumber=*)(!(gidNumber=0))))][dc=example,dc=com]. (Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_get_generic_step] (7): Requesting attrs: [objectClass] (Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_get_generic_step] (7): Requesting attrs: [cn] (Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_get_generic_step] (7): Requesting attrs: [userPassword] (Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_get_generic_step] (7): Requesting attrs: [gidNumber] (Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_get_generic_step] (7): Requesting attrs: [memberuid] (Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_get_generic_step] (7): Requesting attrs: [modifyTimestamp] (Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_get_generic_step] (7): Requesting attrs: [modifyTimestamp] (Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_get_generic_step] (8): ldap_search_ext called, msgid = 2092 (Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_process_result] (8): Trace: sh[0x1e28b20], connected[1], ops[0x1ef6680], ldap[0x1e28d00] (Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_process_result] (8): Trace: ldap_result found nothing! (Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [auth_bind_user_done] (9): Found ppolicy data, assuming LDAP password policies are active. (Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_handle_release] (8): Trace: sh[0x205de50], connected[1], ops[(nil)], ldap[0x204f7f0], destructor_lock[0], release_memory[0] (Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [remove_connection_callback] (9): Successfully removed connection callback. (Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [be_mark_offline] (8): Going offline! (Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [be_run_offline_cb] (3): Going offline. Running callbacks. (Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [be_pam_handler_callback] (4): Backend returned: (1, 9, <NULL>) [Provider is Offline (Authentication service cannot retrieve authentication info)] (Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [be_pam_handler_callback] (4): Sending result [9][LDAP] (Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [be_pam_handler_callback] (4): Sent result [9][LDAP] (Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_process_result] (8): Trace: sh[0x1fb8340], connected[1], ops[0x1f0f950], ldap[0x1f87d90] (Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [simple_bind_done] (5): Server returned no controls. (Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [simple_bind_done] (3): Bind result: Success(0), (null) (Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_handle_release] (8): Trace: sh[0x1fb8340], connected[1], ops[(nil)], ldap[0x1f87d90], destructor_lock[0], release_memory[0] (Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [remove_connection_callback] (9): Successfully removed connection callback. (Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [be_pam_handler_callback] (4): Backend returned: (0, 0, <NULL>) [Success] (Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [be_pam_handler_callback] (4): Sending result [0][LDAP] (Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [be_pam_handler_callback] (4): Sent result [0][LDAP] (Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_process_result] (8): Trace: sh[0x1e28b20], connected[1], ops[0x1ef6680], ldap[0x1e28d00] (Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_parse_entry] (9): OriginalDN: [cn=kau1_grp1,dc=example,dc=com]. (Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_process_result] (8): Trace: sh[0x1e28b20], connected[1], ops[0x1ef6680], ldap[0x1e28d00] (Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_get_generic_done] (6): Search result: Success(0), (null) (Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_get_generic_done] (7): Total count [0] (Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [ldb] (9): tevent: Added timed event "ltdb_callback": 0x20339b0 (Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [ldb] (9): tevent: Added timed event "ltdb_timeout": 0x1fb81d0 (Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [ldb] (9): tevent: Destroying timer event 0x1fb81d0 "ltdb_timeout" (Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [ldb] (9): tevent: Ending timer event 0x20339b0 "ltdb_callback" (Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_initgr_common_store] (8): Updating memberships for kau1 (Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [ldb] (9): start ldb transaction (nesting: 0) (Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [ldb] (9): commit ldb transaction (nesting: 0) (Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_get_initgr_done] (9): Initgroups done (Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_id_op_connect_step] (9): reusing cached connection (Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_id_op_destroy] (9): releasing operation connection (Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_id_op_done] (9): releasing operation connection (Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [acctinfo_callback] (4): Request processed. Returned 0,0,Success (Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_process_result] (8): Trace: sh[0x1e28b20], connected[1], ops[(nil)], ldap[0x1e28d00] (Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_process_result] (8): Trace: ldap_result found nothing! (Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sbus_dispatch] (9): dbus conn: 1E23F20 (Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sbus_dispatch] (9): Dispatching. (Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sbus_message_handler] (9): Received SBUS method [pamHandler] (Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [be_pam_handler] (4): Got request with the following data (Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [pam_print_data] (4): command: PAM_AUTHENTICATE (Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [pam_print_data] (4): domain: LDAP (Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [pam_print_data] (4): user: kau1 (Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [pam_print_data] (4): service: passwd (Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [pam_print_data] (4): tty: (Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [pam_print_data] (4): ruser: (Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [pam_print_data] (4): rhost: (Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [pam_print_data] (4): authtok type: 1 (Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [pam_print_data] (4): authtok size: 9 (Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [pam_print_data] (4): newauthtok type: 0 (Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [pam_print_data] (4): newauthtok size: 0 (Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [pam_print_data] (4): priv: 0 (Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [pam_print_data] (4): cli_pid: 18804 (Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [sdap_pam_auth_handler] (4): Backend is marked offline, retry later! (Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [be_pam_handler_callback] (4): Backend returned: (1, 9, <NULL>) [Provider is Offline (Authentication service cannot retrieve authentication info)] (Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [be_pam_handler_callback] (4): Sending result [9][LDAP] (Thu Jan 19 18:15:44 2012) [sssd[be[LDAP]]] [be_pam_handler_callback] (4): Sent result [9][LDAP]
Upstream ticket: https://fedorahosted.org/sssd/ticket/1147
I've been triaging this issue with Kaushik's help. Neither of us was able to reproduce the issue with 1.8, only with 1.5.x. Considering the amount of changes between 1.5 and 1.8, we are closing the ticket as closed in the current release. In case the bug hits us again, here are the symptoms I was seeing with 1.5: I instrumented SSSD code with getsockname() calls to be able to cross-reference the client machine port with logs on the LDAP server. The LDAP server logs indicated that the bind operation completed successfully, but SSSD never called the _done() callback for the operation. After the hardcoded sdap_op timeout passed, the bind operation timed out, marked the whole request with an error and marked the provider as offline. Another suspicious thing was the *after* the request that later timed out was issued, several others were started and completed successfully.
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: No documentation needed
This was observed in 1.5.x. Works fine in sssd version 1.8.x Verified in version: # rpm -qi sssd | head Name : sssd Relocations: (not relocatable) Version : 1.8.0 Vendor: Red Hat, Inc. Release : 20.el6 Build Date: Fri 30 Mar 2012 06:45:57 PM IST Install Date: Mon 02 Apr 2012 05:36:37 PM IST Build Host: x86-002.build.bos.redhat.com Group : Applications/System Source RPM: sssd-1.8.0-20.el6.src.rpm Size : 7865577 License: GPLv3+ Signature : (none) Packager : Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla> URL : http://fedorahosted.org/sssd/ Summary : System Security Services Daemon
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2012-0747.html