Bug 805729

Summary: sssd_nss dies and does not get restarted properly
Product: Red Hat Enterprise Linux 6 Reporter: Johann Agnarsson <jagnarsson>
Component: sssdAssignee: Stephen Gallagher <sgallagh>
Status: CLOSED ERRATA QA Contact: IDM QE LIST <seceng-idm-qe-list>
Severity: high Docs Contact:
Priority: unspecified    
Version: 6.2CC: grajaiya, jgalipea, kbanerje, prc
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: sssd-1.8.0-12.el6 Doc Type: Bug Fix
Doc Text:
No documentation required
Story Points: ---
Clone Of:
: 950156 (view as bug list) Environment:
Last Closed: 2012-06-20 11:56:25 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 950156    

Description Johann Agnarsson 2012-03-21 22:45:38 UTC
Description of problem:
When under heavy load, sssd experiences a timeout and nss doesn't start back up properly.

Version-Release number of selected component (if applicable):
1.5.1-66.el6_2.3

How reproducible:
Eventually, always.

Steps to Reproduce:
1. Set a low timeout in domain (timeout = 1)
2. Introduce a high amount of load on machine (disk IO/CPU)
3. Keep doing user lookups until sssd_nss dies
  
Actual results:
sssd_nss doesn't get restarted properly and eventually the sssd and sssd_pam processes die off.
getent passwd does not list any AD users.

Expected results:
sssd_nss gets restarted properly and sssd continues to function.
getent passwd lists our AD users.

Additional info:

While trying to reproduce the bug, I've seen a couple of different error messages:
(/var/log/sssd/sssd_nss.log)
(Wed Mar 21 15:04:01 2012) [sssd[nss]] [sss_dp_send_acct_req_create] (0): D-BUS send failed.
(Wed Mar 21 15:04:15 2012) [sssd[nss]] [nss_dp_reconnect_init] (0): Could not reconnect to my.domain.com provider.

and 
(/var/log/sssd/sssd_nss.log)
(Wed Mar 21 13:01:39 2012) [sssd[nss]] [set_unix_socket] (0): Unable to bind on socket '/var/lib/sss/pipes/nss'
(Wed Mar 21 13:01:39 2012) [sssd[nss]] [sss_process_init] (0): fatal error initializing socket

(/var/log/sssd/sssd.log)
(Wed Mar 21 13:01:40 2012) [sssd] [global_checks_handler] (0): Unknown child (457) did exit
(Wed Mar 21 13:01:41 2012) [sssd] [global_checks_handler] (0): Unknown child (7567) did exit
(Wed Mar 21 13:01:42 2012) [sssd] [global_checks_handler] (0): Unknown child (7921) did exit
(Wed Mar 21 14:54:45 2012) [sssd] [monitor_quit] (0): Monitor received Terminated: terminating children
(Wed Mar 21 14:58:29 2012) [sssd] [global_checks_handler] (0): Unknown child (14474) did exit
(Wed Mar 21 14:58:46 2012) [sssd] [global_checks_handler] (0): Unknown child (5079) did exit
(Wed Mar 21 15:04:02 2012) [sssd] [global_checks_handler] (0): Unknown child (6042) did exit
(Wed Mar 21 15:04:44 2012) [sssd] [global_checks_handler] (0): Unknown child (27162) did exit
(Wed Mar 21 15:05:01 2012) [sssd] [global_checks_handler] (0): Unknown child (27675) did exit
(Wed Mar 21 15:05:18 2012) [sssd] [global_checks_handler] (0): Unknown child (27764) did exit
(Wed Mar 21 15:05:36 2012) [sssd] [svc_try_restart] (0): Process [my.domain.com], definitely stopped!

(/var/log/sssd/sssd_my.domain.com.log)

(Wed Mar 21 15:05:33 2012) [sssd[be[my.domain.com]]] [server_setup] (3): CONFDB: /var/lib/sss/db/config.ldb
(Wed Mar 21 15:05:33 2012) [sssd[be[my.domain.com]]] [recreate_ares_channel] (4): Initializing new c-ares channel
(Wed Mar 21 15:05:33 2012) [sssd[be[my.domain.com]]] [be_fo_get_options] (7): Lookup order: ipv4_first
(Wed Mar 21 15:05:33 2012) [sssd[be[my.domain.com]]] [fo_context_init] (3): Created new fail over context, retry timeout is 30
(Wed Mar 21 15:05:33 2012) [sssd[be[my.domain.com]]] [sysdb_domain_init_internal] (5): DB File for my.domain.com: /var/lib/sss/db/cache_my.domain.com.ldb
(Wed Mar 21 15:06:46 2012) [sssd[be[my.domain.com]]] [ldb] (9): commit ldb transaction (nesting: 0)
(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [ldb] (9): tevent: Added timed event "ltdb_callback": 0x1612b40
(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [ldb] (9): tevent: Added timed event "ltdb_timeout": 0x1612bf0
(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [ldb] (9): tevent: Destroying timer event 0x1612bf0 "ltdb_timeout"
(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [ldb] (9): tevent: Ending timer event 0x1612b40 "ltdb_callback"
(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [ldb] (9): trying to load memberof from /usr/lib64/ldb/memberof.so
(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [ldb] (6): asq: Unable to register control with rootdse!
(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [ldb] (9): tevent: Added timed event "ltdb_callback": 0x1626510

(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [ldb] (9): tevent: Added timed event "ltdb_timeout": 0x1610b00

(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [ldb] (9): tevent: Destroying timer event 0x1610b00 "ltdb_timeout"

(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [ldb] (9): tevent: Ending timer event 0x1626510 "ltdb_callback"

(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [ldb] (9): tevent: Added timed event "ltdb_callback": 0x1611f30

(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [ldb] (9): tevent: Added timed event "ltdb_timeout": 0x1612050

(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [ldb] (9): tevent: Destroying timer event 0x1612050 "ltdb_timeout"

(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [ldb] (9): tevent: Ending timer event 0x1611f30 "ltdb_callback"

(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [sbus_new_connection] (1): Failed to open connection: name=org.freedesktop.DBus.Error.NoServer, message=Failed to connect to socket /var/lib/sss/pipes/private/sbus-monitor: Connection refused
(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [mon_cli_init] (0): Failed to connect to monitor services.
(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [be_process_init] (0): fatal error setting up monitor bus
(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [main] (0): Could not initialize backend [5]
(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [sdap_get_users_process] (9): Saving 332 Users - Done
(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [enum_users_op_done] (4): Users higher USN value: [56263675]
(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [sdap_get_generic_step] (6): calling ldap_search_ext with [(&(objectclass=group)(Name=*)(&(gidNumber=*)(!(gidNumber=0))))][OU=Groups,OU=Development,DC=my,DC=domain,DC=com].
(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [sdap_get_generic_step] (7): Requesting attrs: [objectClass]
(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [sdap_get_generic_step] (7): Requesting attrs: [Name]
(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [sdap_get_generic_step] (7): Requesting attrs: [userPassword]
(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [sdap_get_generic_step] (7): Requesting attrs: [gidNumber]
(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [sdap_get_generic_step] (7): Requesting attrs: [member]
(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [sdap_get_generic_step] (7): Requesting attrs: [nsUniqueId]
(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [sdap_get_generic_step] (7): Requesting attrs: [modifyTimestamp]
(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [sdap_get_generic_step] (7): Requesting attrs: [uSNChanged]
(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [sdap_get_generic_step] (8): ldap_search_ext called, msgid = 4
(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [remove_krb5_info_files] (5): Could not remove [/var/lib/sss/pubconf/kdcinfo.MY.DOMAIN.COM], [2][No such file or directory]
(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [remove_krb5_info_files] (5): Could not remove [/var/lib/sss/pubconf/kpasswdinfo.MY.DOMAIN.COM], [2][No such file or directory]
(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [sdap_id_op_destroy] (9): releasing operation connection
(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [sdap_handle_release] (8): Trace: sh[0x1cfd210], connected[1], ops[(nil)], ldap[0x1cfd310], destructor_lock[0], release_memory[0]
(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [remove_connection_callback] (9): Successfully removed connection callback.
(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [sbus_remove_watch] (8): 0x1ced800/0x1ced9a0
(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [remove_socket_symlink] (9): The symlink points to [/var/lib/sss/pipes/private/sbus-dp_my.domain.com.27892]
(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [remove_socket_symlink] (9): The path including our pid is [/var/lib/sss/pipes/private/sbus-dp_my.domain.com.27892]
(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [remove_socket_symlink] (9): Removed the symlink
(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [be_client_destructor] (4): Removed PAM client
(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [be_client_destructor] (4): Removed NSS client

(/var/log/messages)
Mar 21 15:04:01 IRV6A-407 sssd[be[my.domain.com]]: Shutting down
Mar 21 15:04:21 IRV6A-407 sssd[be[my.domain.com]]: Starting up
Mar 21 15:04:42 IRV6A-407 sssd[be[my.domain.com]]: Starting up
Mar 21 15:04:43 IRV6A-407 sssd[be[my.domain.com]]: Shutting down
Mar 21 15:04:59 IRV6A-407 sssd[be[my.domain.com]]: Starting up
Mar 21 15:05:00 IRV6A-407 sssd[be[my.domain.com]]: Shutting down
Mar 21 15:05:17 IRV6A-407 sssd[be[my.domain.com]]: Starting up
Mar 21 15:05:17 IRV6A-407 sssd[be[my.domain.com]]: Shutting down
Mar 21 15:05:33 IRV6A-407 sssd[be[my.domain.com]]: Starting up
Mar 21 15:05:37 IRV6A-407 sssd[nss]: Shutting down
Mar 21 15:05:37 IRV6A-407 sssd[pam]: Shutting down
Mar 21 15:06:59 IRV6A-407 sssd[be[my.domain.com]]: Shutting down

I realize this is very very similar to another bug (https://bugzilla.redhat.com/show_bug.cgi?id=740501)

bhis only happens on 6.2 with sssd-1.5.1-66 and sssd-1.5.1-66_2_3, 
but it does not happen on our 6.1 machines with sssd-1.5.1-34.

I will gladly provide any additional logs and info if needed.

Comment 2 Stephen Gallagher 2012-03-22 12:11:03 UTC
The monitor code was rewritten in SSSD 1.8, which is going to be included in RHEL 6.3. This issue should go away then.

Please test this when the RHEL 6.3 beta becomes available.

Comment 4 Kaushik Banerjee 2012-05-07 10:16:02 UTC
Verified with sssd-1.8.0-25


Tested with the following scenarios:
1. Introduce a high amount of load + set timeout=1 + keep doing user lookups.
sssd backend got restarted and I was still able to lookup users after the restart.

2. Manually killing sssd_nss process.
The process started back again and I was able to enumerate/lookup users again.

Comment 5 Stephen Gallagher 2012-06-12 13:47:44 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
No documentation required

Comment 7 errata-xmlrpc 2012-06-20 11:56:25 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2012-0747.html