RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 805729 - sssd_nss dies and does not get restarted properly
Summary: sssd_nss dies and does not get restarted properly
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: sssd
Version: 6.2
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: rc
: ---
Assignee: Stephen Gallagher
QA Contact: IDM QE LIST
URL:
Whiteboard:
Depends On:
Blocks: 950156
TreeView+ depends on / blocked
 
Reported: 2012-03-21 22:45 UTC by Johann Agnarsson
Modified: 2013-04-09 18:36 UTC (History)
4 users (show)

Fixed In Version: sssd-1.8.0-12.el6
Doc Type: Bug Fix
Doc Text:
No documentation required
Clone Of:
: 950156 (view as bug list)
Environment:
Last Closed: 2012-06-20 11:56:25 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2012:0747 0 normal SHIPPED_LIVE sssd bug fix and enhancement update 2012-06-19 19:31:43 UTC

Description Johann Agnarsson 2012-03-21 22:45:38 UTC
Description of problem:
When under heavy load, sssd experiences a timeout and nss doesn't start back up properly.

Version-Release number of selected component (if applicable):
1.5.1-66.el6_2.3

How reproducible:
Eventually, always.

Steps to Reproduce:
1. Set a low timeout in domain (timeout = 1)
2. Introduce a high amount of load on machine (disk IO/CPU)
3. Keep doing user lookups until sssd_nss dies
  
Actual results:
sssd_nss doesn't get restarted properly and eventually the sssd and sssd_pam processes die off.
getent passwd does not list any AD users.

Expected results:
sssd_nss gets restarted properly and sssd continues to function.
getent passwd lists our AD users.

Additional info:

While trying to reproduce the bug, I've seen a couple of different error messages:
(/var/log/sssd/sssd_nss.log)
(Wed Mar 21 15:04:01 2012) [sssd[nss]] [sss_dp_send_acct_req_create] (0): D-BUS send failed.
(Wed Mar 21 15:04:15 2012) [sssd[nss]] [nss_dp_reconnect_init] (0): Could not reconnect to my.domain.com provider.

and 
(/var/log/sssd/sssd_nss.log)
(Wed Mar 21 13:01:39 2012) [sssd[nss]] [set_unix_socket] (0): Unable to bind on socket '/var/lib/sss/pipes/nss'
(Wed Mar 21 13:01:39 2012) [sssd[nss]] [sss_process_init] (0): fatal error initializing socket

(/var/log/sssd/sssd.log)
(Wed Mar 21 13:01:40 2012) [sssd] [global_checks_handler] (0): Unknown child (457) did exit
(Wed Mar 21 13:01:41 2012) [sssd] [global_checks_handler] (0): Unknown child (7567) did exit
(Wed Mar 21 13:01:42 2012) [sssd] [global_checks_handler] (0): Unknown child (7921) did exit
(Wed Mar 21 14:54:45 2012) [sssd] [monitor_quit] (0): Monitor received Terminated: terminating children
(Wed Mar 21 14:58:29 2012) [sssd] [global_checks_handler] (0): Unknown child (14474) did exit
(Wed Mar 21 14:58:46 2012) [sssd] [global_checks_handler] (0): Unknown child (5079) did exit
(Wed Mar 21 15:04:02 2012) [sssd] [global_checks_handler] (0): Unknown child (6042) did exit
(Wed Mar 21 15:04:44 2012) [sssd] [global_checks_handler] (0): Unknown child (27162) did exit
(Wed Mar 21 15:05:01 2012) [sssd] [global_checks_handler] (0): Unknown child (27675) did exit
(Wed Mar 21 15:05:18 2012) [sssd] [global_checks_handler] (0): Unknown child (27764) did exit
(Wed Mar 21 15:05:36 2012) [sssd] [svc_try_restart] (0): Process [my.domain.com], definitely stopped!

(/var/log/sssd/sssd_my.domain.com.log)

(Wed Mar 21 15:05:33 2012) [sssd[be[my.domain.com]]] [server_setup] (3): CONFDB: /var/lib/sss/db/config.ldb
(Wed Mar 21 15:05:33 2012) [sssd[be[my.domain.com]]] [recreate_ares_channel] (4): Initializing new c-ares channel
(Wed Mar 21 15:05:33 2012) [sssd[be[my.domain.com]]] [be_fo_get_options] (7): Lookup order: ipv4_first
(Wed Mar 21 15:05:33 2012) [sssd[be[my.domain.com]]] [fo_context_init] (3): Created new fail over context, retry timeout is 30
(Wed Mar 21 15:05:33 2012) [sssd[be[my.domain.com]]] [sysdb_domain_init_internal] (5): DB File for my.domain.com: /var/lib/sss/db/cache_my.domain.com.ldb
(Wed Mar 21 15:06:46 2012) [sssd[be[my.domain.com]]] [ldb] (9): commit ldb transaction (nesting: 0)
(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [ldb] (9): tevent: Added timed event "ltdb_callback": 0x1612b40
(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [ldb] (9): tevent: Added timed event "ltdb_timeout": 0x1612bf0
(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [ldb] (9): tevent: Destroying timer event 0x1612bf0 "ltdb_timeout"
(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [ldb] (9): tevent: Ending timer event 0x1612b40 "ltdb_callback"
(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [ldb] (9): trying to load memberof from /usr/lib64/ldb/memberof.so
(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [ldb] (6): asq: Unable to register control with rootdse!
(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [ldb] (9): tevent: Added timed event "ltdb_callback": 0x1626510

(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [ldb] (9): tevent: Added timed event "ltdb_timeout": 0x1610b00

(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [ldb] (9): tevent: Destroying timer event 0x1610b00 "ltdb_timeout"

(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [ldb] (9): tevent: Ending timer event 0x1626510 "ltdb_callback"

(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [ldb] (9): tevent: Added timed event "ltdb_callback": 0x1611f30

(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [ldb] (9): tevent: Added timed event "ltdb_timeout": 0x1612050

(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [ldb] (9): tevent: Destroying timer event 0x1612050 "ltdb_timeout"

(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [ldb] (9): tevent: Ending timer event 0x1611f30 "ltdb_callback"

(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [sbus_new_connection] (1): Failed to open connection: name=org.freedesktop.DBus.Error.NoServer, message=Failed to connect to socket /var/lib/sss/pipes/private/sbus-monitor: Connection refused
(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [mon_cli_init] (0): Failed to connect to monitor services.
(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [be_process_init] (0): fatal error setting up monitor bus
(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [main] (0): Could not initialize backend [5]
(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [sdap_get_users_process] (9): Saving 332 Users - Done
(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [enum_users_op_done] (4): Users higher USN value: [56263675]
(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [sdap_get_generic_step] (6): calling ldap_search_ext with [(&(objectclass=group)(Name=*)(&(gidNumber=*)(!(gidNumber=0))))][OU=Groups,OU=Development,DC=my,DC=domain,DC=com].
(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [sdap_get_generic_step] (7): Requesting attrs: [objectClass]
(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [sdap_get_generic_step] (7): Requesting attrs: [Name]
(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [sdap_get_generic_step] (7): Requesting attrs: [userPassword]
(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [sdap_get_generic_step] (7): Requesting attrs: [gidNumber]
(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [sdap_get_generic_step] (7): Requesting attrs: [member]
(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [sdap_get_generic_step] (7): Requesting attrs: [nsUniqueId]
(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [sdap_get_generic_step] (7): Requesting attrs: [modifyTimestamp]
(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [sdap_get_generic_step] (7): Requesting attrs: [uSNChanged]
(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [sdap_get_generic_step] (8): ldap_search_ext called, msgid = 4
(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [remove_krb5_info_files] (5): Could not remove [/var/lib/sss/pubconf/kdcinfo.MY.DOMAIN.COM], [2][No such file or directory]
(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [remove_krb5_info_files] (5): Could not remove [/var/lib/sss/pubconf/kpasswdinfo.MY.DOMAIN.COM], [2][No such file or directory]
(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [sdap_id_op_destroy] (9): releasing operation connection
(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [sdap_handle_release] (8): Trace: sh[0x1cfd210], connected[1], ops[(nil)], ldap[0x1cfd310], destructor_lock[0], release_memory[0]
(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [remove_connection_callback] (9): Successfully removed connection callback.
(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [sbus_remove_watch] (8): 0x1ced800/0x1ced9a0
(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [remove_socket_symlink] (9): The symlink points to [/var/lib/sss/pipes/private/sbus-dp_my.domain.com.27892]
(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [remove_socket_symlink] (9): The path including our pid is [/var/lib/sss/pipes/private/sbus-dp_my.domain.com.27892]
(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [remove_socket_symlink] (9): Removed the symlink
(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [be_client_destructor] (4): Removed PAM client
(Wed Mar 21 15:06:59 2012) [sssd[be[my.domain.com]]] [be_client_destructor] (4): Removed NSS client

(/var/log/messages)
Mar 21 15:04:01 IRV6A-407 sssd[be[my.domain.com]]: Shutting down
Mar 21 15:04:21 IRV6A-407 sssd[be[my.domain.com]]: Starting up
Mar 21 15:04:42 IRV6A-407 sssd[be[my.domain.com]]: Starting up
Mar 21 15:04:43 IRV6A-407 sssd[be[my.domain.com]]: Shutting down
Mar 21 15:04:59 IRV6A-407 sssd[be[my.domain.com]]: Starting up
Mar 21 15:05:00 IRV6A-407 sssd[be[my.domain.com]]: Shutting down
Mar 21 15:05:17 IRV6A-407 sssd[be[my.domain.com]]: Starting up
Mar 21 15:05:17 IRV6A-407 sssd[be[my.domain.com]]: Shutting down
Mar 21 15:05:33 IRV6A-407 sssd[be[my.domain.com]]: Starting up
Mar 21 15:05:37 IRV6A-407 sssd[nss]: Shutting down
Mar 21 15:05:37 IRV6A-407 sssd[pam]: Shutting down
Mar 21 15:06:59 IRV6A-407 sssd[be[my.domain.com]]: Shutting down

I realize this is very very similar to another bug (https://bugzilla.redhat.com/show_bug.cgi?id=740501)

bhis only happens on 6.2 with sssd-1.5.1-66 and sssd-1.5.1-66_2_3, 
but it does not happen on our 6.1 machines with sssd-1.5.1-34.

I will gladly provide any additional logs and info if needed.

Comment 2 Stephen Gallagher 2012-03-22 12:11:03 UTC
The monitor code was rewritten in SSSD 1.8, which is going to be included in RHEL 6.3. This issue should go away then.

Please test this when the RHEL 6.3 beta becomes available.

Comment 4 Kaushik Banerjee 2012-05-07 10:16:02 UTC
Verified with sssd-1.8.0-25


Tested with the following scenarios:
1. Introduce a high amount of load + set timeout=1 + keep doing user lookups.
sssd backend got restarted and I was still able to lookup users after the restart.

2. Manually killing sssd_nss process.
The process started back again and I was able to enumerate/lookup users again.

Comment 5 Stephen Gallagher 2012-06-12 13:47:44 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
No documentation required

Comment 7 errata-xmlrpc 2012-06-20 11:56:25 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2012-0747.html


Note You need to log in before you can comment on or make changes to this bug.