Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1071449

Summary:	ssh login timeout and sssd needs to be bounced to recover
Product:	Red Hat Enterprise Linux 7	Reporter:	Clifton Coursey <ccoursey>
Component:	sssd	Assignee:	SSSD Maintainers <sssd-maint>
Status:	CLOSED WORKSFORME	QA Contact:	Namita Soman <nsoman>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	7.0	CC:	ccoursey, dgross, grajaiya, jgalipea, jhrozek, lslebodn, mkosek, pbrezina, preichl, redhat
Target Milestone:	rc
Target Release:	7.1
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-01-07 17:26:08 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1113520

Description Clifton Coursey 2014-02-28 20:02:58 UTC

Description of problem:
SSSD will intermittently have the logins timeout and need to be restarted to recover. It suspected that there is crash that is happening

Version-Release number of selected component (if applicable):
sssd-1.9.2-82.10.el6_4.x86_64

How reproducible:
Intermittent

Steps to Reproduce:
1.This is a typical SSSD step
2.
3.

Actual results:
Occasionally logins will timeout

Expected results:
Successful logins

Additional info:

I do not know if this message are useful as I see this pattern repeated.
~~~
(Thu Feb  6 08:56:56 2014) [sssd[pam]] [sss_dp_issue_request] (0x0400): Issuing request for [0x41b4e0:3:u1escpm@LDAP]
(Thu Feb  6 08:56:56 2014) [sssd[pam]] [sss_dp_get_account_msg] (0x0400): Creating request for [LDAP][3][1][name=u1escpm]
(Thu Feb  6 08:56:56 2014) [sssd[pam]] [sss_dp_internal_get_send] (0x0400): Entering request [0x41b4e0:3:u1escpm@LDAP]
(Thu Feb  6 08:57:51 2014) [sssd[pam]] [sbus_dispatch] (0x0020): Performing auto-reconnect
(Thu Feb  6 08:57:51 2014) [sssd[pam]] [sbus_dispatch] (0x0400): SBUS is reconnecting. Deferring.
(Thu Feb  6 08:57:51 2014) [sssd[pam]] [sbus_dispatch] (0x0400): SBUS is reconnecting. Deferring.
(Thu Feb  6 08:57:51 2014) [sssd[pam]] [sbus_dispatch] (0x0400): SBUS is reconnecting. Deferring.
(Thu Feb  6 08:57:51 2014) [sssd[pam]] [sbus_dispatch] (0x0400): SBUS is reconnecting. Deferring.
(Thu Feb  6 08:57:51 2014) [sssd[pam]] [sbus_dispatch] (0x0400): SBUS is reconnecting. Deferring.
(Thu Feb  6 08:57:51 2014) [sssd[pam]] [sbus_dispatch] (0x0400): SBUS is reconnecting. Deferring.
~~~

Comment 2 Jakub Hrozek 2014-02-28 20:15:58 UTC

*** Bug 1071448 has been marked as a duplicate of this bug. ***

Comment 3 Clifton Coursey 2014-02-28 20:26:30 UTC

Something strange happened when I created this bug.... . I have some log that are too big. So I have to find some place to host it. 

--cliff

Comment 4 Clifton Coursey 2014-02-28 20:41:46 UTC

(In reply to Clifton Coursey from comment #3)
> Something strange happened when I created this bug.... . I have some log
> that are too big. So I have to find some place to host it. 
> 
> --cliff

Actually I did not need to host, here is the link to the file in SFDC. This should be the logs from debug_level 6/7 in default and pam.

https://api.access.redhat.com/rs/cases/00921473/attachments/8c4ec718-492d-3d11-b716-c9d8ab33fac4

--cliff

Comment 5 Jakub Hrozek 2014-03-05 16:59:58 UTC

I analyzed the logs and it seems the sssd_be is getting killed by the monitor:
---------
(Thu Feb  6 05:26:31 2014) [sssd] [mt_svc_sigkill] (0x0010): [LDAP][19055] is not responding to SIGTERM. Sending SIGKILL.
(Thu Feb  6 05:29:50 2014) [sssd] [mt_svc_sigkill] (0x0010): [LDAP][14971] is not responding to SIGTERM. Sending SIGKILL.
(Thu Feb  6 07:40:52 2014) [sssd] [mt_svc_sigkill] (0x0010): [LDAP][2473] is not responding to SIGTERM. Sending SIGKILL.
(Thu Feb  6 07:43:27 2014) [sssd] [mt_svc_sigkill] (0x0010): [LDAP][15650] is not responding to SIGTERM. Sending SIGKILL.
(Thu Feb  6 08:40:45 2014) [sssd] [mt_svc_sigkill] (0x0010): [LDAP][15853] is not responding to SIGTERM. Sending SIGKILL.
(Thu Feb  6 08:47:59 2014) [sssd] [mt_svc_sigkill] (0x0010): [LDAP][30334] is not responding to SIGTERM. Sending SIGKILL.
---------

From the sssd_be logs it seems that sssd_be is getting killed when enumerating a large number of users or groups while a transaction is active.

Please advise the customer to put 'timeout=30' into the sssd.conf's domain section, restart the sssd and retry.

Comment 6 Jakub Hrozek 2014-03-05 17:01:41 UTC

Alternatively, suggest enumerate=false in the domain section unless they absolutely need enumeration.

Comment 7 Dylan Gross 2014-03-06 21:53:18 UTC

Customer is unable to set enumerate=false as they and their customers are dependent on tools (mostly homegrown) that rely on calls that use enumeration.  

Unfortunately, the sssd logs don't appear to have logged anything during the reported time of the issue (which was between Feb 5 2-3 pm) and the sssd appears to have been restarted on Feb 5 15:13.


I have a question out to the customer to verify the time windows, but the last thing I see in the sssd.log is a PAM error.


(Mon Feb  3 22:10:24 2014) [sssd] [mt_svc_sigkill] (0x0010): [LDAP][8250] is not responding to SIGTERM. Sending SIGKILL.
(Mon Feb  3 22:23:45 2014) [sssd] [mt_svc_sigkill] (0x0010): [pam][21163] is not responding to SIGTERM. Sending SIGKILL.
(Mon Feb  3 22:26:11 2014) [sssd] [mt_svc_exit_handler] (0x0010): Process [pam], definitely stopped!

(After the sssd restart)
(Thu Feb  6 05:26:31 2014) [sssd] [mt_svc_sigkill] (0x0010): [LDAP][19055] is not responding to SIGTERM. Sending SIGKILL.
(Thu Feb  6 05:29:50 2014) [sssd] [mt_svc_sigkill] (0x0010): [LDAP][14971] is not responding to SIGTERM. Sending SIGKILL.
(Thu Feb  6 07:40:52 2014) [sssd] [mt_svc_sigkill] (0x0010): [LDAP][2473] is not responding to SIGTERM. Sending SIGKILL.
(Thu Feb  6 07:43:27 2014) [sssd] [mt_svc_sigkill] (0x0010): [LDAP][15650] is not responding to SIGTERM. Sending SIGKILL.

Just prior to the "Process [pam], definitely stopped!" error we see the following in the sssd_pam.log-20140204.   I have a question out to the customer to clarify that there were successful logins at any time between these messages and the observed failed login attempts on 2/5/2014 2-3pm.


(Mon Feb  3 22:24:28 2014) [sssd[pam]] [server_setup] (0x0400): CONFDB: /var/lib/sss/db/config.ldb
(Mon Feb  3 22:24:28 2014) [sssd[pam]] [sbus_init_connection] (0x0200): Adding connection 252EF50
(Mon Feb  3 22:24:28 2014) [sssd[pam]] [monitor_common_send_id] (0x0100): Sending ID: (pam,1)
(Mon Feb  3 22:24:28 2014) [sssd[pam]] [sss_names_init] (0x0100): Using re [(?P<name>[^@]+)@?(?P<domain>[^@]*$)].
(Mon Feb  3 22:24:28 2014) [sssd[pam]] [check_file] (0x0400): lstat for [/var/lib/sss/pipes/private/sbus-dp_LDAP] failed: [2][No such file or directory].
(Mon Feb  3 22:24:28 2014) [sssd[pam]] [sbus_client_init] (0x0020): check_file failed for [/var/lib/sss/pipes/private/sbus-dp_LDAP].
(Mon Feb  3 22:24:28 2014) [sssd[pam]] [sss_dp_init] (0x0010): Failed to connect to monitor services.
(Mon Feb  3 22:24:28 2014) [sssd[pam]] [sss_process_init] (0x0010): fatal error setting up backend connector
(Mon Feb  3 22:24:28 2014) [sssd[pam]] [sss_responder_ctx_destructor] (0x0400): Responder is being shut down
(Mon Feb  3 22:26:02 2014) [sssd[pam]] [server_setup] (0x0400): CONFDB: /var/lib/sss/db/config.ldb
(Mon Feb  3 22:26:02 2014) [sssd[pam]] [sbus_init_connection] (0x0200): Adding connection 1E60F50
(Mon Feb  3 22:26:02 2014) [sssd[pam]] [monitor_common_send_id] (0x0100): Sending ID: (pam,1)
(Mon Feb  3 22:26:02 2014) [sssd[pam]] [sss_names_init] (0x0100): Using re [(?P<name>[^@]+)@?(?P<domain>[^@]*$)].
(Mon Feb  3 22:26:02 2014) [sssd[pam]] [check_file] (0x0400): lstat for [/var/lib/sss/pipes/private/sbus-dp_LDAP] failed: [2][No such file or directory].
(Mon Feb  3 22:26:02 2014) [sssd[pam]] [sbus_client_init] (0x0020): check_file failed for [/var/lib/sss/pipes/private/sbus-dp_LDAP].
(Mon Feb  3 22:26:02 2014) [sssd[pam]] [sss_dp_init] (0x0010): Failed to connect to monitor services.
(Mon Feb  3 22:26:02 2014) [sssd[pam]] [sss_process_init] (0x0010): fatal error setting up backend connector
(Mon Feb  3 22:26:02 2014) [sssd[pam]] [sss_responder_ctx_destructor] (0x0400): Responder is being shut down
(Mon Feb  3 22:26:05 2014) [sssd[pam]] [server_setup] (0x0400): CONFDB: /var/lib/sss/db/config.ldb
(Mon Feb  3 22:26:05 2014) [sssd[pam]] [sbus_init_connection] (0x0200): Adding connection 706F50
(Mon Feb  3 22:26:05 2014) [sssd[pam]] [monitor_common_send_id] (0x0100): Sending ID: (pam,1)
(Mon Feb  3 22:26:05 2014) [sssd[pam]] [sss_names_init] (0x0100): Using re [(?P<name>[^@]+)@?(?P<domain>[^@]*$)].
(Mon Feb  3 22:26:05 2014) [sssd[pam]] [check_file] (0x0400): lstat for [/var/lib/sss/pipes/private/sbus-dp_LDAP] failed: [2][No such file or directory].
(Mon Feb  3 22:26:05 2014) [sssd[pam]] [sbus_client_init] (0x0020): check_file failed for [/var/lib/sss/pipes/private/sbus-dp_LDAP].
(Mon Feb  3 22:26:05 2014) [sssd[pam]] [sss_dp_init] (0x0010): Failed to connect to monitor services.
(Mon Feb  3 22:26:05 2014) [sssd[pam]] [sss_process_init] (0x0010): fatal error setting up backend connector
(Mon Feb  3 22:26:05 2014) [sssd[pam]] [sss_responder_ctx_destructor] (0x0400): Responder is being shut down
(Mon Feb  3 22:26:10 2014) [sssd[pam]] [server_setup] (0x0400): CONFDB: /var/lib/sss/db/config.ldb
(Mon Feb  3 22:26:10 2014) [sssd[pam]] [sbus_init_connection] (0x0200): Adding connection 1AF9F50
(Mon Feb  3 22:26:10 2014) [sssd[pam]] [monitor_common_send_id] (0x0100): Sending ID: (pam,1)
(Mon Feb  3 22:26:10 2014) [sssd[pam]] [sss_names_init] (0x0100): Using re [(?P<name>[^@]+)@?(?P<domain>[^@]*$)].
(Mon Feb  3 22:26:10 2014) [sssd[pam]] [check_file] (0x0400): lstat for [/var/lib/sss/pipes/private/sbus-dp_LDAP] failed: [2][No such file or directory].
(Mon Feb  3 22:26:10 2014) [sssd[pam]] [sbus_client_init] (0x0020): check_file failed for [/var/lib/sss/pipes/private/sbus-dp_LDAP].
(Mon Feb  3 22:26:10 2014) [sssd[pam]] [sss_dp_init] (0x0010): Failed to connect to monitor services.
(Mon Feb  3 22:26:10 2014) [sssd[pam]] [sss_process_init] (0x0010): fatal error setting up backend connector
(Mon Feb  3 22:26:10 2014) [sssd[pam]] [sss_responder_ctx_destructor] (0x0400): Responder is being shut down

Comment 8 Jakub Hrozek 2014-03-07 09:31:14 UTC

The file /var/lib/sss/pipes/private/sbus-dp_LDAP is a UNIX pipe that the PAM process uses to communicate with the DP process. What you see in the logs is that the sssd_be was terminated and sssd_pam attempted to reconnect, which failed, because sssd_be did not create the pipe. I'm not really sure what happened, the backend log file (sssd_LDAP.log-20140204.gz) has the last entry at "Mon Feb  3 22:24:09" while the PAM responder goes south at "22:24:28". You can see that "server_setup" is called, which is the very first function the process calls.

For the time being, I would recommend raising the timeout parameter in the domain section to prevent the backend from being restarted in the first place. Some users have also symlinked the cache to /dev/shm, but that makes the cache non-persistent as /dev/shm is a ramdisk.

Comment 9 Jakub Hrozek 2014-04-03 13:07:37 UTC

Ping Clifton, did raising the timeout help?

Comment 10 Dylan Gross 2014-04-03 14:34:31 UTC

I spoke with the customer a few minutes ago, and they have not yet noticed any re-occurrence of the issue in the population of the servers with the increased timeout.   However, they also haven't seen it in the population of the servers that *did not* increase the timeout.

They've seen it go a few weeks between issues previously, so they're giving it two more weeks before their confidence in the increased timeout rises.

Comment 11 Jakub Hrozek 2014-04-17 09:08:44 UTC

Hi Dylan, did the bug ever occur again for the customer?

Comment 12 Dylan Gross 2014-04-17 14:20:47 UTC

Jakub,

I just spoke with the customer.   As of now, there haven't been any reoccurrences of the bug in the population of servers that had the timeout increased a little over a month ago.   (There also haven't been any re-occurrences in the population that did *not* have the timeout increased, however.)   They're going to increase the timeout on a larger population of servers.  At this point, however, there's no further information to share from the customer.

Comment 13 Jakub Hrozek 2014-05-15 11:40:05 UTC

Upstream ticket:
https://fedorahosted.org/sssd/ticket/1729

Comment 14 Benjamin Reed 2014-11-14 16:02:25 UTC

We're seeing this same issue intermittently at OpenNMS on our servers connecting to a remote FreeIPA host.  Clients are Debian (Wheezy) and CentOS6 for the most part.

Comment 15 Jakub Hrozek 2016-01-07 17:26:08 UTC

There were no further comments on this bugzilla for quite some time, therefore I'm closing the bugzilla as WORKSFORME.

Kindly reopen if you can reproduce with a recent version.

Comment 16 Jakub Hrozek 2016-01-07 17:27:16 UTC

Oh, one more important detail -- 7.3 performance changes would make this scenario less probable even with the default timeout setting. But the performance enhancements are better tracked separately.