Description of problem: SSSD will intermittently have the logins timeout and need to be restarted to recover. It suspected that there is crash that is happening Version-Release number of selected component (if applicable): sssd-1.9.2-82.10.el6_4.x86_64 How reproducible: Intermittent Steps to Reproduce: 1.This is a typical SSSD step 2. 3. Actual results: Occasionally logins will timeout Expected results: Successful logins Additional info: I do not know if this message are useful as I see this pattern repeated. ~~~ (Thu Feb 6 08:56:56 2014) [sssd[pam]] [sss_dp_issue_request] (0x0400): Issuing request for [0x41b4e0:3:u1escpm@LDAP] (Thu Feb 6 08:56:56 2014) [sssd[pam]] [sss_dp_get_account_msg] (0x0400): Creating request for [LDAP][3][1][name=u1escpm] (Thu Feb 6 08:56:56 2014) [sssd[pam]] [sss_dp_internal_get_send] (0x0400): Entering request [0x41b4e0:3:u1escpm@LDAP] (Thu Feb 6 08:57:51 2014) [sssd[pam]] [sbus_dispatch] (0x0020): Performing auto-reconnect (Thu Feb 6 08:57:51 2014) [sssd[pam]] [sbus_dispatch] (0x0400): SBUS is reconnecting. Deferring. (Thu Feb 6 08:57:51 2014) [sssd[pam]] [sbus_dispatch] (0x0400): SBUS is reconnecting. Deferring. (Thu Feb 6 08:57:51 2014) [sssd[pam]] [sbus_dispatch] (0x0400): SBUS is reconnecting. Deferring. (Thu Feb 6 08:57:51 2014) [sssd[pam]] [sbus_dispatch] (0x0400): SBUS is reconnecting. Deferring. (Thu Feb 6 08:57:51 2014) [sssd[pam]] [sbus_dispatch] (0x0400): SBUS is reconnecting. Deferring. (Thu Feb 6 08:57:51 2014) [sssd[pam]] [sbus_dispatch] (0x0400): SBUS is reconnecting. Deferring. ~~~
*** Bug 1071448 has been marked as a duplicate of this bug. ***
Something strange happened when I created this bug.... . I have some log that are too big. So I have to find some place to host it. --cliff
(In reply to Clifton Coursey from comment #3) > Something strange happened when I created this bug.... . I have some log > that are too big. So I have to find some place to host it. > > --cliff Actually I did not need to host, here is the link to the file in SFDC. This should be the logs from debug_level 6/7 in default and pam. https://api.access.redhat.com/rs/cases/00921473/attachments/8c4ec718-492d-3d11-b716-c9d8ab33fac4 --cliff
I analyzed the logs and it seems the sssd_be is getting killed by the monitor: --------- (Thu Feb 6 05:26:31 2014) [sssd] [mt_svc_sigkill] (0x0010): [LDAP][19055] is not responding to SIGTERM. Sending SIGKILL. (Thu Feb 6 05:29:50 2014) [sssd] [mt_svc_sigkill] (0x0010): [LDAP][14971] is not responding to SIGTERM. Sending SIGKILL. (Thu Feb 6 07:40:52 2014) [sssd] [mt_svc_sigkill] (0x0010): [LDAP][2473] is not responding to SIGTERM. Sending SIGKILL. (Thu Feb 6 07:43:27 2014) [sssd] [mt_svc_sigkill] (0x0010): [LDAP][15650] is not responding to SIGTERM. Sending SIGKILL. (Thu Feb 6 08:40:45 2014) [sssd] [mt_svc_sigkill] (0x0010): [LDAP][15853] is not responding to SIGTERM. Sending SIGKILL. (Thu Feb 6 08:47:59 2014) [sssd] [mt_svc_sigkill] (0x0010): [LDAP][30334] is not responding to SIGTERM. Sending SIGKILL. --------- From the sssd_be logs it seems that sssd_be is getting killed when enumerating a large number of users or groups while a transaction is active. Please advise the customer to put 'timeout=30' into the sssd.conf's domain section, restart the sssd and retry.
Alternatively, suggest enumerate=false in the domain section unless they absolutely need enumeration.
Customer is unable to set enumerate=false as they and their customers are dependent on tools (mostly homegrown) that rely on calls that use enumeration. Unfortunately, the sssd logs don't appear to have logged anything during the reported time of the issue (which was between Feb 5 2-3 pm) and the sssd appears to have been restarted on Feb 5 15:13. I have a question out to the customer to verify the time windows, but the last thing I see in the sssd.log is a PAM error. (Mon Feb 3 22:10:24 2014) [sssd] [mt_svc_sigkill] (0x0010): [LDAP][8250] is not responding to SIGTERM. Sending SIGKILL. (Mon Feb 3 22:23:45 2014) [sssd] [mt_svc_sigkill] (0x0010): [pam][21163] is not responding to SIGTERM. Sending SIGKILL. (Mon Feb 3 22:26:11 2014) [sssd] [mt_svc_exit_handler] (0x0010): Process [pam], definitely stopped! (After the sssd restart) (Thu Feb 6 05:26:31 2014) [sssd] [mt_svc_sigkill] (0x0010): [LDAP][19055] is not responding to SIGTERM. Sending SIGKILL. (Thu Feb 6 05:29:50 2014) [sssd] [mt_svc_sigkill] (0x0010): [LDAP][14971] is not responding to SIGTERM. Sending SIGKILL. (Thu Feb 6 07:40:52 2014) [sssd] [mt_svc_sigkill] (0x0010): [LDAP][2473] is not responding to SIGTERM. Sending SIGKILL. (Thu Feb 6 07:43:27 2014) [sssd] [mt_svc_sigkill] (0x0010): [LDAP][15650] is not responding to SIGTERM. Sending SIGKILL. Just prior to the "Process [pam], definitely stopped!" error we see the following in the sssd_pam.log-20140204. I have a question out to the customer to clarify that there were successful logins at any time between these messages and the observed failed login attempts on 2/5/2014 2-3pm. (Mon Feb 3 22:24:28 2014) [sssd[pam]] [server_setup] (0x0400): CONFDB: /var/lib/sss/db/config.ldb (Mon Feb 3 22:24:28 2014) [sssd[pam]] [sbus_init_connection] (0x0200): Adding connection 252EF50 (Mon Feb 3 22:24:28 2014) [sssd[pam]] [monitor_common_send_id] (0x0100): Sending ID: (pam,1) (Mon Feb 3 22:24:28 2014) [sssd[pam]] [sss_names_init] (0x0100): Using re [(?P<name>[^@]+)@?(?P<domain>[^@]*$)]. (Mon Feb 3 22:24:28 2014) [sssd[pam]] [check_file] (0x0400): lstat for [/var/lib/sss/pipes/private/sbus-dp_LDAP] failed: [2][No such file or directory]. (Mon Feb 3 22:24:28 2014) [sssd[pam]] [sbus_client_init] (0x0020): check_file failed for [/var/lib/sss/pipes/private/sbus-dp_LDAP]. (Mon Feb 3 22:24:28 2014) [sssd[pam]] [sss_dp_init] (0x0010): Failed to connect to monitor services. (Mon Feb 3 22:24:28 2014) [sssd[pam]] [sss_process_init] (0x0010): fatal error setting up backend connector (Mon Feb 3 22:24:28 2014) [sssd[pam]] [sss_responder_ctx_destructor] (0x0400): Responder is being shut down (Mon Feb 3 22:26:02 2014) [sssd[pam]] [server_setup] (0x0400): CONFDB: /var/lib/sss/db/config.ldb (Mon Feb 3 22:26:02 2014) [sssd[pam]] [sbus_init_connection] (0x0200): Adding connection 1E60F50 (Mon Feb 3 22:26:02 2014) [sssd[pam]] [monitor_common_send_id] (0x0100): Sending ID: (pam,1) (Mon Feb 3 22:26:02 2014) [sssd[pam]] [sss_names_init] (0x0100): Using re [(?P<name>[^@]+)@?(?P<domain>[^@]*$)]. (Mon Feb 3 22:26:02 2014) [sssd[pam]] [check_file] (0x0400): lstat for [/var/lib/sss/pipes/private/sbus-dp_LDAP] failed: [2][No such file or directory]. (Mon Feb 3 22:26:02 2014) [sssd[pam]] [sbus_client_init] (0x0020): check_file failed for [/var/lib/sss/pipes/private/sbus-dp_LDAP]. (Mon Feb 3 22:26:02 2014) [sssd[pam]] [sss_dp_init] (0x0010): Failed to connect to monitor services. (Mon Feb 3 22:26:02 2014) [sssd[pam]] [sss_process_init] (0x0010): fatal error setting up backend connector (Mon Feb 3 22:26:02 2014) [sssd[pam]] [sss_responder_ctx_destructor] (0x0400): Responder is being shut down (Mon Feb 3 22:26:05 2014) [sssd[pam]] [server_setup] (0x0400): CONFDB: /var/lib/sss/db/config.ldb (Mon Feb 3 22:26:05 2014) [sssd[pam]] [sbus_init_connection] (0x0200): Adding connection 706F50 (Mon Feb 3 22:26:05 2014) [sssd[pam]] [monitor_common_send_id] (0x0100): Sending ID: (pam,1) (Mon Feb 3 22:26:05 2014) [sssd[pam]] [sss_names_init] (0x0100): Using re [(?P<name>[^@]+)@?(?P<domain>[^@]*$)]. (Mon Feb 3 22:26:05 2014) [sssd[pam]] [check_file] (0x0400): lstat for [/var/lib/sss/pipes/private/sbus-dp_LDAP] failed: [2][No such file or directory]. (Mon Feb 3 22:26:05 2014) [sssd[pam]] [sbus_client_init] (0x0020): check_file failed for [/var/lib/sss/pipes/private/sbus-dp_LDAP]. (Mon Feb 3 22:26:05 2014) [sssd[pam]] [sss_dp_init] (0x0010): Failed to connect to monitor services. (Mon Feb 3 22:26:05 2014) [sssd[pam]] [sss_process_init] (0x0010): fatal error setting up backend connector (Mon Feb 3 22:26:05 2014) [sssd[pam]] [sss_responder_ctx_destructor] (0x0400): Responder is being shut down (Mon Feb 3 22:26:10 2014) [sssd[pam]] [server_setup] (0x0400): CONFDB: /var/lib/sss/db/config.ldb (Mon Feb 3 22:26:10 2014) [sssd[pam]] [sbus_init_connection] (0x0200): Adding connection 1AF9F50 (Mon Feb 3 22:26:10 2014) [sssd[pam]] [monitor_common_send_id] (0x0100): Sending ID: (pam,1) (Mon Feb 3 22:26:10 2014) [sssd[pam]] [sss_names_init] (0x0100): Using re [(?P<name>[^@]+)@?(?P<domain>[^@]*$)]. (Mon Feb 3 22:26:10 2014) [sssd[pam]] [check_file] (0x0400): lstat for [/var/lib/sss/pipes/private/sbus-dp_LDAP] failed: [2][No such file or directory]. (Mon Feb 3 22:26:10 2014) [sssd[pam]] [sbus_client_init] (0x0020): check_file failed for [/var/lib/sss/pipes/private/sbus-dp_LDAP]. (Mon Feb 3 22:26:10 2014) [sssd[pam]] [sss_dp_init] (0x0010): Failed to connect to monitor services. (Mon Feb 3 22:26:10 2014) [sssd[pam]] [sss_process_init] (0x0010): fatal error setting up backend connector (Mon Feb 3 22:26:10 2014) [sssd[pam]] [sss_responder_ctx_destructor] (0x0400): Responder is being shut down
The file /var/lib/sss/pipes/private/sbus-dp_LDAP is a UNIX pipe that the PAM process uses to communicate with the DP process. What you see in the logs is that the sssd_be was terminated and sssd_pam attempted to reconnect, which failed, because sssd_be did not create the pipe. I'm not really sure what happened, the backend log file (sssd_LDAP.log-20140204.gz) has the last entry at "Mon Feb 3 22:24:09" while the PAM responder goes south at "22:24:28". You can see that "server_setup" is called, which is the very first function the process calls. For the time being, I would recommend raising the timeout parameter in the domain section to prevent the backend from being restarted in the first place. Some users have also symlinked the cache to /dev/shm, but that makes the cache non-persistent as /dev/shm is a ramdisk.
Ping Clifton, did raising the timeout help?
I spoke with the customer a few minutes ago, and they have not yet noticed any re-occurrence of the issue in the population of the servers with the increased timeout. However, they also haven't seen it in the population of the servers that *did not* increase the timeout. They've seen it go a few weeks between issues previously, so they're giving it two more weeks before their confidence in the increased timeout rises.
Hi Dylan, did the bug ever occur again for the customer?
Jakub, I just spoke with the customer. As of now, there haven't been any reoccurrences of the bug in the population of servers that had the timeout increased a little over a month ago. (There also haven't been any re-occurrences in the population that did *not* have the timeout increased, however.) They're going to increase the timeout on a larger population of servers. At this point, however, there's no further information to share from the customer.
Upstream ticket: https://fedorahosted.org/sssd/ticket/1729
We're seeing this same issue intermittently at OpenNMS on our servers connecting to a remote FreeIPA host. Clients are Debian (Wheezy) and CentOS6 for the most part.
There were no further comments on this bugzilla for quite some time, therefore I'm closing the bugzilla as WORKSFORME. Kindly reopen if you can reproduce with a recent version.
Oh, one more important detail -- 7.3 performance changes would make this scenario less probable even with the default timeout setting. But the performance enhancements are better tracked separately.